|Supercomputing on a Shoestring|
|Originally published October, 1998|
|¿ 1998, 2005 Carlo Kopp|
Scientific and engineering computing is well and truly a science within itself, and one which is by any measure a voracious glutton of compute performance. The legendary Cray supercomputers, and many others of this ilk, were crafted first and foremost for this purpose.
Traditionally, high performance scientific and engineering computing was very expensive to perform, precisely because highly specialised machinery was required to achieve the needed performance. This long standing idiosyncrasy of such computing work is however in the process of dramatic change, with the increasing proliferation of workstation clusters into the marketplace.
In this month's feature we will briefly review the "traditional" approach to scientific and engineering computing, and then explore in more detail current clustering techniques.
Supercomputers, Minisupercomputers and Microsupercomputers
To best understand the implications of current technology developments, it is useful to take a look at the distant past, specifically the first genuine supercomputers, since these machines defined the paradigm.
Taking the mid sixties as a datum point, the premier high performance computing platform of that period was Control Data Corporations CDC 6600 / 7600. These machines used many of the internal features which we take for granted in modern microprocessors, such as pipelining and parallel functional units designed to exploit Instruction Level Parallelism (ILP, with reference to last month's feature on processor architecture and VLIW techniques). Peak floating point arithmetic performance was 4.5 MFLOPS and 15-36 MFLOPS respectively. The first CDC 7600 was delivered in 1969. With several Megabytes of magnetic core memory, these machines represent the peak of their generation.
The limitation of this generation of machines, architecturally, lay in their inability to efficiently handle large arrays of operands, and basic matrix/vector arithmetic, very much the mainstay of scientific and engineering computing. Operations on vectors and matrices typically involve very large numbers of identical floating point operations, executed repetitively, with the indices into the vector or matrix being incremented for each operation. A conventional "scalar" architecture incurs a large overhead in vector/matrix intensive work, since it has to compute new addresses for operands as it steps through the array of data.
Dedicated vector processing first appeared in a serious fashion with CDC's Cyber 200 series and Seymour Cray's CRAY-1 machine. In response to a requirement from Lawrence Livermore National Laboratory (LLNL), CDC initiated in 1965 the development of the STAR-100 machine, building on the technology used in the commercial 6600/7600 series machines, but using a new memory architecture designed to maximise storage bandwidth to the processor, a 40 ns clock cycle (25 MHz) and with instructions to operate on vectors as large as 65,536 elements. The STAR-100 was regarded as a disappointment by many, since it it not a hot performer on scalar problems, and the time overheads of setting up for vector operations meant that it performed best with larger rather than smaller array sizes. CDC persevered with the basic architectural model, and produced the redesigned STAR-100A, marketed as the Cyber 203, and later the STAR-100C, which became the Cyber 205.
The Cyber 205 had a 20 ns clock cycle (50 MHz), ECL LSI based processor, a high performance scalar execution unit, 1-4 MWords of main memory accessed through a 512 bit wide bus, and all controlled by microcode. An interesting feature of the design was the use of length matched internal cabling, to defeat clock skew in he hardware. The vector unit achieved peak performance of 100-400 MFLOPS for 32/64 bit operands, with up to 800 MFLOPS achievable for certain 32-bit vector operations (eg scalar product).
Seymour Cray departed CDC during this period to form Cray Research Inc. Cray's first major commercial success was the CRAY-1 machine, which like the Cyber series, was built around the model of parallel functional units, and a highly pipelined vector processing engine. The CRAY-1 architecture was built around 64 element vector sizes, and the Fortran compiler was responsible for vectorising DO loops and larger operand arrays into segments small enough to be crunched by the vector hardware. Published performance figures for the CRAY-1 suggest that the machine performed best when problems could be efficiently vectorised into arrays of between 15-64 elements, . Matrix multiplication performance peaked at about 140 MFLOPS, with best performance achieved for array sizes which were multiples of 64.
The Supercomputer paradigm is essentially defined by the Cyber 200 series and early CRAY machines. Very high clock speeds, large internal memory, very wide and fast internal busses, large internal register sets, multiple heavily pipelined execution units, and vector processing execution units designed to operate on large data arrays.
Performance was achieved at considerable cost, since minimising internal propagation delays, avoiding clock skew, and cooling the machines were non-trivial problems. Later CRAYs required that replacement boards be manually tuned for matched signal delays, and immersed in circulating Freon coolant for th closed cycle cooling system. The characteristic circular footprint resulted from an interconnection scheme designed to use the shortest possible bus length. By the mid eighties, this basic design strategy was beginning to reach its limits, since propagation delays in interconnects became the performance bottleneck.
The high cost of large vector machines produced a number of notable alternatives in the marketplace, generally labelled as "mini-supercomputers". The best known were the Convex "Crayettes", which were essentially superminicomputers with a high speed vector processing engine. One manufacturer produced, during the eighties, a vector processing attachment for the VAX-11/780 supermini. The idea was to bolt on an additional VAX-780 cabinet, and extend the machines SBI backplane into the next cabinet, wherein the vector engine resided. A new microcode load was required to access the vector hardware.
Without doubt the most successful Crayette was the recently discontinued Intel i860 "microsupercomputer". A contemporary of the embedded i960, the i860 was built around a RISC core, with a vector processing engine on chip, and included core facilities for a single board machine, such as an embedded monochrome frame buffer. The i860 was widely used in custom and military signal processing equipment, either on VMEbus boards, or on custom boards. At least two manufacturers supplied add-on i860 boxes for Unix workstations, using an I/O board and ribbon cable to hook the i860 in via a shared memory interface. Since the i860 used a different native instruction set to the workstation, schemes were devised using dedicated libraries, which would remotely invoke the i860 to execute portions of jobs.
The greatest drawbacks of the now classical vector supercomputing model lay in poor scalability of hardware, with increasing clock speeds, and painful software development and often abysmal portability. While many improvements were seen in vectorising Fortran compilers over the life of these machines, applications were typically not very portable between compilers, and if custom libraries were used, often a complete rewrite was required. The Crayettes further suffered in the marketplace, since many users did not fully understand the idiosyncrasies of the computing model and often sought high scalar performance from machines which other than possessing a high performance vector engine, were no better than their contemporary mini/microcomputers at crunching scalar problems. I supported two i860 based products during the early nineties, and more than often customer enquiries resulted in impromptu 2 hour telephone tutorials on the differences between vector and scalar processing. Many clients believed that they merely needed to recompile their existing Fortran program on the new machine.
The performance leaps produced in top end RISC microprocessors during the early nineties, resulting from dramatic clock speed growth, and the adoption of superscalar architectures, meant that multiprocessing RISC machines using essentially workstation technology could effectively compete with supercomputers at the bottom end of the market. This decade has seen the demise of Cray as an entity, and a defacto collapse in the marketplace. The end of the Cold War and thus large scale code cracking and high technology weapons development programs (eg Star Wars) meant that the demand for top end performance declined concurrently with the loss of the bottom end market.
A recently published table of MegaFLOP ratings for the Oxford BSP toolset puts the SGI PowerChallenge and a Pentium Pro cluster (Network of Workstations) at 74 and 60 MFLOPS respectively, for 1M element dot products, and 200 wide matrix multiplies.
Multiprocessing machines built around high speed RISC microprocessors are not without their problems. From a programming perspective, there is little scope architecturally to exploit Instruction Level Parallelism in the manner of supercomputers, since mostly the processors do not have vector instruction support. This means that programs must be parallelised differently, and typically this results in breaking large problems into smaller chunks, computing these, and then finding a method for efficiently combining the results. For many problem types, such as weather simulation, finite element analysis, computational fluid dynamics, stellar dynamics, physics hydrodynamics, this can be achieved albeit with some loss in efficiency. The model relies basically upon the idea that many cheap microprocessors can concurrently match or exceed the MegaFLOP rating of a single vector engine.
From a hardware/system perspective, the multiprocessing approach produces two basic solutions. For smaller numbers of processors, typically between 4 and 16, large machines can be built in which a very fast and wide bus is used to provide shared access to the main memory. SGI's Challenge series is a good example. Beyond some point the bus will saturate and this imposes a scalability limit on the hardware. In this arrangement, mostly a symmetrical multiprocessing Unix kernel will be run on the machine, each processor executing the kernel as required.
To aggregate much larger numbers of processors, a different interconnection model is required. This is typically one or another variation on the Hypercube scheme, in which a processor can typically access only several of its immediate neighbours, and is usually referred to as a Massively Parallel Processor (MPP). Since in many "divide and conquer" approaches to solving large problems, data needs to be mainly exchanged by neighbouring processors between whom the boundary in the problem lies, this strategy can perform well. Where it performs much less effectively is in situations where data must be exchanged between processors which do not share a direct interconnection, since intermediate nodes (processors) must in effect route the data between the communicating nodes.
While a hypercube based machine will typically use a "commodity" microprocessor, such as a Pentium Pro, Alpha, R-N000 or SPARC, these are installed on dedicated processor boards and employ specialised board-to-board busses to implement the internal interconnections. Thus such machines can still be very expensive, especially if configured with large numbers of processors. Moreover, reliability can also become an issue with large numbers of boards, and importantly, interconnections.
COWs, NOWs and A Pile of PCs
By the early nineties the increasing performance of high speed networking hardware, commodity desktop/deskside machines, and declining costs of such suggested that another alternative to the custom built MPP was becoming feasible. This alternative was based on the idea of using off-the-shelf commodity machines and networking hardware, to produce a "virtual MPP".
There is much ongoing R&D effort, internationally, in this area and Australia is no exception.
Buzzwords soon evolved to describe this model, Cluster of Workstations (COW), Network of Workstations (NOW) and Pile of PCs (POPC) all appeared during recent years. The shared characteristic of all of these models is the use of standard hardware, hooked up over a standard network, and employing specialised software tools to start up, manage and terminate jobs. A number of toolsets have evolved and these will be discussed further.
The simplest interconnection topology which can be used for a COW/NOW/POPC based MPP is to simply connect all participating processors into a single high speed hub or switch. Since the communication channel is shared by all processors, arbitrary models for exchanging data between processors may be used. Hypercubes of arbitrary order would be an example, with the "neighbourly" relationship between processors defined wholly in software.
The drawback of a single shared channel is clearly evident, since it is a potential performance bottleneck, especially for applications which must frequently communicate between processors. Many alternatives exist.
The simplest alternative is to equip all machines in the cluster with multiple network interfaces, and connect these to a stack of hubs or switches. In this manner the network load can be distributed across multiple interconnections, and if provisions are made in the network interface drivers, the traffic load can be balanced across the multiple networks. Such a scheme, termed a Bonded Dual Net, was tested using two 10 MHz Ethernets in early variants of the Beowulf POPC. The drawback to this approach lies in higher hardware costs for multiple adaptors and hubs/switches, and the need for more complex device drivers.
A more sophisticated approach is to slice the array of machines, each with only a pair of interfaces, across multiple hubs, to produce a defacto hypercube architecture. With 16 machines, each having a pair of network interfaces, and 8 hubs used, any machine can directly address eight of its neighbours in the cluster. This interconnection model is termed a Routed Mesh. It delivers good bandwidth between machines which share a common hub, but loses performance once accesses must made indirectly, since an intermediate machine must route IP traffic between two hubs.
The performance of the Mesh model can be improved significantly by using switches in addition to hubs, to remove the need for IP routing between mutually unconnected machines. The Switched Mesh model is identical to the Routed Mesh, with the addition of two 4 port switches which interconnect the two sets of four hubs. Connections between mutually unconnected machines flow through the switches and thus incur no computational overhead in software. A variant of this scheme is the Bonded Switched Mesh, in which loads are balanced across multiple paths.
Other alternatives for interconnection also exist. An example would be the tree structured model used in a joint DARPA/NASA Beowulf cluster, in which meta-nodes of 8 machines are clustered, and the meta-nodes interconnected via a 1.2 Gbit/s crossbar switch.
Another important issue is I/O bandwidth to shared storage, of particular importance with very large datasets characteristic of supercomputing applications. Typically machine clusters will have up to several Gigabytes of local storage, and some applications will tolerate mediocre bandwidth to shared storage, since it is only required to transfer the required chunk of the dataset between local and shared storage at job startup and completion time. Applications which must frequently access shared storage however will require higher bandwidth. A typical solution would be to have a large NFS RAID server connected via a pair of interfaces to the switches in a mesh, allowing any single node to connect to the store.
In terms of platforms and operating systems, current clustering toolsets support many Unix variants on often arbitrary workstation or PC hardware, some support NT, and the large and aggressive US Beowulf project (http://www.beowulf.org/) uses Linux on Intel based PCs. The inherent flexibility of the COW/NOW/POPC model means that arbitrary hardware can be exploited, and given the ever improving performance/price ratio of Intel based PCs, these will become the mainstay of such MPP systems.
While the hardware issues in producing effective clusters are not entirely trivial, the real complexity lies in the software. The primary areas of interest span job management, involving starting, monitoring and completion of jobs, and message passing between jobs. A typical approach in current toolsets is to provide a package which includes job management tools, and libraries for message passing which are then integrated into the user's application.
The largest POPC program currently under way is the primarily US Beowulf academic and research program, which is focussed on producing an enhanced Linux kernel designed to support MPP clustering on commodity hardware. Components of the Beowulf package include enhanced fast Ethernet device drivers, network load balancing (Bonding) facilities, support for cluster wide process IDs (PIDs), a pre-pager for virtual memory (which loads pages likely to be accessed before they are accessed), and other performance supporting tools. An important feature of Beowulf is the Distributed Shared Memory (DSM) facility, which produces "virtual" shared memory across multiple machines in a Beowulf cluster.
The Beowulf package is designed to support the widely used PVM and MPI parallel development environments, as well as the Oxford developed BSP environment.
The Parallel Virtual Machine (PVM) environment was initiated by Emory University and Oak Ridge National Laboratory in 1989 to support a model based on heterogeneous hosts forming an MPP. Portability across platforms was a primary requirement, so much so that message passing throughput performance was considered a secondary requirement. The programming interface used in PVM, the API, evolved with the project, and supports the clustering of multiple large multiprocessing machines. Facilities are provide to support fault tolerance in the cluster, and the dynamic addition and removal of hosts.
The Message Passing Interface (MPI) is a more recent environment, developed by a consortium of MPP machine vendors and initially designed by committee in 1993/94. MPI is aimed at homogeneous environments, typical of vendor based solutions, and is heavily optimised to achieve best possible message passing performance, with cross platform interoperability not a consideration. MPI provides support for various topological models, facilities for group communication between processes, and has what is considered to be the best library package of node-node communication routine available at this time. A Motif based front end, XMPI, is available.
More recently an effort has begun to merge the two environments into the PVMPI package.
Another widely used environment is the Local Area Multicomputer (LAM), developed by the Ohio Supercomputer Centre. LAM is available with PVM and MPI APIs. Historically it evolved from the Trollius environment, decided to control and manage an array of Transputers. LAM employs a dedicated microkernel for managing message passing, which runs as a daemon on the host platform.
Closer to home, we have the Clustor environment, codeveloped by Prof David Abramson, then of Griffith University, and Dr Rok Sosic now CEO of Active Tools. The commercial Clustor environment evolved from the earlier Nimrod effort, and is a system designed to support parametric execution of large computing jobs. These are typically large simulations, which must be run with incrementally adjusted setup parameters. Nimrod was funded by Griffith University and the Co-operative Research Centre for Distributed Systems Technology. At this time Prof Abramson, now at Monash CSSE, has set up a Linux/NT cluster at Monash with ten dual CPU 233 MHz Pentium II machines, and a dual CPU 333 MHz Pentium II cluster controller, running the Clustor package. readers interested in details should visit http://www.activetools.com/.
At this time, massively parallel computing is in transition from a research and development technology to a genuine production technology. While many aspects of the technology are still maturing, it is becoming an increasingly available and robust tool for solving scientific and engineering computational tasks. Clearly we have yet to see the best the paradigm can offer.
|$Revision: 1.1 $|
|Last Updated: Sun Apr 24 11:22:45 GMT 2005|
|Artwork and text ¿ 2005 Carlo Kopp|