IBM SP2


1 Introduction

The IBM Scalable POWERparallel 2 (SP2) is IBM's current supercomputer series. It is based on RISC technology. The most famous SP2 around is Deep Blue, the chess computer that almost beat Kasparov in a seven round game last autumn. It was at least an SP2 that controlled the purpose built board of evaluating circuits that was used. There are currently over 450 SP2 systems installed at customer sites around the world.


2 Hardware

The SP2 system consists of from 2 to 512 POWER2 Architecture RISC System/6000 processor nodes, each with its own private memory and its own copy of the AIX operating system, interconnected by a switched network. Each SP2 system also requires a control workstation that is a separate RISC System/6000 workstation that serves as the SP2 system console. The multi processing is based on distributed memory and message-passing.


2.1 Processor nodes

There are two different types of processor nodes in an SP2, thin and wide node, which can be mixed in one frame. A comparison of the processor nodes are in table 1. The thin nodes are meant to be the processing nodes while the wide nodes are better used as server nodes. Each node is based on a super-scalar POWER2 processor from the RISC system/6000 family.

Table 1 Different processor nodes used in SP2
Processor TypeThinThin 2WideWide
Clock Speed66 MHz66 MHz66 MHz66 MHz
Peak Megaflops266266266266
Memory Cards2224 or 8
Memory64-512 MB64-512 MB64-512 MB64-2048 MB
Memory Bus64 bit128 bit128 bit256 bit
Data Cache64 KB128 KB128 KB256 KB
Proc. to Data Cache Bus128 bit256 bit256 bit256 bit
Instruction Cache32 KB32 KB32 KB32 KB
Disk1-9 GB1-9 GB1-18 GB1-18 GB
Microchannel Adapter Slots4488
Level 2 Cache0-1 MB0-2 MBn/an/a

The POWER2 processor consists of eight units: an Instruction Cache Unit (ICU), a Fixed-Point Unit (FXU), a Floating-Point Unit (FPU), four Data Cache Units (DCU), and a Storage Control Unit (SCU).

The FXU contains two execution units and handles all integer arithmetic, storage references and logical operations. It also contains the general purpose registers, a data cache directory and a data translation look-aside buffer. Each execution unit contains an adder and a logical functional unit. The second unit also contains a multiply and divide unit. Two instructions can be executed per clock-cycle but not two multiply or divide instructions. A multiply takes two clock cycles while a divide takes 13 to 17 cycles.

The FPU also contains two execution units which are double-precision (64-bit) together with the floating-point registers. Both execution units are identical and conform to IEEE 754 binary floating-point standard. To speed up execution a multiply-add (AxB+C) instruction is available. The multiply-add takes one cycle so using both units, four floating-point operations per second can be achieved. There is also a hardware square-root instruction available further improving calculation performance. A separate unit for normalizing store data is also available, resulting in effectively zero FPU cycles needed for a floating-point store.


2.2 The Switched Network

The IBM SP2 nodes are interconnected by a high-performance packet switched network. This network is designed to be scalable, with the building block being a two-staged 16 * 16 switch board, made up of 4 * 4 bidirectional crossbar switching elements. Each link is bidirectional and has a 40-megabytes-per-second bandwidth in each direction. The switch uses buffered cut-through worm-hole routing for maximizing performance. Because it is bi-directional with any-to-any internode connection, all processors can send messages simultaneously.

As an example, in small systems (up to 64-way, i.e. 64 nodes) only one switch board is required per 16 nodes. A frame is the "box" that houses up to 16 nodes and one switch board. Additional switch stages are required for larger systems. The extra switch boards for these additional stages are packaged in a special switch frame with up to eight switch boards per frame.

An SP2 node connects to the switch board through an intelligent Micro Channel adapter. The adapter has an onboard microprocessor (Intel i860 XR) that offloads some of the work associated with moving messages between nodes. The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus reducing the overhead on the processor node for message processing and significantly improving the sustainable bandwidth. Message cyclic redundancy check (CRC) code generation and checking is also done by the adapter to detect errors in the links, further reducing the overhead on the SP2 node.

The switch always contains at least one stage more than necessary for full connectivity. Since the basic switching element is a 4 * 4 bidirectional crossbar, this extra stage guarantees that there are at least four different paths between every pair of nodes. The redundant paths provide for recovery in the presence of failures (as well as reduce congestion in the switch).

The communication subsystem software complements the hardware capability to provide transparent recovery of lost or corrupted messages. The communication protocol supports end-to-end packet acknowledgment. For every packet sent by a source node, there is a returned acknowledgment after the packet has reached and been received by the destination node. Thus the loss of a packet will be detected by the source node. The communication subsystem software automatically resends packets if an acknowledgment is not received within a preset interval of time.


2.3 I/O

External servers connect to the SP2 via intermediate gateway nodes. The connection between the gateway nodes and the enterprise servers can be via a local area network, LAN or a high-speed interface such as high-performance parallel interface (HiPPI), Fiber Channel Standard (FCS), or ATM switches. The external servers provide I/O and file service in response to requests from SP2 compute nodes. By using multiple servers and multiple gateway nodes, the aggregate I/O bandwidth can be scaled.

For high-performance I/O requirements, the SP2 allows I/O and file servers to be integrated into the system by configuring some of the nodes as I/O and file servers. Raw I/O capacity and bandwidth can be arbitrarily increased simply by adding more I/O server nodes.


3 Software and Programming Models

IBM's goal with the SP2 is to support as many as possible of the dominant programming models being used today in technical and commercial parallel processing, and continue to add others over time.

Because of the underlying message-passing architecture, clearly a message-passing programming style is the preferred one for performance on the SP2. Several message-passing libraries callable from FORTRAN and C are supported on the SP2. The SP2 also supports the data parallel programming model with High Performance FORTRAN.

IBM has made no fundamental change to the base RISC System/6000 processor and the AIX operating system for the SP2. This means that any of the major hardware or software options available on the base RISC System/6000 workstations can be installed on an SP2 node. Similarly, several thousand RISC System/6000 applications are available immediately to an SP2 customer.


4 Performance

The primary determinant of system performance for a parallel system is the performance capability of the two primary building blocks - the individual nodes and the communication subsystem used to interconnect the nodes. The node performance can be seen in table 2.

Table 2 SP2 node performance
BenchmarkThin nodeThin node 2Wide nodeUnits
SPECint92114122122SPEC units
SPECfp92205251260SPEC units

Table 3 shows a comparison with the established competition - Cray Research's T3D, Thinking Machines' CM5 and Intel's Paragon.

Table 3 Performance for 64 processor systems, (GFlops)
SP2 Wide nodesSP2 Thin nodesCray T3DTMC CM5Intel Paragon
12.19.26.43.82.0


5 Summary

In this short text we have described the IBM SP2's hardware, software and performance. SP2 systems are today being used productively in many different areas, including computational chemistry, crash analysis, electronic design analysis, seismic analysis and as workgroup servers.


6 References

Included

Others


This text was written by Oscar Gustafsson, y92oscgu@isy.liu.se and Anders Wallberg, y91andwa@isy.liu.se at Heriot Watt University, 24/04/96.