IBM SP2

1 Introduction

The IBM Scalable POWERparallel 2 (SP2) is IBM's current supercomputer series. It is based on RISC technology. The most famous SP2 around is Deep Blue, the chess computer that almost beat Kasparov in a seven round game last autumn. It was at least an SP2 that controlled the purpose built board of evaluating circuits that was used. There are currently over 450 SP2 systems installed at customer sites around the world.

2 Hardware

The SP2 system consists of from 2 to 512 POWER2 Architecture RISC System/6000 processor nodes, each with its own private memory and its own copy of the AIX operating system, interconnected by a switched network. Each SP2 system also requires a control workstation that is a separate RISC System/6000 workstation that serves as the SP2 system console. The multi processing is based on distributed memory and message-passing.

2.1 Processor nodes

There are two different types of processor nodes in an SP2, thin and wide node, which can be mixed in one frame. A comparison of the processor nodes are in table 1. The thin nodes are meant to be the processing nodes while the wide nodes are better used as server nodes. Each node is based on a super-scalar POWER2 processor from the RISC system/6000 family.

Table 1 Different processor nodes used in SP2
Processor Type	Thin	Thin 2	Wide	Wide
Clock Speed	66 MHz	66 MHz	66 MHz	66 MHz
Peak Megaflops	266	266	266	266
Memory Cards	2	2	2	4 or 8
Memory	64-512 MB	64-512 MB	64-512 MB	64-2048 MB
Memory Bus	64 bit	128 bit	128 bit	256 bit
Data Cache	64 KB	128 KB	128 KB	256 KB
Proc. to Data Cache Bus	128 bit	256 bit	256 bit	256 bit
Instruction Cache	32 KB	32 KB	32 KB	32 KB
Disk	1-9 GB	1-9 GB	1-18 GB	1-18 GB
Microchannel Adapter Slots	4	4	8	8
Level 2 Cache	0-1 MB	0-2 MB	n/a	n/a

The POWER2 processor consists of eight units: an Instruction Cache Unit (ICU), a Fixed-Point Unit (FXU), a Floating-Point Unit (FPU), four Data Cache Units (DCU), and a Storage Control Unit (SCU).

The FXU contains two execution units and handles all integer arithmetic, storage references and logical operations. It also contains the general purpose registers, a data cache directory and a data translation look-aside buffer. Each execution unit contains an adder and a logical functional unit. The second unit also contains a multiply and divide unit. Two instructions can be executed per clock-cycle but not two multiply or divide instructions. A multiply takes two clock cycles while a divide takes 13 to 17 cycles.

The FPU also contains two execution units which are double-precision (64-bit) together with the floating-point registers. Both execution units are identical and conform to IEEE 754 binary floating-point standard. To speed up execution a multiply-add (AxB+C) instruction is available. The multiply-add takes one cycle so using both units, four floating-point operations per second can be achieved. There is also a hardware square-root instruction available further improving calculation performance. A separate unit for normalizing store data is also available, resulting in effectively zero FPU cycles needed for a floating-point store.

2.2 The Switched Network

The IBM SP2 nodes are interconnected by a high-performance packet switched network. This network is designed to be scalable, with the building block being a two-staged 16 * 16 switch board, made up of 4 * 4 bidirectional crossbar switching elements. Each link is bidirectional and has a 40-megabytes-per-second bandwidth in each direction. The switch uses buffered cut-through worm-hole routing for maximizing performance. Because it is bi-directional with any-to-any internode connection, all processors can send messages simultaneously.

As an example, in small systems (up to 64-way, i.e. 64 nodes) only one switch board is required per 16 nodes. A frame is the "box" that houses up to 16 nodes and one switch board. Additional switch stages are required for larger systems. The extra switch boards for these additional stages are packaged in a special switch frame with up to eight switch boards per frame.

An SP2 node connects to the switch board through an intelligent Micro Channel adapter. The adapter has an onboard microprocessor (Intel i860 XR) that offloads some of the work associated with moving messages between nodes. The adapter can move messages to and from processor memory directly via direct memory access (DMA), thus reducing the overhead on the processor node for message processing and significantly improving the sustainable bandwidth. Message cyclic redundancy check (CRC) code generation and checking is also done by the adapter to detect errors in the links, further reducing the overhead on the SP2 node.

The switch always contains at least one stage more than necessary for full connectivity. Since the basic switching element is a 4 * 4 bidirectional crossbar, this extra stage guarantees that there are at least four different paths between every pair of nodes. The redundant paths provide for recovery in the presence of failures (as well as reduce congestion in the switch).

The communication subsystem software complements the hardware capability to provide transparent recovery of lost or corrupted messages. The communication protocol supports end-to-end packet acknowledgment. For every packet sent by a source node, there is a returned acknowledgment after the packet has reached and been received by the destination node. Thus the loss of a packet will be detected by the source node. The communication subsystem software automatically resends packets if an acknowledgment is not received within a preset interval of time.

2.3 I/O

External servers connect to the SP2 via intermediate gateway nodes. The connection between the gateway nodes and the enterprise servers can be via a local area network, LAN or a high-speed interface such as high-performance parallel interface (HiPPI), Fiber Channel Standard (FCS), or ATM switches. The external servers provide I/O and file service in response to requests from SP2 compute nodes. By using multiple servers and multiple gateway nodes, the aggregate I/O bandwidth can be scaled.

For high-performance I/O requirements, the SP2 allows I/O and file servers to be integrated into the system by configuring some of the nodes as I/O and file servers. Raw I/O capacity and bandwidth can be arbitrarily increased simply by adding more I/O server nodes.

3 Software and Programming Models

IBM's goal with the SP2 is to support as many as possible of the dominant programming models being used today in technical and commercial parallel processing, and continue to add others over time.

Because of the underlying message-passing architecture, clearly a message-passing programming style is the preferred one for performance on the SP2. Several message-passing libraries callable from FORTRAN and C are supported on the SP2. The SP2 also supports the data parallel programming model with High Performance FORTRAN.

IBM has made no fundamental change to the base RISC System/6000 processor and the AIX operating system for the SP2. This means that any of the major hardware or software options available on the base RISC System/6000 workstations can be installed on an SP2 node. Similarly, several thousand RISC System/6000 applications are available immediately to an SP2 customer.

4 Performance

The primary determinant of system performance for a parallel system is the performance capability of the two primary building blocks - the individual nodes and the communication subsystem used to interconnect the nodes. The node performance can be seen in table 2.

Table 2 SP2 node performance
Benchmark	Thin node	Thin node 2	Wide node	Units
SPECint92	114	122	122	SPEC units
SPECfp92	205	251	260	SPEC units

Table 3 shows a comparison with the established competition - Cray Research's T3D, Thinking Machines' CM5 and Intel's Paragon.

Table 3 Performance for 64 processor systems, (GFlops)
SP2 Wide nodes	SP2 Thin nodes	Cray T3D	TMC CM5	Intel Paragon
12.1	9.2	6.4	3.8	2.0

5 Summary

In this short text we have described the IBM SP2's hardware, software and performance. SP2 systems are today being used productively in many different areas, including computational chemistry, crash analysis, electronic design analysis, seismic analysis and as workgroup servers.

6 References

Included

http:// www.mhpcc.edu/training/workshop/html/ibmhwsw/ibmhwsw.html, IBM SP Hardware/Software Overview
http://www.mhpcc.edu/training/workshop/html/ibmhwsw/sp2.chipset.html, SP2 Processor Complex Detailed Description
http://www. mhpcc.edu/training/workshop/html/ibmhwsw/fpu.html, POWER2 Floating-Point Unit: Architecture and Implementation
http://www. mhpcc.edu/training/workshop/html/ibmhwsw/fxu.html, POWER2 Fixed-Point, Data Cache, and Storage Control Units
http://ibm.tc.corne ll.edu/ibm/pps/switch/switch.html, High Performance Switch
http://ibm.tc.cornell.e du/ibm/pps/sp2/index.html, IBM High-Performance Computing Solutions
http://ibm.tc.cornell.edu /ibm/pps/sp2/sp2.html, SP2
http://ibm.tc. cornell.edu/ibm/pps/power/power2/index.html, POWER2: Next Generation of the RISC System/6000 Family

Others

http://ibm.tc.cornell.edu /ibm/pps/outline.html, SP2 Product Information Outline
http://SP2.CCS.QueensU.CA/, Queen's University SP2 Nic Site
http://www-i.al maden.ibm.com/journal/sj/agerw/agerw.html, SP2 system architecture
http://spu d-web.tc.cornell.edu/HyperNews/get/SPUserGroup.html, SP Discussion
http://lscftp.kgn.ibm.com/pps/, SP World home page

This text was written by Oscar Gustafsson, y92oscgu@isy.liu.se and Anders Wallberg, y91andwa@isy.liu.se at Heriot Watt University, 24/04/96.