Wednesday, August 24, 2011

The IBM Blue Gene/Q project - an 18 core homogeneous SoC for exascale simulation

The Blue Gene/Q SoC integrates 18 homogenous cores in a 1.47 billion transistor, 45nm SOI-CMOS chip.

At the Hot Chips Conference held at Stanford University last week, IBM design manager Ruud Haring presented details of the Blue Gene/Q project, an international collaborative effort between geographically dispersed teams of IBM engineers, Columbia University and the University of Edinburgh in Scotland. The primary objective of the project is to develop a system for massively parallel supercomputing that can be used for for large scale scientific and analytics applications, while also laying the groundwork for 'exascale' computing. Exascale refers to the target of achieving a 1000X increase in performance over today's fastest supercomputer. That designation currently belongs to the Tianhe-1A, which was built by Chinese researchers. The U.S. Department of Energy is leading an effort to maintain U.S. leadership in computing, and has contributed funding to the Blue Gene/Q project through the Argonne and Livermore national laboratories.
Supercomputers enable simulation - that is, the numerical computations to understand and predict the behavior of scientifically or technologically important systems - and therefore accelerate the pace of innovation. Simulation enables better and more rapid product design. Simulation has already allowed Cummins to build better diesel engines faster and less expensively, Goodyear to design safer tires much more quickly, Boeing to build more fuel-efficient aircraft, and Procter & Gamble to create better materials for home products. Simulation also accelerates the progress of technologies from laboratory to application. The United States must excel at such tasks to compete in a rapidly developing global economy. - DOE Undersecretary for Science Steven E. Koonin

Haring provided a tour of the roughly 19mm X 19mm die, containing 1.47 billion transistors, which IBM has fabricated in a 45nm SOI (silicon on insulator) CMOS process. The die photo shown here was taken prior to deposition of metal layers. There are a total of 18 identical instances of IBM's A2 processor core, 16 for user applications, 1 for operating system services, and 1 "redundant" core as extra insurance against yield fallout in this complex SoC (system on chip). The processor core was described as being mostly the same as the PowerEN (Power Edge of Network) chip, which IBM presented at the 2010 ISSCC (International Solid State Circuits Conference). The cores run the 64-bit Power instruction set architecture, and are operated at 1.6GHz with a 0.8 volt supply, though the design is capable of operation at 2.4 GHz. The design team scaled back voltage and clock frequency in order to reduce active power consumption and leakage.

Each processor core on the Blue Gene/Q has a dedicated quad FPU (floating point unit), a 4-wide double precision SIMD (single instruction multiple data) architecture which can support 8 concurrent operations. The processors share a central  32MB DRAM L2 cache, which is unique in supporting transactional memory, speculative execution, and atomic operations. The left and right sides of the SoC are dedicated to dual memory controllers that support 16 GB of external DDR3 memory at a 1.33 Gb/s data rate, configured in a 2x16 byte-wide interface + supplemental ECC (error-correcting code) bits.

Built-in chip-to-chip networking enables each Blue Gene/Q to communicate with 10 neighbors, through on-chip router logic that operates autonomously from the cores. Packets for complex floating point operations can flow through the network without disturbing the processor cores. A crossbar switch in the center of the die was manually layed out in order to optimize wiring. Peak performance for the chip was specified at 204.8 GFLOPS with 55w power dissipation.

The L2 cache provides the point of coherency between cores. Users can define the start and end of transactions in the L2 cache, and the cache is designed to guarantee "atomicity", which IBM said eliminates the need for locks. Load/store conflicts are automatically detected and reported, so that they can be resolved by software. Haring showed data that demonstrated how transactional memory with atomic operations reduced the number of L2-to-processor round trips, cutting CPU cycles  for load/store operations from 14,000 cycles to less than 1000 cycles with atomic operation.

The Blue Gene/Q is designed so that the health of each processor core is accessed through a scan chain during manufacturing test. If a single core is bad, a logical to physical re-mapping is performed to shift the bad core to last ID in the 18-core sequence. Haring said that this approach was "totally inspired" by array redundancy. The benefit is increased yield and reduced TTM (time to market), since a perfect die with 1.47B transistors is not required. Redundancy can be invoked at any manufacturing test stage from wafer test, to module, card, or system. The redundancy information can be stored in the chips eFuses, or in EEPROM  (electrically programmable read only memory) so that a defective core can be shut down at boot time. The operation is totally transparent to system software.

At assembly, the Blue Q and 16GB of DDR3 chips are solder-attached directly to the same card, electrically connected through the Blue Q's integrated DDR3 PHY (physical interface) I/O blocks. The compute cards can use either an air-cooled or water-cooled heat sink. A total of 32 cards, containing a 16-core processor each, constitute a 512 processor node card, which are then paired into a 1024 processor rack.

Design Challenges

Haran emphasized that minimizing power was the overarching challenge for the Blue Q design methodology. The team employed architecture and logic level clock gating, and tuned the processor cores for low power.  Power-aware synthesis was used, and the physical design of the clock networks also placed an emphasis on power-efficiency.  The processor cores originated in a high-speed custom design methodology, but the rest of the chip was implemented as an ASIC. The merging of custom and ASIC methodologies presented a challenge. Functional verification of the transactional memory was difficult, with extensive use of cycle simulation and hardware acceleration. The designers built a complete multi-FPGA prototype, which was a key to achieving first pass silicon success.

Finally, Haran said that he would like to speak with any EDA representatives in the audience,  to find a way to parallelize the test pattern generation with full-chip models, which he said is a weeks-long bottleneck.

Related article:

No comments: