Thursday, November 18, 2010

New Supercomputer Coming

With more than 300,000 compute cores, Blue Waters will achieve peak performance of approximately 10 petaflops (10 quadrillion calculations every second) and will deliver sustained performance of at least 1 petaflop on a range of real-world science and engineering applications.

Blue Waters will have a peak memory bandwidth of nearly 5 petabytes/second, more than 1 petabyte of memory, 18 petabytes of disk storage, and 500 petabytes of archival storage.

The interprocessor communications network will have a bisection bandwidth far greater than what is available today, greatly facilitating scaling to large numbers of compute cores. The high-performance I/O subsystem will enable the solution of the most challenging data-intensive problems.

Blue Waters will be built from the most advanced computing technologies under development at IBM, including the multicore POWER7 microprocessor, high-performance memory subsystem, and an innovative interconnect, as well as advanced parallel file systems, parallel tools and services, and programming environments.

The base system and computing environment will be enhanced through a multi-year collaboration among NCSA, the University of Illinois, IBM, and members of the Great Lakes Consortium for Petascale Computation. The enhanced environment will increase the productivity of application developers, system administrators, and researchers by providing an integrated toolset to use Blue Waters and analyze and control its behavior.

POWER7 processors

IBM's multicore POWER7 processors will:

  • Include eight high-performance cores. Each core provides 12 execution units and can perform up to four fused multiply-adds per cycle.
  • Feature simultaneous multithreading that delivers up to four virtual threads per core.
  • Have three levels of cache—private L1 (32 KB) instruction and data caches, private L2 (256 KB) cache and on-chip L3 (32 MB) cache that can be used either as shared cache or separated into dedicated caches for each core—reducing latency.
  • Combine the dense, low-power attributes of eDRAM (for the L3 cache) with the speed and bandwidth advantages of SRAM (for the L1 and L2 caches) for optimized performance and power usage.
  • Have two dual-channel DDR3 memory controllers, delivering up to 128 GB/sec peak bandwidth (0.5 Byte/flop).
  • Support up to 128 GB DDR3 DRAM memory per processor.
  • Have clock frequency in the 3.5-4 GHz range.

Multi-chip modules

The processors will have six fabric bus interfaces to connect to three other processors in a multi-chip module (MCM). These 1-teraflop tightly coupled shared-memory nodes will deliver 512 GB/sec of aggregate memory bandwidth and 192 GB/sec of bandwidth (0.2 Byte/flop) to a hub chip used for I/O, messaging, and switching.

Interconnect

Blue Waters will employ new IBM interconnect technology that combines concepts from IBM's Federation supercomputing switches and InfiniBand. The high-bandwidth, low-latency interconnect will scale to hundreds of thousands of cores. The key component is the hub/switch chip. The four POWER7 chips in a compute node are connected to a hub/switch chip, which serves as an interconnect gateway as well as a switch that routes traffic between other hub chips. The system therefore requires no external network switches or routers, providing considerable savings in switching components, cabling, and power.

The hub chip provides 192 GB/s to the directly connected POWER7 MCM; 336 GB/s to seven other nodes in the same drawer on copper connections; 240 GB/s to 24 nodes in the same supernode (composed of four drawers) on optical connections; 320 GB/s to other supernodes on optical connections; and 40 GB/s for general I/O, for a total of 1,128 GB/s peak bandwidth per hub chip.

The Blue Waters interconnect topology is a fully connected two-tier network. In the first tier, every node has a single hub/switch that is directly connected to the other 31 hub/switches in the same supernode. In the second tier, every supernode has a direct connection to every other supernode. These inter-supernode connections terminate at hub/switches, but a given hub/switch is directly connected to only a fraction of the other supernodes. Messages traveling from a hub/switch in one supernode to another supernode in many cases must be routed through another hub/switch in both the sending and receiving supernodes. Multiple routes are possible with direct and indirect routing schemes. The minimum direct-routed message latency in the system is approximately 1 microsecond.

Indirect routes add an intermediate supernode to the path described above, which (depending on the running program's communication pattern) can be useful for avoiding contention for the link that directly connects the two communicating supernodes. A key advantage to the Blue Waters interconnect is that the number of hops that must be taken by a message passed from any supernode to any other supernode in the system is at most three for direct routes and five for indirect routes, independent of the number or location of the processors that are communicating.

The hub chip participates in the cache-coherence protocol within the node, enabling it to extract the data to be sent over the interconnect from either the sending processor's caches or main memory, and to place data from the interconnect into either the L3 cache or the main memory of the receiving processor, potentially reducing memory access latency significantly.

The hub chip provides specialized hardware called the Collectives Acceleration Unit (CAU) to significantly reduce the network latency associated with frequently used collective operations, such as barriers and global sums, offloading this work from the processors. The CAU can perform multiple independent collective operations occurring simultaneously between various groups of processors.

The hub includes hardware support for global address space operations, active messaging, and atomic updates to remote memory locations. These features help improve the performance of one-sided communication protocols as well as PGAS languages, such as X10, UPC, and Co-Array Fortran.

I/O subsystem

The Blue Waters I/O subsystem will provide more than 18 petabytes of online disk storage with a peak I/O rate greater than 1.5 TB/s aggregate. Redundant servers/paths will be configured and advanced RAID will be used in conjunction with IBM's General Parallel File System (GPFS) to ensure high availability and high reliability.

A large near-line tape subsystem, eventually scaling to 500 petabytes, is also directly attached to the Blue Waters system. The tape archive will run HPSS, but will use the GPFS-HPSS interface (GHI) to integrate the tape archive into the GPFS namespace. GHI also allows for the transparent migration of data between disk and tape. The apparent disk space to which users will have direct access will be an order of magnitude greater than what is available today, with corresponding bandwidth increases. The file system and archive will be substantially larger, faster, more reliable, and easier to use than similar systems on today's platforms.

The file system will automate many storage and data transfer tasks. Researchers will have a simplified and easily searched view of their data. They will be able to set lifetime information-management policies that establish where their data are stored, how long data are kept in the faster-access file system instead of the tape-based archive, and how the data are backed up and retrieved. This is markedly more efficient than current systems where researchers must log into multiple systems, manually transferring their data, keeping track of where those data are stored, and confirming that transfers have been completed successfully.


No comments:

Post a Comment