02-24-2010 -PetaFLOPS for the Common Man Pt 2– What do current PetaFLOPS systems look like?This is a featured page

Jeff Layton
Jeff Layton, Ph.D.
Dell Enterprise Technologist - HPC

To borrow a line from one of my favorite movies, “PetaFLOPS systems…. What do they look like?” (bonus points if you can tell me which movie this is based upon). PetaFLOPS systems currently exist and are in production. So how what do these systems look like and how did we reach PetaFLOPS?

In the November 2009 Top500 there are actually two systems that achieve above one PetaFLOPS in sustained performance on the Top500 benchmark (HPL). However there are five systems that have a theoretical performance above one PFLOPS. However, let’s examine the top two systems that actually achieve one PFLOPS sustained performance. The two systems are:

1. Jaguar at Oak Ridge National Laboratories
2. Roadrunner at Los Alamos

The systems are similar in that they both achieved at least one PetaFLOPS but they diverge in the detail of how they got there. This difference is very important moving forward because they really represent the two options available for reaching PFLOPS-land.


Jaguar:
Jaguar is a Cray XT5-HE system that uses a combination of x86_64 processors coupled with Cray’s Interconnect (Seastar2). It is built into two partitions: an XT5 partition and a XT4 partition. The XT5 partition has 224,256 cores of the new AMD 6-core Istanbul processor that runs at 2.6 GHz (18,688 two-socket nodes) that have 1GB of memory. It uses an interconnect developed by Cray called SeaStar 2+. For the official Top500 score it used 224,162 cores on the XT5 partition.

Figure 1 below from Cray’s website illustrates the building block that they use in their XT5 systems.
Seastar 2+ layout
Figure 1 – Seastar 2+ layout

The XT4 partition has 7,832 nodes that use four-core AMD processors running at 2.1 GHz with 8GB of memory (62,656 cores). This partition uses a SeaStar 2 interconnect.

Because of the very large number of nodes used in the system a common classic fat-tree network topology isn’t practical. It would be very expensive, involve at least 2 layers of switches, a very large number of cables and potentially a unique layout because of cable lengths. Jaguar uses a 3D Torus network topology that allows the nodes to be connected to six of their nearest neighbors. This makes for a very cost-effective network but with one with higher latencies than a fat-tree topology.

The Nov. 2009 Top500 lists the performance of Jaguar as #1 with a performance of 1.759 PFLOPS (http://www.top500.org/system/performance/10184). Here are some statistics from the Top500 run:

  • 224,162 AMD Opteron cores
    • AMD 6-core 2.6 GHz processors
  • Cray Seastar 2+ interconnect
    • 3D Torus
  • 6.95 MW of power

The Green500 is a new “list” that ranks systems based on the performance per watt (MFLOPS/W). Jaguar ranks fairly high on the list at 44 on the Nov. 2009 Green500. It achieves 253.07 MFLOPS/W.


Roadrunner:

The IBM Roadrunner was the first system to go faster than one PetaFLOPS on the Top500 benchmark. The design of the system is very unique because it is a hybrid system coupling typical processors with Cell processors (http://en.wikipedia.org/wiki/Cell_processor). This approach allowed them to use far fewer network connections and far fewer CPUs and cores.

Roadrunner is a bladed system combining three different types of blades (so-called “triblade” system). Figure 2 below from the Wikipedia article on Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) shows these three blades:

Roadrunner TriBlade
Figure 2 – Triblade layout


The main blade in the lower right (LS21) is a two socket blade with each socket having a dual-core AMD Opteron processor (Opteron 2210 - http://en.wikipedia.org/wiki/Opteron) running at 1.8 GHz and 16GB of DRAM. The LS21 is connected via two HT links (http://en.wikipedia.org/wiki/Hypertransport) to the Expansion Blade (bottom left). This blade then connects to two QS22 blades (top right and top left) via two PCIe x8 slots each (two slots to each QS22 blade). Each QS22 blade has two Cell processors (PowerXCell 8i) running at 3.2 GHz and 16GB of DRAM. Each Triblade combination is connected to others via an InfiniBand 4x (DDR) connection (connection is shown on the left of Figure 2).

Three Triblades were put inside an IBM BladeCenter H chassis. Then four chassis are put in a single rack. Then fifteen racks were connected to a core 288-port Voltaire IB switch. This grouping is called a Connected Unit (CU). Then eighteen CU’s are connected to a second tier of eight IB switches. Each CU uses 12 uplinks to these eight core switches. Figure 3 below from Wikipedia illustrates the layout



Roadrunner layout Figure 3 – Roadrunner layout


Overall Roadrunner has the following hardware statistics:

  • 6,480 Opteron processors (from the LS21 blades)
    • 51.8 TiB RAM
  • 12,960 Cell processors (6,480 QS22 blades)
    • 51.8 TiB RAM
  • (26) 288-port DDR IB switches
  • 296 racks!
  • 2.35 MW of power
In the latest Top500 (Nov. 2009) Roadrunner hit 1.042 PetaFLOPS but it only used 17 of the 18 CU’s. Using 17 of the CU’s meant that it used 6,120 Opteron processors (these have 2 cores each) and 12,240 PowerXCell 8i processors (these have 9 cores each). That meant that a total of 122,400 cores were used to achieve the one PetaFLOPS run.

The latest Green500 list had Roadrunner ranked #6 in the list. (http://www.green500.org/lists/2009/11/top/list.php?from=1&to=100). It achieved a 444.25 MFLOPS/W ratio (one of the highest).

Compare/Contrast

These are the two fastest systems in the world and the only two above one PetaFLOPS sustained performance on the Top500 benchmark (HPL). However, they are both fairly different from each in how they got there.

Jaguar took the route of using conventional CPUs with lots of cores per CPU. To surpass one PetaFLOPS it had to use a huge number of cores (224,162). Assuming a two socket blade where each socket has 6 cores there are 18,681 nodes. Using a conventional fat-tree topology would have required at least two tiers, if not three. It would have required a larger number of cables and perhaps even a unique layout to minimize cable lengths. Overall, the network would have been expensive and difficult to grow. So Jaguar uses a 3D Torus network. It is cheaper than a fat-tree network and can scale much easier. However, it introduces extra latency because of all of the hops.

Roadrunner took a different route combining CPUs and what are generically called “accelerators.” Accelerators are dedicated computing devices that are different than normal CPUs. In the case of Roadrunner the accelerators are Cell processors. It uses a basic CPU as a “controller”called a Power Processing Element (PPE) for specialized processing elements that are called “Synergistic Processing Elements” (SPE). This allows the SPE’s to be focused on computational and not have to worry about tasks that the PPE can perform for them. For the Cell process in Roadrunner there are 8 SPE’s and one PPE.

This hybrid configuration of accelerators and regular CPUs allows much fewer network connections although there are still 3,552 Triblades or 3,552 network connections. The design of Roadrunner keeps the network costs to a minimum by using full bi-sectional bandwidth networks to the CU’s. Then it uses over-subscription to connect the CU’s to each other.

The systems are also different in the respect of programming. Jaguar is a fairly conventional system with CPUs that can use existing programming models of OpenMP or MPI without much additional effort. Getting the applications to scale to the number of cores, however, is a different story and is independent of the design of the system.

Roadrunner, on the other hand, is much more difficult to program. You have to consider not only how to program the CPUs, but also how they accelerators are utilized. For example, how is the data communicated from the CPU to the PPE and then to the SPE’s in an effective manner has to be of paramount concern when writing applications. This isn’t an easy process and usually requires some more work.

One other consideration that you may have missed is the power usage. Roadrunner uses 2.35MW of power to reach just a bit above one PetaFLOPS while Jaguar uses 6.95MW of power to reach about 1.76 PetaFLOPS. While Roadrunner is more energy efficient based on the results from the Green500, they both use a great deal of power.

Summary

So we’re seen two very different ways that have been used to reach PFLOPS-land. Each has its attractive features and the not so attractive features. It’s obvious that both approaches were effective because they both reached at least one PetaFLOPS.

In the next article of this series we’ll take a look at what a PetaFLOPS class machine would look like today using both approaches and what a PetaFLOPS system in a few years could look like. I think you will be surprised by what can be done today.

-- Dr. Jeff Layton

Share your thoughts with a COMMENT --




HPCBrad
HPCBrad
Latest page update: made by HPCBrad , Feb 25 2010, 1:41 AM EST (about this update About This Update HPCBrad Rename - HPCBrad

No content added or deleted.

- complete history)
Keyword tags: None
More Info: links to this page
There are no threads for this page.  Be the first to start a new thread.
Browse by Keywords
Loading...