The XT4 partition has 7,832 nodes that use four-core AMD processors running at 2.1 GHz with 8GB of memory (62,656 cores). This partition uses a SeaStar 2 interconnect.
Because of the very large number of nodes used in the system a common classic fat-tree network topology isn’t practical. It would be very expensive, involve at least 2 layers of switches, a very large number of cables and potentially a unique layout because of cable lengths. Jaguar uses a 3D Torus network topology that allows the nodes to be connected to six of their nearest neighbors. This makes for a very cost-effective network but with one with higher latencies than a fat-tree topology.
The Nov. 2009 Top500 lists the performance of Jaguar as #1 with a performance of 1.759 PFLOPS (
http://www.top500.org/system/performance/10184). Here are some statistics from the Top500 run:
- 224,162 AMD Opteron cores
- AMD 6-core 2.6 GHz processors
- Cray Seastar 2+ interconnect
- 6.95 MW of power
The Green500 is a new “list” that ranks systems based on the performance per watt (MFLOPS/W). Jaguar ranks fairly high on the list at 44 on the Nov. 2009 Green500. It achieves 253.07 MFLOPS/W.
Roadrunner:The IBM Roadrunner was the first system to go faster than one PetaFLOPS on the Top500 benchmark. The design of the system is very unique because it is a hybrid system coupling typical processors with Cell processors (http://en.wikipedia.org/wiki/Cell_processor). This approach allowed them to use far fewer network connections and far fewer CPUs and cores.
Roadrunner is a bladed system combining three different types of blades (so-called “triblade” system). Figure 2 below from the Wikipedia article on Roadrunner (http://en.wikipedia.org/wiki/IBM_Roadrunner) shows these three blades:

Figure 2 – Triblade layout
The main blade in the lower right (LS21) is a two socket blade with each socket having a dual-core AMD Opteron processor (Opteron 2210 - http://en.wikipedia.org/wiki/Opteron) running at 1.8 GHz and 16GB of DRAM. The LS21 is connected via two HT links (http://en.wikipedia.org/wiki/Hypertransport) to the Expansion Blade (bottom left). This blade then connects to two QS22 blades (top right and top left) via two PCIe x8 slots each (two slots to each QS22 blade). Each QS22 blade has two Cell processors (PowerXCell 8i) running at 3.2 GHz and 16GB of DRAM. Each Triblade combination is connected to others via an InfiniBand 4x (DDR) connection (connection is shown on the left of Figure 2).
Three Triblades were put inside an IBM BladeCenter H chassis. Then four chassis are put in a single rack. Then fifteen racks were connected to a core 288-port Voltaire IB switch. This grouping is called a Connected Unit (CU). Then eighteen CU’s are connected to a second tier of eight IB switches. Each CU uses 12 uplinks to these eight core switches. Figure 3 below from Wikipedia illustrates the layout
Figure 3 – Roadrunner layoutOverall Roadrunner has the following hardware statistics:
- 6,480 Opteron processors (from the LS21 blades)
- 12,960 Cell processors (6,480 QS22 blades)
- (26) 288-port DDR IB switches
- 296 racks!
- 2.35 MW of power
In the latest Top500 (Nov. 2009) Roadrunner hit 1.042 PetaFLOPS but it only used 17 of the 18 CU’s. Using 17 of the CU’s meant that it used 6,120 Opteron processors (these have 2 cores each) and 12,240 PowerXCell 8i processors (these have 9 cores each). That meant that a total of 122,400 cores were used to achieve the one PetaFLOPS run.
The latest Green500 list had Roadrunner ranked #6 in the list. (
http://www.green500.org/lists/2009/11/top/list.php?from=1&to=100). It achieved a 444.25 MFLOPS/W ratio (one of the highest).
Compare/ContrastThese are the two fastest systems in the world and the only two above one PetaFLOPS sustained performance on the Top500 benchmark (HPL). However, they are both fairly different from each in how they got there.
Jaguar took the route of using conventional CPUs with lots of cores per CPU. To surpass one PetaFLOPS it had to use a huge number of cores (224,162). Assuming a two socket blade where each socket has 6 cores there are 18,681 nodes. Using a conventional fat-tree topology would have required at least two tiers, if not three. It would have required a larger number of cables and perhaps even a unique layout to minimize cable lengths. Overall, the network would have been expensive and difficult to grow. So Jaguar uses a 3D Torus network. It is cheaper than a fat-tree network and can scale much easier. However, it introduces extra latency because of all of the hops.
Roadrunner took a different route combining CPUs and what are generically called “accelerators.” Accelerators are dedicated computing devices that are different than normal CPUs. In the case of Roadrunner the accelerators are Cell processors. It uses a basic CPU as a “controller”called a Power Processing Element (PPE) for specialized processing elements that are called “Synergistic Processing Elements” (SPE). This allows the SPE’s to be focused on computational and not have to worry about tasks that the PPE can perform for them. For the Cell process in Roadrunner there are 8 SPE’s and one PPE.
This hybrid configuration of accelerators and regular CPUs allows much fewer network connections although there are still 3,552 Triblades or 3,552 network connections. The design of Roadrunner keeps the network costs to a minimum by using full bi-sectional bandwidth networks to the CU’s. Then it uses over-subscription to connect the CU’s to each other.
The systems are also different in the respect of programming. Jaguar is a fairly conventional system with CPUs that can use existing programming models of OpenMP or MPI without much additional effort. Getting the applications to scale to the number of cores, however, is a different story and is independent of the design of the system.
Roadrunner, on the other hand, is much more difficult to program. You have to consider not only how to program the CPUs, but also how they accelerators are utilized. For example, how is the data communicated from the CPU to the PPE and then to the SPE’s in an effective manner has to be of paramount concern when writing applications. This isn’t an easy process and usually requires some more work.
One other consideration that you may have missed is the power usage. Roadrunner uses 2.35MW of power to reach just a bit above one PetaFLOPS while Jaguar uses 6.95MW of power to reach about 1.76 PetaFLOPS. While Roadrunner is more energy efficient based on the results from the Green500, they both use a great deal of power.
SummarySo we’re seen two very different ways that have been used to reach PFLOPS-land. Each has its attractive features and the not so attractive features. It’s obvious that both approaches were effective because they both reached at least one PetaFLOPS.
In the next article of this series we’ll take a look at what a PetaFLOPS class machine would look like today using both approaches and what a PetaFLOPS system in a few years could look like. I think you will be surprised by what can be done today.
-- Dr. Jeff Layton
Share your thoughts with a COMMENT --