Drilling Down Into The Power E980
August 20, 2018 Timothy Prickett Morgan
On August 7, IBM announced the two machines that are based on the “Cumulus” scale-up versions of the Power9 processors – the Power E950 midrange box that scales up to four sockets and the Power E980 big iron machine that scales up to 16 sockets. The IBM i platform is not available on the Power E950, so IBM i customers who need more than a two-socket Power9 machine to support their workloads have to make a jump to a single-node implementation of the Power E980.
This is the same situation that customers faced during the Power8 era, so it is nothing new. What is new, of course, is that the Power E980 has more oomph per processor and better NUMA interconnect within the system nodes and across the system nodes, as we discussed last fall when we caught wind of the “Fleetwood/Mack” Power E980 system and discussed briefly in our initial analysis of the Cumulus-based systems last week.
Here is the overview of the Power E980 system, which is arguably the most powerful NUMA system available on the market at the moment, surpassing even IBM’s own System z14 mainframe systems. Both sets of processors are manufactured by GlobalFoundries, which bought IBM’s Microelectronics chip making division many years ago, using its 14 nanometer processes. Take a look:
As has been the case since the Power5 generation back in 2005, IBM is using a modular approach to building its largest Power-based NUMA servers. There are a set of NUMA links that come off the processors – running at 16 Gb/sec with the Power9 processors – and these are used to cross couple four CPUs into a single image within a single node, and supplemental links, based on the “Bluelink” 25 Gb/sec signaling IBM created for NUMA connections as well as to hook various peripherals to the compute complex using a number of protocols, such as OpenCAPI and NVLink are used to lash as many as four compute nodes together to get that 16-way scalability. Here is what it looks like schematically:
The Cumulus Power9 chip has three X-Bus ports, so all four processors can be linked such that they are only one hop away from each other to access each other’s shared memory. These updated X-Bus links run at twice the speed compared to those used within a Power8 node in a Power E870 or Power E880 machine, and combined with fewer hops between CPUs, this should result in radically better NUMA scaling and tighter coupling of the processors so that more of the CPU cycles are actually used to do real work. The A-Bus links based on the 25 Gb/sec Bluelink ports have four times the bandwidth of those used in the Power8 machines, so this should also help with NUMA scaling. The upshot should be that performance is better on the Power E980 compared to the Power E880 than you might expect based on prior NUMA efficiency. (A lot depends on the latency of those A-Bus links, which may not have come down a lot in the transition to the Power9 chips.)
There are four different processor options in the Power E980, as follows, and they are all in the P30 software group for IBM i. This is a fairly pricey tier, and it is therefore a big jump from the two-socket Power S924 machine, which is in the much more affordable P20 software group. (Certain models of the entry Power9 machines are in the P05 and P10 software tiers, which are even less expensive per core.) Here are the processors:
The 11-core chip is a little odd for those thinking in base 2, but if the yield fits, sell it. The base clock speed of the cores goes up as the active core count goes down, as you might expect because this is what the Laws of Thermodynamics demands within a relatively constant power and thermal envelope. The processors used in the Power E980 have a maximum of a dozen cores, and they have eight-way threading (what IBM calls SMT8, short for simultaneous multithreading). We have not seen pricing yet, so we are not sure if the chips with a lot of cores cost more or less than those with fewer cores. Our guess is that the chips with eight cores and twelve cores cost more per core than those 10 or 11 cores. But that is just a guess.
As for memory, the Power9 chip has eight embedded DDR4 memory controllers on them, and with the “Centaur” memory buffers, they can support four memory sticks per memory controller, for a total of 32 memory slots per socket. Memory all runs at 1.6 GHz, which is not all that fast in the DDR4 generation, which can be cranked to 3 GHz in commercial servers, but interleaving memory with buffers usually also means stepping down the speed. You give up some latency to any particular memory stick to hide a lot more latency across the memory sticks from the point of view of the processors. In any event, each socket delivers 230 GB/sec of bandwidth, which works out to an aggregate of 3.6 TB/sec across the compute complex in a fully loaded Power E980 machine with 16 sockets loaded. (Meaning not the maximum memory capacity, but that all memory slots are full and pumping data.)
The Centaur buffers have an integrated L4 cache memory in them, as they did in the Power8 servers, and that 16 MB of extra cache helps keep memory moving smoothly between DRAM on the system board and the Power9 chip’s memory controllers, which feed into L3 cache. A four-socket Power E980 node has 128 GB of this L4 cache memory, which 410 GB/sec of peak bandwidth from the L4 cache to main memory, which works out to 1.6 TB/sec of L4 bandwidth into and out of the memory controllers.
The Power E980 memory feature cards that have 1 TB and 2 TB capacity are not normal in the server industry, and they are a bit pricey we reckon given that they use stacked memory. Any DDR3 or DDR4 CDIMM memory cards bought with prior Power E870 and Power E880 machines can be moved over to the Power E980, but you can only double up the maximum memory on a fully loaded system by shifting to the 512 GB DIMM sizes on that 2 TB feature card. We do not expect for many IBM i shops to need to do this, or be able to afford it. The thing to remember is that DDR3 and DDR4 memory, of different capacities, can be mixed across the system, but the memory cards have to be the same type and capacity within a single node.
Here is what the schematic of the Power E980 node, which comes in a 5U enclosure (1U taller than the Power E950 to accommodate all of that interconnect for NUMA):
And here is an exploded view of the components, which we always think is fun:
The Power E980 is very much like the Power E880, if you look at the feeds and speeds of the components in the system. They scale to the same number of sockets and cores, and they both have eight PCI-Express slots with a total peak bandwidth of 545 GB/sec of aggregate bandwidth. The difference is bandwidth here, with the Power E980 supporting PCI-Express 4.0 slots, which have twice the bandwidth of the PCI-Express 3.0 slots used in the Power E880. The maximum memory capacity is double, to be sure, but only if you toss out all of your memory and buy cards that are twice as dense. They both scale to 192 cores in a single image using processors with a dozen cores, although here again, the Power9 chip will yield somewhere between 40 percent and 50 percent more compute oomph per core, so in theory, a Power9 machine with 128 cores to 140 cores should be able to do about the same work as a Power8 machine with 192 cores. The Power E980 has faster CAPI and OpenCAPI ports, and twice the I/O bandwidth thanks to the move to PCI-Express 4.0 and Bluelink ports, and it also supports four NVM-Express bays for linking flash more tightly to the compute complex.
When we did our guesses about what the Fleetwood/Mack Power E980 system might look like back in October 2017, we nailed it pretty much in terms of the memory scalability and processor core count. We were more aggressive about clock speeds, assuming that IBM could push it further – between 4.25 GHz on the chips with lots of cores and 4.75 GHz with the chips with fewer cores. That didn’t happen, even with the shift from 22 nanometers with the Power8 chips to 14 nanometers with the Power9. We didn’t expect the 11-core option, so that was a surprise, and we don’t know how the performance stacks up for the various core SKUs and combinations yet, so we don’t know what to think there yet. As soon as we figure it out, we will let you know.
The one thing that we cannot find is detailed pricing on the Power E980. The announcement letter says that the base unit in the Power E980 costs $20,000, but it does not explain what that includes and that is a useless price anyway. We know these machines cost hundreds of thousands to millions of dollars configured.