Inside The IBM “Denali” Power E1080 System
November 8, 2021 Timothy Prickett Morgan
With Big Blue not really shipping its Power10-based big, bad “Denali” Power E1080 system in full volume and with full configurations until December, we knew we would have some time to dig down into the architecture of the Denali systems. And with other things out of the way, which also needed to be covered, including our initial coverage on the Power E1080 machine on September 8 and some follow-on stories, we are now eagerly taking the lid off the Denali machine and taking a hard look at it for you.
Seriously. Here is the machine with the lid off of the node that comprises the Power E1080:
OK, we were just kidding. We are not going to start there. We will start with an idea first, and it is this: Even shops that do not need such a beast as the Power E1080 should take comfort in the fact that IBM still makes big iron like this. The reason why is because it means Power Systems machines are still valued by large enterprises, which means IBM can generate decent revenue and profits – over the long haul – from these high-end systems. There are not very many companies that make machines this big, and as we have talked about already, this market is expected to be on the upswing as enterprise workloads grow. As long as IBM can spread research and development costs across System z and Power Systems, and up and down the Power Systems line, then it can continue to provide small, medium, and large Power-based systems supporting IBM i, AIX, and Linux. That is an impressive support matrix, even if it is for something on the order of only 150,000 customers or so.
Let’s start this story by reminding you about the Power10 chip, which is shown below with its core and uncore portions of the processor labeled:
We did a deep dive on the Power10 architecture back in August 2020, and have talked about how this is not the first Power10 design that IBM did – it looks like the third iteration as far as we can tell because of the failed 10 nanometer and 7 nanometer manufacturing processes from Globalfoundries and then the move to the Samsung foundry and its 7 nanometer processes. The resulting Power10 design, even if it is later to the field than IBM had planned, is probably more like what IBM might have done with the Power11 chip, particularly when it comes to the matrix math accelerators (MMAs) and memory encryption.
The Power10 chip has a completely redesigned core and a reimplemented and extended Power instruction set, which was created from the ground up to do more stuff, to do more complex stuff, and to do it burning less energy. The Power10 chip has a total of 16 cores, each with 2 MB of L2 cache and an 8 MB L3 cache segment that is proximate to it. The chip is organized into two “hemispheres” with eight cores and 64 MB of L3 cache each, but presents itself logically to the BIOS and operating system as a 16-core chip. Well, not exactly. Because IBM and Samsung knew that 7 nanometers was going to be tough when it came to chip yields, IBM is only selling machines that have 10, 12, or 15 cores activated, at base speeds of 3.65 GHz, 3.6 GHz, and 3.55 GHz, respectively.
The Power10 chip includes the new OpenCAPI Memory Interface (OMI) to DDR4 memory, which uses the same high speed signaling that is used for NUMA interconnects (IBM still calls it symmetric multiprocessing, or SMP, when it is really NUMA) and other accelerator interconnects that used to be called BlueLink but are now called PowerAXON. That OMI interface simplifies the IBM design and lowers power and chip edge length as well as chip surface area required compared to traditional DDR memory controllers, which is why IBM is doing something that is non-standard. (More on this in a moment.) The bottom of the chip has a pair of PCI-Express 5.0 peripheral controllers, which have double the bandwidth of PCI-Express 4.0 controllers, lane for lane.
The Power10 chip has 18 different layers of metal laid down to create its circuits, which contain 18 billion transistors in a chip with an area of 602 square millimeters. It is interesting to contemplate if the “intentional dudding” of at least one core – and sometimes four cores and sometimes six cores – is also true for the L3 cache segments and the OMI controllers, PCI-Express controllers, and NUMA controllers that are also on the die. Presumably, but perhaps not always. The spec sheets indicate that IBM and Samsung are dudding the cores and the L3 cache segments together. Still, we would be willing to bet that there is a way to get a SKU that has more cache if it works, and when Samsung improves its yields on its 7 nanometer processes, we suspect that IBM will offer machines with the full sixteen cores activated. It is also possible to see some creep up in clock speed over time, too. It depends on what customers want.
If Google was widely deploying Power10 chips to run its search engines and databases, you can bet that company would be getting chips with all the bells and whistles on and the knobs turned full on. In fact, we think Google would have been using the 48-core Power10 that was supposed to have materialized a few years ago but didn’t. For all we know, Google is taking a hard look at the “Stratus” dual-chip module (DCM) implementation of Power10 – that’s our codename for it, not IBM’s – which will cram 30 cores into a single socket and four sockets into a single machine for 120 cores in a variant of the future Power E1050 server due in May or June next year if all goes well.
It takes a lot of pins to power up all those cores and the caches and on-chip interconnects, and for completeness and because this is a work of art, here is the underside of the Power10 chip, which we have nicknamed “Cirrus” because Big Blue did not give it a proper codename, that shows all of those pins:
Pretty, isn’t it? Here is what the Power10 socket looks like without the heat sink on the socket:
Now, let’s drill down into the Power10 socket and look at a schematic diagram of its I/O:
All you need to remember is 32 Gb/sec. Just about all of the signaling in this chip is running at 32 Gb/sec, as you can see. IBM adds certain blocks for memory, NUMA, and other I/O. IBM is locking memory interfaces and the PowerAXON interfaces in their specific functions, but the logic blocks all look the same to us and we think this is being done in firmware and that IBM could change the balance of memory and I/O in the Power10 systems at will – and might do that in the future if customers need a different mix of the two.
There are twice as many DDR4 memory slots (64 of them) associated with the Power10 socket compared to the number of “Centaur” buffered memory slots in the Power9 socket used in the Power E980 system. The way it works out, a single node of either machine tops out at 16 TB of memory capacity, but the memory bandwidth in the Power E1080 node is 1.6 TB/sec, an increase of 77.8 percent over the still very impressive 920 GB/sec that IBM was able to deliver in both the Power8-based Power E880C node and the Power E980 node. When you add up all of the other I/O in the Power E1080, it works out to 576 GB/sec, 5.6 percent higher than the 545 GB/sec in the Power E980 and 125 percent higher than the 256 GB/sec in the Power E880C node. The PCI-Express and NUMA interconnects are running faster and faster, but IBM is also being careful to dial the lanes back to keep power consumption under control and to not supply so much I/O that it is not useful. So many workloads are memory bandwidth constrained, not I/O constrained or compute constrained, so this makes sense.
Here is the schematic of how four of those Power10 sockets are linked to each other on a Power E1080 node:
This is a very similar design in many ways to the Power E880C and the Power E980, and as we have pointed out before, forms the basis of the Power E1050 that will come next year. Basically, IBM has a two-socket design, a four-socket design, and a 16-socket design, and they are all related and are basically just a change in the topology of components using the integrated NUMA circuits on each chip.
Here is what the four-socket node looks like on the outside:
And here is what it looks like with the cover off:
Now, we will actually talk about this for a bit. The four giant heat sinks are obviously the Power10 processors in the Power E1080 node. To the left of them are all of the I/O peripheral slots, and to the left of them are the densely packed OMI memory sticks, and there are sixteen per socket. (Eight channels and two DDR4 differential DIMMs, or DDIMMs, per channel per socket.) By the way, the fatter memory runs slower than the skinnier memory, which is often the case with the Power Systems line, so you do have to sacrifice some memory bandwidth to get maximum capacity because of the thermal load on the system. The 32 GB and 64 GB DDIMMs run at 3.2 GHz and yield 409 GB/sec of bandwidth per Power10 socket when fully loaded; the 128 GB and 256 GB DDIMMs run at 2.93 GHz and deliver 375 GB/sec of bandwidth per socket. That’s about a factor of 2.5X what you can get in a Xeon SP socket, by the way, and about 1.8X more than what the Power E980 could deliver.
It is safe to say that the OMI memory is a big differentiator for IBM, and the design is made to intercept higher performance and higher capacity DDR5 main memory when it becomes available, and we fully expect for such memory to be made available on the Power E1080. We are not certain if this will be able to be done with a BIOS update or if a new Power E1080C machine will have to be announced with MES upgrades. We shall see. The point, which IBM makes in the chart above, is that with DDR5 memory and OMI interfaces, IBM will be able to get more capacity than other servers can do with DDR5 and get memory bandwidth that is approaching that of the HBM2E stacked memory used in accelerators and certain kinds of HPC/AI processors these days.
Here is one of the OMI memory feature cards, just so you see how different they are from regular DIMMs and how similar they are to the Centaur CDIMMs:
One of the neat new things with the Power E1080 does is put NUMA connectors – again, IBM keeps calling them SMP when they are not – right on the Power10 SCM package.
In the past, these connectors came out of the pins and then out to the motherboard, which was fussy (as all signaling is when it comes off chip) and introduced latencies. The new on-package NUMA ports cut down on latency and improve the reliability of the NUMA signals that flow between processors within a node and processors across nodes.
This signaling is the heart of the Power E1080 system, of course. And what is new with the Power E1080 is that the A-buses inside of a node and the X-buses across nodes – that is where AXON comes from, with OpenCAPI being the “O” and Networking being the “N” and it was supposed to be NVLink we believe but IBM and Nvidia did not get their act together to win two exascale supercomputer deals with the US Department of Energy in sites where they were the incumbents – is that the signals all run at 32 Gb/sec (rather than a mix of 25 Gb/sec and 16 Gb/sec as was the case with the Power E980). The number of NUMA ports has been increased so that all of the sockets in the system are all fully horizontally and fully vertically connected and no processor (and its main memory) is more than two hops away from any other processor (and its main memory) within the system. If IBM wanted to create a system with more sockets, some sockets would be three hops away. You would probably cluster together pairs of four-ways nodes and cross connect four or eight pairs to make 32-socket or 64-socket machines
So, there is the Denali machine, finally. Up next, we are going to figure out what this thing costs and the performance it delivers, and what IBM and partners will use for a sales pitch to try to move customers with earlier generations of big Power iron to move to the Denali system.