Power9 Prime Previews Future Power10 Memory Boost
September 30, 2019 Timothy Prickett Morgan
Customers who are deploying Linux on Power Systems iron from Big Blue are about to get a very substantial boost in memory bandwidth as IBM is getting ready to launch the Power9’ (that’s a prime symbol, not an apostrophe or a typo, after the 9) processor. As we previewed back in March 2018, the Power9’ chip will feature a substantially upgraded memory subsystem that has a new, faster SERDES signaling technology, normally used for various kinds of system and accelerator interconnect, that has been tweaked to support memory buffers and therefore DDR4 and future memories.
There is no technical reason why the Power9’ processor can’t run the IBM i and AIX operating systems created by IBM itself, but it looks like IBM’s new Red Hat Enterprise Linux, thanks to its acquisition of the commercial Linux distributor, will be the key operating system running on the Power9’. The reason why we bring this up is that if you happen to have an IBM i or AIX application that really requires a lot more memory bandwidth than Big Blue is delivering today, then you at least have a shot at asking for an RPQ system that can tide you over until the Power10 iron comes out in 2021. (We suspect it will be early 2021.)
The new Power9’ memory subsystem adheres to what IBM is calling the OpenCAPI Memory Interface, or OMI for short. Jeff Stuecheli, Power Systems architect at the company, and Scott Willenborg, engineer of the Power9’ processor, presented the new OMI memory at the recent Hot Chips conference that was hosted at Stanford University. We were on hand for the presentation and wanted to give you a sense of what IBM has done with the Power9’ chip with regard to memory and what this portends for the Power10 processor.
First, let’s review the memory and I/O embodied in IBM’s Power chips for the past decade. Take a look:
IBM has been using buffered memory for a long time in its Power chips and this is one of the reasons why it could attach more memory per socket than the typical X86 processor available at the same time as well as drive more memory bandwidth even though it was using memory that typically ran a lot slower than that used for high bandwidth with the X86 competition of the time. The “Centaur” memory buffer was designed for the Power8 and Power8’ processors launched in 2014 and 2016, respectively, and we have talked about the benefits of these memory buffers in the past. As you can see from the table above, the Power7 and Power7+ chips were relatively slow, with only 65 GB/sec of sustained memory bandwidth (as gauged by the STREAM memory bandwidth benchmark test), and, but the Power8 and Power8’ chips (a tweak that added NVLink to the Power architecture to directly attach Nvidia Tesla GPU accelerators to Power compute complexes) had 210 GB/sec of sustained memory bandwidth per socket.
With the Power9 processors, IBM bifurcated the processor line, with a scale-out version that had DDR4 memory controllers etched on the chip and therefore able to directly drive DDR4 memory – that’s the “Nimbus” P9 SO chip in the table above. The Centaur memory buffer chip was retained in the architecture of the “Cumulus” Power9 SU chip in the table above, and as you can see, ditching the Centaur buffer in the Power9 SO chip dropped the sustained memory bandwidth per socket by 29.6 percent to 150 GB/sec, while the memory bandwidth on the Power9 SU chip stayed the same at 210 GB/sec. By the way, that is about twice as much as Intel can drive with a fat “Skylake” or “Cascade Lake” Xeon SP processor.
With the Power9’ and Power10 chips, the memory bandwidth jump per socket, thanks to the OpenCAPI memory, is going to be not only higher, but substantially higher than Big Blue had planned a year and a half ago. Back then, a single socket of Power9’ was expected to deliver a pretty stunning 350 GB/sec of sustained memory bandwidth, but now that the OpenCAPI memory is being perfected, IBM has bumped that up to a pretty amazing 650 GB/sec. And with the Power10 chip, the plan a year and a half ago was to push OpenCAPI memory to deliver up to 435 GB/sec of sustained memory bandwidth, but in the latest roadmap we see that it is humming along at 800 GB/sec.
By the way, that is almost the same memory bandwidth as what an Nvidia Tesla GPU accelerator can get with stacked High Bandwidth Memory (HBM2), and this could shift some of the balance of compute back to the CPU from the GPU for a number of applications, or at the very least provide a kind of balance of power as well as a balance in engineering on hybrid CPU-GPU machinery.
So how is IBM doing it? Here is how the math works:
On the direct attached memory on the Power9 chips, the memory signals are 72 bits wide (the extra bits above 64 are for error correction) and they run at 2.67 GHz. Registered DIMMs have smaller capacity, so a Power9 socket tops out at around 256 GB with that 150 GB/sec of sustained bandwidth. With load reduced DDR4 memory sticks, the bandwidth is the same, but the memory can be stacked up on the DIMMs and the capacity per socket can be pushed up to 2 TB.
With the Power9’ processor, IBM will be using buffered memory, but is switching to a much smaller memory buffer created by chip supplier Microchip, which is a lot smaller than the Centaur chip from IBM and which enables much smaller memory cards than IBM has been building for its Power Systems iron. (More on that in a moment.) The SERDES on the Power9’ chip is 8 bits wide and the differential signaling is running at 25.6 GHz, which works out to 25 Gb/sec after encoding overhead is taken out.
With OpenCAPI memory and its predecessor DMI memory, there is an interplay between memory capacity and bandwidth, and IBM is planning on delivering two different options for the Power9’ chips. With regular OpenCAPI DDR4 memory sticks, the capacity will range from 256 GB to 4 TB per socket and the bandwidth will come in at around 320 GB/sec – about what IBM was saying it could do a year and a half ago. But there is also an option to double up the pipes out from the buffer to a smaller amount of memory and in that case the capacity drops to between 128 GB to 512 GB per socket, but the bandwidth per socket more or less doubles up to 650 GB/sec. That’s a fair trade, especially considering that the OpenCAPI Memory Interface only adds somewhere between 5 nanoseconds to 10 nanoseconds of overhead compared to RDIMM or LRDIMM DDR4 memory sticks without buffers. The memory interfaces on HBM2 memory are very wide, at 1,024 bits, and the bandwidth can be pushed up to 1 TB/sec, but the capacity is around 16 GB to 32 GB, which is just not a practical size for main memory in a CPU-based system.
Here’s what these Power9 processors in the family look like in terms of their various I/O, including main memory attach:
The “Cumulus” Power9 SU chips used in the Power E950 and Power E980 systems have 96 lanes of Bluelink I/O – now called PowerAXON – built into the chip, comprised of two banks of 48 lanes running at 25 Gb/sec. The chip also has two banks of four Direct Memory Interface (DMI) SERDES (short for serializer/deserializer) that as IBM points out, have 3X the bandwidth per chip area as the signaling used in the DDR4 controller used in the “Nimbus” Power9 SO processor. This is significant because by using the DMI SERDES, IBM has more room to add for Bluelink SERDES. The SMP links shown on this chart above are really NUMA links, and they run at 16 Gb/sec. That’s actually slower than the Bluelink SERDES, which is weird but NUMA doesn’t need more than that apparently.
With the Power9’ processor, IBM has an enhanced OpenCAPI memory SERDES that can double up the bandwidth while also running at the same 25 Gb/sec using the same differential signaling, and that is what is allowing the memory bandwidth to increase further.
Here is what the OpenCAPI memory looks like compared to IBM’s Centaur DIMMs and regular DDR4 DIMMs that meet JEDEC memory standards:
The Microchip Smart Memory Controller 1000 is designed to implement the OpenCAPI Memory Interface as is a precursor to the JEDEC DDR5 memory spec, which is expected to add a memory buffer to the architecture for the same reasons that IBM and Intel have used buffered memory for so long on their high-end systems. It’s the only way to add more memory to a DIMM and also jack up the bandwidth per module. It is not clear if the OpenCAPI Memory Interface is a superset of the JDEC standard or will be incompatible, but the ideas that IBM has developed are clearly going to be used, just as the CAPI (running over PCI-Express 3.0 and 4.0 transports) and OpenCAPI (running over 25 Gb/sec Bluelink SERDES) are inspiring Intel with its Compute Express Link (CXL) accelerator attach protocol and transport. While the OMI SERDES on the Power9’ chip can drive up to 650 GB/sec of sustained bandwidth with 16 channels of memory with eight lanes per channel running at 25 Gb/sec, the Microchip SMC 1000 DDR4 memory buffer tops out at 410 GB/sec of sustained bandwidth.
With the Power10 chips, the PowerAXON and OMI SERDES (both based on Bluelink) will have 32 Gb/sec and 50 Gb/sec native signaling, delivering a mix of performance boosts depending on how IBM makes use of them on the Power10 processing complex. The 32 Gb/sec signaling looks like it will be used for the OMI SERDES for DDR5 main memory, presumably with a beefier buffer chip so it can handle the 800 GB/sec that IBM is expecting to deliver in each Power10 socket. We also presume that the number of OMI channels will stay the same at 16, with eight lanes each, just running at 32 Gb/sec instead of 25 Gb/sec. It looks like the Bluelink SERDES will be doubled up to 50 Gb/sec and that the OpenCAPI 5.0 protocol will be supported on these devices and quite likely the NVLink 3.0 ports, even though IBM does not say that. We suspect that Intel’s CXL protocol will also be supported across these interconnect circuits, too. It doesn’t look like IBM will boost the number of Bluelink lanes in the Power10 and will keep it at 96 lanes, probably in two banks like with Power9 and Power9’.
As far as we know, the Power10 chip will have 48 cores at SMT4 and 24 cores with SMT8, but IBM has not confirmed this.