The Deal On Power9 Memory For Entry Servers
March 5, 2018 Timothy Prickett Morgan
There are a lot of changes that come with any new Power Systems platform. But perhaps the biggest change – and one that will in some ways make the Power9 platform more competitive with X86 and ARM servers and in others less competitive – is the way IBM is shifting from buffered DDR3 and DDR4 main memory used in Power8 iron to plain vanilla registered DDR4 memory that is commonly used in all servers these days.
Buffered memory had its heyday on high-end NUMA systems, and was necessary to try to balance the needs of memory bandwidth against ever-increasing compute capacity. For a long time now, IBM’s Power Systems have had a pretty significant memory bandwidth advantage over Intel’s Xeon iron, but it always came at a cost because buffered memory is inherently more expensive to manufacture than regular unbuffered memory.
The midrange and high-end NUMA systems based on the Power7 chips, which came out in April 2010, had buffering on each memory stick, and the buffer ASIC hooked into ports on the memory controllers. Each buffer chip was associated with one DIMM and one memory port, so a Power7 chip had two controllers with four ports each to support a total of eight DIMMs per socket. These DDR3 memory sticks ran at 1.07 GHz, and they came in 8 GB, 16 GB, and 32 GB capacities. The memory controllers ran at 6.4 GHz. The memory feature cards had four of these buffered memory chips each.
With the Power8 chips that came out in April 2014, IBM added a funkier memory buffer chip, code-named “Centaur,” which we talked about in detail here. With this buffering, IBM could have up to four memory sticks hang off of each memory controller etched onto the processors, which supported either DDR3 or DDR4 because the interfaces for the memory were moved out of the controller and out into the Centaur buffer chip. The Centaur chip also had 16 MB of embedded DRAM on the buffer, which was activated on high-end machines (those with 4, 8, 12, or 16 sockets) as an L4 cache memory that sat between main memory and external storage devices that hooked to the compute complex. That processor interface coming out of the Power8 memory controller had 8 GB/sec of bandwidth, and so the processor had an aggregate of 192 GB/sec of sustained bandwidth coming out to the Centaur chips. On the other side of the Centaur chips, there were four memory sticks per buffer, for a total of 32 memory slots running at 1.6 GHz, yielding a total of 410 GB/sec of bandwidth per socket. At first, using 64 GB CDIMM memory sticks, capacity was limited to 512 TB per socket, but eventually IBM shifted to 128 GB memory sticks and the capacity topped out at 1 TB.
The memory sticks in the Power8 machines ran at 1.6 GHz, 50 percent faster than the memory used in the Power7 machines, and combined with the factor of 4X increase in memory sticks, the Power8 machines ended up with 6X the bandwidth at the memory sticks. But again, you can’t drive more bandwidth into the chip than those memory ports on the die can handle, so really, buffering is just about being able to hang more memory off the processor than might otherwise be possible yet with getting appreciably high overall performance.
Obviously, these memory bandwidth figures are based on having all memory slots full and running at the top speed available. If you use slower memory in one socket, they all drop down to the slower speed, lowering bandwidth, and if you don’t populate all of the memory slots, the bandwidth drops proportionately there, too.
We detailed the DDR3 memory used with the scale-out versions of the Power7+ and Power8 machines back in June 2014, and we talked about the shift to DDR4 memory on Power8 iron, which boosted capacity to 2 TB per socket using 128 GB sticks, back in October 2016.
That brings us to the “ZZ” Power9 scale out machines, which were launched in February and which we drilled down into here. As we pointed out in our analysis of the Power S924 server last week, IBM has offered up to 2 TB per socket using 128 GB DDR4 memory sticks. The peak memory bandwidth per socket, however, has dropped to 153 GB/sec, and without that Centaur buffer chip, the sustained memory bandwidth as measured by the STREAM Triad benchmark test is 33.5 percent lower, with 173 GB/sec for the Power8 socket and 115 GB/sec for the Power9. This is moving backwards, to be sure, but it is right on par with what Intel is delivering with its latest “Skylake” Xeon SP chip, which sports only six memory controllers and which tops out at 1.5 TB per socket and, in most cases, has a ceiling of 768 GB per socket. IBM wants to draw even on memory bandwidth and win on memory capacity and use the same unbuffered DDR4 DIMMs, therefore making building up main memory on the Power9 servers less costly than it would be with buffered chips.
On the Power9 ZZ systems, which come with one or two sockets, each processor has eight DDR4 memory controllers, and pairs of controllers are banded together to make quads of memory blocks in the compute complex, like this:
The memory slot count goes down to 16 per slot with the Power9, which accounts for a lot of the bandwidth decrease, but the memory speed on a fully loaded (in terms of memory) socket goes up by 33 percent to 2.13 GHz, which helps mitigate those loss in slots and therefore loss in memory capacity and bandwidth. There are faster Power9 memory sticks, but you cannot fully populate the slots to use them. Here are the feature codes and the slots:
As you can see, IBM is offering DDR4 memory sticks in 16 GB, 32 GB, 64 GB, and 128 GB capacities. IBM didn’t offer a maximum of 4 TB of memory per socket on the scale-out versions of the Power8 systems, but it could have with 128 GB CDIMMs, and if it wanted to, it could have plunked 32 of its 256 GB CDIMMs onto a single socket and had a whopping 8 TB per socket. This is not exactly appropriate for a server with one or two sockets, but it is fun to think about.
There are rules about memory with the Power9 ZZ scale out systems. First, any DIMMs in the same quad have to have the same capacity. The minimum configuration on any ZZ system is two 16 GB sticks per socket, and users can install 2, 4, 6, 8, 12, or 16 memory sticks across those quads, balancing across them as they see fit. If you want to use 2.67 GHz memory for extra lower latency on memory accesses, you can’t use it in more than half of the memory slots in a socket, and you can only use 16 GB memory sticks, which means the memory capacity will top out at 128 GB; peak memory bandwidth in this machine would be 119 GB/sec, a little less than two thirds of the machine with all the memory occupied but only running at 2.13 GHz. If you want fatter memory than 16 GB but want to run at a faster 2.4 GHz speed (compared to the base 2.13 GHz at least), you can use 32 GB, 64 GB, and 128 GB sticks running at 2.4 GHz, yielding a bandwidth across those sticks of 107 GB/sec. (Those are peak bandwidths; sustained will be lower.)
Here is another important difference in the old and new lines. With the Power8 line, the Power S822, which came in a 2U form factor and which needed denser CDIMM memory sticks, IBM only offered 16 GB, 32 GB, and 64 GB memory features rather than these capacities plus 128 GB sticks on the Power S814 and Power S824 machines, which had a 4U chassis and therefore more height for less dense memory sticks. IBM charged the same prices for these different CDIMM sticks, but it had to have two different manufacturing runs for them, and that means they were more costly to make than if there was just one kind of stick.
If DRAM memory prices had not gone up by more than a factor of 2X in the past year out on the open market, we think that IBM would be charging a lot less for memory on Power9 systems than it is. The densest memory sticks, which use the densest memory chips, are actually 25.4 percent more expensive per gigabyte, as you can see in the table below:
If customers don’t need fat memory sticks, or want the lowest latency and fastest clock speed on the memory, then the 16 GB sticks are 44.2 percent cheaper on the Power9 scale out iron than on their Power8 predecessors, and are about half the cost per unit of capacity than the original DDR3 memory used in the Power8 machines. The 32 GB DDR4 sticks for Power9 servers are 30.6 percent cheaper than their Power8 DDR4 predecessors, and the 64 GB sticks are 21.5 percent cheaper. So, this is progress. If memory chip prices and therefore memory DIMM prices come down, we expect IBM to lower Power9 memory prices even further. But the memory manufacturers are having way too much fun raking in record profits to boost the output of their factories to do that any time soon. So don’t hold your breath.