The Road Ahead For Power Is Paved With Bandwidth
March 26, 2018 Timothy Prickett Morgan
The gap between what most IBM i customers need from Power Systems hardware – particularly when it comes to compute capacity – and what Big Blue can deliver has been widening since about the Power7 generation back in 2010. But, thanks in large measure to the slow down in Moore’s Law advances in chip manufacturing processes, the gap is going to start closing. Not so much because IBM wants it to, but rather because of the limits of physics.
And, given that the IBM i platform is a database platform that mainly does transaction processing and analytics, IBM’s shift away from compute and toward memory and I/O enhancements that engender all kinds of acceleration using GPUs, FPGAs, and non-volatile memory is a good thing . . . . in theory. In practice, even though such acceleration has been around for years now and is a key strategic aspect of the Power Systems – excuse me, the Cognitive Systems – platform, IBM has not really leveraged GPU and FPGA acceleration for IBM i workloads. The company has done a reasonably good job in promoting the use of flash memory to accelerate storage, and it has moved to standard DRAM rather than its own “Centaur” buffered DRAM to lower the cost of main memory on entry Power9 systems. The combination of larger and cheaper main memory, flash storage, and ample compute will meet the current needs of most IBM i shops within a two-socket server form factor.
But we like to look ahead, and to see the broader spectrum of workloads that could – and should – be running at IBM i shops. Big Blue goes to all of the trouble to rebrand the Db2 for IBM i database, but it doesn’t really do anything to modernize it, to accelerate it like crazy, so it can not only do more work faster, but also to do different kinds of work such as advanced analytics and machine learning. It is with this in mind that we have discussed acceleration in the past, and how IBM could create something we have called the Cognitive Systems/500, a system (not just a machine) very much like the Power9-based AC922 system that IBM launched last December and that is the company’s flagship system for AI and HPC workloads. (You can see our initial coverage on the Power9 architecture here, which is useful for the purposes of this conversation.)
IBM hosted its Think 2018 business partner and customer conference last week in Las Vegas, and at the same time, the company’s partners also had their own shindig at the OpenPower Summit; this was also the same time that Facebook was holding its own Open Compute Summit, so it was a busy week in infrastructure. IBM did a lot of talking about the shiny new Power9 systems and its System z14 mainframes last week, as you might imagine, and frankly, we told you more about the Power9 “ZZ” entry systems in the past few weeks than you would learn at Think 2018. Which means we are doing our jobs, we suppose. But the new thing that did come out of the OpenPower Summit was an updated Power processor roadmap, which Brad McCredie, the founding president of the OpenPower Foundation, an IBM Fellow, and vice president of Cognitive Systems development at IBM, talked a bit about.
Let’s take a look at this roadmap:
The shift from focusing on compute to focusing in I/O started with the Power8+ chip, a name which IBM stopped using because it wanted to emphasize the fact that, although there were some microarchitecture improvements with this chip, the big change was not anything having to do with CPU cores or caches, but rather the integration of the NVLink 1.0 interconnect, which allows for the “Pascal” P100 Tesla GPU accelerators from Nvidia to be tightly coupled with each other and with the Power8+ processors from IBM. This chip was really the proof of concept for the Power9 chip that was launched late last year and that we have been talking about for years now with great anticipation.
As you can see, IBM has taken its own tick-tock approach to updating the Power chip family, but not on quite the same tight schedule and not in precisely the same manner as Intel. With the Intel tick-tock method, the tick is a change in manufacturing process, say a shrink from 45 nanometer chip etching techniques to 32 nanometer methods, while the tock is a change in architecture, such as core design, changes in cache hierarchy, and the addition of accelerators. To mitigate its risk, Intel tried not to do a process change and an architecture change at the same time. Too many things can go wrong. These days, as Moore’s Law improvements in manufacturing processes are getting harder to come by, Intel has shifted to a three step process we have called tick-tock-clock, which means there is a third step now where a process is further refined and performance enhancements come out with a third generation rather than moving on to a subsequent generation. There just are not that many steps left after 14 nanometers before we will hit the limits of transistor physics, starting with 10 nanometers followed perhaps by 7 nanometers and then by 5 nanometers and maybe down to 3 nanometers, give or take a nanometer here or there. There may be only 7 nanometers down to 4 nanometers as the primary steps beyond 10 nanometers. After that, you are down to etching with individual atoms, and that is just not going to work from a physics or economics perspective.
With the Power7 chips in 2010, IBM did eight cores in 45 nanometers, and then with the Power7+ chip two years later, Big Blue tweaked the microarchitecture a bit while shifting to a much smaller 32 nanometer process, but it used that shrink to lower manufacturing costs and did not boost the core count. The Power8 in 2014 not only did a process shrink to 22 nanometers, but it also did a substantial reworking of the Power cores while at the same time boosting the core count by 50 percent to 12 cores per chip. IBM had versions of the Power8 with six cores and put two of them in a single package to create sockets with 12 cores while at the same time it also created versions of the Power8 chip with all 12 cores on a single die, which are more expensive. The Power8 chips could use either DDR3 or DDR3 memory, and included eight memory controllers on the die, allowing for a factor of 3.2X increase in memory bandwidth over the Power7 and Power7+ chips. The Power8 chip from 2014 also used PCI-Express 3.0 peripheral slots, and allowed the initial CAPI 1.0 protocol for coherently attaching accelerators to the Power memory complex to run atop the PCI-Express ports. With the Power8+ chip two years later, IBM added 20 Gb/sec signaling on its prototype “Bluelink” generic, high speed I/O ports, and these new I/O lanes were used to create four ports with 20 Gb/sec of bandwidth in each direction to make the NVLink 1.0 ports for hooking four Pascal P100 accelerators to the Power8+ CPU, with a rudimentary cache coherency across the CPUs and GPUs.
With the Power9 “Nimbus” and “LaGrange” scale-out Power9 chips, which are shipping in the AC922 and ZZ systems that IBM has already announced, a couple of different things happened. IBM shifted to a 14 nanometer process at its partner, GlobalFoundries, which acquired the IBM Microelectronics business four years ago. This process shrink allowed IBM to cram up to 24 skinny cores with SMT4 threading (four threads per core) or 12 fat cores with SMT8 threading (eight threads per core) on a single die. Basically, IBM allowed the cores to be cut in half and the threading level to be cut in half compared to the Power8 chips if that is how customers want to dice and slice the compute, but really, the core count did not go up in the strictest sense. (12 cores with SMT8 is what the Power8 and Power8 supported, and this is also all that the Power9 supports.)
For the entry Power9 machines, IBM also wanted to get its memory costs down, so it shifted away from its own Centaur buffered memory and L4 cache and to plain vanilla DDR4 main memory. There are still eight memory controllers on the Power9 SO die, but without the buffers the sustained memory bandwidth drops by 29 percent, from 210 GB/sec per socket with the Power8 and Power8+ to 150 GB/sec for the Power9 SO variants. Also with the Power9 SO chips, IBM is boosting the Bluelink lane speed to 25 Gb/sec, which is as good as current Ethernet switch chips can do, which across six NVLink ports on the Power9 SO processor means it can offer 300 GB/sec of bandwidth. The shift to 48 lanes of PCI-Express 4.0 peripheral I/O doubles the bandwidth for other peripherals. The OpenCAPI 3.0 protocol runs on the Bluelink lanes – they don’t have to support NVLink 2.0 exclusively – and the updated CAPI 2.0 protocol can run atop PCI-Express 4.0.
With the “Cumulus” Power9 chips used in the future scale up variants of Power Systems with four, eight, 12, or 16 sockets in a single system image, IBM will keep the Centaur DDR4 buffered memory and the sustained memory bandwidth will stay the same at 210 GB/sec per socket. The chart implies that there will be variants available with 24 skinny cores sporting SMT4 threading, but as far as we know, IBM only plans to offer 12 fat cores with SMT8 threading in these machines. Also, the chart implies that the bandwidth on the Bluelink ports will stay the same, but some of those Bluelink ports are going to be used up in creating the NUMA shared memory across sockets, as we have pointed out in our analysis of the future “Fleetwood” sixteen-socket Power E980 system that will come out in the third quarter. There will still be some Bluelink ports left over to run OpenCAPI or NVLink ports for those who want to do this for hooking in GPU, FPGA, or flash accelerators.
That brings us to the Power9+ kickers due starting in 2019, which will have the same core counts as the Power9 variants and be implemented in an improved 14 nanometer process from GlobalFoundries. While there will be some microarchitecture improvements with these Power9+ chips, there will not be a substantial change in performance, we are guessing, at least not in the core compute complex. But there will be enhancements in the memory subsystem and in the protocols running on the I/O subsystem that will allow those cores to be better fed with data and therefore get more work done on the compute capacity that is there. This is also a kind of performance improvement.
The PCI-Express 4.0 and Bluelink I/O subsystems will be the same on this future Power9+ family of chips in terms of the signaling rates and peak and sustained bandwidths. But the CAPI, OpenCAPI, and NVLink protocols will all be updated to run better. IBM is also shifting to a new memory subsystem, and McCredie offered some hints on how it might be done in this slide:
If this chart is to be taken literally, the Bluelink ports are shown on the southbound I/O coming off the Power compute complex, making OpenCAPI and NVLink ports as we know they do. But this also shows Bluelink ports going northbound, up to JEDEC-compliant memory buffers. We suspect that this could be for DDR4 or DDR5 DRAM memory, but it could also be for persistent non-volatile storage, such as ReRAM or 3D XPoint, which is also bit addressable like DRAM.
If we were IBM, we might preview Bluelink-attached DDR4 DRAM, with memory buffers sitting between the compute complex and the DRAM, in these Power9+ chips. To get the 67 percent increase in memory bandwidth shown with the Power9+ chips, the number of memory controllers could be pushed up by 50 percent to a dozen per die and the number of memory slots could be increased by a factor of 50 percent as well, to 48 sticks per socket with buffers or 24 sticks without them. If you boost the memory speed to 3.8 GHz from the current 3.2 GHz when all of the slots are occupied as is the case with the Power9 ZZ machines, you can get that bandwidth. This is just conjecture, of course. The DDR5 specification will be done this year, and for all we know, IBM will do something like this with early and geared down DDR5 memory sticks. DDR5 memory will offer twice the capacity and twice the bandwidth of DDR4 when it does arrive.
We think that IBM will increase the memory slots with the Power9+ chips is a stop-gap maneuver because Big Blue has to wait for DDR5 until the Power10 chips come out in 2020 or 2021, and moreover that it is hedging its bets here in this Power roadmap. But we think the Power10 will be designed to have a much more flexible memory subsystem underpinned by 50 Gb/sec Bluelink signaling and that it will also support 32 Gb/sec Bluelink signaling for other aspects of the system. (The chart suggests that these two speeds will be available.) By moving to DDR5 with the Power10, it will be able to drop the number of memory slots back down to 32 with buffers and 16 without buffers while boosting the speed a bit to 3.4 GHz on the memory to reach that 435 GB/sec of main memory bandwidth shown in the chart.
The Power10 chip will also support PCI-Express 5.0 peripherals, which again will double up the I/O bandwidth compared to PCI-Express 4.0. The number of cores has yet to be determined with the Power10 chip, and that is because the world’s major HPC and AI centers have not figured out what they might need three years from now. We think it is highly unlikely that IBM will do more than a 50 percent increase to 36 skinny cores with SMT4 or 18 fat cores with SMT8. We also doubt it will keep the core count the same if the chips are implemented in the 7 nanometer processes from GlobalFoundries, as we suspect they will be. GlobalFoundries, by the way, is itself hedging on its 7 nanometer manufacturing techniques in the Malta, New York fab. It will have conventional immersion lithography with 7 nanometers as well as its first run of extreme ultraviolet lithography at 7 nanometers. So IBM will have two ways to get there, one potentially less risky but the other with a lot fewer manufacturing steps and complexity and therefore at a lower cost.
The task now is to get IBM to actually redesign the IBM i software stack, and various Linux applications that can run in conjunction with IBM i, to take full advantage of these capabilities. Green screen applications with even the worst possible screen scrapers cannot even dent this kind of compute and I/O, and disk arrays can’t touch it, either. Imagine a two-socket system with 72 cores, many terabytes of main memory and tens of terabytes of addressable non-volatile storage, and petabytes to tens of petabytes of flash storage, with a terabyte per second of bandwidth within the system bus and maybe two or three times that of aggregate memory and I/O bandwidth piping into the compute complex. It is time for IBM i to go HPC and AI.