Power9 Gets Ready To Roll In Systems In 2017
August 29, 2016 Timothy Prickett Morgan
In about a year or so, a radically different Power processor family will be embedded as the motors in the Power Systems machines that will drive IBM i applications into the future. The forthcoming Power9 chips, which IBM’s top techies unveiled at the Hot Chips conference in Silicon Valley last week, are as always packed with lots of technical innovation. But that is not the main thing that IBM i shops should be ebullient about.
The real innovation that will drive the Power platform forward is the OpenPower Foundation and the fact that Big Blue has rearchitected its chips and its business model for those processors so they can be used in a broader array of machines in the datacenter. This, as we have pointed out many times, helps ensure the longevity of the Power processor in general and therefore the Power Systems-IBM i platform in particular.
Having said that, IBM has not forgotten its IBM i and AIX customers as it tweaks the Power processor so it is more suitable to cloud builders (like its own SoftLayer as well as Rackspace Hosting) and hyperscalers (like Google). As it turns out, there are not just two variants of the Power9 processor–one designed for scale out clusters of machines that have one or two sockets and another for scale-up, shared memory NUMA machines with four or more sockets and–but four versions. In addition to variants of the chip for different levels of scalability–the heftier Power9 SO chip for NUMA has more SMP stuff on it and less other kinds of I/O than the Power9 SU chip–IBM will also be offering versions of each chip with either 12 or 24 cores per chip. These will all be single chips, not dual-chip modules as was the case with the Power8 SO processors that were used in the Power Systems LC machines for running Linux on bare metal or atop the OpenKVM hypervisor and the plain vanilla Power Systems machines that run IBM i, AIX, and Linux atop the PowerVM hypervisor.
Interestingly, IBM will be tuning up and down the virtual threads, implemented using simultaneous multithreading, which IBM has had in its chips in one form or another since the “Northstar” PowerPC chips back in the late 1990s. IBM has done this dynamically to date, but it is hard-coding it in the two different flavors of Power9 chips.
The Power9 chips with 24 cores will have their SMT backstepped to a maximum of four virtual threads per core, which will yield better single-threaded performance as well as some multithreading, while the 12 core chips will have a higher level of threading, up to eight virtual threads per core like the Power8 chips, and offer lots of threads for workloads like Java virtual machines and relational databases that love many of threads as much as they like higher clock speeds. We think these lower cored, higher threaded Power9 chips will have higher clock speeds, and perhaps more cache than the scale out versions. We also think that the Power9 SO versions will focus on lower thermals while pushing up single-threaded performance up as high as can be expected given that constraint. What we know is that search engine indexing likes lots of cores in a box and the strongest cores possible. Ahem.
The question is this: Will IBM’s 24 cores with Power9 SO beat Intel’s 28 cores with the future “Skylake” Xeon E5 v5 processors due next year as well? (We will explore the server possibilities with Power9 chips in a future story. So stay tuned.) For now, IBM is focusing on talking about the new architecture of the Power9 chips and it is not about to pre-announce products.
Pick Your Flavor
The Power9 chip, as we have explained, comes in two flavors, the SO and SU variants. The Power9 SO machines will be optimized to have a lot of their I/O bandwidth allocated to hooking various kinds of storage and compute accelerators to the processing complex with a nominal amount for the NUMA links (which IBM still calls SMP for some reason even though it is not) to lash two processor sockets together.
These Power9 SO chips importantly will allow for regular DDR4 memory sticks to be plugged directly into the Power9 memory controllers. With the Power8 chips, all of the memory chips were linked to DDR memory controllers on the chip with an intermediary “Centaur” buffer chip that also had a segment of L4 cache on it. That buffered architecture allowed for more memory sticks to fan out from each of the eight DDR ports on the Power8 chip that would have been possible without buffers, and it also allowed IBM to drive memory bandwidth up quite high. With the Power9 SO chips now allowing for directly attached memory, the overall system complexity and cost will go down, but memory bandwidth will be sacrificed. You can see that in this chart below:
On the Power9 SO chips, each of the eight DDR memory ports can support two memory sticks per channel and delivers an aggregate memory bandwidth into and out of the processor of 120 GB/sec. With buffered 256 GB main memory, which IBM already sells but no one else delivers in volume, main memory would top out at 4 TB, which is the architectural limit of the Power8 and Power9 scale-out chips, according to as Brian Thompto, senior technical staff member for the Power processor design team at IBM, who gave the Power9 presentation at the Hot Chips 28 conference. While that direct attached memory does not give The Power9 SO systems the benefit of L4 cache, because it is directly piped to the controller it offers lower latency.
With the Power9 SU systems, the Centaur buffers are still used (they will likely be enhanced with more L4 cache and other features to support faster DDR4 memory) and that means each Power9 SU socket will be able to support twice as many memory sticks at 32. The main memory bandwidth into and out of the processor complex nearly doubles to 230 GB/sec, and the maximum capacity with 256 GB memory sticks works out to 8 TB. With 64 GB sticks that will be common, the memory per socket will be 2 TB and other features, like chipkill error correction and lane sparing will make IBM’s big NUMA boxes–and indeed any others that might be manufactured by OpenPower partners–more resilient and reliable.
So, to be very clear, there are Power9 SO chips aimed at standalone entry machines and clusters that will have either 24 cores and SMT4 threading or 12 cores and SMT8 threading, and there will be Power9 SU chips that are aimed at machines with four, eight, or 16 sockets that will have either 24 cores and SMT4 threading or 12 cores and SMT8 threading, and they will differ in how they attach to main memory and whether their I/O is used for SMP/NUMA links or other kinds of peripherals.
The Power9 chip is implemented in 14 nanometer chip making processing from Globalfoundries, which was basically paid $1.5 billion to take over that business a few years back and which is committed to helping Big Blue bring several generations of Power chips to market in the coming years. The Power9 chip has 8 billion transitions, which is a little less than twice as big as the 4.2 billion transistors in the original Power8 chip (which did not have the NVLink GPU accelerator interconnect on it) and which was implemented in 22 nanometer processes. To get double the number of cores on the die plus a slew of new I/O links, the L3 cache could not be doubled up to 192 MB. Instead, the Power9 chip has 120 MB of L3 cache, implemented in embedded DRAM (eDRAM) as was used in the Power7, Power7+, and Power8 chips. The on-die interconnect that lashes the 256 MB L3 segments together than spreads across the 24 cores on the SMT4 die delivers 7 TB/sec of aggregate bandwidth.
The Power9 chip has 48 ports running at 25 Gb/sec, which are generic and which are used as the foundation for remote SMP links for servers with more than two sockets an to link accelerators to the chip over a New CAPI interface as well as NVLink 2.0 interfaces. Machines with remote SMP links to hook up four, eight, or sixteen sockets do not have these links available for accelerators, which stands to reason because they are basically database engines anyway. Local SMP for two-socket machines run on dedicated 16 GB/sec links, and there are 48 lanes of PCI-Express 4.0 peripheral support to run the CAPI 2.0 protocol for other accelerators as well as to link other peripherals like network cards or disk controllers to the processor complex. That CAPI 2.0 port running on PCI-Express 4.0 ports has 4X the bandwidth of the CAPI 1.0 port running on PCI-Express 3.0 ports. CAPI 2.0 is not, however, as fast as New CAPI. The PCI-Express 4.0 lanes offer 192 GB/sec of aggregate bandwidth per socket while the 25 Gb/sec ports have an aggregate of 300 GB/sec of bandwidth into the chip. (Those are duplex figures.)
So how will the Power9 chip perform? Well, that is going to depend. But to give us an idea, Thompto showed off a chart that pit the Power8 against the Power9. In this case, to get an apples-to-apples comparison, IBM’s performance techies ran some benchmarks on the Power8 chip with 12 cores running at a 4 GHz frequency, and then ran the same tests on a Power9 chip with 12 cores; both had SMT8 threading turned on.
As you can see from the chart above, the average performance increase across multiple workloads, on a socket-by-socket basis, is higher than 1.5X and lower than 2X; it looks like around 1.8X to my eye. This is all due to architectural differences, and given that, it is a very big jump in performance. But, with the Power9 SO chips coming in the second half of 2017 and the Power9 SU chips coming in 2018 sometime, there is also a big gap in years, given that the Power8 chips debuted in April 2014. And there is no guarantee that IBM can deliver that kind of performance jump across all core counts, cache sizes and platform types. The jump is, however, about as big as the jump from Power7 to Power8, and that means IBM’s architects deserve a lot of credit.
Now comes the fun bit: Trying to figure out what the Power9 systems will look like, and how they will stack up to the Xeon competition. . . .