IBM Drops Power10 Into Big, Bad Iron First
September 8, 2021 Timothy Prickett Morgan
After a long, long wait – it has been nearly four years since the first Power9 machine was launched in the “Witherspoon” Power AC922 supercomputing node back in December 2017 based on the 24-core “Nimbus” Power9 processor and just over three years since the high-end “Fleetwood” Power E980 system debuted with the 12-core “Cumulus” Power9 variant – the first machines out of IBM based on the Power10 processor are being unveiled.
The expected three-year cadence between processor generations set by IBM a long time ago has worked out almost precisely for high-end Power Systems shops, with today’s launch of the “Denali” Power E1080. We are calling the 16-core variant of the chip, which only has 15 active cores because of yield issues that all chip designers have to deal with as they move to 7 nanometer foundry manufacturing, the “Cirrus” variant of the Power10 chip because IBM didn’t get around to giving it a proper name.
If we don’t get the real names, we will also name the 30-core dual-chip module (DCM) variant of Cirrus, perhaps “Stratus” is a good name, which has 32 physical cores its two dies but only exposes 30 of them to the software because of the 7 nanometer yield issues. But we will have time, since the word on the street is that IBM won’t get these chips into the field for its so-called “scale out” Power10 systems – what we would call entry systems – until late in the second quarter of 2022. The “ZZ” entry systems using the Nimbus Power9 chips appeared in February 2018, which you know as the Power S914, the Power S922, and the Power S924 machines, while the “Boston” Power LC921 and LC922 machines, which were Linux-only systems, appeared in May 2018. So that gap between Power9 and Power10 entry machines is somewhere between three and a half and four years. Chalk this delay up to having to change the Power10 design at least once (probably twice), and its process technology three times (two different 7 nanometer foundries, with GlobalFoundries initially and Samsung finally), and its foundry partner once (Samsung is making the Power10 chips now).
Considering the complexity of the Power10 chip, the disruptions caused by IBM’s foundry issues, and the sheer difficulty of moving to 7 nanometer designs at all, it is amazing that the Power10 chip isn’t later than it is. Which we discussed a month ago in the story called Balancing Supply And Demand For Impending Big Power10 Iron and which we warned you within that IBM would be doing a gradual rollout of even these big iron systems that will be useful to around 14,000 customers are using from the Power7, Power7+, Power8, and Power9 generations of NUMA systems.
This is the smallest part of IBM’s Power Systems installed base, accounting for maybe 10 percent of installed machines across IBM i, AIX, and Linux, but it is but the biggest part of the Power Systems revenue stream these days with IBM not winning any of the big exascale-class supercomputer deals with Power10 as it did with Power9. (If that had happened, we would have seen a Power AC1022 launching in November of this year, I can assure you. Or perhaps even now.)
The point is, the first Power10 processor and its machine are launching today, and it is the beginning of the new cycle. From this machine, we will learn much and speculate about much more across the entire product line, which will be filled out probably by May 2022, we think, including a four-socket Power E1050 that we want IBM i supported upon (as the Power E850 and Power E950 did not) as well as entry Power10 iron that could be called the Power S1014, the Power S1022, and the Power S1024 in the mainstream lineup and then coming with Power S1021LC and Power S1022LC Linux-only variants and possibly the H Series follow-ons for SAP HANA workloads, the Power H1021 and Power H1022. We still do not know if IBM will offer SMT versions of the Power10 cores, which would give 60 cores per socket in a single chip module and 120 cores in a dual-chip module.
We shall see.
As we go to press, we have talked with the top brass at IBM about the new Power E1080 system, but we have not yet received all of the feeds and speeds of the system or seen the announcement letters or the redbooks for the system. So for now, we are going to give you a high-level overview and then follow up with a lot more detail and analysis after we get our hands on some more data.
“Obviously, we are going to deliver unmatched performance – and there is no secret there – with the best scaling in the world – again,” says Steve Sibley, vice president and global offering management for Cognitive Systems at IBM. And if we didn’t believe this was going to be true, we wouldn’t have quoted him saying it. “But because agility and flexibility are so important – and particularly so since the pandemic – we are doing so with the same pay-per-use consumption model and even being able to share those environments across Power9 and Power10 and protecting the data in motion and at rest with encryption from the core out to the cloud. Other themes include efficient scaling and sustainability, and then streamlining the insights businesses need by brining AI on the chip itself and running machine learning inference and even training right next to the transaction environments.”
We will explore many of these themes individually because what is said about the Power E1080 will be true of the other Power10 machines as well.
But to get us started, let’s talk about the basic feeds and speeds. We have not seen the official CPW and rPerf commercial transaction processing benchmarks for the IBM i and AIX environments as yet, but IBM says very generally that the Power E1080 based on the 15-core Power10 has up to 30 percent more performance per core and up to 50 percent more performance per socket than the Power E980 based on the 12-core Power9 chip. Because the Power E980 and Power E1080 are both based on four socket nodes and up to four nodes in a full system for a maximum of 16 sockets, the performance per system increase is also a maximum of 50 percent.
The per-core performance of the Power Systems line has been increasing steadily since the Power4 first debuted in 2001, and this chart that IBM showed off to business partners gauged the performance of the top-end, full-blown system in each generation since the 64-core, 32-socket Power 595 machine based on the dual-core Power5 processor from way back in 2004. These relative performance metrics are based on the AIX rPerf test, but the CPW scaling would be roughly the same on the IBM i platform. Take a look:
Now, be careful with this chart, since I read it wrong the first time and didn’t realize this was showing per core performance increases.
The Power E1080 has a total of 240 cores and delivers a factor of 26.1X more performance than a 64-core Power 595. We base this on the AIX Relative Performance (rPerf) online transaction processing tests that IBM has published in the wake of the Power E1080 announcement three weeks ago. The p5 595 server using 32 sockets of the Power5 chip running at 1.9 GHz had a rating of 306.21 on the rPerf test, while the Power E1080 with 240 cores across its 16 sockets has a rating of 7,998.6 on the rPerf test. The 6.9X shown above is just for the microarchitectural changes and clock speed boost of the cores themselves, which means the remaining 19.2X increase in performance is coming from the cores and increased efficiencies of the interconnect to keep those cores fed.
In any event, over the 17 years between the Power 595 and the Power E1080, that has been a pretty steady performance increase, and one that has gotten steeper and steadier since Power8 came on the scene in 2014. The gaps are big between generations, but so are the performance jumps.
To get you started thinking about Power10 systems, here is a good summary chart to give you a feel of how the Power8-based Power E880C stacks up against its follow-on, the Power E980, and the shiny new Power E1080:
The first thing I saw is that there are variants of the Power10 with 10, 12, or 15 cores activated, and Sibley tells me that the yields on the 7 nanometer process are better than expected and, importantly, the yields on the 15-core Power10 chips are better than the yields were on the 12-core Power9 chips etched using GlobalFoundries 14 nanometer processes (which stands to reason since one core is intentionally assumed to be bad). What is important is what Sibley said next, which is that the pricing on the chips with more cores would not be as extravagant as it was during the 22 nanometer Power8 and 14 nanometer Power9 generations. We will let you know when we get our hands on some pricing info.
The other thing that immediately jumps out at me is that the Power10 chip in the Power E1080 has twice as many memory channels and twice as many memory slots as the Power9 machine, and that its memory seems to be moving a little bit slower but the aggregate bandwidth for the socket and the system is up by a factor of 1.79X compared to the Power9-based Power E980. The machines both tap out at 16 TB per node and 64 TB per system, and IBM is using less dense and presumably cheaper memory to get that capacity. And in doing so, by having twice as many memory channels, it can deliver nearly twice as much memory bandwidth and still lower the power consumption in the overall machine. This is a correct set of tradeoffs. You will also note that IBM is using DDR4 memory, not DDR5 memory, in the Power E1080, packaged up in the OpenCAPI Memory Interface, or OMI DIMM form factor, and presumably when DDR5 memory comes out, an upgrade with higher performance and higher density will be available.
You will also note that the overall I/O bandwidth was only up by 5.6 percent even though the machine has switched from PCI-Express 4.0 to PCI-Express 5.0 controllers. The reason why is that IBM is using half as many PCI-Express lanes as it was using PCI-Express 4.0 lanes to deliver about the same bandwidth. With the Power11 chip due in about three years, IBM will be switching to full on, 16-lane PCI-Express 5.0 slots that will provide greater than 60 GB/sec of aggregate bandwidth each, compared to around 31.5 GB/sec for 16 lanes of PCI-Express 4.0 or eight lanes of PCI-Express 5.0.
Finally, look at the clock frequencies on the Power10 chip. The 10-core has a base speed of 2.65 GHz and turbos up to 3.9 GHz; the 12-core has a base speed of 3.6 GHz and turbos up to 4.15 GHz; and the 15-core has a base speed of 3.55 GHz and turbos up to 4 GHz. These speeds are absolutely consistent with the Power9 chips, and when you factor in that the Power10 instruction set allows more work to get done per clock – and more different kinds of work, now including matrix math for machine learning and other high performance computing routines – and has more cores in the box, IBM has clearly been able to pull a lot of levers to get that incremental performance that keeps it more or less on track with the expected Moore’s Law improvements.
It would be better if IBM upgraded processors every two years for the sake of its marketing, but it is upgrading every three years for the sake of its hardware and software engineers and for its customers, who apparently do not want to consume new hardware faster than that rate – and often slower than this. They tend to skip generations, in fact, and thus the Power E1080 is aimed more at customers using high-end Power7, Power7+, and Power8 systems than it is for those using the Power E980.
We have lots more coverage of the Power E1080 to come, so stay tuned.