Max Thread Room
September 28, 2020 Timothy Prickett Morgan
For a lot of organizations that buy servers and create systems out of them, the overall throughput of each single machine is the most important performance metric they care about. But for a lot of IBM i shops and indeed even System z mainframe shops, the performance of a single core is the most important metric because most IBM i customers do not have very many cores at all. Some have only one, others have two, three, or four, and most do not have more than that although there are some very large Power Systems running IBM i. But that is on the order of thousands of customers against a base of 120,000 unique customers.
We are, therefore, particularly interested in how the performance of the future Power10 processors will stack up against the prior generations of Power processors at the single core level. It is hard to figure this out with any precision, but in its presentation in August at the Hot Chips conference, Big Blue gave us some clues that help us make a pretty good estimate of where the Power10 socket performance will be and we can work backwards from there to get a sense of where the Power10 cores could end up in terms of the Commercial Performance Workload (CPW) benchmark ratings that IBM uses to gauge the relative performance of IBM i systems.
Some review is in order to get an appreciation of the performance leaps IBM has been able to do in recent generations. With the move from the Power7+ to the Power8, the latter of which was launched in 2014, the raw integer performance at a socket level went up by a factor of 2.2X and performance on other commercial performance as gauged by TPC-C and other tests went up by about 2.7X and on Java workloads by a little more than 2.5X. These numbers were given out at Hot Chips in 2013. In 2016, a year before the Power9 chip was launched and when Big Blue was previewing the performance of the Power9 socket, IBM said the per socket performance of Power9 would be about 1.8X higher in terms of raw integer performance and a little bit higher for commercial applications. With Power10, compared to Power9, here is what Big Blue says to expect on a per socket basis in terms of performance:
Thanks to the increase in cores (from 12 in Power9 to 16 with Power10 but only 15 will be activated to improve the effective yields on the 7 nanometer processes that will be employed to make them at Samsung’s foundries) and a complete reworking of the cores from the ground up, the raw integer performance of the Power10 socket is north of 3X and on enterprise workloads – what used to be called commercial in the prior presentations – it is going up by 3X flat. IBM has used a design spec of 4 GHz for the past several generations of Power processors, and I believe it is the same with Power10.
Not that this matters for IBM i shops today, but the vector and matrix math performance of the Power10 chip socket is going to increase by even more, thanks to wider math units and mixed precision support within them, which is important for machine learning training and inference workloads. Machine learning inference will be important within a lot of applications in the coming years, including those running on IBM i. Mark my words. So even though this is not particularly relevant today, it is worth noting, so here is the data so you have it:
Here is why this will be important. Inference is not necessarily going to be a super-heavy workload like it is for the hyperscalers, and it is very likely to be embedded in all aspects of the software stack, from self-monitoring and self-healing features in operating systems, databases, and middleware to AI algorithms sprinkled throughout the applications in the systems of record and to the AI routines that are a significant part – if not dominant the dominant part – of the systems of engagement. In many cases, the machine learning training workload will require dedicated hardware like hybrid CPU-GPU systems to run, but the machine learning inference, which is the application based on the neural network model rather than the thing that builds the model, will run just fine on CPUs – provided they have support for the mixed precision floating point and integer math that these models use to boost throughput. The Power10 chip will pack 10X the raw double-precision (64-bit) floating point math of Power9 at the same 4 GHz clock speed, which is impressive in its own right, and will have 10X the oomph on single precision (32-bit) floating point, which is used for genomics, signal processing, and machine learning inference. But supporting the BFloat16 format created by Google for its TPU inference engines (which has the dynamic range of 32-bit FP but only takes up half as much space in a math unit because it has slightly lower precision), inference will run at more than 15X faster on Power10 than Power9, and using even lower precision 8-bit integer formats (INT8), the inference performance goes up by what looks like 22X. This means a lot of companies will not have to deploy inference accelerators based on FPGAs, GPUs, or custom ASICs. This runs counter to IBM’s whole accelerated computing message for the past decade, but there will still be many organizations that need these devices because their inference workloads are so huge. It always comes down to cases in IT, as you well know.
Which brings us to the history of the Power family of chips running OS/400 and IBM i since 2001, when the dual-core Power4 chip first shipped and put IBM on track to take over the proprietary and Unix server business as it shrank mightily at the same time. (Did IBM win and the other guys, namely Hewlett Packard and Sun Microsystems, lose? I think both happened.) Take a look at this table:
In the table above, we show the interplay of time, chip process technology, transistor count, cores in the chips with the most simultaneous multithreading (which IBM had way back in the 1990s with the “Star” family of PowerPC chips used in the AS/400 line but which was not in the Power4 and Power4+ chips), the maximum threads available per socket given that SMT per core, clock speed, and ultimately per core performance on the CPW test. These are the salient characteristics of the monolithic chip available in each generation. IBM has had both chiplets and multi-chip modules at many times in the past, so this is not new at all. Power5 and Power5+ had dual-chip modules, and Power6+ and Power7+ did, too. Power8 had a half chiplet, which had only six cores instead of twelve, and two of these were put into one socket.
The Power10 chip will come in a dual-chip module variant, as we have previously explained and which will be an absolute screamer. In the table above, the Power10 data is for the single chip implementation, which has 16 cores but only 15 of them will be activated. IBM is just being honest that at least one out of 16 cores coming out of the Samsung fabs is going to be a dud, right off the bat, because of the newness of the 7 nanometer process and Samsung’s learning curve in making fat chips like Power10. When IBM and Samsung move to 4 nanometer processes – we think IBM will skip the 5 nanometer generation – with Power11, we think IBM will be able to boost the size of the Power11 chiplet a little and still double up the core count on the chiplet while keeping the power consumption in check by keeping a governor on the clock speed. We think IBM will have a Power11 DCM and quite possibly even a QCM – that is a quad-core module – which will significantly boost the per socket performance. We would not be surprised to see Power10 and Power11 share a common socket, but that would probably limit Power11 to a DCM.
Given all of this, and the rough performance data that IBM has supplied above, we expect a single Power10 core to have somewhere around 43,000 CPWs of oomph. That is about the same performance as an eight-way, 16-core Power4+ system would have had back in 2002, to give you a frame of reference. That’s a factor of 20X performance per core improvement over 20 years, which is not too shabby at all. And a single socket of these Power10 cores, if all 16 could be fired up and used, would have about as much performance as a 12-socket Power7 or Power7+ system from 2010 and 2012, respectively, and about the same as a six-socket Power8 machine (if IBM made one) or a three-socket Power9 machine (which IBM definitely does not make on purpose, but you could populate a Power E950 with only three processors to get there).
We don’t think IBM can wring as much performance out of a Power11 core as it did in the jump from Power9 to Power10, which was a complete rewrite that will yield on the order of 2.4X more raw oomph per core, on average. But we do think IBM can do 1.5X or so per core with the jump from Power10 to Power11, and that puts the CPWs per core at around 65,000 or so. That is about the same as a single Power7+ processor with 12 cores from 2012.
Hopefully, IBM will unlock this performance in the future Power10 and Power11 processors for IBM i workloads at a reasonable price – and find lots of other work for the system to do so customers buy more capacity, not less. You can’t buy a quarter of a core, after all. With all of this performance, the trick is to get customers to use it.