Performance Choices For Power7+ Servers Could Be Complicated
September 10, 2012 Timothy Prickett Morgan
In last week’s issue of The Four Hundred, I gave you the low-down on the forthcoming Power7+ processors from IBM, which are expected to be launched in new Power Systems machines before the end of the year. This information came from a presentation Big Blue’s chip experts made at the Hot Chips 24 conference, and it was refreshing to see IBM talk about the chips a bit before they came out in systems, giving customers some time to ponder their options.
If you are an IBM i customer, you are going to need the time because the situation could get a little complicated with some variants of the forthcoming Power7+ systems being optimized for higher-speed single chip modules (SCMs), which have one eight-core die in a socket, and others being optimized for lower-speed dual-core modules (DCMs) that put two dies in a single package and double up the cores and caches at a slightly lower clock speed. The DCMs, like prior double-stuffed sockets in the Power5+ and Power6+ generations, run at a lower clock speed or else the sockets would burst into flame, but the extra cores and threads allow for the overall double-stuffed system to do more work on multithreaded jobs.
Conceptually, here is how Scott Taylor, one of the chip architects who worked on both the Power7 and Power7+ processors, illustrated the differences between the SCMs and DCMs.
You will remember that with the Power7 chip, IBM also had a quad-chip module, or QCM, that put four eight-core Power7 chips on a single package, which was packaged up in the “Blue Waters” Power 775 machine that was going into the University of Illinois’ National Center for Supercomputing Applications. IBM canceled the Blue Waters contract with the University of Illinois last summer, presumably because it could not make money on the deal, and Cray subsequently won a $188 million deal to supply a Blue Waters machine that will pack somewhere between 10 and 20 petaflops of floating point oomph using a hybrid CPU-GPU design. I said at the time that IBM should resurrect this machine as a 1.5 million CPW server drawer for IBM i-based clouds. Considering that the stupidly fast Torrent hub/switch interconnect can lash together 2,048 Power775 nodes, you could build a cloud in one data center with 2.1 billion–that’s billion with a B–CPWs of aggregate computing oomph across 1,365 compute nodes with 349,440 cores and 342 storage nodes. At a P05 software tier for the cores, the hardware and the software might cost $3 billion. Now, with the Power7+, IBM could double up the core count for the same price by octo-stuffing the Power 775 processor modules.
That’s silliness, of course. IBM is not going to do that unless some very big supercomputer centers pay it to do so, and it is fair to say that it will never certify IBM i on a Power 775 node or any follow-on.
But it is clear that the double-stuffed Power7+ sockets present customers with some interesting options, and depending on pricing, which is a complete unknown at this point, they might be as economically interesting as they are technically appealing.
IBM has not given precise clock speeds for the Power7+ processors, but Taylor said at the Hot Chips conference that the shrink to 32 nanometer processes from the 45 nanometer processes used to make the Power7 processors would allow for the clock speeds to rise by 25 percent with the Power7+ chips. The Power7 chips run at between 3 GHz and 4 GHz at the moment, with the 3 GHz chips used in entry Power 720 rack and tower systems and PS701 blades and the 4 GHz chips running in the Power 795 beast. So we can expect Power7+ clock speeds to be in the range of 3.75 GHz to 5 GHz, yielding at least a 25 percent boost in single-thread performance, and very likely more than that since the L3 cache has been expanded by a factor of 2.5 to 80 MB across an eight-core chip.
Taylor also said in passing that the DCM variants of the chips would run at approximately the same speed as the current Power7 chips. So that is between 3 GHz and 4 GHz, yielding roughly the same potential clock cycles for apps but with 2.5 times the L3 cache and other tweaks (like larger main memory, presumably faster PCI-Express I/O, and on chip accelerators for memory compression and other functions). So even at the same clock speeds, it seems likely that the DCM variants of the Power7+ chips will yield more performance than the current Power7s of equivalent clock speeds.
What I can tell you for sure about Power7 versus Power7+ SCM and DCM performance is this: They have considerably more performance. Here’s a chart that Taylor tossed up in his presentation, showing the relative performance normalized across the same number of cores, presumably using top-end parts in the same thermal bands to make the comparison fair (he was not exactly sure what the comparisons were, and admitted this):
If the bars on this chart are proportional to the performance increase, then a Power7+ SCM is delivering around 32 percent better performance on ERP software and about 37 percent better performance on raw integer work than the Power7 SCM it replaces. For work where threads and L3 cache matter, the performance gain from the Power7 to the Power7+, core for core and SKU for SKU, is around 82 percent for OLTP workloads and around 55 percent for Java workloads.
As you can see from the chart, raw integer work and ERP software performance does not spike as much using the Power7+ DCMs as does database and Java work where threads and L3 cache are king. A Power7+ DCM has more than twice the performance of a Power7 chip running at about the same clock speed, which just goes to show you how important cache really is.
But why stop there?
With the power gating features on the Power7+ chips (which I detailed in last week’s issue), the Turbo Core mode clock speed for these Power7+ chips, whether an SCM or a DCM, could be even considerably higher.
In the prior Power7 machines, only a few of which have Turbo Core mode, the clock speed uplift of turning half the cores off within the die was only 5.6 to 6.3 percent, which is not a lot of extra clock speed. But when you turn off half the cores you keep on all of the L3 cache, and that doubles the caches on the remaining cores. This can have a dramatic effect on the performance of cache-sensitive workloads like Java apps and the databases they smack.
Imagine that now that IBM has power gating on individual cores and caches on the Power7+ chip (which it did not have with the Power7s) that it can now not only turn off elements not in use, but also boost the clock speeds a bit higher in Turbo Core mode–and do so on the fly. Imagine again a special key that might permanently block activating all of the cores on a machine equipped with either the SCM or DCM versions of the Power7+ chips. Let’s see what happens.
Maybe Turbo Core mode can boost clock speeds on the SCMs to between 4.4 GHz and 5.5 GHz, which is a 10 percent bump in performance all by itself just from the clocks. And having 80 MB spread across those four cores in the modified Power7+ (and remember that Intel’s new Xeon E5 chip has 20 MB of L3 cache across eight cores running at a top speed of 2.9 GHz) should also boost performance. On OLTP work, if 25 percent of the performance is coming through clock speeds, then boosting the L3 cache by a factor of 2.5 added another 57 percent or so.
Cache misses on such a fast processor are a big deal, and hence why IBM has 48 MB L3 caches across six cores on the new z12 mainframe engine, which clocks at 5.5 GHz, plus another 384 MB of L4 cache front ending that L3 cache.
Now ask yourself this: What would be the effect of boosting the L3 cache per core by another factor of two? By my simple math, it should be around 45 percent more oomph, if that L3 cache can be kept fed from the chip interconnect and main memory. So, a Turbo Core version of the Power7+ chip with all of its L3 caches on and four cores power gated and shut down might yield another 55 percent more oomph per core. When you do the math on this hypothetical Power7+ processor, that gives you 40 percent more OLTP throughput per socket with half as many cores as the Power7 chip it would replace.
In other words, that’s 40 percent more work done than an eight-core Power7 socket with half the IBM i software bill because you are only running on four cores with the Power7+ chip.
Test that idea out for a bit. Mull it around. And then ask yourself why IBM doesn’t offer such a machine, and call up your IBM sales rep and ask for one. And don’t forget to add flash to it so the machine keeps the main memory and cache memories well fed.
The same thing could be done with the DCMs as well, of course. You could double stuff the sockets, turn off half or even a quarter of the cores, and run the cores that do work in Turbo Core mode all the time. IBM could even sort through the bins and find Power7+ chips where many of the cores are duds but the caches are L3 fine and they run at high speed and make special database versions of the Power Systems machines. Imagine a DCM version with four cores activated, with the cores running at between 3.3 GHz and 4.4 GHz, with 160 MB across those four cores, or a stunning 40 MB per core. How much work would such a socket do? It looks to me like it might do 25 percent more work than an eight core Power7 socket, at about half the IBM i software bill.
This is a thought experiment, and one that IBM should play around with before it launches the Power7+ servers. Even if it is not formally launched, I see no reason why IBM could not offer it on a special bid basis.