IBM Power7+ Chips Give Servers A Double Whammy
September 4, 2012 Timothy Prickett Morgan
If you are in the middle of acquiring a Power7-based server to run IBM i, AIX, or Linux workloads, you might want to give it a rethink about right now. Because IBM has some interesting twists that are coming out with the next-generation Power7+ processors that will dramatically affect their bang for the buck and therefore what you should be paying now for any processing capacity or whole systems that you acquire.
IBM has been vague about exactly when the Power7+ chips and their new Power Systems servers will ship, but has said that the new processors will come out before the end of the year. As is the case with most moves to new processors, customers can expect better price/performance on the new machines relative to the old ones. In many cases, IBM holds prices for raw processing and operating systems the same and just boosts the performance of the processors and the scalability of the systems. This was certainly the case with the jump from Power6+ to Power7 in early 2010, and given that IBM only wants to compete on price with its PowerLinux iron, which has lower pricing for processors and much lower pricing for memory and disk on machines that are restricted to running Linux, I expect for IBM to keep prices more or less the same for comparable machines with the same socket and core count.
But there is a new twist with the Power7+ chips: IBM is offering variants of the systems that will use the die shrink enabled by a move from the 45 nanometer processes used to etch the Power7 chips to the 32 nanometer processes used to make the Power7+ chips to slow down the clocks and cram two whole Power7+ chips into a single socket. IBM did this CPU doubling with the Power5+ processors in October 2005 (although i5/OS V5R4 could not run on the machines) and again with the Power6+ chips in October 2008 (although IBM did not admit these were Power6+ chips at the time, you will remember). I can’t think of everything, but in hindsight, with a chip manufacturing process shrink and the certain knowledge that I had over the past several months that the Power7+ would have eight cores and a lot more cache, I should have expected that IBM would be double-stuffing the Power7+ sockets to give customers twice as many cores and threads for workloads that need this. Double-stuffing two whole chips into one package is a hell of a lot smarter than creating a big and complex 16-core processor that would have poor yields due to its size on any new process and therefore be prohibitively expensive.
The wonder is why Intel doesn’t double stuff its sockets. Advanced Micro Devices has for the past two generations of its Opteron 6100 and 6200 processors, and when Hewlett-Packard made its own PA-RISC chips and when Itanium core counts were too low, HP double-stuffed as well to boost the thread and core count.
The trouble with double stuffing is that you have to lower the clock speed so the sockets don’t fry. With the Power5+ double stuffers, IBM dropped the clock speed down from the nominal 1.9GHz to 1.5GHz, but could get twice as many cores and threads in each socket, thereby delivering somewhere between 55 percent and 65 percent more work to be done in the system if that system’s applications (like databases and Java virtual machines) are heavily multithreaded. With the Power6+ double-stuffed variants sold only in the Power 560 server, which was allowed to run i5/OS and IBM i, IBM put two dual-core Power6+ chips in a single socket running at 3.6 GHz, which was quite a bit slower than the 4.4 GHz and 5 GHz Power6+ chips available in the Power 570 machine. But IBM could deliver around 20 percent more oomph per socket running DB2/400 database applications as gauged by the Commercial Performance Workload (CPW) benchmark test. It is a pity that IBM doesn’t charge for OS/400 and IBM i on a per socket basis. This extra performance was welcome, but on per-core pricing the software bill would be twice as high for 20 percent more throughput, and at the prices IBM charges for OS/400 and IBM i, that would be just plain idiotic.
Depending on what IBM does with software pricing on IBM i 7.1, these double stuffers might be interesting, and I will share some thoughts on that in next week’s issue.
But what is immediately useful from what Scott Taylor, a chip architects who worked on both the Power7 and Power7+ processors and on the power-saving features of the designs in particular, told the audience at the Hot Chips conference last week in Cupertino, California, is that the clock speeds on the Power7+ chips would be 25 percent higher than on similar models in the Power7 family of processors, which come with four, six, or eight cores running at between 3 GHz and 4 GHz at the moment. So that puts your Power7+ range at 3.75 GHz to 5 GHz, and that is a 25 percent boost in throughput for the same operating system license dollar–and maybe the same hardware dollar with slightly lower memory and disk prices, too.
So, as I say, be careful more about what you are paying for anything that you are buying now. There’s nothing wrong with a Power6, Power6+, or Power7 machine. Just don’t pay too much for it if you need the capacity right now and cannot wait until later this year.
The Power7+ feeds and speeds
The Power7+ chip is implemented in IBM’s familiar high-k metal gate processes, which include copper and silicon-on insulator technologies as well. The 32 nanometer process shrink allows IBM to cram 2.1 billion transistors in a die that has a surface area of 567 square millimeters. Taylor said that the fact that it only takes two transistors to implement a bit in embedded DRAM (eDRAM) instead of six transistors for static RAM (SRAM) means that the Power7 and Power7+ chips alike (as well as the z11 and z12 processors for mainframes, which also use eDRAM for L3 cache memory) are a lot smaller than they might otherwise be or, alternatively, have a lot more L3 cache memory than IBM could cram onto the chip using SRAM. The fact that eDRAM runs slower than SRAM is more than offset by the big wads of L3 cache that are available on the die. In the case of the Power7 chips, this eDRAM was one of the key factors that enabled IBM to double the performance of the chip over the Power6 processors–that and a new pipeline and core design, of course. IBM is boosting the L3 cache on the Power7+ to 80 MB, which no one even comes close to offering and which will again have a big positive impact on performance. That is a factor of 2.5 increase in the L3 cache, as I have explained in past issues of The Four Hundred.
Other than a bunch of tweaks and tunings a math unit that has double the single-precision floating point math performance of the Power7 chip, the Power7+ core is more or less the same. At the bottom of the core element of the chip is the 256 KB of L2 cache memory. Above that on the right are two load store units (LSU) and a condition register unit (CRU), a branch register unit (BRU), and instruction fetch unit (IFU). Each Power7 core has 32 KB of L1 instruction cache and 32 KB of L1 data cache. The instruction scheduling unit (ISU), which is where the out-of-order execution in the chip gets handled is on the top right. On the top right are four double-precision vector math units (VSUs), and in the top middle are the two fixed point units (FXUs). Above them is the decimal fixed unit (DFU) that does two-digit money math. There is a new element called an NCU, and I have no idea what that does.
The Power7+ chip does have some big differences other than the increased L3 cache that is wrapped around each core. Here’s what it looks like with its labels on (you’ve see it in The Four Hundred unlabeled, of course):
The chip has two memory controllers that hang off the L3 cache and chip interconnect fabric in the center uncore region of the chip. In the middle at the bottom of the chip are remote SMP links out to other processor sockets that are used to preserve cache coherency across multiple server nodes and present a single memory space to an operating system or hypervisor. At the top of the middle of the chip is local SMP links that allow the cores to talk to each other and share work and maintain cache coherency within a die as well as a nifty new set of accelerator functions that all of the cores can access through the SMP interconnect and use to significantly speed up certain math-heavy operations with very specific algorithms.
There are three new accelerators that come with the Power7+ chip, and here is how they link to each other and to the memory and SMP features of the Power7+ chipsets:
The 842 accelerator is designed, as I have suggested, to handle the compression and decompression associated with using the Active Memory Expansion memory compression algorithm that IBM shipped with the Power7 servers running only AIX two years ago. The name refers to the 8, 4, and 2 byte parsing that the memory compression algorithm uses, which can double the effective main memory of an AIX system. By putting an accelerator into the Power7+ iron, IBM can offload this compressing and uncompressing from the processor and memory controller, freeing the core up to do other work. It should also be possible to invoke 842 compression/decompression for memory on behalf of IBM i and Linux, and considering the high prices IBM charges for Power Systems memory, IBM should be taken to task for not offering it on IBM i at least. (PowerLinux boxes already have cheap memory, so this doesn’t matter as much.)
The Power7 chip also has three accelerators that can process AES encryption as well as SHA (Secure Hash Algorithm) hashing routines, again which can now be offloaded from the processors to these accelerators. The Asymmetric Math Function (AMF) units, of which there are also three, implement RSA and ECC cryptography algorithms in various bit levels commonly used in software. The accelerators are cross-linked so they can directly share data, and the AMF and AES/SHA accelerators have their own I/O buffers so they can hold the results of operations locally and pass them on to another accelerator for further processing. So, you could encrypt a bit of data coming in from memory and then compress it, for instance, passing it back out through the I/O stack to disk or flash.
The Power7+ chip also has a random number generator (RNG in the graph) that is a “true hardware entropy generator” in that it is based on random fluctuations in the electronics of the chip itself and therefore the 64-bit random numbers it creates cannot be algorithmically reverse engineered, according to Taylor. There is also an MCD unit (not sure what that stands for) that predicts whether a memory access is local to the chip or on another system board if the machine has multiple boards.
Finally, the Power7+ chip has much more sophisticated control over the power consumed by the chip. For the first time, the IBM has put power gating on each core, allowing them to be shut down individually and in much more sophisticated ways.
The Power7+ chip has three different power modes. In nap mode, the clocks are stopped on a processor’s core execution units and all L2 and L3 cache segments are left running. This saves about 10 percent of the electricity that a Power7+ chip would burn, and it takes about 10 microseconds to fire up the cores and get them talking to the caches. During this nap mode, the caches are kept coherent across all of the cores in the system, whether they are local or remote. This is the same nap mode that the Power7 chip had.
With the Power7+, there are two more modes, one called deep sleep and the other called winkle, after Rip Van Winkle, of course. (Perhaps the Power7+ chip is code-named “Irving,” after the author who penned that story about a man who fell asleep for twenty years and missed the American Revolution?) In deep sleep mode, the core is powered off and the L2 cache is flushed into L3 cache and it is shut down as well. The L3 cache is left running. This reduces the energy consumption of the core and associated L2 cache and L3 cache segment by 80 percent; the downside is that it required a re-initiation and a restore operation to wake the core module back up, and this takes around 4.5 milliseconds. (That’s a long time in computer.) The winkle mode goes a bit further, flushing the L3 cache to the other seven L3 cache segments after emptying its L2 cache. The cores and the L2 cache and L3 cache segment are all power gated and turned completely off. This reduces the power consumption on the chiplet by 95 percent (you have to keep some things powered up so they know to turn the caches and cores back on), and it takes around 7 milliseconds to wake up from this winkle state and start doing useful work.
The energy savings that come through the deep sleep and winkle states are part of the reason IBM can crank the clocks on Power7+ chips, and it will be very interesting to see what kind of Turbo Core mode the Power7+ chips will have given all this. It may be possible to take a 16-core double-stuffer, turn off half the cores in each chip but leave the L3 caches on and get more powerful IBM i database engine. It will all depend on the feeds and speeds–and IBM i licensing–if such a thing makes sense. Throw in flash memory on the system board and you could so something truly interesting.
We’ll see soon enough.