The Feeds and Guessed Speeds of Power7
September 14, 2009 Timothy Prickett Morgan
Just before The Four Hundred went on hiatus, I told you that IBM would be making a presentation about the forthcoming Power7 processors at the Hot Chips 21 conference hosted by the IEEE at Stanford University. And indeed, the top techies working on the eight-core Power7 processors did raise the curtains a little bit about the future brains of the Power Systems lineup, which are due in 2010.
IBM confirmed to me a little more than a year ago that the Power7 chips would have at least eight cores and would be made using a 45 nanometer process in Big Blue’s East Fishkill, New York, wafer baker. And in August, as IBM was preparing for the Hot Chips event and knew the analysts prebriefed about Power7 would start talking to the press whether or not it wanted them to, the company said that Power7 would come with four, six, or eight cores with as many as four instruction threads per core. The company also said that it would support up to 1,000 logical partitions per server on Power7-based machines, and that current Power 570 and Power 595 machines would be able to upgrade to the Power7 equivalents while preserving serial numbers. This, as I explained in detail in August, makes the bean counters happy because existing infrastructure does not have to be written off, and it makes IBM happy because now it can sell current top-end Power6 or Power6+ machines to customers who know they can upgrade to Power7 later if they need to.
So, what new do we now know about the Power7 chip? Plenty. First, according to the presentation given by Ron Kalla, chief engineer of the Power7 chip, and Balaram Sinharoy, chief core architect for the Power7, the chip will be made using a 45 nanometer lithography process that incorporates copper and silicon-on-insulator transistor doping techniques. The chip, according to Kalla, will have 1.2 billion transistors and will offer the “equivalent function” of a chip with 2.7 billion transistors. (I am not sure what this means, to be honest.) The chip weighs in at 567 square millimeters in area, and is just a little bit wider than it is deep.
The most interesting thing about the Power7 chip is that it includes 32 MB of embedded DRAM as a shared L3 cache for the processor cores. This is the same size cache that IBM had been offering in multi-chip packaging with Power5 and Power6 generations of machines, but now it is not only on the chip, but right smack dab in the middle of it, acting as a huge information exchange between the eight cores on the chip. By my estimates, looking at the picture of the fully loaded Power7 chip, the cores account for about 53 percent of the surface area of the chip, the local SMP links, remote SMP links and I/O, and two four-channel DDR3 main memory controllers take up about 35 percent of the space, leaving the L3 cache and chip interconnect to eat up about 12 percent of the real estate on the chip.
The Power7 core has a dozen execution units, which are binary compatible with previous Power6 and Power6+ chips, and a revamped instruction pipeline that is necessary to cope with the many threads and many cores in the chip. The Power6 chips had two cores, each with two threads. The Power7 chip will have, at 32 threads, eight times the virtual instruction streams of the Power6 per socket. So things have to change to accommodate that. The Power7 core has two fixed point units, two load store units, four double-precision floating point units, one decimal floating point unit (for doing the kind of money math popular on business applications), and one vector unit (for doing the matrix math popular in nuclear simulations and weather forecasts). The cores support out-of-order instruction execution, as prior generations of power chips have, and have 32 KB of L1 instruction cache, 32 KB of L1 data cache, and a 256KB of L2 cache.
As for the L3 cache on chip, which could account for a lot more of the transistors than the eDRAM area would imply, each core has an L-shaped segment of that cache that is loosely affiliated with it that weighs in at 4 MB. Other cores can access the L3 cache segments from other cores, which is a lot quicker on the chip than going out through the memory controller and out to the DDR3 main memory. The difference between searching a remote L3 cache segment and going to main memory is like the difference between a heartbeat and waiting for a cab in New York on a rainy day.
As with prior Power designs, IBM is really concerned with pumping up the memory and SMP bandwidth. IBM says that it can get 100 GB/sec of sustained memory bandwidth per chip socket, and that the SMP interconnect of a 32 socket machine has an aggregate bandwidth of 360 GB/sec.
Here’s where it gets interesting. Rather than make one Power7 chip that is supposed to fit all jobs, IBM will be tweaking the chips that come out of the fab. Some of them will have the full eight DDR3 memory channels activated, while others will only sport four. Some will have half-width SMP buses, while others will sport the standard widths and others still will have four-wide SMP buses. The precise way these are mixed and matched with the core count is not clear, but IBM did say that for blade and rack servers with two or four processor sockets, the Power7 chips would have only one memory controller activated and three 4-byte local SMP links, all in a single chip organic package. That might imply a maximum of four cores for these machines, if I understand the schematics correctly, but six cores is also possible. I would guess that these chips will be half-duds, with maybe 8 MB or 16 MB of the L3 cache working.
For midrange and high-end Power Systems boxes using the Power7 chip, both memory controllers plus three 8-byte local links and two 8-byte remote links. These will be packaged up in single-chip class ceramic packages and will be used in the standard Power Systems product line. It is my guess that IBM will offer six or eight core variants here, and with 16 MB or 32 MB of L3 cache activated.
There is another variant of the Power7 chip that will not appear in standard machines and which seems to be destined for the “Blue Waters” massively parallel supercomputer being built by IBM for the National Center for Supercomputing Applications at the University of Illinois. This is a multichip module (MCM) that will be a standalone SMP node that has four entire Power7 chips, with all eight of their memory controllers activated, plunked into a single ceramic package. These chips will all sport three 16-byte local links that glue the chips together. And there is every reason to believe these chips, which were developed under the code-name “Q7,” will end up running at higher clock speeds than the MCM packaging should allow.
As part of the Power7 design, cores on the chip can be dynamically turned on or off (provided IBM has enabled them, of course), and the core frequencies of the chips can also be turned up and down in an effort to reallocate energy. The threads on the chips can also be enabled and disabled as needed, ranging from a low of one thread up to the full four threads. I wonder if IBM will allow machine with fewer threads turned on to run at slightly higher clock speeds?
From what I can see in the presentation, the individual per-core performance of the Power7 cores will go up by about 20 percent, even though I expect clock speeds in the range of 3 GHz to 4 GHz, much less than the 4.2 GHz to 5 GHz of the Power6 and Power6+ cores. Multithreading is the answer. And so is adding cores. Kalla said that IBM would be able to boost the per-socket performance of a Power Systems box by a factor of four, getting Big Blue back on the roadmap to where it needs to be after the Power6+ totally fizzled in terms of delivering the performance expected. I still think IBM expected to get clock speeds up to 6 GHz or higher with the Power6+ and opted instead to double up the cores to boost system oomph. But no one will ever cop to that.
In next week’s issue, I will take a stab at what the future Power7 machines might look like in terms of specs and performance compared to the current Power Systems lineup.