More Power7 Details Emerge, Thanks to Blue Waters Super
July 21, 2008 Timothy Prickett Morgan
The great and wonderful thing about big government-sponsored supercomputing projects is not just the exotic technology that such massive projects cook up, but also the fact that people like to brag about what they are up to inside the small circle of supercomputing centers in academia and governments. Eventually, some of the information about future HPC projects and the products they will be based on leaks out to the rest of the world, giving us a glimpse into the technologies that might be deployed in some modified form in commercial servers.
And so it is with IBM‘s future Power7 processors. Big Blue has been pretty quiet about Power7, except for a slip in the Austin American-Statesmen back in May, the hometown paper in that Texas town where IBM designs Power processors, where IBM confirmed that a future supercomputer called “Blue Waters” being built for the National Center for Supercomputing Applications at the University of Illinois would employ Power7 chips with eight cores. I didn’t see this report until I started poking around last week after The Register ran a story based on the internal IBM specs of the Blue Waters machine, which you can read here. According to that report, the Power7 chip will be implemented in a 45 nanometer process and will have four processor cores per chip, each with four threads per core, running at 4 GHz. The story went on to say that the machine would have 38,900 Power7 processors (that would be 311,200 cores) and deliver 10 petaflops of aggregate number-crunching performance, with a total of 620 TB of main memory.
According to NCSA’s own presentation of the Blue Waters project, which was awarded this past spring, the design goal is for more than 200,000 processor cores based on “multicore Power7” chips), more than 800 TB of main memory (with at least 32 GB of main memory per SMP node), and more than 800 TB of main memory; the initial goal is for 100 Gb/sec of external bandwidth, eventually quadrupling that to 400 Gb/sec.
I have read the story in The Register many times over, and I have also listened very carefully to what IBM has been saying about chips for the past two decades. And I think that people are talking about two different generations of Blue Waters machines, and I think some of the data might be mixed up. Before I get into that, let’s take a look at the latest Power chip roadmap that I can scrape up and talk about that for a second, because there is some new data on this and it is important:
First, this roadmap includes relative performance metrics compared to the Power5 generation of processors. That’s new. Second, see the Power6+ area. See anything interesting there? It says “Multi-Core,” which is the language in the chip racket for “four or more cores.” So the way I read this roadmap is that IBM is now planning to get a four-core Power6+ processor into the field that will roughly double the performance of the current Power6 chips. That’s new data as far as I know. But this now explains why IBM gutted the out-of-order instruction stream for Power5 and Power5+ chips and replaced it with the in-order instruction stream (which has special cases for out-of-order execution). The Power5 chip had multithreading that allowed five instructions per cycle on alternate threads, but the tweaks in the Power6 chip allow for seven instructions per cycle–five on one thread and two on the other–and this results in more work getting done. The instruction stream was also tweaked so IBM could jack up the clock speeds. Remember, when Power6 was projected to come to market in late 2006 or early 2006, it was supposed to run at 6 GHz. To get that 2X performance boost on a Power6+ chip in 2009, that would mean getting clock speeds up to 6 GHz and doubling the core count. IBM could get this core count doubling by packaging two dual-core Power6 chips in the same socket, much as it did with the Power5+ quad-core modules (QCM) used in its supercomputers.
With the Power7 chips, which are due in 2007, we know that the chips have a multicore–meaning more than four core–design, but IBM’s roadmap and its top techies have been talking about an “advanced hybrid core design.” I have been led to believe that by hybrid core, IBM means bringing special functions that used to be done on the motherboard our elsewhere in computer networks back onto the chip, but not necessarily inside the processor core. (Think of the VMX AltiVec vector math units or the decimal math units on the Power6 chip and you get the right idea.) This hybrid approach lets the Power core do the work it is designed to do–run applications–while these adjunct chips assist with specific functions.
Now, the other thing this roadmap shows is that the jump from Power6 to Power7 results in only a factor of five boost in performance, and only a factor of 2.5 compared to this Power6+ processor, which could just end up being a quad-core chip or most likely a quad-core module with two Power6 dual-cores in a single package. If I were IBM, I would be moving the shared 32 MB L3 cache chip into this module with the Power6+ and start making my way toward integrating L3 cache memory right inside the Power7 chip. Look at the integration effort over time. Power4 chips had L1 and L2 caches and their controllers on the chip as well as L3 cache controllers, with GX bus controllers, main memory controllers, and main memory external to the chip; Power5 brought in a GX distributed switch and the main memory controller onto the chip; and Power6 pulled in a new fabric bus controller and a GX bus controller. With the future “PERCS” supercomputer that IBM has been working on for the U.S. government (and which Power7 systems fulfill) from its Austin labs, IBM is promising a highly configurable processor chip, a switchless architecture, ultradense packaging for improved performance and reduced cost–including opto-electronics on the chip modules.
People that I spoke to last week who are familiar with IBM’s Blue Waters project suggested that the clock speed on the Power7 chips might be more in the range of 3 GHz to 4 GHz, not just simply 4 GHz, and they also suggested that these initial Power7 chips could actually be delivered in the Blue Waters machine sometime between the fall of 2009 and the spring of 2010.
I could not confirm that Power7 would be a quad-core chip or an eight-core chip, but here’s my gut instinct on this one. Power6+ will cram two Power6 chips into one socket and pretest the 45 nanometer processes IBM is perfecting for Power7. Power7 will be a quad-core chip proper with some kind of funky configurable cores, and it may even plug into Opteron or Xeon processor sockets (or both). The PERCS project and its related TRIPS project explains the kind of thing that IBM might be up to when I say funky configurable cores; I wrote about these way back in 2004 (see IBM, UT Austin Designing Morphable Teraflops Chip and in 2003 Cray, IBM, Sun Split Phase Two of $146 Million DARPA Super Deal). If there is a true octo-core Power7, I don’t think that will happen until Power7+, and I think that this is what people are yammering about in reaction to the piece in the Register. I also happen to think that this will be two quad-core Power7 chips in a single socket, shared in some funky and elegant way that makes them look like on giant chip. Probably with all of the cache memory on the chip or at least inside the package.
While I will concede that IBM might have shifted gears and is going to start cookie-cutting cores onto a chip, like Intel and Sun Microsystems are going to do, I have a problem with this. The Power6 instruction stream was gutted so IBM could crank up clock speeds. The reduction in clock speeds in the eight-core Power7 relative to even current top-end Power6 chips seems to imply it is two faster quad-core Power7 chips that have been geared down so they don’t burn up in the module. This is how the Power5+ QCMs worked, and this delivers a modest net improvement in performance with a big reduction in electricity used and heat consumed. Which is exactly what you would expect for a supercomputer.
But for commercial customers–and one of the key goals of the PERCS project is to design supercomputers with standard, not exotic, parts–this lowering in clock speed would have a dramatic effect on performance. Batch jobs would run slower, as would monolithic code not designed to consume the 32 threads in a Power7 chip complex. IBM has to also offer faster Power7 chips for these customers, since they actually pay its bills. All that these supercomputer deals do is fund IBM’s chip and system research–to the tune of hundreds of millions of dollars. The Defense Advanced Research Projects Agency, the research arm of the U.S. military, ponied up $242 million for the PERCS research, and NCSA is shelling out another $208 million for the derivative Blue Waters project, which is being paid for out of National Science Foundation money. So this is nothing to shake a stick at. But you have to remember that IBM is trying to preserve a $5 billion-a-year commercial Power Systems business, too. And an eight-core, 4 GHz Power7 machine might not do the trick on database and transactional workloads, even with twice as many threads per core. But who knows?
I’ll keep digging for more information on Power7 until it adds up. IBM surely knows, as will its partners and customers sooner or later.