IBM Uncloaks Power6 Chip Details
October 16, 2006 Timothy Prickett Morgan
It’s the fall, so that means it is time for the major designers of processors and other related electronics to converge on Silicon Valley for the In-Stat Microprocessor Forum. IBM‘s Brad McCredie, who works for the Systems and Technology Group as the chief architect of the future Power6 processor, gave his presentation at the forum last week, and divulged a lot of the inner workings of the device. There’s a lot of stuff in this chip, which is what you would expect from a device with around 750 million transistors.
First and foremost, McCredie said that the Power6 chip was on track for delivery in mid-2007, and he confirmed that the chip would appear first in IBM’s System p server line. The System p line, formerly known as the pSeries, supports the AIX Unix variant as well as Linux and, for a small number of customers, a few processor cores’ worth of i5/OS (IBM’s proprietary operating system, formerly known as OS/400). Executives in IBM’s System i division, which makes and sells servers that predominantly run i5/OS (but which also support AIX and Linux), hinted a month ago that the System i line (formerly known as the iSeries) might not see the Power6 chip first. The Power4 chips appeared first in the pSeries, and the Power5 chips appeared first in the iSeries. So it is clearly the System p division’s turn to get the new technology first–if you want to think about the IT business as being fair. (It isn’t.) What can be honestly said is that the System p division needs more performance and better control over power consumption more than does the System i division, and this is probably why IBM will put the Power6 chip in AIX and Linux machines first.
The Power6 chip is, like the Power4 and Power5 chips before it, a dual-core design. McCredie explained in his presentation that the move to a 65 nanometer process (which IBM has perfected with its Power-based Cell and “Xenon” Xbox processors for Sony and Microsoft game consoles, respectively) yields about 30 percent more performance compared to 90 nanometer processes used in the Power5+ chips. IBM is using a 10-layer 65 nanometer process with silicon-on-insulator (SOI) and with low-k dielectric on the first eight layers of the chip. Both SOI and low-k were perfected by IBM and have allowed Big Blue to create some of the most sophisticated processor designs in the world. McCredie also explained that the Power6 design included transistors that were tailored for ultra-low-power consumption, and include separate voltage supplies for different areas of the chip so parts of the chip can be shut down when they are not in use. (Unlike processors from days gone by, where all transistors were hot, whether or not you were making them do practical work, and even with the design IBM has put together, all parts of the chip consume some juice, even when they are not stressed.)
The Power6 cores are modified versions of the Power5+ cores, as it turns out. The Power6 core has 64 KB of L1 instruction cache, 64 KB of L1 data cache, two integer execution units, two floating point units, and a branch execution unit. Like Power5 and Power5+, the Power6 core has two instruction threads, which are enabled by a feature called simultaneous multithreading, or SMT. With SMT, you can make a single instruction thread look like two virtual threads to an operating system, and therefore goose the amount of work the chip can do. With the SMT features on the Power5 chips, IBM was able to see a 30 percent to 40 percent boost in performance on some workloads, particularly thread-sensitive database work. The Power6 chip has essentially the same instruction depth as the Power5+ pipeline. The SMT features have been improved significantly, and apparently IBM is seeing as much as 55 percent performance improvement on two virtualized threads compared to a single thread with SMT disabled. On integer applications, IBM is seeing about a 40 percent gain in the SMT implementation on Power6 on a thread basis. The bigger caches and dedicated caches for each core seem to help SMT performance as well.
The Power6 cores have some other new features. First, IBM has added AltiVec VMX vector co-processors to each core, similar to the VMX units that are in the PowerPC 970MP processors. Each core also borrows some mainframe-class error recovery electronics, including a core error collection unit, a core restart unit, and a recovery unit. Basically, if instructions get mashed up, the Power6 core can try to figure out what went wrong on the fly, and if it can’t, it can move the entire state of the core as it was running over to a new core and try it again. Soft errors are retried, and if that doesn’t fix the problem, then the hardware is changed for a hard error (like a bit flipping in memory). The processor state is checked every cycle and a checkpoint is saved before work progresses; then both error corrected (ECC) and non-ECC circuits are checked. If there is a soft error, it rolls back to the checkpoint and tries again, and if that doesn’t fix the problem, the checkpoint state is moved over to a new processor core in the complex. (Some IBMers were hinting at triple redundancy in the Power6 chip, but this is not what triple redundancy means.)
Interestingly, the Power6 core has a new kind of execution unit–one dedicated to doing decimal math. You know, money math. An awful lot of the witchcraft that programmers have to do coping with operations on data that is stored in either decimal or integer form, but mathematical operations are performed on this data using binary (base 2) floating point math units. So IBM has created a decimal floating point (DFP) math unit for each Power6 core (that’s base 10). According to IBM’s analysis of data stored in large commercial databases, about 55 percent of the numeric data in databases is in binary coded decimal (BCD) format, and another 43 percent is in integer format, often coded as decimal integers.
Instead of using binary floating point units and then software to get things into decimal format, the Power6 will see decimal operations and just do decimal math in base 10. This can speed up these decimal math operations by as much as a factor of 2 to 7 according to IBM’s initial benchmark tests. The DFP unit required IBM to add about 50 new instructions to the Power instruction set architecture. Presumably the COBOL and RPG compilers made by IBM will be tweaked to use the DFPs.
Each Power6 chip has two cores, and each core will have its own dedicated 2 MB L2 cache memory. With the Power4 chips, IBM had 1.44 MB of shared L2 cache for the two cores on the chip, and with the Power5 design, IBM increased the shared L2 cache to 1.9 MB. As with the Power5, each Power6 core can have a 32 MB L3 cache assigned to it. The Power6 chip also includes two memory controllers on chip as well as an L3 memory controller and an L3 directory cache. IBM is allowing one or two memory controllers to be turned on, as necessary in the system design, and they can be configured to run at full or half width, too. The Power6 chips will come with L3 cache in three different configurations–those without L3 cache, those with the L3 cache on the module (as was done with the Power4 and Power5 chips), and those that have the L3 cache off the module.
The chip has a GX+ I/O controller, and a sophisticated interconnection fabric for linking multiple Power6 chips together into systems. The chip has 80 GB/sec of bandwidth going into L3 cache, 75 GB/sec of bandwidth going into main memory, 20 GB/sec of bandwidth going into the GX+ I/O bus, 50 GB/sec of bandwidth into an internode connection fabric (which links chips in two-socket and four-socket machines gluelessly) and 80 GB/sec of fabric interconnection for connecting up to eight of these four-socket boxes gluelessly into a 32-socket, 64-core machine. (This is called intranode linking.) Two Power6 chips can be plugged together into a very tight pair (this is new), and so can four sockets, as was the case with the Power4 and Power5. But the Power6 chip also includes a two-tiered memory coherency protocol that has significantly lower latencies compared to the Power5.
McCredie reiterated at Microprocessor Forum that the Power6 chip will run in the 4 GHz to 5 GHz clock speed range, which IBM divulged this year–and added that it would be closer to 5 GHz.