Power8 Processor Packs A Twelve-Core Punch–And Then Some

September 9, 2013 Timothy Prickett Morgan

Well, IBM i shops hankering for some more processing oomph, Big Blue has a very big jump in performance coming your way with the future Power8 processors. Just after The Four Hundred went on vacation at the end of August, the techies who designed the Power8 chip took the podium at the Hot Chips conference, hosted by the IEEE every summer for the past 25 years at Stanford University, and revealed the major aspects of the design of the successor to the current Power7 and Power7+ processors.

As I was watching the presentation by Jeff Stuecheli, chief nest architect for the Power8 chip (that refers to the non-core portions of the die, with the cores being akin to eggs), it struck me immediately that this single Power8 processor, with its dozen cores, was akin to an entire AS/400e 650-2243 system from the summer of 1997.

At its design point clock speed of 4 GHz, the Power8’s cycle time a whopping 32 lower than the 125 MHz (remember when megahertz used to matter?) PowerPC Apache processors used in the AS/400e 650. That 12-way AS/400e 650 machine was rated at 2,340 CPWs, and the following year that system was upgraded to 262 MHz PowerPC Northstar processors with an early implementation of simultaneous multithreading (SMT) that allowed the 12-way AS/400e 650 system to push up to 4,550 CPWs of aggregate raw OS/400 performance.

Stuecheli said in his presentation that based on early benchmark tests, the Power8 chip running at 4 GHz had somewhere around 2.5 times the performance of a Power7+ chip at the socket level, due to a mix of factors that I will explain in a bit. Based on my extrapolations from CPW ratings for Power 740 systems with a single Power7+ chip running at 4.2 GHz, I would peg a single Power8 chip running at 4 GHz at around 155,000 CPWs, give or take a few thousand. In other words, the AS/400e 650, the top-end box from 15 years ago, fits in the error bars of a guesstimation of the single chip performance of a Power8 chip.

Just let that soak in a little bit. And think about how large the IBM midrange business might be, and how prevalent IBM i might be, had Big Blue priced this technology very aggressively to extend its customer base rather than to extract profits so aggressively from it. Q might be a very popular letter out there in the data centers of the world, and in an alternative universe where IBM listened to us, it is.

We have to deal with what is, not just what can and cannot be, and what I can tell you is that the Power8 chip is yet another leap forward for the Power Systems business and demonstrates, yet again, that Big Blue knows a thing or two about designing processors. This is one big bad chip, and it is aimed at big bad systems for sure.

Hopefully, IBM will be able to gear it down for entry and midrange customers with more modest processing needs–and do so at an affordable price. This will be particularly important for the IBM i customers that Big Blue wants to retain as well as the Linux and, to a smaller extent, AIX customers that the company wants to attract. As you well know, most IBM i shops have entry or midrange systems with two, four, or eight cores running IBM i. One of these Power8 chips is more than enough for them. I will get into the possibilities in next week’s issue.

The feeds and speeds of the Power8

The Power8 chip starts out with the Power7+ core that IBM just rolled out this time last year and has finished putting across the Power Systems product line as the summer was getting started. IBM is moving from the 32 nanometer processes used to etch its Power7+ chips to 22 nanometer process with 15 metal layers. Both processes have high-k metal gate and silicon-on-insulator (SOI) techniques IBM perfected many years ago in previous generations of Power chips. This is a pretty big jump–the same that Intel has made with its Xeon processors–and that gives Big Blue lots of options. That is a big shrink, and one that could have allowed IBM to cram as many as 16 cores on a die, if it was willing to sacrifice some on-chip L3 cache memory, perhaps. It is always hard to guess these things from the outside.

As it turns out, IBM went with a dozen cores and cut back on the L3 cache a little bit so it could add PCI-Express 3.0 controllers and a new Coherent Accelerator Processor Interface, which rides atop that PCI transport, to the die. These integrated PCI-Express controllers–there are two of them–and their CAPI overlay replace the current GX++ bus used on Power chips. The GX++ is a funky variant of 20 Gb/sec InfiniBand that comes off the die and then talks to an I/O bridge that in turn talks to PCI devices. Each PCI-Express controller runs at 8 Gb/sec, so this is actually a slight drop in raw bandwidth. But this is a much better way to do things, and will make integration with PCI devices simpler.

And, as The Four Hundred already divulged, the CAPI overlay will allow for the sharing of memory between Power8 processors and auxiliary coprocessors linked to it over CAPI. IBM no doubt wants makers of network controllers, GPUs, FPGAs, and DSPs to adopt CAPI’s Processor Service Layer to it can talk to the coherence bus on the Power8 chip and act like they were right there on the die. That, if anything, is what the OpenPower consortium announced a few weeks ago is all about.

The Power8 chip also has a new technology-neutral memory controller that puts a lot of the electronics dealing with specific signaling for particular memory types (DDR3, DDR4, or whatever) out onto the buffer chips that IBM has always used with the most recent Power chips.

As expected, IBM has goosed the SMT capabilities of the Power motor with the Power8, and will be able to manage as many as eight threads per core. Those threads look like an individual processor to the IBM i, AIX, or Linux operating system and use interleaving techniques during stalls and other activities that slow down processing on a single thread to squeeze more work out of the machine. This threading is dynamic and automatic and can scale back to a single actual thread as well as run with two, four, or eight threads activated. Databases and Java application servers like threads, RPG and COBOL apps are more thread than in past years but have their limits.

Each Power8 core has 64 KB of data cache (twice what is on the Power7 and Power7+ chips) and 32 KB instruction cache. Each core on the die has a 512 KB L2 cache segment allocated to it, which juts up against the 96 MB of shared L3 cache, which is implemented in embedded DRAM. As with the Power7 and Power7+ chips, this eDRAM takes fewer transistors to implement a memory cell, but those cells move a little slower than the SRAM that is used in the L1 and L2 caches. IBM has doubled up the data buses from L1 to L2 cache to 64 bytes, and the cores also have improved branch prediction and prefetching of data and instructions as well as larger issue queues. The core has two load store units (LSUs), a condition register unit (CRU), a branch register unit (BRU), and two instruction fetch units (IFUs). For math, the Power8 chip has two fixed-point units (FXUs), a decimal floating unit (DFU), and two vector math units (VMXs). There is also one cryptographic unit, which is sort of like a specialized math unit if you think about it. A single thread on a Power8 chip running at the 4 GHz cock speed has about 1.6 times the performance of the Power7 core’s thread running at an equivalent clock speed, according to Stuecheli.

The Power8 chip has two main memory controllers, one on either side, and thanks in part to the aggregate 128 MB of eDRAM L4 cache on the Centaur memory buffer chips, IBM can drive 230 GB/sec of sustained bandwidth into and out of those two main memory controllers. This is a lot of memory bandwidth, and once again shows where IBM differentiates from other chip makers.

Those Centaur chips, so called because they are half L4 cache memory and half DDR3 memory controller, are probably the wave of the future for how main memory will be implemented in upcoming systems from IBM and others. It is very simple. Processors change every year, and main memory technology does more slowly, like every four or five years if we are lucky. It is much better to have a generic memory bus coming out of the CPU and the memory scheduling logic, caching structures, and energy management features of a specific memory technology out on the buffer chips that in turn talk to main memory sticks rather than have it on the controller on the die. So with the Centaur, IBM has broken the main memory controller into two and put half of the circuits on the buffer chip while also adding 16 MB of L4 cache memory to sit between the main memory and the remaining generic memory controller on the Power8 die.

Each Power8 processor can have up to eight Centaur chips hanging off it, which yields a maximum of 128 MB of L4 cache per socket. Each Power8 chip has eight high-speed memory channels, running at 9.6 GB/sec, and each Centaur chip can drive four DDR3 ports for a total of 410 GB/sec of aggregate peak bandwidth coming off the DRAM into the Centaur chip’s L4 cache memory. IBM will be supporting 32 GB DDR3 memory sticks with the initial Power8 systems, which yields a maximum of 1 TB of memory capacity per socket. That’s twice the memory of the current Power7+ systems on the market.

Those Power8 memory controllers support transactional memory, by the way. This transactional memory made its debut with the zEnterprise EC12 mainframes announced last year. With regular memory, you lock down resources to avoid contention when transactions are pumped through the system. But with transactional memory, you do your work and assume (correctly) that most of the time there is no contention and then if you do find contention, you back out and wait and redo the work. On mainframes, IBM saw as much as a 45 percent performance boost from transactional memory on DB2 database and virtualized server workloads running on the EC12.

The chip interconnect on the Power8 die that links the L3 banks to each other has 150 GB/sec of bandwidth per direction per L3 cache segment (there are 12 of them, one for each core) when running at 4 GHz. IBM has created a NUMA-like scheduling system for the L2 and L3 caches that keeps hot data migrating into L2 from L3 cache segments. You can move data into the L3 cache from L4 cache on a single Power8 chip at 128 GB/sec and out to the L4 cache at 64 GB/sec. The pipes between the L3 and L2 caches run at 129 GB/sec both ways. The Power8 can move data out of the cores to the L2 cache at 64 GB/sec, but can move data from the L2 cache to the core at four times the speed, or 256 GB/sec. Add it up across twelve cores running at 4 GHz, and you have 4 TB/sec of L2 cache bandwidth and 3 TB/sec of L3 cache bandwidth.

Next week, I will do a thought experiment about how this chip might be used in future Power Systems iron, since IBM is not yet ready to talk about that.