IBM Pushes Performance Up, Energy Down With Power8
February 17, 2014 Timothy Prickett Morgan
The IEEE hosted its International Solid-State Circuits Conference last week in San Francisco, which is generally loaded with coming out parties for all kinds of server processors. IBM‘s top techies from the Power Systems division were on hand to show off some more of the feeds and speeds of the forthcoming Power8 chips, expected sometime around the middle of this year.
Many of the feeds and speeds of the Power8 chip were divulged last year at the Hot Chips conference at Stanford University, which The Four Hundred reported on at the time. Some more details of the Power8 chip were revealed at ISSCC last week, which we will now give to you.
First, a recap. The Power8 chip is implemented in IBM’s 22 nanometer chip fabrication process, which is combined with its silicon-on-insulator and copper wire add-ons to increase the power efficiency of the transistors on the circuit. The Power8 chip has 12 cores, which we knew, and IBM now says that the die has 4.2 billion transistors across its cores and 96 MB of on-chip eDRAM L3 cache. By comparison, the Power7 chip, implemented in 45 nanometer copper/SOI processors, had 1.2 million transistors and only 32 MB of eDRAM L3 cache. IBM may be a process node or so behind Intel, which has had 22 nanometer processes in production since last year and which is getting ready to ship 14 nanometer server chips later this year, but Big Blue’s chip techies and chip making plant in East Fishkill, New York, is doing better than GlobalFoundries or Taiwan Semiconductor Manufacturing Corp are doing. Moore’s Law is still working in East Fishkill, even if IBM can’t afford to keep up with Intel anymore because it has lost a lot of the merchant silicon business to ARM upstarts, Intel, and others.
Modern processors do not have pins as we old folks know them, but rather little copper pads that provide the link between the processor package and the chip socket. The Power8 chip has a stunning 15,823 pins, with 5,982 of them used for powering various parts of the die, 7,742 of them for ground, and 2,099 of them for signals. Each Power8 core has 32 KB of instruction L1 cache and 64 KB of L1 data cache plus a 512 KB SRAM L2 cache; the L3 cache is spread around the cores and comes to 96 MB in total. The die has a DDR4 main memory controller that supports transactional memory and various coprocessors for memory compression (still not supported by IBM i but working for AIX), virtual memory management. The memory interface on the Power8 die runs 50 percent faster than that of the Power7+ chip, but uses 50 percent less power; it can drive up to 72 memory lanes running at 9.6 Gb/sec, and the socket-to-socket interface runs at 4.8 Gb/sec–which is 50 percent faster than Power7+ again. There’s some logic to this, with the Power8 chip, at 12 cores, having 50 percent more cores than the Power7+ chip.
The Power8 chip has two on-die PCI-Express 3.0 controllers and an overlay called the Coherent Accelerator Processor Interface, which rides atop that PCI transport and which links external accelerators into the main memory of the Power8 chip so they can share a virtual memory space. This is important because the big latency and programming issue with using any kind of accelerator is having to shift from the CPU main memory to an accelerator’s memory and back. With virtual shared memory, there is no shifting.
The Power8 chip is a true I/O monster. On the die, the chip interconnect has 12 segments–one for each die–that each have 150 GB/sec of bandwidth in each direction, for a total of 3.6 TB/sec of bandwidth in and out of the SMP interconnect and L3 cache memory. The SMP interconnect and PCI-Express controllers deliver an aggregate of 951 GB/sec (or 7.6 Tb/sec if you want to look at it that way) of aggregate bandwidth in and out of the chip to the outside motherboard. That is a factor of 1.6 times as great as the Power7+ chip, and again, this stands to reason given the extra cores and fatter L3 cache. There is also 128 MB of eDRAM L4 cache on the “Centaur” memory buffers used on high-end systems, which also accelerates performance and keeps things in balance as data flows from disks into the system and back out again.
Interestingly, the Power8 chip has its own pet PowerPC chip embedded on it, with its own 512 KB SRAM cache, and it is used as an on-chip controller for real-time control of chip firmware. This firmware watches for changes in the workloads running on the chip and adjusts the frequency and voltage of the cores on the die as needed to maximize performance within a given thermal envelope. This on-die firmware controller can react about 100 times faster to workload changes and is generally under the typical timescale of an operating system timeslice. Which is precisely what you want.
Based on the charts and arts in IBM’s presentations, the company has tested Power8 chips running at between 2.5 GHz and 5.5 GHz. It remains to be seen what clock speeds IBM can deliver with appreciable yields out of East Fishkill. At 4 GHz, a Power8 socket will deliver around 2.5 times the performance of a Power7+ socket, and that is a hell of a big jump. Part of that has to do with having more cores, but also to boosting the multithreading from four to eight threads per core. At that same 4 GHz test speed for Power8, the single-thread performance of a core is about 1.6 times that of a thread running on a Power7 (not Power7+) chip.
Suffice it to say, the new chips will pack a wallop. And as I have discussed before, the new Power8 systems give IBM a chance to rethink how it packages and prices the Power Systems-IBM i stack and try to get its entire base to move forward rather than those who are pushed to the cutting edge by the demands of their applications.