Power8 Packs More Punch Than Expected

August 18, 2014 Timothy Prickett Morgan

Here is something you don’t see every day in the systems business. IBM is getting even better performance out of the new Power8 processors that were launched back in April than it anticipated. Systems performance engineer Alex Mericas, who works in IBM’s Systems and Technology Group, gave a presentation at the Hot Chips 26 conference in Silicon Valley last week, revealing that that the Power8 was delivering a little more oomph than expected.

As The Four Hundred previously reported, IBM revealed a lot of the details around the Power8 chip about a year ago at the Hot Chips 25 conference, showing off the core count and memory and I/O bandwidth of its 12-core Power engine.

We didn’t know this at the time, but what IBM was talking about then was a monolithic 12-core Power8 chip that it now refers to as the Power8 Scale Up version. This one, Mericas explained, has all of the cores on the same die, plus large memory capacity of up to 1 TB per socket and very high memory bandwidth to go with it. The Power8 Scale Up also sports 32 PCI-Express 3.0 links, with 16 of them being able to be configured as Coherent Accelerator Processor Interface (CAPI) ports to hook into accelerator co-processors based on GPUs, DSPs, FPGAs, and anything else you might want to put on a PCI card and hook into the virtual memory of the Power8 complex.

The chips that actually shipped in the initial Power8-based machines are what IBM calls the Power8 Scale-Out processors, and in this case, IBM actually chops a Power8 chip in two (presumably because half of it has something messed up on the die) with six cores on each half and the full complex of local SMP links and accelerators on each half instead of shared across all twelve cores on the monolithic die. The Scale-Out chip is a little wider than the Scale-Up version because of this. It also has 48 PCI-Express 3.0 slots, with up to 32 of these being configurable as CAPI ports.

Mericas took performance data from last year’s Hot Chips, which showed the expected performance of the Power8 Scale-Up chip that was in development at the time, and added to it the actual measured performance of the Power8 Scale-Out on various workloads. The estimates were made comparing a Power7+ and a Power8 at the same 4 GHz baseline clock speed. For the real-world tests, IBM pitted a two-socket Power 740+ server with two eight-core Power7+ processors against a two socket Power S824 machine based on two six-core Scale-Out multichip modules. In the real-world, that Power7+ chip runs at 4.2 GHz and the Power8 chip runs at 3.52 GHz, and it only makes sense if the chips are normalized to a clock speed. There is a 3.6 GHz version of the Power 740+, and it seems likely that IBM clocked them both at 3.6 GHz or overclocked the Power8 up to 4 GHz and slowed down the Power7+ to 4 GHz to make the comparisons fair. Here is the result:

As you can see, the memory bandwidth as delivered by the initial Power8 systems is a bit higher than expected (about 12 percent in my estimate of the chart above), and so is the Java performance (about 15 percent). Floating point math operations are about what Big Blue expected they would be, but integer performance is a little bit lower (it looks to be around 10 percent lower) compared to expectations. The commercial performance–which is very likely based on the TPC-C or SAP SD transaction processing tests, but IBM does not say–also came in a bit shy of expectations, about 8 percent by my eye.

Such wiggling between theoretical and modeled performance and actual performance can be attributed to a few factors, and one of them might have to do with the difference between a monolithic chip and two half chips sharing the same package. Perhaps more significantly, the fact that IBM can reckon the performance of such workloads on in its models and come even close to the numbers, with a variant of 8 to 15 percent up or down, is pretty remarkable.

IBM did not talk about how the performance of these various workloads would vary based on the operating system, and presumably these are all performance metrics based on IBM’s AIX Unix variant running atop the processor. Generally speaking, for a lot of workloads, there is not a significant performance difference between the three on Power iron for a certain level of scalability, but for across multiple processors, AIX and Linux scale better than does IBM i.

In addition to talking about actual versus expected performance on Power8 chips relative to a Power7+ baseline, Mericas also talked a bit about the batch performance of the Power8 processor. And yes, in case you missed it, batch is back in vogue. Well, to be fair, it never really left, as IBM i and System z mainframe shops know full well. But what a lot of people don’t realize is that Hadoop and its MapReduce protocol, so commonly used for big data analytics these days, is a batch processing system. And as IBM tries to position Power8 clusters as a perfect place to run modern workloads like Hadoop, batch performance therefore matters a great deal.

To stress test the Power8 under batch processing conditions, IBM grabbed an internal benchmark that emulates batch tasks performing compression, and under conditions where the response time for the transactions is important. IBM compared a Power7+ core to a Power8 core with one thread dedicated to the task and no other threads on a core doing any other work. In another set of runs, IBM turns on the maximum number of threads for each core (four for the Power7+ and eight for the Power8) and runs the batch job again.

For single-threaded batch work on this unnamed workload, the Power8 core delivered 2.3X the throughput on the batch work and a 56 percent lower response time compared to the Power7+ core. (These tests were done on two sixteen-core systems, presumably at the same clock speeds or with the data normalized for clock speeds.) If you turn on simultaneous multithreading (SMT) on the Power7+ core, the Power8 with no threading has an 82 percent lower response time on the batch work and delivers 1.4X more throughput. And if you compare the Power7+ with SMT4 (which means four threads of SMT) to Power8 with SMT8 (eight threads per core), then the Power8 core has a 31 percent lower response time and delivers 2.9X the throughput of the Power7+ core.

The lesson here is that performance, for either online and batch work, depends on how threaded your software is and you can tune the system to get a desired mix of throughput and response time based on how many threads per core you activate. You are in control of those knobs, and SMT can be dynamically allocated as conditions change. This is a far cry from the old days when you added a few disk arms or a few sticks of memory to try to rejigger throughput and response time, or worse yet, did a processor upgrade based on some generic relative performance metric like RAMP-C or CPW and then did not see the expected performance gains after spending the money. IBM i shops have to know how their apps are threaded and set goals, and then see if they can tune the hardware threading to get the results they need for both batch and online work.