IBM Clarifies Power8 Chippery And Performance

September 8, 2014 Timothy Prickett Morgan

A few weeks back, The Four Hundred gave you some details about the performance of Power8 systems and how they were doing a little bit better than IBM had expected on a series of benchmark tests compared to the prior generation of Power7+ machinery. In that story I also discussed the two versions of the Power8 chip and IBM wants to clarify some of the specifics of the Power8 chips and the benchmark results.

Here’s the background. Systems performance engineer Alex Mericas, who works in IBM’s Systems and Technology Group, gave a presentation at the Hot Chips 26 conference in early August talking about the Power8 processor. In the story Power8 Packs More Punch Than Expected, which was based on his presentation, I suggested that the dual-chip module versions of the Power8 chip, which are used in what are called the Power8 Scale-Out systems by IBM and which were announced in April, were actually cut from a single Power8 die. It is not completely illogical to think this, considering that the Power8 Scale-Up chip, as IBM is calling the one it showed off a year earlier at Hot Chips 25 and which will be used in larger enterprise-class systems, has a maximum of 12 cores and each half of the dual-chip module, or DCM in the IBM lingo, has a maximum of six cores.

“We actually have two different chips,” explains Mericas. “What we did is we actually cut the design by roughly 60 percent. But they are actually two whole completely different chips in the Scale-Out systems versus the Scale-Up. We are not making one chip and then sorting them to see which ones to chop up.”

Both sets of chips are implemented using IBM’s 22 nanometer processes and fabbed at its chip plant in East Fishkill, New York.

Mericas is not involved in the chip manufacturing part of IBM Microelectronics, so he could not answer my speculation that the odds are higher getting yields on a smaller die size than on a larger one, which seems logical to me. This could be a benefit of having a smaller six-core variant of the Power8 chip, but it is not the reason to create one in the first place.

“For the Scale-Out systems, the main thing we were worried about was having enough I/O for the system,” Mericas explains. “The original Scale-Up chip would only have 32 lanes of PCI-Express per socket. And particularly since we are trying to push our CAPI interface, we would end up eating up a good bit of our I/O and potentially not leave enough I/O for what these small systems really need. And with this two chip design, we were able to optimize this Scale-Out system around having more I/O capability.”

The Power8 Scale-Up chip has all of the cores on the same die, plus a memory capacity of up to 1 TB per socket and very high memory bandwidth to go with it. The Power8 Scale-Up also sports 32 PCI-Express 3.0 links, with 16 lanes being configurable as Coherent Accelerator Processor Interface (CAPI) ports to hook into accelerator co-processors based on GPUs, DSPs, or FPGAs on a PCI card and link them into the virtual memory of the Power8 complex. The Power8 Scale-Out chip is a little wider on the I/O, and has 48 PCI-Express 3.0 lanes per socket, with up to 32 of them being configurable as CAPI ports. So if you want to run an accelerator in a fat x16 slot, you can put two accelerators into a Power8 Scale-Out machine compared to one for the Scale-Up variant, and in either case you have another 16 lanes of PCI-Express 3.0 per socket to hook other peripherals like disk controllers or network interface cards into the system.

The decision to make two Power8 chips was done in a very early design stage, well before first silicon, and had been part of the Power8 strategy for quite some time ahead of the launch in April, Mericas said.

Some other things to note. In the table presenting the relative performance of Power7+ and Power8 systems, the table showed the relative performance of the Power8 Scale-Out data shown at Hot Chips 25 with the Power8 Scale-Up data run on the new machines launched in April. These were not normalized to a specific cloud frequency, which was not clear in the data shown. Rather, IBM ran it on the fastest 16-core Power 740+ system and on the fastest 24-core Power8 S824 system and showed the relative performance. The commercial workload that is probably most relevant to IBM i shops was a mathematical average of a bunch of different benchmarks; Mericas did not divulge which ones.

I just noticed something else interesting in the presentation from Hot Chips 26 that I missed a few weeks ago, and that is the big improvement in hardware encryption with the Power8 chips:

The Power8 core and chip has accelerators for AES encryption and SHA hashing algorithms in on the chip and in the core, and the RAID CRC/syndrome checksum accelerator that is new with Power8 is in the core as well. The interesting bit in that chart above is that Power8 has user-mode instructions to accelerate common algorithms, and on the SPECjbb2013 Java application benchmark the tweaks to the hardware encryption lead to a 10 percent improvement in performance. Also, look at how few cycles per byte the SHA and AES functions take on Power8 as implemented in hardware compared to doing these algorithms in software on Power7 and Power7+ processors. This is a radical improvement.