tfh
Volume 23, Number 30 -- September 9, 2013

Power8 Processor Packs A Twelve-Core Punch--And Then Some

Published: September 9, 2013

by Timothy Prickett Morgan

Well, IBM i shops hankering for some more processing oomph, Big Blue has a very big jump in performance coming your way with the future Power8 processors. Just after The Four Hundred went on vacation at the end of August, the techies who designed the Power8 chip took the podium at the Hot Chips conference, hosted by the IEEE every summer for the past 25 years at Stanford University, and revealed the major aspects of the design of the successor to the current Power7 and Power7+ processors.

As I was watching the presentation by Jeff Stuecheli, chief nest architect for the Power8 chip (that refers to the non-core portions of the die, with the cores being akin to eggs), it struck me immediately that this single Power8 processor, with its dozen cores, was akin to an entire AS/400e 650-2243 system from the summer of 1997.

At its design point clock speed of 4 GHz, the Power8's cycle time a whopping 32 lower than the 125 MHz (remember when megahertz used to matter?) PowerPC Apache processors used in the AS/400e 650. That 12-way AS/400e 650 machine was rated at 2,340 CPWs, and the following year that system was upgraded to 262 MHz PowerPC Northstar processors with an early implementation of simultaneous multithreading (SMT) that allowed the 12-way AS/400e 650 system to push up to 4,550 CPWs of aggregate raw OS/400 performance.

Stuecheli said in his presentation that based on early benchmark tests, the Power8 chip running at 4 GHz had somewhere around 2.5 times the performance of a Power7+ chip at the socket level, due to a mix of factors that I will explain in a bit. Based on my extrapolations from CPW ratings for Power 740 systems with a single Power7+ chip running at 4.2 GHz, I would peg a single Power8 chip running at 4 GHz at around 155,000 CPWs, give or take a few thousand. In other words, the AS/400e 650, the top-end box from 15 years ago, fits in the error bars of a guesstimation of the single chip performance of a Power8 chip.

Just let that soak in a little bit. And think about how large the IBM midrange business might be, and how prevalent IBM i might be, had Big Blue priced this technology very aggressively to extend its customer base rather than to extract profits so aggressively from it. Q might be a very popular letter out there in the data centers of the world, and in an alternative universe where IBM listened to us, it is.

We have to deal with what is, not just what can and cannot be, and what I can tell you is that the Power8 chip is yet another leap forward for the Power Systems business and demonstrates, yet again, that Big Blue knows a thing or two about designing processors. This is one big bad chip, and it is aimed at big bad systems for sure.



Hopefully, IBM will be able to gear it down for entry and midrange customers with more modest processing needs--and do so at an affordable price. This will be particularly important for the IBM i customers that Big Blue wants to retain as well as the Linux and, to a smaller extent, AIX customers that the company wants to attract. As you well know, most IBM i shops have entry or midrange systems with two, four, or eight cores running IBM i. One of these Power8 chips is more than enough for them. I will get into the possibilities in next week's issue.

The feeds and speeds of the Power8

The Power8 chip starts out with the Power7+ core that IBM just rolled out this time last year and has finished putting across the Power Systems product line as the summer was getting started. IBM is moving from the 32 nanometer processes used to etch its Power7+ chips to 22 nanometer process with 15 metal layers. Both processes have high-k metal gate and silicon-on-insulator (SOI) techniques IBM perfected many years ago in previous generations of Power chips. This is a pretty big jump--the same that Intel has made with its Xeon processors--and that gives Big Blue lots of options. That is a big shrink, and one that could have allowed IBM to cram as many as 16 cores on a die, if it was willing to sacrifice some on-chip L3 cache memory, perhaps. It is always hard to guess these things from the outside.

As it turns out, IBM went with a dozen cores and cut back on the L3 cache a little bit so it could add PCI-Express 3.0 controllers and a new Coherent Accelerator Processor Interface, which rides atop that PCI transport, to the die. These integrated PCI-Express controllers--there are two of them--and their CAPI overlay replace the current GX++ bus used on Power chips. The GX++ is a funky variant of 20 Gb/sec InfiniBand that comes off the die and then talks to an I/O bridge that in turn talks to PCI devices. Each PCI-Express controller runs at 8 Gb/sec, so this is actually a slight drop in raw bandwidth. But this is a much better way to do things, and will make integration with PCI devices simpler.

And, as The Four Hundred already divulged, the CAPI overlay will allow for the sharing of memory between Power8 processors and auxiliary coprocessors linked to it over CAPI. IBM no doubt wants makers of network controllers, GPUs, FPGAs, and DSPs to adopt CAPI's Processor Service Layer to it can talk to the coherence bus on the Power8 chip and act like they were right there on the die. That, if anything, is what the OpenPower consortium announced a few weeks ago is all about.

The Power8 chip also has a new technology-neutral memory controller that puts a lot of the electronics dealing with specific signaling for particular memory types (DDR3, DDR4, or whatever) out onto the buffer chips that IBM has always used with the most recent Power chips.



As expected, IBM has goosed the SMT capabilities of the Power motor with the Power8, and will be able to manage as many as eight threads per core. Those threads look like an individual processor to the IBM i, AIX, or Linux operating system and use interleaving techniques during stalls and other activities that slow down processing on a single thread to squeeze more work out of the machine. This threading is dynamic and automatic and can scale back to a single actual thread as well as run with two, four, or eight threads activated. Databases and Java application servers like threads, RPG and COBOL apps are more thread than in past years but have their limits.

Each Power8 core has 64 KB of data cache (twice what is on the Power7 and Power7+ chips) and 32 KB instruction cache. Each core on the die has a 512 KB L2 cache segment allocated to it, which juts up against the 96 MB of shared L3 cache, which is implemented in embedded DRAM. As with the Power7 and Power7+ chips, this eDRAM takes fewer transistors to implement a memory cell, but those cells move a little slower than the SRAM that is used in the L1 and L2 caches. IBM has doubled up the data buses from L1 to L2 cache to 64 bytes, and the cores also have improved branch prediction and prefetching of data and instructions as well as larger issue queues. The core has two load store units (LSUs), a condition register unit (CRU), a branch register unit (BRU), and two instruction fetch units (IFUs). For math, the Power8 chip has two fixed-point units (FXUs), a decimal floating unit (DFU), and two vector math units (VMXs). There is also one cryptographic unit, which is sort of like a specialized math unit if you think about it. A single thread on a Power8 chip running at the 4 GHz cock speed has about 1.6 times the performance of the Power7 core's thread running at an equivalent clock speed, according to Stuecheli.

The Power8 chip has two main memory controllers, one on either side, and thanks in part to the aggregate 128 MB of eDRAM L4 cache on the Centaur memory buffer chips, IBM can drive 230 GB/sec of sustained bandwidth into and out of those two main memory controllers. This is a lot of memory bandwidth, and once again shows where IBM differentiates from other chip makers.

Those Centaur chips, so called because they are half L4 cache memory and half DDR3 memory controller, are probably the wave of the future for how main memory will be implemented in upcoming systems from IBM and others. It is very simple. Processors change every year, and main memory technology does more slowly, like every four or five years if we are lucky. It is much better to have a generic memory bus coming out of the CPU and the memory scheduling logic, caching structures, and energy management features of a specific memory technology out on the buffer chips that in turn talk to main memory sticks rather than have it on the controller on the die. So with the Centaur, IBM has broken the main memory controller into two and put half of the circuits on the buffer chip while also adding 16 MB of L4 cache memory to sit between the main memory and the remaining generic memory controller on the Power8 die.

Each Power8 processor can have up to eight Centaur chips hanging off it, which yields a maximum of 128 MB of L4 cache per socket. Each Power8 chip has eight high-speed memory channels, running at 9.6 GB/sec, and each Centaur chip can drive four DDR3 ports for a total of 410 GB/sec of aggregate peak bandwidth coming off the DRAM into the Centaur chip's L4 cache memory. IBM will be supporting 32 GB DDR3 memory sticks with the initial Power8 systems, which yields a maximum of 1 TB of memory capacity per socket. That's twice the memory of the current Power7+ systems on the market.

Those Power8 memory controllers support transactional memory, by the way. This transactional memory made its debut with the zEnterprise EC12 mainframes announced last year. With regular memory, you lock down resources to avoid contention when transactions are pumped through the system. But with transactional memory, you do your work and assume (correctly) that most of the time there is no contention and then if you do find contention, you back out and wait and redo the work. On mainframes, IBM saw as much as a 45 percent performance boost from transactional memory on DB2 database and virtualized server workloads running on the EC12.

The chip interconnect on the Power8 die that links the L3 banks to each other has 150 GB/sec of bandwidth per direction per L3 cache segment (there are 12 of them, one for each core) when running at 4 GHz. IBM has created a NUMA-like scheduling system for the L2 and L3 caches that keeps hot data migrating into L2 from L3 cache segments. You can move data into the L3 cache from L4 cache on a single Power8 chip at 128 GB/sec and out to the L4 cache at 64 GB/sec. The pipes between the L3 and L2 caches run at 129 GB/sec both ways. The Power8 can move data out of the cores to the L2 cache at 64 GB/sec, but can move data from the L2 cache to the core at four times the speed, or 256 GB/sec. Add it up across twelve cores running at 4 GHz, and you have 4 TB/sec of L2 cache bandwidth and 3 TB/sec of L3 cache bandwidth.

Next week, I will do a thought experiment about how this chip might be used in future Power Systems iron, since IBM is not yet ready to talk about that.


RELATED STORIES

IBM To Divulge Power8 Processor Secrets At Hot Chips

IBM Forms OpenPower Consortium, Breathes New Life Into Power

IBM Names New GMs For Power Systems And System z

PureSystems Sales Break 6,000, And IBM Names New GM

What Is IBM Going To Do With Its Systems Business?

Looks Like PureFlex GM Has Left Big Blue

Systems And Strategy Execs Switch Roles At Big Blue

Mad Dog 21/21: Think Or Quit

Power Systems Sales Stalled--Again--By Power7+ Rollout

Will Big Blue Deep Six Its X86 Server Biz?

Server Manufacturing Moved Out Of Rochester, Minnesota

IBM Rochester Gets A Piece Of the PureSystems Action

A Closer Look At The Flex System Iron

IBM Launches Hybrid, Flexible Systems Into The Data Center

IBM Starts Refurbishing Power Systems Machines In China

Taiwan Gets Its Own Power Systems Lab

IBM Moves Power Systems Factories from Ireland to China and Singapore

IBM Reorganization Tucks Systems Under Software

IBM's Plan for an Adjacent, Custom Systems Market

The IBM Systems Agenda: iB(M)

international Business (machines)



                     Post this story to del.icio.us
               Post this story to Digg
    Post this story to Slashdot


Sponsored By
PRODATA COMPUTER SERVICES

ProData Has Them ALL!

Cast out ProData's safety nets!
Order now and receive $500 off!

SQL Audit and DBU Audit give you the control
of data manipulation. RDR saves the day by
retrieving your deleted records.

Download SQL Audit, DBU Audit and RDR Today!

HURRY! Special $500 savings ends soon.

Download today!
DoDBU.com
800.228.6318


Editor: Timothy Prickett Morgan
Contributing Editors: Dan Burger, Joe Hertvik, Victor Rozek,
Jenny Thomas, Hesh Wiener, Alex Woodie
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

Sponsored Links

Maxava:  FREE Webinar: Test your DR without Downtime. September 12
Shield Advanced Solutions:  HA4i ~ High Availablity for the IBM i. FREE 30-day trial
System i Developer:  Upgrade your skills at the RPG & DB2 Summit in Minneapolis, Oct 15-17.

 

 

More IT Jungle Resources:

System i PTF Guide: Weekly PTF Updates
IBM i Events Calendar: National Conferences, Local Events, and Webinars
Breaking News: News Hot Off The Press
TPM @ The Reg: More News From ITJ EIC Timothy Prickett Morgan


 
Four Hundred Stuff
Krengel Adopts Tokens in Credit Card Transaction Software

Kisco Gives IBM i Security Tool a Web Interface

Quadrant Launches New Fax Appliance

RJS Tackles a 'Screen Jumping' Problem

Avnet Introduces Utility Pricing on Hardware

Four Hundred Guru
Override Default Data Types In The SQL Descriptor

Join The Queue With Open Access

Which Job Is Filling Up My System Storage?

Four Hundred Monitor
Four Hundred Monitor's
Full iSeries Events Calendar

System i PTF Guide
August 31, 2013: Volume 15, Number 35

August 24, 2013: Volume 15, Number 34

August 17, 2013: Volume 15, Number 33

August 10, 2013: Volume 15, Number 32

August 3, 2013: Volume 15, Number 31

July 27, 2013: Volume 15, Number 30

TPM at The Register
Intel shows off 'disaggregated' rack of servers, storage, and networking

Intel chases network gear and cold storage with Avoton Atoms

Intel's Avoton Atoms give microservers muscle – and Xeon-class features

Intel readies server-grade Atom for microserver ARM wrestling
Fujitsu to push 28 nanometer limits with Sparc64 X+

Oracle revs up Sparc M6 chip for seriously big iron

Server sales continue decline – time to bargain hard with your supplier

You won't find this in your phone: A 4GHz 12-core Power8 for badass boxes

HyTrust pockets more dough, ready to expand virty platform coverage

VMware goes after biz critical apps with vSphere 5.5

HotLink rolls VMware virty servers out to the Amazon cloud

HP hammered in servers, storage, and PCs in fiscal Q3

THIS ISSUE SPONSORED BY:

ProData Computer Services
Help/Systems
United Computer Group, Inc.
Enforcive
WorksRight Software


Printer Friendly Version


TABLE OF CONTENTS
Power8 Processor Packs A Twelve-Core Punch--And Then Some

Databorough Snapped Up By Fresche Legacy

SaaS HR And Payroll Powered By i Proves Popular

As I See It: Motivate This

IBM Re-Emphasizes Software And Services To The Channel

But Wait, There's More:

Servers Sales Swoon A Little From April Through June . . . A Cloud Falls Over The U.S. Open . . . RPG & DB2 Summit Registrations Rise, Signals Progress In IBM i Shops . . . IBM i Tech Books Available Through BookHawkers . . . How Does The Flex System Stack Up Against Cisco's UCS? . . .

The Four Hundred

BACK ISSUES




 
Subscription Information:
You can unsubscribe, change your email address, or sign up for any of IT Jungle's free e-newsletters through our Web site at http://www.itjungle.com/sub/subscribe.html.

Copyright © 1996-2013 Guild Companies, Inc. All Rights Reserved.
Guild Companies, Inc., 50 Park Terrace East, Suite 8F, New York, NY 10034

Privacy Statement