Crazy Idea # 542: Port IBM i To The Mainframe
September 18, 2017 Timothy Prickett Morgan
In case you didn’t know it, and why would you care, IBM launched the new System z14 mainframe back in June, was talking about the new z14 motors it uses in August, and has just last week launched the LinuxOne “Emperor II” Linux-only mainframe variant of that platform. The machine, as always, has some impressive engineering. And it got me to thinking. Which is always dangerous.
Here is a crazy idea. No, this is really crazy, unlike some of my other inspirations, which of course make perfect sense. Maybe IBM should converge the Power Systems and System z lines, and not the way we usually think about it.
Over the years, we have spent a lot of time talking about how IBM should port OS/400 and then IBM i to one hardware platform or another – some Xeon or Opteron X86 architecture, or maybe an ARM architecture – but we have never considered the obvious thing and the one that Big Blue tried to do itself a few times in its history and failed at. Maybe it should converge its hardware and port IBM i – and maybe AIX, too – to the mainframe. If the mainframe is, as IBM says, the best and most scalable and most secure platform for Linux, then maybe it should be the best and most scalable and most secure platform for IBM i and maybe AIX, too. We are always trying to take the operating systems down to the next cheapest and usually more exciting hardware architecture, but maybe this time, for some good reasons, IBM i should move up the hardware scale rather than down.
Or, because I am greedy when it comes to IBM i, maybe both. But anyway. Here’s my thinking.
The new z14 processors are true beasts, and they are, as I look at the specs, a bit like the overclocked Power9 chips I was thinking about as being very useful for IBM i shops only a few weeks ago. In fact, hearing the z14 briefing at Hot Chips back in August out in Silicon Valley is what got me to thinking about overclocking IBM i machines in the first place. IBM is pushing the thermal limits here, and like other makers of brawny-core motors, is trading off massive core scalability like that embodied in the Power9 chip, which has 24 cores in its “Nimbus” scale out version for machines with one, two, or four sockets and 12 cores in its “Cumulus” scale up version for machines with four, eight, or sixteen sockets, and cranking the clocks up as high as possible without melting the server.
The z14 chips have a lot of funky new stuff in them, and IBM has not in any way stopped innovating with its mainframe processors. IBM keeps a pretty good pace with the System z processor roadmap, roughly on par with the Power line and not quite as aggressive as we have seen with Intel’s Xeon chips.
The z11 chips came out in 2010, and they had four cores running at a top speed of 5.2 GHz; this chip was implemented in a 45 nanometer process. A single z11 core delivered about 1,200 MIPS of raw computing capacity running full out.
The z12 chip came out in the summer of 2012, and it had six cores clocking in at 5.5 GHz, with each core delivering about 1,600 MIPS of performance; it was made using 32 nanometer processes, and IBM used the process shrink to goose the clock speed by 6 percent to boost the core count by 50 percent. The z12 chip had a lot of architectural enhancements, including a new out-of-order execution pipeline and much larger on-chip caches to further increase single-threaded performance.
The z13 chip, which came out in January 2015, had a maximum of eight cores and its 4 billion transistors were implemented in 22 nanometer process. It ran at a top speed of 5 GHz but even with that 10 percent clock speed drop it offered a 10 percent performance bump per core thanks to other tweaks in the core design, including better branch prediction and better pipelining in the core. The z13 chip also has much larger caches, which IBM feels is the best way to secure good performance on a wide variety of workloads that are heavy on I/O and processing. To be precise, the z13 core has 96 KB of L1 instruction and 120 KB of L1 data cache. The L2 caches on the most recent generations of mainframe chips are split into data and instructions caches, and in this case have been doubled to 2 MB each. The on-chip L3 cache, which is implemented in embedded DRAM (eDRAM) as on the Power7 and Power8 chips, has been increased by 50 percent to 64 MB shared across the six cores. And the L4 cache that is parked out on the SMP node controller chip in the System z13 machine has been boosted to 480 MB, a 25 percent increase. The System z13 tops out at 10 TB of main memory, three times that of the predecessor zEnterprise EC12 machine. All of these changes in the cache hierarchy smooth out the SMP scalability of the system, and a top-end System z13 had about 40 percent more aggregate MIPS than the largest System zEnterprise EC12 from two and a half years ago. I estimate that zEnterprise EC12 machine using the z12 processors topped out at 75,000 MIPS of total capacity, and the System z13 peaked at 110,000 MIPS.
The z14 chip, which has 6.1 billion transistors, is etched using the latest 14 nanometer techniques from Globalfoundries, which bought the IBM Microelectronics division three years ago and which is contracted to make Power and z motors until 2020. The z14 chip has ten cores, a 25 percent increase, and some of that die shrink is also allowing IBM to boost the clock speeds by a mere 4 percent to get a little extra oomph and still stay within the 400 watts to 500 watts of peak heat that the z14 central processor emits; it is typically more like 300 watts to 350 watts, according to Christian Jacobi, distinguished engineer and chief architect of System z processor development at IBM, who unveiled the z14 at the Hot Chips conference.
With the z14 chip, IBM has tweaked the z core pipeline with an optimized, second generation of two-thread simultaneous multithreading (SMT), and has not pushed this to four or even eight threads per core as has been the norm on the past several generations of Power chips. The floating point units on the die have twice the memory bandwidth, and the branch predictors and instruction buffers both have performance improvements. There are a large number of gorpy improvements in the guts of the chip, and the L1 data and instruction caches on the z14 cores are now 128 KB each, the L2 data cache has been doubled again to 4 MB per core, the L2 instruction cache is doubled to 2 MB per core, and the shared embedded DRAM (eDRAM) on the die that spans all ten cores has been doubled as well to 128 MB. Interestingly, the z14 chip has single precision and quad precision vector math units as well as circuits to support pauseless garbage collection, which helps with middleware such as Java and PHP servers. There is data encryption, hashing, and data compression built into the z14 cores, too.
That’s the compute complex. The System z architecture also includes a system control, or SC, chip that is separate and that includes the NUMA circuits for clustering six CP chips within a single drawer and across up to four drawers for a maximum of 24 sockets in a single system image with up to 32 TB of main memory to play in. Here is what the SC chip, which is exactly the same size as the CP chip at 696 millimeters square, looks like:
The SC chip for the z14 processor has a total of 672 MB of L4 cache on it, which is carved up into four blocks. (The Power8 and Power9 chips used in the scale-up NUMA boxes put the L4 cache into the “Centaur” memory buffer chips, and the NUMA interconnect is distributed across each CPU die.) The z14 chip has two PCI-Express 3.0 controllers on each CP, for a total of 48 controllers, and up to 40 of these can have x16 adapters in them to fan out to I/O drawers that support up to 320 distinct peripheral cards.
Here is how the System z14 drawer is stacked up:
As you can see, the base z14 system drawer has six of the CPs, which are water cooled using cold plates, and one SC, which is air-cooled. There is room for 8 TB of main memory in each drawer, with eight memory slots per CP for a total of 48 slots per drawer. IBM has RAID data protection on the memory, and Big Blue is using 192 GB memory sticks to get 9 TB of physical memory into the box and then striping and protecting it to get it down to 8 TB of usable space. Customers could combine RAID memory with data compression on the memory and push usable capacity further, but that is very workload dependent. The top end z14 system, at 32 TB of usable capacity, has more than three times the main memory capacity of the biggest z13, which hit 10 TB usable. Interestingly, an LPAR can span as large as 16 TB with the z14.
IBM says that, after all the tweaks, the z14 core delivers about 1,832 MIPS, or about 8.1 percent better than the 1,695 MIPS that a single z13 motor could do. Add it all up, and a fully loaded 24-socket System z14 that has the maximum of 170 cores dedicated to processing (some CP cores are allocated to process I/O) can deliver about 148,000 MIPS of aggregate capacity, according to IBM’s Large Systems Performance Reference.
Now, I am not suggesting for a second that the typical IBM i shop needs something like a fully loaded z14, although there are probably a few companies running IBM i that are raising an eyebrow at the thought. . . . No, I am thinking of what might happen if you had a single drawer with a single CP and SC chip each, and topped out the memory at 1.3 TB of main memory with that RAID protection.
As I pointed out with my analysis of the z13 chip more than two years ago, IBM’s own performance documents from the Power4 generation say that to calculate the rough equivalent performance on IBM i workloads, take the MIPS and multiply by seven and that will give you an approximate ranking on the Commercial Performance Workload (CPW) test that IBM uses for OS/400 and IBM i database and transaction processing work. That means a top-end System z13-NE1 model with 141 cores active doing CP work was rated at about 735,000 CPWs and a top-end System z14-M05 would come in at just a tad bit over 1 million CPWs with all 170 cores doing work at peak capacity. The low-end z14-M01 has 33 CPs fired up doing compute work and has 8 TB of maximum memory, but we are talking about something with maybe eight cores running CP work and two cores running I/O. That would be something around 12,300 usable MIPS, or about 86,000 CPWs of oomph on a single socket machine.
Here’s some perspective. A 256-core Power 795 using 4 GHz Power7 chips had about 1.6 million CPWs, and Power E880C with 192 Power8 cores running at 4.02 GHz delivers 2,069,000 CPWs. Roughly speaking, the Power E880 is delivering 12,000 CPWs per core while the new System z13-NE1 is delivering around 6,000 CPWs per core, a least based on my MIPS estimates and the MIPS-to-CPW ratios. Everything comes down to cases, and the important thing is that both the Power8 and z13 systems offer lots of capacity. (IBM has sophisticated Parallel Sysplex clustering to lash multiple System z machines into a single compute engine, too, and IBM has not really talked about its DB2 for i Multisystem clustering for about 15 years. But as I have said many times, it should.) The other thing to remember is that the performance numbers for the Power 795 and Power E880 have four-way and eight-way SMT turned on, respectively, and this significantly boosts performance on thread-friendly workloads – by a factor of 50 percent moving from two to eight virtual threads, according to the internal IBM data that I have seen.
To compare that to a real entry machine, a Power S822 with four cores running at 4.15 GHz delivers 52,700 CPWs, and a Power S814 with four cores running at 3.02 GHz delivers 39,500 CPWs.
Here is the point. A 5.2 GHz z14 engine with all kinds of goodies for data compression and encryption built in might make for a better IBM i platform, with crazy single-thread and batch performance, than even a Power9. A lot depends on how tightly IBM has coupled IBM i, and AIX, to the iron. We know that Linux runs well on both Power and z architectures, so IBM has the skills to help make IBM i, and maybe AIX, run well on System z iron. A long time ago, before IBM had Power RISC processors, AIX did run on the mainframe, and it even had RPG compilers way back when. Maybe IBM needs a unified system platform again, and maybe, for the sake of fun contemplation and possibly implementation, maybe for certain cases, the z motors might be a better host for real IBM i workloads than the Power9 motors it is aimed at hyperscale datacenters and HPC customers, and low-end X86 iron for that matter.
It’s a thought, and IBM founder TJ Watson did admonish us all to THINK.