The Other IBM Big Iron That Is On The Horizon
August 23, 2021 Timothy Prickett Morgan
The Hot Chips conference is underway this week, historically at Stanford University but this year as was the case last year, is being done virtually thanks to the coronavirus pandemic. There are a lot of new chips that are being discussed in detail, and one of them is not the forthcoming Power10 chip from IBM, which is expected to make its debut sometime in September and which was one of the hot items at last year’s Hot Chips event.
The one processor that IBM is talking about, however, is the “Telum” z16 processor for System z mainframes, and unlike in times past, IBM is revealing the latest of its epically long line of mainframe central processing units (1964 through 2022, and counting) before they are launched in systems rather than after. We happen to think IBM had hoped to be able to ship the Telum processors and their System z16 machines before the end of 2021 and the transition from 10 nanometer to 7 nanometer processes at former foundry partner GlobalFoundries to 7 nanometer processes at current foundry partner Samsung has delayed the z16 introduction from its usual cadence. As it stands, the z16 chip will come out in early 2022, after the Power10 chips with fat cores (meaning eight threads per core and only 15 cores per chip) come to market. The skinny Power10 cores (four threads per core but 30 cores on a die) used in so-called “scale out” systems are not expected until the second quarter of 2022. It is rough to change foundries and processes and microarchitectures all at the same time, so a delay from the original plan for both z16 and Power10 are to be expected.
It will be up to a judge to accept IBM’s lawsuit against GlobalFoundries, which we talked about back in June, or not accept it, and it will be up to a jury to decide economic damages should Big Blue prevail and win its case in the courts. Or, Intel could buy GlobalFoundries and settle the case and have IBM as its foundry partner. There are a lot of possible scenarios here. The good news is that IBM and Samsung have been able to get the z16 and Power10 processors designed and are ramping production on the Samsung 7 nanometer process, trying to drive up yields. If IBM could not deliver these chips in the near term, it would not be saying anything at this point. Like when the process shrink with the Power6+ or the Power6+ were not panning out, for instance.
The Telum z16 processor is interesting from a technology standpoint because it shows what IBM can do and what it might do with future Power processors, and it is important from an economic standpoint because the System z mainframe still accounts for a large percentage of IBM’s revenues and an even larger share of its profits. (It is hard to say with any precision.) As the saying goes around here, anything that lets IBM Systems stronger helps IBM i last longer.
Besides, it is just plain fun to look at enterprise server chips. So, without further ado, take a gander at the Telum z16 processor:
According to Ross Mauri, the general manager of the System z product line, “Telum” refers to one of the weapons sported by Artemis, the Greek goddess of the hunt, known for bow hunting but also for her skill with the javelin. This particular javelin has to hit its enterprise target and help Big Blue maintain its mainframe customer base and make them enthusiastic about investing in new servers. The secret sauce in the Telum chip, as it turns out, will be an integrated AI accelerator chip that was developed by IBM Research and that has been modified and incorporated into the design, thus allowing for machine learning inference algorithms to be run natively and in memory alongside production data and woven into mainframe applications.
This is important, and bodes well for the Power10 chip, which is also getting its own native machine learning inference acceleration, albeit of a different variety. The z16 chip has an on-die mixed-precision accelerator for floating point and integer data, while the Power10 chip has a matrix math overlay for its vector math units. The net result is the same, however: Machine learning inference can stay within the compute and memory footprint of the server, and that means it will not be offloaded to external systems or external GPUs or other kinds of ASICs and will therefore be inside the bulwark of legendary mainframe security. There will be no compliance or regulatory issues because customer data that is feeding the machine learning inference and the response or recommendation from that inference will all be in the same memory space. For this reason, we have expected for a lot of machine learning inference to stay on the CPUs on enterprise servers, while machine learning training will continue to be offloaded to GPUs and sometimes other kinds of ASICs or accelerators. (FPGAs are a good alternative for inference.)
The Telum chip measures 530 square millimeters in area and weighs in at about 22.5 billion transistors. By Power standards, the z16 cores are big fat ones, with lots of registers, branch target table entries, and such, which is why IBM can only get eight fat cores on that die. The Power10 chip, which we have nick-named “Cirrus” because IBM had a lame name for it, using the same 7 nanometer transistors can get sixteen fat cores (and 32 skinny cores) on a die that weighs in at 602 square millimeters but has only 18 billion transistors. The Telum chip will have a base clock speed of more than 5 GHz, which is normal for recent vintages of mainframe CPUs.
A whole bunch of things have changed with the Telum design compared to the z14 and z15 designs. IBM has used special versions of the chip called Service Processors, or SPs, to act as external I/O processors, offloading from Central Processors, or CPs, which actually do the compute. With this design, IBM is doing away with this and tightly coupling the chips together with on-die interconnects, much as it has done with Power processors for many generations. Mainframe processors in recent years also had lots of dedicated L3 cache and an external L4 cache that also housed the system interconnect bus (called the X-Bus). The z15 chip implemented in GlobalFoundries 14 nanometer processes had a dozen cores and 256 MB of L3 cache, plus 4 MB of L2 data cache and 4 MB of L2 instruction cache allocated for each core. Each core had 128 KB of L1 instruction cache and 128 KB of instruction cache. It ran at 5.2 GHz, and supported up to 40 TB of RAID-protected DDR4 main memory across a complex of 190 active compute processors.
With the z16 design, the cache is being brought way down. Each core has only 32 MB of L2 cache, which is made possible in part because the branch predictors on the front end of the chip have been redesigned. The core has four pipelines and supports SMT2 multithreading, but it doesn’t have physical L3 cache or physical L4 cache any more. Rather, according to Christian Jacobi, distinguished engineer and chief architect of the z16 processor, it implements a virtual 256 MB L3 cache across those physical L2 caches and a virtual 2 GB cache across eight chips in a system drawer. How this cache is all carved up is interesting, and it speaks to the idea that caches often are inclusive anyway (meaning everything in L1 is in L2, everything in L3 is in L3, and everything in L3 is in L4), which is a kind of dark silicon. Why not determine the hierarchy on the fly based on actual workload needs?
To make this virtual L3 cache, there are a pair of 320 GB/sec rings. Two chips are linked together in a single package using a synchronous transfer interface, shown at the bottom two thirds of the Telum chip and four sockets of these dual-chip modules (DCMs) are interlinked in a flat, all-to-all topology through on-drawer interfaces and fabric controllers, which run across the top of the Telum chip. At the bottom left is the AI Accelerator, which has more than 6 teraflops of mixed precision integer and floating point processing power that is accessible through the z16 instruction set and is not using a weird offload model as is the case with CPUs that offload machine learning inference to GPUs, FPGAs, or custom ASICs. This accelerator, says Jacobi, takes up a little less real estate on the chip than a core does. And clearly, if IBM wanted to raise the ratio, it can add more accelerators. This ratio is interesting in that it shows how much AI inference that IBM expects – and that its customers expect – to be woven into their applications.
That is the key insight here.
This on-chip AI Accelerator has 128 compute tiles that can do 8-way half precision (FP16) floating point SIMD operations, which is optimized for matrix multiplication and convolutions used in neural network training. The AI Accelerator also has 32 compute tiles that implement 8-way FP16/FP32 units that are optimized for activation functions and more complex operations. The accelerator also has what IBM calls an intelligent prefetcher and write-back block, which can move data to an internal scratchpad at more than 120 GB/sec and that can store data out to the processor caches at more than 80 GB/sec. The two collections of AI math units have what IBM calls an intelligent data mover and formatter that prepares incoming data for compute and then write-back after it has been chewed on by the math units, and this has an aggregate of 600 GB/sec of bandwidth.
That’s an impressive set of numbers for a small block of chips, and a 32-chip complex (four sets of four-sockets of DCMs) can deliver over 200 teraflops of machine learning inference performance. (There doesn’t seem to be INT8 or INT4 integer support on this device, but don’t be surprised if IBM turns it on eventually, thereby doubling and quadrupling the inference performance for some use cases that have relatively coarse data.)
Jacobi says that a z16 socket with an aggregate of 16 cores will deliver 40 percent more performance than a z15 socket, which had 12 cores. If you do the math, 33 percent of that increase came from the core count increase; the rest comes from microarchitecture tweaks and process shrinks. We don’t expect the clock speed to be much more than a few hundred megahertz more than the frequencies used in the z15 chip, in fact. There may be some refinements in the Samsung 7 nanometer process further down the road that allow IBM to crank it up and boost performance with some kickers. The same thing could happen with Power10 chips, by the way.
One final though, and it is where the rubber hits the road with this AI Accelerator. A customer in the financial services industry worked with IBM to adapt its recurrent neural network (RNN) to the AI Accelerator, allowing it to do inference on the machine for a credit card fraud model. This workload was simulated on a z16 system simulator, so take it with a grain of salt. It illustrates the principle:
With only one chip, the simulated System z16 machine could handle 116,000 inferences per second with an average latency of 1.1 milliseconds, which is acceptable throughput and latency for a financial transaction not to be stalled by the fraud detection and for it to be done in real time rather than after the fact. With 32 chips in a full System z16 machine, the AI Accelerator could scale linearly, yielding 3.5 million inferences per second with an average latency of 1.2 milliseconds. That’s a scalability factor of 94.3 percent of perfect linear scaling, and we think this has as much to do with the flat, fast topology in the new z16 interconnect and with the flatter cache hierarchy as it has to do with the robustness of the AI Accelerator.
Soon, we will see how the Power10’s native matrix math units and caches and interconnects compare.