Setting The Stage For The Next Decade Of Processing
August 12, 2019 Timothy Prickett Morgan
It is no secret that Moore’s Law is causing all kinds of grief with chip designers working in all parts of the IT stack. It was bad enough to run out of clock scaling when Dennard Scaling stopped, and the industry has done a great job in making processors more parallel and allowing for them to offload processing to various kinds of accelerators, either on the die, in the package, or in the chassis over high speed interconnects. But even this is running out of gas as processors keep pushing up against the reticle limits of lithography machines because the transistors can’t get small enough fast enough.
Despite the massive engineering challenges and the massive economic commitment that is required to design a chip and build a fab to etch it, this remains a time of excitement and optimism when it comes to advancing the state of the art in data processing. We have always conquered our barriers before – moving from bipolar manufacturing techniques to CMOS was no joke, and was very difficult for system makers, for example – and so we have this sense that we will always be able to do it. It’s hard to argue against history, and it is helpful to remember that the Universe is not actually a casino and we are not betting against the house when we are being optimistic. We are part of the house, and it wins when we win. (That’s not precisely a theology, but it might as well be.)
As I have been saying for quite some time, this is an important year coming up for processing. The most important thing about it, interestingly, is that the world’s largest processor maker, Intel, has had very big problems getting its 10 nanometer chip manufacturing technology to market, and has apparently had to scrap it an go back to the physics drawing board to get something in the field late this year for Core PC chips and next year for Xeon server chips. Intel thinks of itself as a chip manufacturer first and a chip designer second, so this is particularly embarrassing. But this is incredibly tough to do, and everyone has something go wrong sometime.
But the reality is that there are only three companies left that have advanced fabs – GlobalFoundries backed away from its 7 nanometer and 5 nanometer efforts a year ago, and IBM gave up and sold its Microelectronics division to GlobalFoundries years before that. This leaves Samsung Electronics, which makes DRAM and flash memory as well as Arm processors for its smartphones and tablets and which is IBM’s fab partner for the 7 nanometer technologies that will be used to make the future Power10 chip, plus Intel and Taiwan Semiconductor Manufacturing Corp. Intel’s 10 nanometer process is roughly analogous to the 7 nanometer processes from Samsung and TSMC. The latter company has 7 nanometer chips coming into the field, starting with the new “Rome” Epyc chip from AMD, which was launched last week. Samsung is working with IBM to perfect its 7 nanometer process and the two plan to work together to roll out the Power10 processor sometime next year. This is a good partnership between IBM and Samsung because it gives Samsung a relatively low volume part to perfect its 7 nanometer process and get paid to do so, thus lowering its development costs before it moves to the process with its higher volume Arm CPUs and DRAM, where it cannot afford to have low yields.
The Rome Epyc processors, which I have covered here in an initial pass at The Next Platform, are interesting in that they are actually not one processor, but a multichip module with nine chips on the package. There are eight eight-core compute complex dies (or CCDs as AMD calls them), which are etched in that 7 nanometer process. These link directly through signaling ports on the chiplets to a central I/O and memory controller hub, which is basically taken straight from the existing and prior generation “Naples” Epyc processors, with everything centralized and reblocked. This I/O and memory hub chip is even implemented in a slightly improved 14 nanometer process from GlobalFoundies, which made the Naples server chips for AMD using that very well-known and well-yielding process. This not only mitigated some risk for AMD, but also allows for the I/O and memory controllers, which have to push signals off the package and out to I/O slots and memory sticks, to run at a higher voltage and power and thus to run more efficiently than they would if they had been shrunk to 7 nanometers on a monolithic die. While we have had multichip modules for a long time in the industry – IBM did it way back with Power5+, you will recall, and Intel and AMD have done this before, too – this is the first time we have a mixed process MCM. This is the wave of the future.
The funny bit is that the Rome Epyc processor looks like an eight-socket NUMA server that someone put through the wash and shrunk in the dryer. It really is a system, with a central chipset for linking multiple processing elements to I/O and memory. (And the difference is getting blurred there, too, but that is a different story for another day.)
AMD is under severe pressure to cut prices on processing to get some market share away from Intel, which utterly dominates server computing. By moving to MCMs instead of multiple monolithic designs, AMD can create what looks like a big, fat processor from multiple smaller CPUs, which are easier to make, get better yields, and cost a lot less per unit of raw performance to make. It probably costs half as much to make that collection of nine chips in the Rome MCM as it would to make a monolithic Rome die; the extra packaging costs probably add back another 10 percent or maybe 20 percent. So it is not so much that AMD is getting Rome into the field six to nine months ahead of Intel’s future “Ice Lake” Xeon SP processors, but rather that a single socket 64-core processor can beat the performance of any two-socket first generation “Skylake” Xeon SP processor and just about every second generation “Cascade Lake” Xeon SP processor.
Let that sink in for a second.
AMD’s prices for Rome are also considerably lower than those Intel charges for Cascade Lake, which is another blow that AMD has landed on Intel as it has shifted to chiplet architectures before Intel is ready to. AMD can charge less because its chips cost less. This is an important idea, and one that could help the IBM i base as it looks ahead to Power10 and Power11 chips.
As far as we know, IBM is not implementing a chiplet architecture for Power10 processors, but there is no reason that it could not do so, or decide to do it at a later date. Some Power8 chips were MCMs with two six-core chips in the package to make a dozen cores (these were aimed mostly at volume Linux workloads in two-socket servers), others had a single chip with a dozen cores (and lots of NUMA links to scale up the memory and compute footprint of a single machine). Even if IBM did not create a Power10 chiplet in 2020 with the initial launch, it could do a follow-up. I think that it would be far more useful to have a special four-core Power10 chip that ran at 6 GHz with overclocking and lots and lots of L2, L3, and L4 cache and water-cooling – and thus making batch jobs and database transactions run like a bat out of hell – than trying to create a 48-core processor with SMT4 threading and a 24-core processor with SMT8 threading that only a relative handful of customers in the IBM i base would care about. We need stronger threads, not more of them, in the IBM i base.
AMD has learned this lesson with the Rome Epycs. It creates Rome packages with 2, 4, 6, or eight compute dies on them with the L3 cache scaling up with the dies to keep them balanced; it also can use dies where not all of the cores can be activated, and thus really drop the price because it is recycling a chip that might otherwise be thrown out. So, for instance, the Rome Epyc 7252 has only two of the compute dies on the package and only four cores on each die working, but they run at a pretty respectable 3.1 GHz, have 64 MB of L3 cache (twice as much cache per core as the top-end 64-core Epyc 7742 processor), and only costs $475. The big bad Epyc 7742 has an impressive 64 cores and 128 threads and 256 MB of L3 cache, but at $6,950 its bang for the buck is less than half that delivered by the Epyc 7252. It would be cool to overclock this thing to 4 GHz, boosting its performance by around 30 percent and even with a price hike to $600 the bang for the buck would stay constant.
Our point is, chiplets are the wave of the future and they would let IBM do a better job of addressing the very diverse needs of its IBM i, AIX, and Linux customers.
But shifting to a chiplet architecture may also allow Big Blue to do something else: create hybrid architectures that are based on Power technology rather than on a mix of CPUs and accelerators such as Nvidia Tesla GPU coprocessors or Xilinx Alveo FPGA coprocessors. We have nothing against these devices, and they have done wonderful things for the HPC and AI workloads the world over. But imagine, instead, that a very fast Power10 motor with a few cores running overclocked was the core batch processing and operating system engine, but that selected routines could be offloaded to a more parallel compute complex made with many, many Power10 chiplets that looked like a parallel accelerator. Instead of a GPU or an FPGA, which require special programming as opposed to special tuning for software that is already written for the Power platform. IBM could easily create a compute complex with eight six-core compute complexes in a 7 nanometer process – it has already done two six-core chiplets in 22 nanometers for Power8, and it could have, in theory, done four six-core chiplets for Power9 but did not as far as we know. That 48-core complex would make a very sweet accelerator for many integer and floating point operations, which could be offloaded from the very fast serial processors in the Power10 core compute complex. These chips would plug into the same system and have the same access to main memory and be coherent across their caches, although they might have their workloads isolated by hypervisors because they have very different balances of threads and clock speeds. Such a device might make a great database accelerator for complex joins and queries, for instance, and not require any major changes in the system or application code.
It’s something to think about. And now, before the future is already here.