The NUMA NUMA [Song] Tax
October 1, 2018 Timothy Prickett Morgan
When you have a wafer of chips, at least in theory all of the transistors cost the same on the wafer. But sometimes, when transistors perform certain functions, they are worth more. And in some cases, such as the electronics that enable the coupling of multiple cores on a die across a shared L3 cache or that allow the ganging up of processors across multiple sockets that allows for larger and shared main memory for applications to run in, those circuits are worth a lot more.
This lashing together of compute components across shared memories – Non-Uniform Cache Access, or NUCA for the on-chip coupling and Non-Uniform Memory Access, or NUMA, for the multi-socket coupling – make up a significant portion of the die area and transistor budget, and therefore should represent a significant portion of the overall cost and then the price of a given processor. NUMA predates NUCA, and in a sense, NUCA is just an on-chip version of NUMA. A modern processor is basically a version of a NUMA server, like a 12-socket processor from the RS/6000 S70 system and the AS/400 740-2070 and AS/400 S40-2208 systems from the late 1990s based on the “Northstar” PowerPC chip – remember those? – implemented on a single socket. The big difference is that those machines had 100X less memory and 15X lower clock speeds.
We get a lot more for our Power Systems money these days, to be sure. But relative to the market at large, Power Systems running IBM i and AIX are still paying a premium compared to Windows Server and Linux systems with equivalent performance and functionality.
Back then, because IBM published list prices as had been its habit – compelled by law after settlements of several antitrust lawsuits decades earlier – we could actually calculate the incremental cost of the NUMA expansion in the machines. Now as then, customers didn’t just pay extra based on the NUMA chipsets that glued multiple processor sockets into a single system image, but you also paid in ever-higher system software license costs. Moreover, performance did not scale perfectly as NUMA levels increase from 2 to 4 to 8 to 12 sockets; the incremental performance improvement gets smaller as the NUMA cluster gets larger, and this is not something that just affected and continues to affect IBM Power Systems, but all NUMA machines. I will say that the NUMA interconnects have gotten better, faster, and more efficient over time, and more of the aggregate CPU clocks can actually get work done if the compute complexes are not memory or I/O constrained.
I will give you a for instance from back then to prove a point, and to show that IBM used to be generous. Back in the first quarter of 1999, a single socket Northstar system, the AS/400 730-2065, had a single 262 MHz – yes that is megahertz – processor and it cost $390,000; it was in a P30 software tier and rated at 560 CPWs of performance. That worked out to $697 per CPW. Moving to two processors in the AS/400 730-2066 had two of the same processors and was rated at 1,050 CPWs but cost $650,000, or $619 per CPW. Moving to four of the Northstar processors in the AS/400 730-2067 meant having an aggregate of 2,000 CPWs, and the cost dropped to $515 per CPW with a price tag of $1.03 million; the software group jumped to P40. These were for full-on machines, with both 5250 and client/server workloads able to run full out. (You could get machines with varying levels of crippling on either style of workload, and that changed the economics a lot.) The AS/400 740-2069 put either of these processors in a single system image, which had 3,660 of aggregate CPWs at a cost of $1.44 million, or $393 per CPW, and the software tier rose to P50. And finally the big bad AS/400 740-2070 had a dozen of these chips and was rated at 4,550 CPWs, and at a cost of $1.68 million, that worked out to $369 per CPW for the P50 tier system.
Obviously, compute was much more constrained, from the point of view of chip design, and IBM wanted to encourage companies to invest in ever larger systems, and so it made the CPWs cheaper as companies wanted more of them. The NUMA tax, therefore, was all in the software tier elevation costs and in the gradually decreasing incremental performance as the NUMA cluster grew.
Fast forward to today. A Power S914 with a single Power9 chip with four cores – a mere four cores – can crank through 52,500 CPWs, or about 13,125 per core. A single Power9 core has three times the performance of a 12-way NUMA AS/400 740 series server. That is just crazy, but then again, it isn’t. It is just Moore’s Law, going wide on the cores and threads after it is no longer possible to go deep on the clock speeds. (The base clock speed on that Power9 core is about 8.8X higher than the Northstar from days gone by at 2.3 GHz. But it is not 10 GHz or 20 GHz as we thought was possible many years ago.) A single socket “Cumulus” Power9 machine with all 12 cores activated and running at 2.9 GHz would deliver about 70,000 CPWs, we estimate, and 12 of these sockets running at 3.55 GHz would deliver 2.05 million CPWs. We estimate that this machine, configured up, would cost a few million bucks – IBM has not provided pricing of any sort for this machine, but clearly we are down in the range of a buck or two per CPW. Call it a factor of 200X improvement (meaning lowering) in the cost of raw compute.
But that is not the question. What I want to know, and what I cannot yet figure out, is if the incremental cost of scaling machines across sockets is now higher as the machines scale all the way from the Power S914 through to the Power E980.
Back in April, I did a price/performance analysis of the “ZZ” Power9 entry systems that can run IBM i, and it showed that depending on the configuration the cost per CPW for the hardware of a base system and its base IBM i license with a reasonable number of users rose a little as the system got more powerful thanks to core scaling within a socket, but got a lot more expensive sometimes as the number of sockets grew. In general, it was the software cost that made this true, and that is because software costs scale up with GDP and inflation (software being created and supported by people) while processor, memory, and storage costs scale with Moore’s Law (getting less expensive over time, cut in half every 36 months in the Power Systems line). Here is a reminder of what data we do have:
The only thing we have to compare is the Power S914, the Power S922, and the Power S924, all three of which can run IBM i on all of their cores. These were prices with all cores activated to run IBM i and for base hardware with a reasonable amount of memory and disk. As you can see, the cost per CPW for the hardware rises from the Power S914 to the Power S924 a bit, and drops with the Power S922, which is a denser machine with less expandability and therefore has a bit of a price cut on hardware. But the IBM i software costs triple or quadruple, depending on the points you compare, between the one-socket Power S914 and the two-socket Power S924, and this is a kind of NUMA tax. There is no IBM i allowed on the four-socket Power E950, but we suspect that the hardware costs rise here and would the software tier costs if IBM i was allowed. We have no pricing on the E980, but there is no question in our minds that IBM probably charges more per unit of capacity for these machines, which scale up to 16 sockets and 64 TB of main memory, and not just for the systems software but also for the hardware. We can’t prove it, but it is a hunch. We will try to see if this is the case, because it would represent an important reversal in IBM’s pricing strategy.
Stay tuned, and if you know something, say something.
[Editor’s note: A heartfelt shout out to Gary Broslma, who lip-synched the Numa Numa Song with such joy and enthusiasm and was the first thing I ever saw on YouTube. You are an inspiration, even after all of these years.]