Moore’s Law and the Performance Wall
October 5, 2009 Timothy Prickett Morgan
Moore’s Law, one of the most famous observations in the IT industry, could soon be running out of gas. Some might even argue that for certain applications, the ability to cram twice as many transistors onto a piece of silicon every two years or so thanks to the drumbeat progress of semiconductor manufacturing technology advances (and observation made by Intel co-founder Gordon Moore in 1965) has already hit a wall. For the Power Systems server running the current i 6.1 or the future i 7.1, the headlights out in front of us may be already showing the brick barrier we are facing.
The barrier is software, not hardware, but when you hit that barrier, it is going to bend a lot of metal just the same. And it is a lot harder to solve the problem of using parallelism in the software to make use of the parallelism in hardware than it has been for three decades to shrink chips, crank clocks, and move more and more of the electronics in a system onto a single piece of silicon. But there are other problems, such as the diminishing returns from process shrinks on chip making per unit of investment in plant and equipment. It takes billions of dollars to do a process shrink, which is why you see so many chip makers dumping their fabs and so many fabs consolidating their work in the past decade. Because of the economic and technical challenges of making “faster” computers–and I use that term in a generic way, meaning delivering more performance, whether it comes through clock speeds on single chips or through the hardware parallelism that comes through plunking multiple cores, memory controllers, cache memory, and soon main memory and networking onto a single chip.
To its credit, IBM was the first to figure out some of the limits of just shrinking chips and cranking clocks to make servers run faster, which is why the “Northstar” family of 64-bit PowerPC processors had the world’s first implementation of simultaneous multithreading, or SMT, in a commercial processor. That was back in 1997, by the way, long before Intel’s relatively weak implementation of SMT, called HyperThreading, came to market. Anyway, with SMT, the instruction pipeline in the execution units of the chip is virtualized, in this case to look like two virtual pipelines (commonly called threads) and hence like two virtual processors to the operating system and its applications. This is very clever stuff, allowing a stalled instruction to not let a processor just sit there doing nothing while it is waiting for data. SMT can boost performance by anywhere from 20 to 40 percent for a single chip.
With the Power4 line of chips in 2001, IBM went from have two virtualized cores in a chip (through SMT) to putting two actual cores, running at a top speed of 1.3 GHz, and their L1 and L2 cache memories on the chip (that was with an 180 nanometer process). Power4+ jumped to a 130 nanometer process, cranked the clocks to 1.7 GHz and pulled the main memory controllers onto the chip. Power5 key the same 130 nanometer process, boosted the L2 cache to 1.9 MB (up from 1.44 MB with the Power4 and 1.5 MB with Power4+), and added back SMT with an improved implementation. Power5+ chips were based on a 90 nanometer process, and clock speeds ramped up to 2.2 GHz for the chips; this shrink was used to not only crank clock speeds, but to put two whole chips in a single package to fight against multiple core X64 chips that were coming on the market. With the Power6 chip, IBM shifted to a 65 nanometer process, kept two cores per chip, and used the shrink to goose the clock speed up to 5 GHz max while adding 8 MB of L2 cache to the chip and L3 cache controllers. The Power6+ chip, which IBM never copped to announcing, came out in October 2008 and April 2009, allowing some clock speed goosing and some minor tweaks to the chip architecture.
As previously reported in The Four Hundred, the future Power7 chip coming from IBM is going to have a whole lot of cores, a whole lot of threads, and whole lot of sockets. But I am not so sure that it will have a whole lot of increased performance for the kinds of monolithic applications running on the i and AIX platforms. Particularly the i boxes, many of which are running application code that still thinks there is only one thread in the computer, no matter how many IBM weaves into the system. So adding threads and cores may not give you the kind of performance boost that IBM’s Commercial Processing Workload (CPW) relative performance metrics imply. Remember, CPW is more of a measure of aggregate processing capacity than it is for single thread performance.
The DB2 for i database can scale across many threads and cores, as most databases do. With i 6.1, IBM told me in June 2008 when it put out CPW ratings for the Power 595 box running i 6.1 that the benchmarks that were run to come up with the 300,000 CPW rating for the system were actually the result of splitting the Power 595 in half and using 32 cores each (with a total of 64 threads running at 5 GHz) to get 150,000 CPWs of capacity. The database could scale beyond that, on a special bid basis, but IBM did not say how far it could scale. Perhaps it might only deliver 225,000 CPWs on a 64-core, 128-thread Power 595 image, I guessed.
Well, with the Power7 chips coming next year, IBM has to get the multithreading fixed in DB2 for i or get a whole lot of excuses ready for why customers buy more cores and threads, running at lower clock speeds, and don’t see performance go up. A midrange Power7 box might have four sockets, eight cores per socket, and four threads per core, for a total of 128 threads. We already know that i 6.1 doesn’t do so well above 64 threads, for reasons that the company has not yet explained. (Maybe all databases have trouble beyond 32 threads, and the AS/400 people are just more honest about it.)
There’s an even larger problem, perhaps. If you are programming your AS/400, iSeries, System i, and Power System i box using Java, then your life is a bit easier when it comes to making use of the threads inside a heavily threaded box as the Power7 servers will be. The Java Virtual Machine and the Java code that runs inside of it were designed to eat threads for breakfast, which they had to be able to do to make up for the fact that Java code, even when it is compiled into bytecodes, is nowhere as efficient as compiled RPG or COBOL code is when it is running down on the iron. That layer of abstraction that gives you portability and simplicity that comes through the JVM comes at a performance cost.
RPG may be a different story. While IBM has added some multithreading features into ILE RPG as the core and thread counts have gone up in AS/400s and their progeny, I get the impression from IBM’s own WDSc documentation that RPG and parallelism do not play well together. “Normally, running an application in multiple threads can improve the performance of the application,” the WDSc documentation warns. “In the case of ILE RPG, this is not true in general. In fact, the performance of a multithreaded application could be worse than that of a single-thread version when the thread-safety is achieved by serialization of the procedures at the module level.”
I am by no means a programming expert. But it seems to me that this will need to change for RPG to continue to be a useful programming language as more cores and threads are built into Power-based systems to increase their capacity. Either that or IBM is going to have to crank out special chips with lower core and thread counts for its i shops with higher clock speeds. Or take RPG and drop it into a virtual environment akin to a JVM and let it make use of threads. I think it will be a cold day in hell–much colder than in Rochester, Minnesota, in fact–before the former happens, and a portable, virtualized RPG compiler and runtime environment is something IBM does not seem to be inclined in creating but which it should have created more than a decade ago before Java was even a thought inside of James Goslings’ head at Sun Microsystems. IBM might just suggest that you run multiple copies of your applications in logical partitions to support different sets of users. I dunno.
I have no idea what IBM will or can do, but I do know what you should do. If you are an IT manager, you should be talking to your programmers about how the Power7 architecture might affect your workloads, and then you should ring up your business partners or IBM, whoever you work with, just to start the planning cycle now. Getting the most out of the Power7 hardware dollars you spend might not be as simple as doing an upgrade and getting back to the grind. It might even mean putting off a move to Power7 iron and sticking with Power6 or Power6+ boxes as you dig through your code and see how parallelization can and cannot be used to make your applications run faster as well as offer more capacity to support more work.