Threading The Needle Of Power8 Performance
June 2, 2014 Timothy Prickett Morgan
I am going to take a break from the price/performance comparisons between earlier Power machines and the new Power8 systems and talk about something a little bit different from but still intimately related to the performance that customers can expect from the latest IBM i operating system releases as they run on the new iron and on older Power7 and Power7+ machines. The issue is threading, and IBM has done some pretty clever things to help the operating system and processors juggle more work and boost throughput.
Most modern processors have some form of simultaneous multithreading, or SMT, on them. With SMT, the processor pipeline is broken into two or more virtualized instruction pipelines that look and feel exactly like real pipelines to the operating system and its applications. By doing so, certain applications that run better over multiple threads–such as database management systems and Java application servers–can boost their performance significantly. Single-threaded jobs would not see the other threads, and they would performance worse on machines with SMT, so there is always a bit of a balancing act. Luckily, most modern implementations of SMT allow for users to turn it on and off, and with the Power8 chips, it can be dynamically changed as applications, middleware, and databases require. Earlier generations of Power processors had two threads per core, and in fact IBM’s AS/400e “NorthStar” processors from the late 1990s was one of the first RISC processors to introduce SMT. The Power7 chips had four threads per core, which IBM calls SMT4, and the Power8 has eight threads per core, which it calls SMT8. So not only does Power8 have 50 percent more cores than Power7+, it also has twice as many threads per core. The combination of the two is largely why the performance of the Power8 socket is roughly 2X to 2.5 times higher than the Power7 chip from early 2010.
I will get into the SMT features more in a moment, but the question you have to ask yourself is: How many threads can the operating system itself see? Just because the threads are there in the processors in the box does not necessarily mean that the operating can make full use of them.
Remember that earlier i5/OS and IBM i operating systems could not span the full extent of Power Systems hardware–what I called being short-sheeted on the thread count when it happened with Power7 systems and IBM i 6.X software four years ago. Just to refresh your memory, here is how it looked with IBM i 6.1 and 7.1 on Power6 and Power7 iron four years ago:
IBM i 6.1 running on Power6 and Power6+ chips could only span 32 cores and 64 threads, which is not very much by modern standards where a big bad box potentially has hundreds of cores and possible thousands of threads, all in a single system image. (Meaning, a single memory space that all processors share.) If you want IBM i 6.1 on Power7 chips, the threads doubled to 128 but the cores stayed the same. With IBM i 7.1 running on Power7 chips, the max was 32 cores and 128 threads, but with special RPQ support from IBM, you could push it up to 64 cores and 256 threads. As you can see, AIX 7.1 could span the biggest Power 795 and create a single image across its 256 cores and 1,024 threads. You were basically limited to an eighth of such machine with IBM i 7.1 and a quarter with special assistance from IBM. Big Blue has never provided much in the way of a performance guarantee about what performance you can expect as you scale the operating systems across more cores and threads. It is not linear at a 45 degree angle–that much for sure.
Now, let’s move ahead to IBM i 7.1 Technology Refresh 8 (TR8) and the even shinier IBM i 7.2. Take a look at this table:
This table shows cores per partition, but remember, the Power Systems machines are always running the PowerVM hypervisor and a machine that you perceive as having no logical partitions and no virtualization is really a box with the hypervisor on it that is running only one logical partition. As you can see, on Power7 (and therefore also Power7+) processors, you have the same ST, SMT2, SMT4 thread support and the same base 32 cores and 128 threads as a maximum for IBM i 7.1 TR8 as you had for the original 7.1 release. You can push that up to 64 cores and 256 threads with the RPQ from IBM Lab Services. On Power8 iron, IBM i 7.1 TR8 adds support for the SMT8 threading on each core. On the largest Power8 machine on the market thus far–a two-socket machine with 24 cores and 192 threads–IBM i can see the whole thing and span it as one partition if need be. IBM adds, as you can see, that the theoretical maximum number of threads for IBM i 7.1 TR8 is 256 threads per partition.
Now, with IBM i 7.1, which is brand new and which does not have a Technology Refresh yet, IBM has pushed up the scalability on Power7 machines. Now, you can span up to 96 cores and 384 threads with the RPQ from Lab Services. That is a 12-socket Power7 machine with all eight cores in each socket. That gets you up into the Power 780 zone, just to make it concrete. On Power8 machinery, IBM is supporting the maximum 24 cores in the two-socket systems, but says the operating system is already good to go with 48 cores when needed. So you know there is a future Power8 machine with four sockets coming down the pike. The thread count tops out at 192 threads for the current machines, but IBM says that with SMT8 mode, it can make IBM i 7.2 run across up to 768 threads. That is 96 cores across eight sockets with SMT8 turned on; or, it could be 12 sockets with only eight cores turned on as well. The math works.
Presumably there will be tweaks to IBM i 7.2 with Technology Refreshes to span even larger iron when it becomes available.
Now, back to SMT for a second, which is the other aspect of threading that affects performance. Conceptually, here is what is going on and the aggregate performance scaling that can be achieved, in theory:
The important thing is that you can switch between multithreading modes dynamically and you can mix different modes inside of a logical partition for different parts of the application and system code base. This is both clever and handy. Here is the effect of SMT on transaction workloads, according to IBM, using an unnamed benchmark test:
As you can see, the activation of progressively finer-grained virtual threads allows for the Power8 processor to get more work done. IBM’s Commercial Performance Workload (CPW) test, which is used to gauge the relative performance of OS/400, i5/OS, and IBM i machines, was run on the Power8 systems in SMT8 mode, and IBM cautions that if you drop back to SMT4 mode, your performance for online transaction processing work will be lower. (Presumably, this will also hold true as you drop back to SMT2 mode or drop all the way back to turning virtual threading off.) IBM did not provide a CPW table to show the effects of SMT levels on IBM i workloads on Power8 machines, but it did create one for AIX workloads using the Relative Performance (rPerf) test:
I don’t think the relative differences between CPW and rPerf are that different when it comes to the extra throughput from each successive SMT level upgrade. But it is always best to test this with your own applications, of course.
As you can see, the incremental gains of SMT diminish as more threads are added to each core. This is true of all chips, not just the Power8 processor. It seems unlikely that IBM will increase the thread count on cores with Power9 chips, unless it can get more performance out of the machines than the jump from SMT4 to SMT8 mode. It is hard to figure what IBM will do with Power8+ and Power9, to be honest with you, but that is a conversation for another day.