IBM Puts More Power7 Iron Through the Java Test Paces
June 14, 2010 Timothy Prickett Morgan
Back in March, in the wake of the initial Power7-based server announcements from IBM that came out in early February, I walked you through the available benchmarks for the new Power 750 midrange box running the SAP data warehousing benchmark and the SPECjbb2005 benchmark from the Standard Performance Evaluation Corp. Since that time, IBM has launched Power7-based blade servers and done tests on the larger Power 770 and 780 machines to show their Java performance on the SPECjbb2005 workload.
Because there is less gaming in the SPECjbb2005 Java application benchmark (or so it looks like from my armchair software engineer’s perspective), the performance difference between a Power7 box running i 6.1, AIX 6.1, and Linux 2.6 are not as significant as the Commercial Performance Workload (CPW) ratings on i boxes and the Relative Performance (rPerf) ratings on AIX and Linux machines imply. I can’t prove it, but I think that the AIX folks are doing all kinds of tuning on the TPC-C online transaction processing benchmark test to show high performance numbers, since IBM has been hell-bent on knocking the wind out of Hewlett-Packard and Sun Microsystems (now part of the Oracle collective in the Unix racket. Whatever IBM is doing to show such good performance–and I think it has something to do with de-randomizing the TPC-C data stream such that data needed by processors to do transactions is shuttled there even before they ask for it. If this is indeed what IBM is doing, it would not be against the rules or the TPC would revoke IBM’s benchmarks. But whatever IBM is doing, I have a hard time believing that the performance gap between i and AIX is as large as the CPW and rPerf tests (both of which are derivative of the TPC-C test) imply.
The good news, if you are an i shop trying to defend your platform, is that the performance gap between i, AIX, and Linux machines is not as large on the SPECjbb2005 test. Although to be honest, there is a gap and one that I cannot imagine could not be closed more than IBM has done.
You can see the full listing of machines tested using the SPECjbb2005 benchmark here. I distilled the most recent Power-based machines that IBM has tested using SPECjbb2005 into a single table to help you reckon how the Power7 machines running the i operating system stack up. (And in a future issue of The Four Hundred, I will roll in all the latest X64-based servers from the key players plus some funky clustered machines with very high throughput to give you some food for thought.)
When IBM did its first batch of SPECjbb2005 tests on the Power 750s, it did everything it could to not allow for a direct comparison between i 6.1.1, AIX, and Linux running the Java test on the box. IBM tested a Power 750 with two Power7 processors (that’s 16 cores and 64 threads) running at 3.3 GHz, which was able to process 976,223 business operations per second, or BOPS, on the test. AIX and Linux were tested on a four-socket, 32 core, 128 thread configuration, and did 2.48 million and 2.41 million BOPS, respectively. As I said back in March, if Java performance scales with the CPW ratings, then a Power 750 running i 6.1.1 and using the fully loaded (instead of half loaded) server with 32 cores and using the faster 3.55 GHz parts (instead of the 3.3 GHz ones) should be able to do about twice the work of the i box that IBM tested, or around 1,992,471 BOPS. And if that is the case, that would indicate there is a performance penalty that i shops are paying by using the 32-bit JVM running in the PASE AIX runtime inside of i 6.1.1 instead of running it natively in the real AIX, something on the order of a 20 percent penalty by this math.
Well, it looks like my estimates were a bit high, but not that far off the mark, because Big Blue did the right thing and tested a fat Power 780 configuration that was identical in terms of hardware configuration running the SPECjbb2005 test running i 6.1.1, AIX 6.1, and Linux 2.6 (in this case, Novell‘s SUSE Linux Enterprise Server 11), and the performance penalty for being on i relative to AIX was 17 percent.
You can see for yourself in this updated SPECjbb2005 table relating to Power Systems machines. The old Power6 and Power6+ machines and the Power 750 from the table I made in March are above the light blue line, and all the new machines tested are below the light blue line. IBM tested the new Power Systems 702 blade server (the double-wide with two sockets), a Power 770, and three different configurations of the Power 780. Only one machine–the Power 780 with half the cores in the box running at 4.14 GHz in Turbo Boost mode and with the maximum of 128 threads that i 6.1.1 supports–actually pitted all the three operating systems against each other on the same exact iron. IBM did not see fit to test i 6.1.1 or the even better I 7.1 on all the machines as it did AIX 6.1.
But people need to make some sort of comparison, so I used the delta between i and AIX on the Turbo Boosted Power 780 to reckon what an i operating system might do on some of the configurations tested with AIX 6.1, which already supports 256 threads. You have to go to a special-bid version of i 7.1 to span 256 threads, as I explained in detail back in February, and while AIX 7.1, SLES 11 SP1, and Red Hat Enterprise Linux 6 will all span 256 cores and 1,024 threads, i 6.1.1 and i 7.1 will not. If you want to run on a future Power 795 behemoth, you will have to partition your machine. (See i/OS Gets Short Sheeted with Power7 Thread Counts for more on that.)
On the Power 770 and Power 780 machines, I estimated the performance using the 256-thread special bid version of i 7.1, and did so by keeping the CPW-to-SPEC BOPS ratio as close as I thought reasonable (in the range of 11 to 12). Doing that, the performance penalty for running i 7.1 and the 32-bit JVM inside the PASE AIX runtime (instead of the 64-bit native JVM for OS/400 and i) is probably closer to 25 percent. But it could end up being as low as 15 percent because SPECjbb2005 could scale better across 64-core, 256-thread Power7 machines running i than does the CPW workload that is related to (but not the same as) SPECjbb2005 and TPC-C. It’s a tough call to estimate this, so I split the difference based on the initial CPW ratings for the larger configurations of the Power 770 and Power 780 boxes. Which are based on splitting a 64-core machine into two 32-core partitions, by the way. With Java application workloads, partitions are fine, but with database-driving OLTP workloads, partitioning the database is not cool. But, all the cool vendors do it to push the performance limits on the TPC-C tests, so why shouldn’t IBM do the same because it has limited thread support with i 6.1, i 6.1.1, and i 7.1?
The performance gap on 256-thread configurations between i and AIX could be a lot larger than 25 percent if the DB2 for i database, for whatever reason, just doesn’t scale well across threads or the OS itself does not. We really won’t be able to find out until later this year, when IBM supports fatter memory cards on the Power 770, 780, and 795 machines. IBM can ship the cores to make big boxes, but it will not get a 256 GB DDR3 memory card into the field (allowing for balanced performance across 32, 64, 128, and 256 cores) until November of this year. That’s why IBM’s TPC-C test for the Power 780 only activated one-quarter of the cores in the box, because with current memory cards, the maximum memory across a four-chassis Power 780 is 512 GB. To activate all the cores in its bigger boxes and get balanced performance, IBM needs fatter memory cards.
One more thing to notice. The 32-core Power 595 box that IBM tested a few years back running i5/OS V5R4, which could handle 1.53 million BOPS, was comparatively awful at running the SPECjbb2005 test, as was a similar Power 570 and Power 595 configuration. These machines used the dual-core Power6 or Power6+ chips, which had 36 GB of off-chip L3 cache. The Power7 chips have 32 GB of on-chip eDRAM L3 cache and a much more efficient caching setup that Java just loves. This eDRAM cache is what helps Java perform well even though IBM has cut back the processing speeds from 4.2 GHz to 5 GHz on the Power6/Power6+ chips to 3 GHz to 4.14 GHz on the Power7s. The net result is that the BOPS processed per core has gone up on similar classes of boxes. A Power 550 running AIX could do just under 44,000 BOPS per core, but a Power 750 with Power7 clock speeds that are 15.5 percent lower (down to 3.55 GHz from 4.2 GHz with the Power6) is able to handle 77.6 percent more BOPS per core. And, in moving from dual-core to octo-core chips, the machine is able to crank through seven times the work. This is real progress, since software is priced per core. But imagine the progress if IBM charged for software based on the socket. . . . On Power 770 and Power 780 machines, the BOPS per core has grown similarly compared to the Power 570 and Power 595 machines.