|
|
![]() |
|
|
Itanium 2 Servers Will Rival iSeries Machines in OLTP Power by Timothy Prickett Morgan Intel raised the curtain on the forthcoming "McKinley" Itanium 2 processor last week, which more or less fulfills the promises Intel made way back in 1996 for the "Merced" first generation Itanium chip. The company says that the 1 GHz Itanium 2 will offer from 1.5 to 2 times the performance as the 800 MHz Merceds, which shipped last year, and that Itanium 2 will ship in mid-year. The scuttlebutt is that Itanium 2 will debut on July 8.
The Itanium 2 processor is not, like many chips built by Intel, Hewlett-Packard, IBM, Sun Microsystems, Fujitsu, and Advanced Micro Devices, just a shrunken version of the prior generation of chip using an advanced manufacturing process that allows a vendor to crank up the clock speed. The Itanium 2 has substantial architectural changes that yield the 50 percent to 100 percent performance improvement on compiled Itanium applications running on the first generation Merced chips. At 1 GHz, the clock speed on the McKinley is only 20 percent higher than the 800 MHz clock speed on the Merced, which was also available at a lower 733 MHz clock speed. This increased clock speed does not come from an improved chip making process, however, and it is in fact that target top clock speed of the original Merced chips. Both the Merced and McKinley chips are made using a 0.18 micron process, and Intel is not expected to move to the leading edge 0.13 micron process, used in its low-power laptop and server processors, until the "Madison" and "Deerfield" generation of Itaniums--presumably to be called the Itanium 3--in 2004. Madison and Deerfield are slightly different versions of the same chip, which is why they are a single generation. Deerfield is a low cache, low power version of Madison, just like the "Tualatin" Pentium III for laptops is a stripped down, shrunken version of the regular Pentium III processor. Here are the big differences between Itanium and Itanium 2. The Merced Itanium processor had 25 million transistors, with the off-chip cache accounting for another 300 million transistors. The McKinley chip has a total of 221 million transistors, mostly coming from the reduction in L3 cache memory size, from 4 MB to 3 MB. The Merced chip had 32 KB of L1 instruction and data cache, 96 KB of L2 cache, and either 2 MB (733 MHz) or 4 MB (800 MHz) of L3 cache. The McKinley chip has the same 32 KB L1 cache, a bigger 256 KB L2 cache, and a 3 MB L3 cache. The Merced chips could process six instructions per clock cycle, in theory, but in practice this probably didn't happen. Let's go from the outside into the guts of Merced. The chip had 4 MB of off-chip L3 cache memory that was connected to main memory through a 64 bit, 266 MHz system bus that yielded 2.1 GB/sec of bandwidth. The Merced had 10 pipeline stages and nine instruction issue ports that, in turn, fed into 328 registers. These registers fed into four integer units, three branch units, two floating point units, two SIMD units, and two load/store units, which ran at 800 MHz. With McKinley, Intel has tripled the system bus bandwidth, moved a smaller (but still quite large) L3 cache onto the chip, removed a few pipeline stages, added issue ports, and tweaked the various computing units inside the chip, so a 1 GHz processor comes closer to actually processing those six theoretical instructions per second. Specifically, the McKinley chip has 3 MB of L3 cache on the chip, which is linked to a 128-bit, 400 MHz system bus with 6.4 GB/sec of aggregate bandwidth. The McKinley has eight pipeline stages feeding into 11 issue ports, which, in turn, feed into the same 328 registers. These pass off instructions and data to six integer units, three branch units, two floating point units, one SIMD unit, two load units, and two store units. The increase in integer units and clock speed alone accounts for close to 90 percent more throughput on integer workloads. Having the L3 cache on die, rather than within the Itanium packaging on separate chips, reduces the L3 memory latencyhow long it takes to move data from L3 to the chip or from the chip up to L3by half. The increased system bus bandwidth is what has allowed the dual floating point units in the Itanium chips to do the work they could have been doing all along. The Merced was bandwidth-crippled from the get-go, and McKinley proves it. Why this was the case is unclear, but my guess is that Intel's i870 server chipset was late and buggy, so Intel had to graft Merced onto the i460GX workstation chipset just to get it out the door last June. This chipset was fine for two-way and four-way workstations, but it was not designed for servers in the same way that the i870 chipset, now called the E8870 chipset, was designed to handle server workloads. Java and in-memory database applications have benefited most among all applications in the jump from Merced to McKinley, according to initial benchmark test results provided by Intel, with almost twice the performance. On security programs like SSL encryption, the McKinley offered about 50 percent more performance20 percent coming from the higher clock speed and the remaining 30 percent coming from increased bandwidth and tweaks in the guts of the chip. The performance improvement on Linpack Fortran benchmarks was a little more than 50 percent better on McKinley than on Merced, and performance on the SPECint2000 integer and SPECfp2000 floating point benchmarks was up by about 70 percent and 75 percent respectively. Computer-aided engineering and ERP applications that were compiled for Merced will see around 75 percent more performance. Oddly enough, online transaction processing performance (by which we presume Intel means the TPC-C benchmark) will increase by only 50 percent. However, this could yield a big improvement in scalability for Itanium-based enterprise servers. If these per-chip improvements in online transaction processing can be passed through to the full systems, which were presumably designed with enough memory and system bandwidth to feed McKinley in the first place, the performance of a 16-way Itanium serversuch as IBM's 16-way "Man-O-War" xSeries server, set to debut with the McKinleyscould hit as much as 210,000 transactions per minute (TPM). The few 16-way Itanium servers from Unisys, Bull, and HP, using the Merced processors, could, in theory, handle around 140,000 TPM with Merced chips. Very few of these Merced-based machines have been sold, because of the scarcity of operating system, middleware, and application support for the Merced-class Itanium servers. A four-way Itanium server using the 800 MHz Merced chips could handle about 45,000 TPM, which means that a four-way McKinley server should handle about 67,500 TPM. These McKinley servers will support 64-bit implementations of Windows, Linux, and a number of different Unix environments, including HP's HP-UX variant. IBM's AIX and Sun's Solaris will not run on these McKinley machines, even though both vendors have done the ports, because they do not want to launch these operating systems on those platforms until customers demand it. Just to put those McKinley OLTP performance numbers in perspective, let's compare them with existing iSeries servers. While IBM does not say this publicly very often, the CPW performance ratings it gives for the iSeries and AS/400 line are based on the TPC-C benchmark. There are some changes that make them distinct, but if you multiply the CPW rating by a factor of 10, you get the theoretical TPM rating for a full configuration of an iSeries or AS/400 server; if you shave off about 5 percent or so, you get something that more closely resembles the performance IBM has delivered on actual TPC-C tests. So a four-way Model 820 server using S-Star processors running at 600 MHz and equipped with either 2 MB or 4 MB of L2 cache memory can therefore process a little more than 35,000 TPM on the TPC-C test. Model 820 machines using the slower 500 MHz I-Star processors equipped with 4 MB of L2 cache can handle about 30,500 TPM. If IBM delivered the 668 MHz S-Star processors in Model 820s, they could probably do about 39,000 TPM, and if IBM got aggressive and put the 750 MHz S-Stars in the Model 820 chassis, it could put a machine in the field that could handle about 43,750 TPM. It would take an S-Star running at about 1.15 GHz to offer equivalent performance to a McKinley server in four-way configurations. IBM, as far as I know, has no intention of offering S-Star processors at this high clock speed in either the iSeries or pSeries lines; the 668 MHz and 750 MHz S-Stars are available in the pSeries line, but IBM has not made its intentions clear for using these processors in the iSeries line. And while IBM does have 1.1 GHz and 1.3 GHz Power4 processors that could easily create four-way servers as powerful as the McKinley machines, IBM has not made its plans known for entry or midrange Power4-based servers in the iSeries family. IBM did not offer a 16-way iSeries machine using the 600 MHz S-Stars, but the 24-way machine could handle just under 200,000 TPM, which is roughly equivalent to the expected performance of servers using the 1 GHz McKinley processors. The new iSeries Model 890 Power4-based server, which begins shipping next week in limited quantities, is rated at 20,000 CPWs with 16 1.3 GHz processors activated, 29,300 CPWs with 24 processors activated, and 37,400 CPWs with all 32 processors activated. Because the Regatta server chassis on which the Model 890 is based is so different from the Condor chassis that the prior S-Star-based iSeries and pSeries machines use, it is hard to say for sure what the estimated TPC-C online-transaction-processing ratings are for these servers. However, if you assume the same relationship between CPW and TPC-C transactions per minute, then a 16-way Model 890 will offer about 190,000 TPM on the TPC-C test, which is in the same ballpark as a 16-way McKinley server. The 24-way Model 890 will be able to crank out about 280,000 TPM, and the 32-way Model 890 should hit about 355,000 TPM. The iSeries will be able to run OS/400 and Linux workloads, but the AIX Unix variant is still years into the future on the iSeries.
|
Editor
Contact the Editors |
|
Last Updated: 6/10/02 Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved. |