The X Factor: High-End Chips Draw Even, Vendors Prepare to Differentiate
August 7, 2006 Timothy Prickett Morgan
It has taken a long time, a lot of planning and roadmaps, and some coincidental delays in products and surprises in the performance delivered by chip makers, but the four major high-end processors used in big servers–the IBM Power, Sun Microsystems‘ UltraSparc, the Intel Itanium, and the AMD Opteron–have more or less drawn even in terms of the high performance and dual-core capabilities that five years ago were only available from IBM. But the situation may not last for long, as these same chip makers are preparing to push their chips in different technological directions in the coming years to push the performance envelope further.
Chip makers and server vendors have been talking about consolidation of functions formerly performed by separate circuits on a motherboard for a long time. In a very real sense, the advent of dual-core processors by IBM in October 2001 was a watershed event, although not the first time circuits from outside of a CPU were integrated into a CPU. Floating point math units, L1 and L2 caches, memory controllers, and chip interconnection schemes have all been moved on chip. But with dual-core chips, vendors could in effect get a baby symmetric multiprocessing server into a single socket, thereby doubling performance of a machine. When it soon became apparent that ramping up clock speed was not going to be easy beyond 4 GHz using 130 nanometer, 90 nanometer, and even 65 nanometer processes, every chip maker was forced to look at their product roadmaps and take the dual-core approach.
So here we are in the summer of 2006, and IBM has finally rounded out its System p5 server line with the Power5+ chips, which come in modules with one, two, or four chips (and from two, four, or eight cores) in a single semiconductor package. The Power5+ chips are nowhere near the 3 GHz and higher chip speeds that IBM expected to be able to deliver using 90 nanometer technologies, and are topping out at 2.3 GHz in the high-end multichip modules (MCMs) used in high-end p5 595 servers and at 2.2 GHz in the dual-core modules (DCMs) that are used in the midrange p5 570s. The Power5+ MCMs are nearly a year late, too, as they were expected in the fall of 2005. These delays have pushed out Power6, too. The Power5+ chips are also available in 1.9 GHz and 2.1 GHz speeds in various packages, and are even being made in a quad-core module (QCM) package that puts two Power5+ chips in a single socket (although running at a slower 1.5 GHz) for customers who need more threads for their applications.
The Power5 and Power5+ chips both had very efficient implementations of simultaneous multithreading, which carves up the chip’s pipeline and makes it look like two virtual pipelines instead of one physical pipeline. (Intel’s implementation of this is called HyperThreading, which is nowhere near as efficient as IBM’s virtual threads, and thus far, neither Sun nor AMD seem inclined to go this route with their respective UltraSparc or Opteron chips.) The chips have a shared 1.9 MB L2 cache on chip, and integrated memory controllers and cache directories on the chip.
The Power5 processors offer roughly double the performance of the Power4 processors that date from 2001, and next year’s Power6 processors, which will be dual-core chips with lots of new features that will run at 4 GHz to 5 GHz, are expected to double the per-core performance again. Rather than going to four cores in a single socket on a single piece of silicon, IBM is moving to a 65 nanometer process and cranking up the clock. Yes, that is a very 1998 strategy. IBM could, however, jump to a four-core chip with the Power7 generation. Why is IBM now cranking the clock instead of boosting cores? Well, for one thing, IBM is a mainframe maker, and it knows a thing or two about batch jobs. While spreading many workloads around a machine with dozens of cores works great for infrastructure or transactional workloads, when it comes to batch performance, the single biggest factor that determines how well or poorly a batch job will run is clock speed. IBM apparently wants to be able to really crank it out on batch workloads compared to the competition with the Power6 chips. And if Power6 is indeed the foundation on which IBM will converge its mainframes into the Power server line (as many expect), then batch performance will be important.
With the dual-core “Montecito” Itanium 9000 processors that were just announced this month, Intel has finally delivered an Itanium processor family with decent thermal properties and very good performance. This is the first dual-core Itanium, and the first Itanium to use HyperThreading, and like IBM’s Power chips, the move to dual cores actually delivers twice the performance of the prior single-core Itanium chip, known as “Madison.”
The Montecito chip has had a number of different design specs, and it is arguably at least 18 months late coming to market–more if you don’t want to be generous, and a lot more if you believe that Intel should have been able to get dual-core Itaniums into the field in 2002, when it had been obvious for years that the Power4 chips would have substantial advantages over Itanium chips. And, like the Power5+ chips, the Montecito Itaniums processors have come out at a substantially lower clock speed than was planned. Originally, the target for Montecito was 2 GHz and higher, but it has come out at 1.6 GHz. There are six different Montecito chips, which offer different L3 cache sizes, different clock speeds, and different prices.
Each Montecito core has 16 KB of L1 data cache, 256 KB of L2 data cache and 1 MB of L2 instruction cache, plus 12 MB of L3 cache. Intel is not unifying the L2 or L3 caches, as it could have done and probably will with future chips. In October 2005, when Montecito was delayed yet again, Intel dropped the 667 MHz front side bus that was expected for the high-end Montecito part, along with its “Foxton” Speed Burst technology, which would have allowed the Montecito to kick it up into high gear for short bursts if the system had enough room in its thermals to allow it. The Montecitos, as delivered today, have 400 MHz and 533 MHz front side bus speeds. They also have Intel’s VT hardware-assisted virtualization electronics and the “Pellston” error correction technology, which is now called Cache Safe Technology. With Cache Safe, a bad transistor in that large 24 MB cache block (which accounts for the vast majority of the 1.72 billion transistors in the Montecito chip) can be isolated in the event that a two-bit error hits the cache. (Chip makers and server makers have long since figured out how to cope with a single-bit error, but a two-bit error can cause a machine to crash. Cache Safe is a recognition that with such a large cache, the odds of a two-bit error go way up, so you have to cope with it gracefully.)
On some workloads, particularly those involving lots of mathematical calculations, the Montecitos will go toe-to-toe with the Power5+ chips. The performance data is a bit thin at the moment for other workloads, but IBM is expected to do superbly on the TPC-C online transaction processing test, cresting above 4 million transactions per minute (TPM), while a 64-socket Superdome server should hit 2.5 million TPM and maybe as high as 3 million TPM using the Montecito chip. No one has tested Montecito on the SPECjbb2005 Java benchmark yet, so it is hard to compare it to UltraSparc and Opteron chips. But the differences are not expected to be substantial across a mix of benchmarks.
Looking ahead, Intel will deliver a kicker to Montecito that runs at a higher clock speed, maybe 2.5 GHz or 2.6 GHz and maybe in 2007. And further out into 2008, Intel is expected to get quad-core “Tukwila” Itaniums into the field, which will be competing against IBM’s Power6+, Sun’s “Rock” massively multithreaded future Sparc chips, and the Rev G quad-core Opterons. The Tukwila chip will have an on-chip memory controller and a new bus called the Common System Interconnect that smells an awful lot like the HyperTransport interconnect developed by AMD for the Opterons.
With the delivery of the “Panther” UltraSparc-IV+ dual-core processors last year, Sun finally got a dual-core Sparc RISC processor to market that did not have embarrassingly poor performance. In fact, in some cases (based on thin benchmark data, of course), the Panther chips offer comparable performance to Power, Itanium, and Opteron chips.
The UltraSparc chip family, like the Itanium line from Intel, has been plagued by delays since 2000, which was, sadly for Intel and Sun, just when IBM finally started to get is Power chip act together. The “Jaguar” UltraSparc-IV chips delivered in 2004 were Sun’s first dual-core chips, but with the UltraSparc-IV+ chips that came to market in September 2005 at 1.5 GHz, which was a bit slower than the 1.6 GHz to 1.8 GHz expected and about nine months later than Sun had planned.
But the chip offers such good performance that many of Sun’s customers and salespeople finally could breathe a sigh of relief. The dual-core 1.5 GHz Panther chip offers about five times the performance as the original 900 MHz single-core “Cheetah” UltraSparc-III processors that came to market in 2001, and did so within a 90-watt power envelope even though the chip had 2 MB of on-chip L2 cache and an in-package 32 MB L3 cache. The Panthers are the first chips Sun has designed with on-chip L2 cache and the first ones to have an L3 cache. These innovations, as well as other tweaks, are what gives the Panthers competitive performance. Why Sun didn’t do this in 2003 or 2004, only Sun knows. Chip designs take at least five years, and you can’t just turn them around.
Sun hasn’t said much about its future “Rock” Sparc chips, but plans to make some announcements regarding them and the “Supernova” server designs that will make use of these chips before 2006 is over. What is known is what these chips are not. The Rock processors will not be a quad-core implementation of the UltraSparc chips; in fact, Sun will be using dual-core and quad-core chips from partner Fujitsu in the so-called Advanced Product Line, which is essentially the future Sparc64 servers from Fujitsu that Sun is rebadging as a stop-gap until Rock gets here.
The Rock chips will be made using a 65 nanometer process, and I happen to think that Rock really is something like a cluster of Sparc T1 multicore chips all glued together to look like a massive CPU as far as software is concerned. While Sun will not say this, the “Niagara” Sparc T1 processors were designed so they can be interconnected into SMP-like clusters on a single piece of silicon or together like a regular SMP on a single cell board. The Niagara-II kickers to the Sparc T1s will go from eight to 16 cores and from 32 to 64 threads, so ganging up a bunch of these would be a formidably threaded box. In fact, four of these chips lashed together to look like a single, virtual CPU would have 256 threads, which is about all that any operating system these days can handle. If Sun could get the processor clock speeds up to 2 GHz or more on such a device–if indeed this is what Rock might be–then it would be a very powerful processor.
Some rumors have suggested that the Rock chip will have four Sparc processing elements, each with four cores and a floating point co-processor in each element, for a total of 16 cores on a single chip. This sounds an awful lot like a Niagara chip with higher clock speeds, math co-processors, and some steroids. All that Sun has said about Rock chips is that they will deliver 30 times the performance of machines based on the 1.2 GHz UltraSparc-III processors. The chip also has a feature called hardware scout, which will use idle processors to pre-fetch data from cache when there is a cache miss, thereby improving efficiency.
While the Opteron processor is arguably a laggard somewhat in performance compared to 64-bit RISC and Itanium processors, and has not been available in scalable servers that have the reliability features that are common in enterprise servers and have been adopted in varying degrees with the Power, Sparc, and Itanium chips, AMD seems to understand that Opterons have to grow up a bit and be more than just a great processor for two-socket servers and workstations.
AMD has brought on a new team of developers and managers that have a deep knowledge of mainframe-class servers, and it is widely expected to start putting more RAS features into the Opteron designs and the servers that use them.
Within weeks, AMD will take the wraps off its Rev F Opteron designs, code-named “Santa Rosa.” The Rev F chips will have the same integrated 1 MB L2 cache memory per core and the same HyperTransprt links as the current Rev E Opterons. They will, however, sport a new on-chip memory controller that support DDR2 main memory. The Rev F chips will come in 68-watt Opteron HE and 95-watt Opteron standard variants. These chips will use a new 1,027-pin land grid array instead of the 940-pin and 939-pin chip package we are all used to seeing.
To get to quad-core processors, AMD will be moving to a 65 nanometer process in 2007, which will include a totally revamped Operton core, code-named “Deerhound,” also known as Rev G. These Rev G chips make jump from HyperTransport 1.0 to the HyperTransport 3.0 interconnect, and have a new architecture that incorporates L3 cache. The rumor is that the L3 cache will be a few megabytes in size, and that it will be on the die, not in the package alongside the die. There is speculation that the Rev G chips will allow glueless interconnection of up to 32 cores–and maybe even more–without needing extra chipsets to do it, all through the improved HyperTransport. The original Opterons from years back allows glueless interconnection of up to eight Opteron cores, but few vendors pushed it beyond two or four. It is anyone’s guess where the clock speeds will be for the Rev Gs, but probably in excess of 2.5 GHz and maybe well over 3 GHz is a good guess.
All in all, it looks like there will be plenty of competition in 2007, 2008, and beyond.