Cray Announces XT4, XMT Supercomputers
Published: November 13, 2006
by Timothy Prickett Morgan
The SuperComputing 2006 show is being held in Tampa, Florida, this week, and supercomputer maker Cray is the first major vendor in the space out the door with new products. And perhaps the two new Cray machines being announced this week--the XT4 and the XMT--will be among the more interesting supercomputers on display, even if they do not dominate the semi-annual Top 500 list of supercomputers that will also come out this week.
The XT4 and XMT machines represent the next step in a converged supercomputer architecture that Cray began talking about two years ago. That architecture will culminate, if the technology and the funding all goes well, with the "Cascade" line of machines that bring together many different kinds of supercomputing processors into a shared memory system that has enough intelligence to deploy applications on the right kinds of processors within that system. Cascade is about more than just converging hardware and choosing Linux for head and compute nodes in a cluster. It is about adding intelligence into the system so it can better manage lots of different kinds of workloads and, perhaps most significantly, get the most out of the money that the supercomputer centers of the world spend on all of this exotic gear.
The convergence strategy that Cray has been working on for several years is known by the code-name "Ranier." The idea, according to Steve Scott, Cray's chief technology officer, is to take the "SeaStar" interconnect that Cray created for the "Red Storm" massively parallel supercomputer project, mix it with Opteron or MTA compute nodes and I/O Opteron blades, and then build a cluster with a shared global file system that has a single point of login and a single set of administration tools. The Ranier systems have not, as yet, completely consolidated the Cray machines.
The XT3 Opteron-based scalar machines, which were derivatives of the Red Storm, have been upgraded to the "Hood" XT4 machines and now can be used in the same infrastructure--the SeaStar interconnect and the chassis designs--as the "ThreadStorm" kickers to the MTA processors. The Tera MultiThreaded Architecture processors are used by the United States government for unknown purposes, but probably to sift through mountains of data for patterns and to crack codes and encryption. The XT4 uses the same 3D torus interconnect as the XT3, and breaks a machine down into application or compute modules and service or I/O modules. Cray has yet to get the field programmable gate array technology, which it got through its OctigaBay acquisition several years ago, into the Ranier machines, and it cannot yet get its X1E vector processors into the boxes, either.
Scott says that the integration is about 80 percent accomplished with the XT4 and XMT machines announced today. With the Cascade project, which follows Ranier, the integration will be complete, including a fully integrated machine with a shared global memory space that all of the architectures can address and, importantly, can use to pass data between different kinds of processors at memory--rather than I/O or network--speeds. Cascade machines will have FPGAs to boost performance of specific algorithms as well as vector processors to speed up matrix math.
The XT4 systems that Cray will put into the field today are based on server nodes derived from Advanced Micro Devices' Opteron 1000 series dual-core processors, which are really glorified Athlon 64 X2 processors. Cray will support quad-core Opteron 1000s when they are available around the middle of next year.
By using a single socket, dual-core chip instead of two single-core chips, the XT4's performance is not impacted by the need to keep cache coherence in a server node that implements symmetric multiprocessing. "The Opteron 1000s have a much lower cost and they give much better performance because they have less cache coherence overhead," explains Scott. Moreover, each processor socket has its own dedicated memory bandwidth and its own interconnect, which means there is less latency as nodes in the parallel machine communicate with each other.
The XT4 can have up to 30,000 dual-core Opteron 1000s, and the SeaStar2 interconnection chip plugs right into the HyperTransport links on the Opteron processors to make a very fast mesh of processors and memory for applications to play with. Each compute node runs a stripped down Linux kernel that has been tweaked to reduce operating system jitter, which means the operating systems are synchronized in such a way that they do housekeeping work simultaneously so they can all work at the same time and not sit around waiting for each other to clean up.
The SeaStar2 chip implements the router network linking the 30,000 processors into that 3D torus. By moving to the Opteron 1000s, the main memory bandwidth of the XT4 is twice that of the XT3, and the network injection rate--a measure of the interconnection bandwidth--is also twice that of the XT3, coming in at 2 GB per sec per socket.
The XT4 will begin shipping before the end of the fourth quarter of this year. Scott says that Cray has sold in the high 300s of cabinets of these XT3 boxes, not including the Red Storm machine, of course. That probably works out to nearly 40,000 Opteron processor sockets in aggregate, since each XT3 can hold 96 sockets per cabinet, which means Cray has a bunch of customers who want to upgrade to much faster boxes. With the XT4, in fact, Cray can build a petaflops system with 288 cabinets--about twice the physical size of Red Storm. Such a machine would have 27,500 Opteron 1000 processors and eat about 5.2 megawatts of power to deliver around 1.15 petaflops. That is twice the size of Red Storm, but more than a factor of 10 more performance.
As for pricing, Cray doesn't talk specifics. "An XT4 will definitely cost more per socket or teraflops than a whitebox cluster," explains Scott, "but it is a very different machine." Basic machines are in the range of $1 million for a starter box, and it is not unusual to spend tens of millions of dollars on an XT3. So it won't be unusual on an XT4, either.
But you get something for your money. With traditional Linux clusters, each node is running a full Linux instance and this is where the operating system jitter comes in. And in a lot of cases, clusters use machines with two or four processor sockets, and that memory bandwidth has to be shared among the sockets in each node; ditto for I/O and networking. Moreover, different software tools and hardware are supported by different vendors, whereas Cray provides an integrated system with server nodes, storage, operating systems, and a programming environment.
The "Eldorado" XMT kicker to the MTA-2 supercomputer uses the "ThreadStorm" MTA processor, which is interesting in one regard. Cray has repackaged the old Tera MTA chip so it can plug right into a Rev E 940-pin Opteron socket. (This design was in the works two years before AMD started up the "Torrenza" project to get other chip makers to adopt the Opteron socket.) This means the ThreadStorm processor can plug into the XT4 frame and make use of SeaStar2 and HyperTransport interconnects.
By moving to the Opteron socket and the XT4 interconnect, the XMT machine has a lot more scalability than the MTA-2 box it replaces. Only two of the MTA-2s were sold (as far as anyone knows), but the scalability, nature, and reduced price of the XMT machines relative to the MTA-2s may give this old Tera architecture a new lease on life.
Like the MTA-2 chip, the ThreadStorm chip has lots of threads on it--in this case, it has 128 threads per processor. The ThreadStorm chip is different from other heavily multithreaded designs on the market today in that it was intended to be ganged up in a shared memory system, not used as a standalone chip in an isolated node. The MTA-2 box scaled up to 256 processors and 1 TB of memory, but the XMT will scale up to 8,192 ThreadStorm processors and 128 TB of shared memory for those processors. (Remember, this is global memory for individual shared operating systems, not cache coherent shared memory for an SMP setup running a single operating system instance.) Each Eldorado blade has four ThreadStorm chips, a SeaStar2 interconnect chip that links the blade to other blades in the parallel cluster, and up to 64 GB of main memory. The blades run a variant of BSD Unix called Unicos on the compute nodes and standard SUSE Linux from Novell on the service nodes.
The XMT machine does not excel at floating point math. But the highly threaded nature of the MTA, MTA-2, and ThreadStorm processors make them excellent boxes on which to run random database searches against multi-terabyte, memory resident databases. "This is the kind of workload that brings a conventional processor to its knees," says Scott. In early tests, the customer using the Eldorado machine is seeing a 100-fold improvement in performance.
Which is one reason why Cray is looking to service providers and other interested resellers of supercomputing iron to peddle the new XMT servers on its behalf. Scott says that Cray will sell the XMTs directly to governments, but that this machine is well suited to fraud detection and other kinds of deep analysis of random data that businesses are increasingly coping with.
The XMT will ship in the first half of 2007.
Looking further down the road to the "Baker" designs in the Ranier project, Cray is putting the finishing touches on a new server blade called "Granite," which is expected to be built under phase three of the High Productivity Computing Systems supercomputing project being sponsored by the Defense Advanced Research Project Agency, the research and development arm of the U.S. Department of Defense that gave us the Internet, among many other things. The phase III contract has not been awarded yet, by the way, and it is running late apparently because Congress can't get its act together. In any event, the Granite blade will merge the capabilities of the X1E vector processors created by Cray and the multithreaded processors created by Tera in what Scott promises to be a very innovative manner.
With all of this talk, you might be wondering what happened to the OctigaBay machines, now called the XD1s. The FPGA add-ons that this Linux-Opteron machine offered--and Red Storm did not--will eventually be added to the Ranier machines. And the Linux software infrastructure that OctigaBay created has gone into the current and future Ranier boxes. A new network interconnect chip coming with the "Baker" machine in 2009--the first Cascade box--is heavily influenced by the technology inside the XD1, too. Eventually, Cray will only run Linux on all portions of its machine. No more Unix.
Cray Lands $200 Million Linux-Opteron Super Deal with DOE
Cray Warns Q2 Down Significantly, Affirms Guidance for Year
Cray Gives Pink Slips to 8 Percent of its Workforce
Cray's CTO Plans Its Future Converged Iron