Newsletters Subscriptions Forums Store Career Media Kit About Us Contact Search Home

News While It's Still Hot
November 28, 2005

Cray's CTO Plans Its Future Converged Iron

by Timothy Prickett Morgan

If you didn't think that Microsoft was serious about getting into the high performance computing market--what we used to call the supercomputer market--a few weeks ago when it debuted a beta version of Windows for HPC clusters, the fact that Microsoft has just hired away the chief scientist at Cray shows that not only is Microsoft deadly serious, but that Cray is perhaps in a very tough place in its long history.

Still, Cray has lots of smart people, and it will proceed with its plans to make specialized Linux-Opteron supercomputers as well as its vector machines and the future "Cascade" supers, whatever they might end up being.

Burton Smith, who has taken a job working at Microsoft, was chief scientist at Cray and the founder and chief scientist at Tera Computer. Tera created a multithreaded supercomputer that embodies some of the goals of the general purpose processors coming out from chipmakers and server markers with their own chips today. As is usually the case with supercomputers, ideas tested on supers eventually make their way a decade or two later to regular servers.

The MultiThreaded Architecture (MTA) supercomputer conceived of by Burton and sold by Tera Computer made him famous and made Tera enough money that it could acquire the Cray vector supercomputer business in March 2000 from the then-struggling (and still struggling) Silicon Graphics. SGI had bought most of Cray in February 1996 in an effort to corner the market on indigenous supercomputing here in the United States. (Sun Microsystems bought the remaining piece, a Sparc-based super under development by Cray, and created the high-end server business that made it billions of dollars in profits in the dot-com boom). By the time Tera bought Cray, the MTA processors were made from a very expensive gallium arsenide process and reportedly cost around $1.5 million a piece. The top-end 16-processor machine at the San Diego Supercomputer Center cost around $6 million after heavy discounts, and were cranky. But the MTA design had inherent parallelization and a single memory space for Fortran applications to frolic within, and it commanded the respect of the supercomputing community and the National Security Agency, which bought some of the MTAs.

While Smith, as Cray's chief scientist, has been the point man on the Cascade project, the MTA has been pushed aside by the "Red Storm" Linux-Opteron supercomputer project at Sandia National Laboratories, which has been commercialized as the Cray XT3, the acquired OctigaBay Linux-Opteron line, which has been commercialized as the XD1, and the continuing development of the Cray X1 multi-streaming processor (MSP) vector machines. The plan for Cascade, which is due in 2010 or so, is to deliver a completely new architecture HPC system that has sustained petaflops performance. Cascade is being funded by the Defense Advanced Research Projects Agency (DARPA), the real inventor of the Internet and one of the big government agencies in the United States besides the Department of Energy that funds primary supercomputer research in the hopes of getting very powerful iron to play with.

Steve Scott, who was the chief architect at Cray before Tera bought it and the main designer of the Cray X1 vector machines and the Cray T3E parallel machines, joined the original Cray in 1992 right after getting his PhD from the University of Wisconsin in computer architectures. Back then, Seymour Cray, perhaps the greatest genius in supercomputing, was still in charge. Scott is Cray's current chief technology officer, and how that is distinct from chief scientist is something that no longer matters now that Smith has left Cray.

While much has been made of Cray's problems in delivering the "Red Storm" Linux-Opteron supercomputer to Sandia in the past year. The current Red Storm machine has 10,880 Opteron processors running at 2 GHz, which delivers a peak performance of 43.5 teraflops. While that is not the best in the world in aggregate teraflops, the efficiency of the Red Storm machine, at least measured by the Linpack Fortran benchmarks, is very high. Red Storm delivers 36.19 teraflops of sustained performance, which means the machine is hitting a sustained performance that is 83 percent of peak performance. This is a little bit better than the BlueGene/L super over at Lawrence Livermore National Laboratories and only a little bit less than the "Columbia" Itanium-Linux cluster from SGI running at NASA Ames. But the efficiency of the Red Storm machine just blows away that of the "Thunderbird" Xeon-InfiniBand Linux cluster running right next to Red Storm at Sandi, which has 64.5 peak teraflops of power, but only delivers 38.3 teraflops of sustained performance--59 percent of peak. Cray says that the Red Storm design can scale to around 60,000 Opteron cores, which should mean Red Storm can scale to 300 teraflops or more of peak performance, depending on how well the "SeaStar" 3D interconnect, created by Cray to lash Opteron processors and their HyperTransport interconnects together, works.

There is still a place for purpose-built supercomputers in the world, like Red Storm, to be sure, but Scott has a sense of humor about the experience with the Red Storm machine. "This is not what you would choose as a first customer shipment of a new design," he explains. "But it is fully installed and running and we've got our money." And that is the point. And equally important is the fact that the Red Storm machine, which has a proprietary interconnect called SeaStar that was developed by Scott and his colleagues, has been commercialized by Cray as the XT3. So far, Cray has sold 16 of these XT3 machines, including an earthquake modeling system rated at 10 teraflops at the Pittsburgh Supercomputer Center called Big Ben, which will have twice the processing capacity of the "LeMieux" Alpha-Tru64 cluster it replaces, and a 20-teraflops combustion simulator at Oak Ridge National Laboratories (which also has recently upgraded its Cray X1E system to an 18.5 teraflops configuration). ORNL, says Scott, plans to boost its Cray system capacity to 100 teraflops in 2006 and to 250 teraflops by 2007 as researchers from all over the country want access to do all kinds of simulations. Cray has sold 27 of the vector-style X1E machines to date.

In early 2004, Cray bought OctigaBay, an upstart Linux-Opteron supercomputer maker with a very sophisticated design that included a lot of field programmable gate arrays (FPGAs), which are add-on co-processors that can massively speed up certain algorithms used in specific supercomputer applications. The OctigaBay machines were designed to scale to 12,000 Opteron cores, but Cray has largely positioned them as an entry machine compared to the XT3. Scott has high hopes for the FPGA approach. In recent tests using a single Opteron core equipped with a single FPGA running some customer code, the XD1 was able to provide a 28X performance improvement on the code when the FPGA was activated. "This kind of speed up is what has driven people to build special-purpose hardware. But real hardware development like this is very, very expensive." FPGAs split the difference. They allow a relatively general purpose machine, like a Linux-Opteron cluster with a fast interconnect, to be custom-programmed for specific algorithms running on FPGAs. "The real crunchy bits run on the FPGAs," explains Scott, "and you can tweak the algorithms on the fly." This is probably why Cray has sold 90 of the XD1 machines since they became commercially available in the fall of 2004.

While Cray wants to pursue as many big deals as it can and make a name for itself on the Top 500 supercomputer rankings, Scott's philosophy on where supercomputing is headed is a bit different. "Scientists have to start coupling different models and bring different kinds of simulations to bear so they can solve different kinds of problems--not just a bigger problem," he says. "And no matter what, you want to be able to fit the computer to the application, not the reverse." That's why Cray has many different kinds of supercomputers.

The work at Oak Ridge is a case in point. It involves the running of computational fluid dynamics (CFD) and turbulence simulations running on a mix of vector and parallel machines to deliver the first integrated, 3D, CFD-turbulence simulation of a premixed methane-air Bunsen burner flame. I know, that sounds very exciting. But when these simulations, which have never been done in this way before, help make your fuel ignition on your car work better and help you boost your gas mileage, you will have bit-twiddlers to thank for it.

Looking ahead, Cray has a few things cooking. First, the National Security Agency is wanting a kicker to its MTA machine, and sometime in 2006, a kicker dubbed "Eldorado" will be delivered as the MTA-3. What Scott did not say is that Eldorado will be a MTA machine that uses Red Storm's SeaStar interconnection scheme and cabinetry. It will literally plop four MTA processors onto a Red Storm motherboard that has been equipped with a set of four SeaStar2 interconnection processors. Just as is the case with Red Storm, the front end service and I/O nodes in the machine will run Linux, while the back-end nodes will run their own operating system and the applications. In the case of Red Storm, that back-end operating system is "Catamount," Cray's own stripped-down Linux kernel, and on the MTA machines, the back-end OS is MTX, a BSD Unix variant created by Tera. The MTA-2 had processors running at 220 MHz and scaled to 256 processors and 1 TB of shared main memory, but the Eldorado machine will be a lot larger and more powerful, with clock speeds on the MTA-3 chips boosted to 500 MHz and system scalability radically increased to 8,192 processors and 128 TB of shared main memory. The MTA-3 processors and chipsets taped out in early 2005, according to other sources, and prototype systems were expected around now. A machine with 500 to 1,000 processors was expected to be delivered to the NSA near the end of the first quarter of 2006. After being burned by chip supply problems from former partner IBM on the X1 and X1E engines, Cray has gone to Taiwan Semiconductor Manufacturing Company to fab the Eldorado processors.

Cray just moved to dual-core processors with its X1E machines this year. Each multichip module in the X1E supercomputer has two processors, which each have four multistreaming processors that run at 1.13 GHz and deliver 18 gigaflops of performance on vector supercomputer workloads. The kicker to the X1E is code-named "Black Widow." In June of this year, Uncle Sam ponied up $17 million to be matched dollar for dollar by Cray to create the Black Widow machines. The Black Widow machine is expected to prototype in the first quarter of 2006 with deliveries in 2007 and a kicker in 2008 called Black Widow-II. Scott was a little thin on the details of what would go into the Black Widow processors, except to say that it would have substantially improved scalar performance as well as improved vector performance. The Black Widow machines, probably to be sold as the X2, will almost certainly make use of a 3D mesh interconnection scheme like Red Storm, if not the same exact interconnect (the X1E has a 2D interconnect, which limits its scalability). Oak Ridge was supposed to be getting a Black Widow machine in the 120 teraflops range by about this time, and that 250 teraflops plan calls for the installation of an X2 machine, as it turns out. In a very real sense, Oak Ridge is helping design the future Cray product lines. The Black Widow processors will be fabbed by Texas Instruments using a 90 nanometer process.

This year, Cray delivered dual-core Opterons for the XD1 supercomputers, and plans to do the same with the Red Storm-derived XT3 supers in 2006. Further down the road--and Scott will not be precise--the XT3 and XD1 lines will converge with the "Adams" and "Baker" Cray systems. This convergence will apparently be gradual. The reason why Cray is doing this is simple. The XD1 interconnect has a lower latency and it has a much cleaner Linux approach in that the XD1 compute nodes run stock Linux. So the Adams and Baker machines will do this, too. The XT3 systems have a 3D mesh interconnect, which allows high scalability, but the system design is too expensive to scale down to below a few hundred processors. So the guts of the communication ASICs in the XD1s will be used to create parts of the future, converged Linux-Opteron machines, according to Scott. From the looks of things, the Adams platform will converge the XT3 and XD1 product lines, while the Baker platform will converge in the MTA-3 and vector MSP processors as well. That way, Cray customers will be able to pick and choose the processor elements they want to run their applications. There will be kickers to both the Black Widow machines, which only support MSPs, and the Baker boxes, and then the Cascade machines will come out around 2010.

Not much is known about Cascade, and Scott didn't want to do anything to spoil the fun. But some details leak out, even among the super-secret HPC community, because people just have to brag. The Cascade architecture will apparently couple together two levels of processing elements, the MSPs for heavyweight processing and multiple lightweight processors networked together to assist the vector engine. These lightweight engines will have many processors per die and be heavily multithreaded, which would lead one to believe that they will either be Opteron or MTA chips--maybe even Opteron chips with some special MTA sauce tossed in for flavor.

As part of the Cascade project, Cray is creating a new programming language called Chapel, which is short for Cascade High Productivity Language and which Cray hopes will simplify the programming of parallel applications on machines such as Cascade.

Sponsored By

Clusterworx® Whitepaper

High performance Linux clusters can consist of hundreds or thousands of individual components. Knowing the status of each CPU, memory, disk, fan, and other components is critical to ensure the system is running safely and effectively.

Likewise, managing the software components of a cluster can be difficult and time consuming for even the most seasoned administrator. Making sure each host's software stack is up to date and operating efficiently can consume much of an administrator's time. Reducing this time frees up system administrators to perform other tasks.

Though Linux clusters are robust and designed to provide good uptime, occasionally conditions lead to critical, unplanned downtime. Unnecessary downtime of a production cluster can delay a product's time to market or hinder critical research.

    Since most organizations can't afford these delays, it's important that a Linux cluster comes with a robust cluster monitoring tool that:
  • Provides essential monitoring data to make sure the system is operational.
  • Eliminates repetitive installation and configuration tasks to reduce
          periods of downtime.
  • Provides powerful features, but doesn't compromise on usability.
  • Automates problem discovery and recovery on would-be critical events.

This paper discusses the features and functions of Clusterworx® 3.2. It details how Clusterworx® provides the necessary power and flexibility to monitor over 120 system components from a single point of control. The paper also discusses how Clusterworx® reduces the time and resources spent administering the system by improving software maintenance procedures and automating repetitive tasks.

High Performance Monitoring

Each cluster node has its own processor, memory, disk, and network that need to be independently monitored. This means individual cluster systems can consist of hundreds or thousands of different components. The ability to monitor the status and performance of all system components in real time is critical to understanding the health of a system and to ensure it's running as efficiently as possible.

Because so many system components need to be monitored, one of the challenges of cluster management is to efficiently collect data and display system health status in an understandable format. For example, let's say a cluster system has 100 nodes and is running at 97 percent usage. It's very important to know whether 100 nodes are running at 97 percent usage or whether 97 nodes are running at 100 percent usage while three nodes are down.

Clusterworx® provides real-time analysis of over 120 essential system metrics from each node. Data is displayed in easy-to-read graphs, thumbnails, and value tables. Clusterworx® collects data from groups of nodes to spot anomalies, then drills down to single node view to investigate problems. This allows users to determine exactly what the problem is before taking corrective action.

Clusterworx® also tracks the power and health state of each node and displays its status using visual markers in a node tree view throughout the user interface. Power status shows whether the node is on, off, provisioning, or in an unknown state. The health state tracks informational or warning messages and critical errors. Health state messages are displayed in a message queue on the interface.

Clusterworx®'s comprehensive monitoring and easy-to-read charts and graphs allow users to quickly asses the state of each node and the overall system at a glance - while providing the necessary information to make informed decisions about the cluster system.

To read the rest of this whitepaper, please visit

Editors: Dan Burger, Timothy Prickett Morgan, Alex Woodie
Publisher and Advertising Director: Jenny Thomas
Advertising Sales Representative: Kim Reed
Contact the Editors: To contact anyone on the IT Jungle Team
Go to our contacts page and send us a message.

Breaking News


Copyright © 1996-2008 Guild Companies, Inc. All Rights Reserved.
Guild Companies, Inc. (formerly Midrange Server), 50 Park Terrace East, Suite 8F, New York, NY 10034
Privacy Statement