|
ClearSpeed Tweaks Math Coprocessors, Shows Off Benchmarks
Published: May 8, 2007
by Timothy Prickett Morgan
ClearSpeed Technology, which makes math co-processor boards for workstations and servers, has added a new accelerator board to keep pace with the peripheral slot changes in motherboards. But perhaps more significantly for its efforts to sell its products, ClearSpeed has run some benchmarks that show the kinds of performance customers running real number-crunching workloads can expect to see and how they can play off performance gains against unplugging server footprints in clustered supercomputer environments by using the company's Advance accelerators.
The existing Advance board is called the X620, and it has two 96-core CSX600 math processors on each PCI-X peripheral card; this is a three-quarter length, half height PCI-X card. The new card, called the E620, is a PCI-Express x8 peripheral card that is a half-length card that is still full height. The Advance cards burn about 25 watts of power per board, and deliver around 55 gigaflops of double-precision, 64-bit floating point math power. The CSX600 chip has four banks of 24 cores running at 210 MHz, with about one quarter of the chip being floating point units and the rest being memory and supporting interfaces and ports. The whole shebang has 128 million transistors and each CSX600 chip burns about 10 watts.
According to Peter ffoulkes, director of outbound marketing for ClearSpeed, the extra bandwidth that comes with the new PCI-Express card doesn't really help all that much on the kinds of HPC workloads that customers buy the Advance cards to goose. "You would think that the bandwidth would help, but when doing matrix math, the amount of computation completely overwhelms the I/O," says ffoulkes.
Both the existing PCI-X and new PCI-Express cards cost around $8,000 in single unit quantities, with unspecified but significant volume discounts. This is, says ffoulkes, about what a field programmable gate array co-processor costs for an X86 server, and it is less than half of what IBM is charging for its QS20 blade server, which has two if its "Cell" Power vector processors on it. "Everything is in the same order of magnitude of price except for the graphics processing units, which are an order of magnitude cheaper than any of these options," says ffoulkes. Of course, as popular as the GPU idea is as a co-processor for HPC workloads, GPUs only offer single-precision floating point math and they are not yet supporting 64-bit math. So they have their limits, too.
In addition to launching the new PCI-Express card, ClearSpeed has also launched a random number generator for the cards, which allows the Monte Carlo simulations that the financial services industry uses to balance their portfolios and make money off your money. The company has also updated its vector math library, called CSXL, to the 2.5 release, which that provides native support for Microsoft's Windows platform; Linux was supported in prior releases. The random number generator was an add-on before CSXL 2.5. ClearSpeed has also created a tool called Visual Profiler, which allows programmers to look deep into the way the code is running on their X64 systems with Advance cards and see where the bottlenecks are in the system that might be inhibiting performance.
To help it better sell its Advance boards, ClearSpeed ran a Monte Carlo simulation on an X64 server with two 3 GHz processors equipped with 3 GB of main memory and Red Hat Enterprise Linux 4. The simulation, which included 400 million data samples using a European pricing model for stocks, could run at 6.4 million samples per second with one X64 core activated, and doubled to 12.9 million samples per second with two cores turned on. Then, ClearSpeed slipped in one of its Advance cards, and the rate rose to 130.9 million samples per second (which means the job that took 60 seconds with a single CPU core finished in 2.9 seconds with one Advance card). A second Advance board added to this server nearly doubled performance to 260 million samples per second, a third card drove it to 386.1 million samples per second, and the fourth card pushed it up to 505.1 million samples per second. While that last card didn't deliver a linear performance improvement, this is an incredible increase in the speed at which the Monte Carlo simulation can run--from 60 seconds down to 0.8 seconds.
ClearSpeed has also ran the industry-standard Linpack benchmark on a cluster of X64-based ProLiant DL380 GHz servers with two dual-core "Woodcrest" Xeon 5150 processors (which run at 3 GHz) and 14 GB of memory. (Linpack is, of course, the Fortran matrix math benchmark that it used to rank the Top 500 supercomputers.) A four-node cluster of these DL380 machines was able to do 114.8 gigaflops on the Linpack test, but adding an X620 Advance card to each node more than doubled performance to 251.3 gigaflops. To get roughly the same performance (258.3 gigaflops, specifically), it took nine of those ProLiant DL380 server nodes, which took up twice as much space and burned twice as much power. (The four-node cluster burned an average of 1,722 watts, the accelerated four-node cluster burned an average of 1,750 watts, and the nine-node cluster without accelerators burned 3,875 watts.) Clearly, it makes more sense to use the co-processor than to double up on the servers, and if you look carefully, you can see that offloading some work to the Advance co-processor boards actually makes the server CPUs a little bit less hot, compensating some for the increase in power added by the Advance boards themselves.
RELATED STORIES
ClearSpeed Ships New Math Accelerator, Inks Deal with IBM
ClearSpeed Ships Advance Co-Processors in Giant Sun Supercomputer
Sun, NEC, and AMD Partner for 50 Teraflops Opteron Cluster
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot
|