The Cognitive Systems/500 2018 Edition
June 12, 2017 Timothy Prickett Morgan
With the Power9 processor not coming to the IBM i and AIX platforms until sometime early in 2018, rather than right now as many of us were led to believe would be happening, we have some extra time on our hands. So we should all – including the executives who run IBM Systems and its subordinate Cognitive Systems division (formerly known as Power Systems) – take this opportunity to take a hard look at how the IBM i platform is packaged and priced and how a modern integrated platform should be architected.
The IBM i customer base needs the Cognitive Systems division, however nebulous Big Blue’s use of that cognitive term may seem, to succeed. Cognitive is not different in its broad applicability (if you want to be charitable) or vagueness (if you don’t feel like it) as e-commerce or deep computing or cloud or one of the many other types of computing that IBM has espoused that its various systems are designed for. This is how marketing works, although the reality is often less dramatic than the hype. Less drama is fine for conservative businesses that can’t afford to be cut by the bleeding edge. But these same companies are perfectly fine being just behind those that do blaze the trails.
As we have pointed out many times in the past, the System/38 and AS/400 businesses of days gone by were innovators in many different kinds of technologies, even if they were largely deployed by small or midrange businesses instead of the Global 2000. So it is not that IBM can’t take bleeding edge technologies and package them in such a way that “normal” companies can use them. The real genius is to take such technologies and make them part of a system in such a way that they are utterly transparent.
The System/38 was the first commercial relational database management system, and RPG and COBOL were tweaked from prior systems to make use of that relational database technology conceived by IBM researchers in the late 1970s and brought to market in many forms by many companies and open source efforts in the following years. (Much as Google invented the MapReduce technique of crunching unstructured data stored in object file systems and wrote a paper explaining how it worked, then Yahoo cloned it and open sourced it, and finally others such as Cloudera, HortonWorks, and MapR have hardened and extended that software, known as Hadoop, wrapped it in services, and currently make north of a half billion dollars a year supporting.) Single-level storage, where main memory and permanent storage such as disk and now flash are addressed as a single address space, is still an innovation that is only in the System/38 and its progeny; various memory and disk technologies made their debut in the IBM midrange, and so have myriad database and programming techniques. We have had our share of innovation.
Technology is not the problem with the IBM i platform, which has gotten much better at absorbing new stuff, with the integration of Ruby, Python, and Node.js being but three more modern examples. I think that IBM is good at following the industry’s lead, but has perhaps become a little complacent about following rather than leading and it has also forgotten that the key thing that the System/38 and AS/400 did was absorb new technologies, created by IBM or by others, and integrating them in such a way that they were truly part of the system and in many cases invisible. This is what made the System/38 innovative and, 10 years later, what made the AS/400 the dominate midrange platform in the world at a time when computing was not nearly as pervasive as today. One could argue that the AS/400 in particular is why sophisticated relational-based transaction processing became pervasive in the first place.
Nearly three decades later, of course, the world is much larger than transaction processing. The Web created a whole new presentation layer as well as a means of extending the corporate datacenter out into the broader world to reach anyone and everyone. And now, the world is being driven by analytics using a mish-mash of structured and unstructured data and, even more recently, by machine learning algorithms that can make associations in large sets of data that begin to look and smell like artificial intelligence. Self-training machine learning algorithms against large datasets are the new backends of the datacenter, helping not to just drive various kinds of identification of unstructured data such as speech, text, image, and video as well as translations between these different forms, but these algorithms are driving the search engines and recommendation engines that make the suggestions that drive the transactions. They are the beginning and the end of data; the transaction is just a tiny, almost invisible, slice the middle. It is vital, though, since that is how the bills get paid, so one must not equate the amount of compute required for a job with its absolute importance.
My point is this: A revamped and refreshed Power system, built for the cognitive era, should reflect this reality and it ignores this reality at its peril. So, I have a few suggestions as to how create what I am calling the Cognitive Systems/500 2018 Edition, which is a system aimed at small and medium companies that does all the smart things that hyperscalers like Google and Facebook do with regard to unstructured data and machine learning and high performance computing centers do with regard to simulation and modeling as well as the transaction processing and application serving that we are all well aware of.
As I have explained before, IBM has been laying the foundation of such a system as part of the US Department of Energy’s “Summit” and “Sierra” massively parallel CPU-GPU hybrid systems. IBM was the first to break through the teraflops barrier with a combination of AMD Opteron chips and the Cell Power vector coprocessors with the “Roadrunner” system built by the DOE for Los Alamos National Laboratory back in the 2000s; the machine was decommissioned in 2013, but the ideas about hybrid computing that IBM and the DOE created live on in the top supercomputers running in the world today. As it turns out, a GPU processor is itself massively parallel, and Nvidia has done a brilliant job taking the GPUs used for displaying images and video on our computing screens and making them work as math coprocessors for HPC simulations and models.
This work was done in parallel with IBM’s Roadrunner system, and Nvidia’s CUDA environment brought hybrid computing to the masses and helped to speed up scientific and technical applications – the kinds used by manufacturers, drug makers, as well as national supercomputing labs – by an order of magnitude over CPU-only setups. About five years ago, the hyperscalers, who are all hypercompetitive with each other when it comes it image recognition and natural language processing, figured out that by moving their machine learning training algorithms to GPUs they could not only speed up the training of their models, but they could throw more data at the training and thus start improving the effectiveness of the model. In five short years, these machine learning techniques have resulted in applications that can not only best humans in just about any kind of recognition task, but they can also teach themselves how to play games, and we think, not too long in the distant future, to write code or write articles about technology as it advances or just about any other task I can think of where people derive paychecks today.
Yup. Let that sink in.
This is a revolution that no one is planning, and no one is doing damage control. And what I am about to suggest will hasten this to a certain degree. So, if I were you, I would buy some land and learn how to grow some food.
The nodes at the heart of the Summit supercomputer, which are code-named “Witherspoon,” have a tremendous amount of performance. They two 24-core Power9 processors and six of Nvidia’s latest “Volta” generation of GV100 GPU coprocessors, each with a tremendous amount of floating point and machine learning power. Each Volta coprocessor has 16 GB of stacked and integrated HBM2 main memory that delivers 900 GB/sec of bandwidth – about 2.5X that of the Power8 chip, just to give you a reference. The Volta GPU has 7.5 teraflops of double precision (64-bit) floating point performance, 15 teraflops of single-precision (32-bit) floating point performance. These floating point operations are commonly used by HPC applications for all kinds of simulations and models, and could be used at manufacturers to design products or at all companies to take in telemetry streaming in from all manner of products and do predictive maintenance on them. The Volta chips also have special math units that do 8-bit integer math, which is used by certain machine learning algorithms, and also have new Tensor Core units that specialize in the tensor matrix math that is at the heart of machine learning and do it very well. A single Volta delivers 120 trillion operations per second of such performance, which is an order of magnitude better than GPUs using 64-bit floating point can do.
The Summit machine has 4,600 of these nodes, which have a total of 512 GB of DDR4 main memory, another 800 GB of unspecified non-volatile memory (we think it is flash, not Intel’s and Micron Technology’s 3D XPoint memory-addressable persistent memory), and 96 GB of HBM2 memory across those six GPUs lashed together in a way they can share memory using NVLink interconnects. The NVLink ports on the Power9 chip hook the GPU complex to the CPU complex in such a way that they have a kind of single level storage across the DDR4 and HBM2. Which is neat. By the way, that Summit machine is expected to deliver over 200 petaflops at double precision floating point, and costing somewhere north of $100 million. That is on the order of $22,000 per node, unless IBM has renegotiated the price to be higher. As the numbers stand, this price implies that Uncle Sam is getting a hell of a price cut. Street price for such a node would probably be more like $150,000 if I had to guess.
It’s a good guess, not a wild one.
So there is the foundational building block for the Cognitive Systems/500 machine that I want to revitalize the Power systems (note the lower case S there) business.
Now, here is what I want to do with it, in no certain order:
- Add 3D XPoint memory DIMMs that plug into the memory slots of the system and 3D XPoint SSDs linked to the compute complex by NVM-Express slots on the PCI-Express bus. Get rid of disks and replace them with flash drives. 3D XPoint is a technology that looks and smells like slow main memory but has the persistence of flash and disk (meaning data does not disappear when the power goes down and it does not burn as much energy maintaining state) and a cost that lies somewhere in between. Intel has struggled to get 3D XPoint to market, but it will prevail and it will use it to make its own systems have a competitive advantage. You can create larger memory spaces for data and applications to play in and make systems more resilient and less dependent on slower disk storage. I am personally over spinning rust, and only hyperscalers that have to store exabytes of cat videos give a care about disks. Everyone else can use flash as disk and add 3D XPoint in the memory hierarchy to stretch main memory affordably and thereby radically improve the performance, thermals, and economics of systems. Think of main memory as Level 0 cache and 3D Xpoint as main memory if that helps. Think of 3D XPoint as extended main memory. Whatever metaphor you want.
- Add an integrated HPC environment, accelerated by GPUs, for manufacturers that hooks into the warehousing and manufacturing systems at manufacturers. This should have been done decades ago, when the HPC applications were running on RISC/Unix workstations and or clusters of RISC/Unix servers. It’s never too late, though. Line up the key simulation and modeling application vendors and integrate this onto clusters of Witherspoon Power9-Volta servers. Make it a single, integrated workflow on a single cluster. By the way, fraud detection and high frequency trading algorithms at financial services firms are just another kind of simulation, and these are also accelerated by hybrid CPU-GPU systems
- Add an integrated machine learning environment, accelerated by GPUs. All of the key machine learning frameworks use GPU acceleration, and parts of the Watson question-answer system now does, too. All companies need to use machine learning in some fashion. Again, make it part of the integrated system. While Facebook has Torch and Caffe, Microsoft has the Cognitive Toolkit, and IBM has whatever hodge-podge is behind Watson, Google’s TensorFlow framework is getting a lot of support.
- Preserve the legacy IBM i and AIX environments in logical partitions, but add in a Linux-based, PASE-like Docker container environment. Obviously, we don’t want to throw anything away in the Cognitive Systems/500. Just like IBM had System/36 and System/38 runtimes and database overlays in the AS/400, we need to carve out a place where IBM i and AIX live, side-by-side, with native, hyperscale-inspired Docker container environments.
- Add GPU acceleration for the database. Databases love parallel architectures, and this is the third wave of innovation coming to hybrid CPU-GPU iron. MapD, Kinetica, Sqream Technologies, BlazingDB, and Brytlyt are the innovators here, and MapD has just open sourced its code. We would prefer for GPU acceleration for the database be native, but if it has to move data from DB2 for i to MapD or some other GPU database to speed up complex queries against datasets with tens or hundreds of billions of rows, be it. It sure beats what happened with SQL Server and OLAP processing, where the vast majority of customers in the 2000s moved this work off proprietary and Unix systems where they had their main databases and onto Windows Server.
- Integrated Hadoop and Spark data analytics. Same story here. With 48 cores in a node and maybe a dozen nodes, this is a fairly powerful Hadoop batch analytics or Spark in-memory analytics setup. Why not run these workloads natively on some nodes in a Cognitive Systems/500 cluster?
- NUMA is not enough, even if we love it for its simpler programming model. I love the fact that NUMA clustering tightly couples multiple processor and memory complexes into what looks like a giant workstation with a single address space. This is great for certain kinds of programs. But IBM has various kinds of clustering, including DB2 Multisystem for database clustering, that should be the norm. It wouldn’t hurt to learn a bit about relational database design from the likes of Google Spanner and its clone, CockroachDB. These are very sophisticated database clusters that are easy to expand and that automagically move data around the cluster (which can be geographically distributed and synchronized by atomic clocks if you want to go crazy) where applications need it. NUMA machines are expensive and are not modern. IBM needs to bring the Cognitive Systems/500 into the state of the art for clustering for HPC, machine learning, database, and Web applications.
- Integrated object storage for the Integrated File System. The System/38 and AS/400 were, from the get-go, examples of an object-based system. I am not talking about that. What I am talking about is supporting a native, integrated object file system with erasure coding for data protection that supports the Amazon S3 object protocol for unstructured data. If we want companies to dump this stuff into the Cognitive Systems/500, it has to be native. The open source Ceph object store is probably what I would use, since it also has file system and block addressing overlays. Think of this as a hyperscale Integrated File System.
So, there you have it. I know that is a lot, but much of the hard work is done. This is all about integration, support, and sales pitch. I presume IBM is still good at this. Prove me right, Big Blue. And good people of the IBM i community, tell me what you think about all this.