Dances With Elephants
May 13, 2013 Timothy Prickett Morgan
If you are a typical IBM i shop, you have spent decades turning the transactional data in your application stack into information that business managers and other workers within the company can not only use to keep the money rolling in and the products rolling out, but to do a better job of managing those two processes. And you are probably pretty proud of the fact that it takes relatively modest resources to accomplish this, and you are no doubt wondering what all of this yammering about so-called “big data” is all about.
Well, the first thing big data is absolutely about is IT vendors smelling an opportunity to sell you something you think you need because all the cool kids are doing it. Generally, that means using tools like the Hadoop big data muncher that mimics an earlier release of the infrastructure that Google created to chew through all of the web pages it hoovered off the Internet to create its search engine. This data is unstructured, meaning that it is not something that fits comfortably in the rows and columns of a relational database management system. Although there is no reason, technically, that you could not use a relational database to create a search engine. (Indeed, DB2 for i has a search engine function in it for your databases.) The trouble is that turning a relational database into a search engine, or making it handle any other job that processes volumes of unstructured data, takes a lot of iron and adds up to a lot of money.
That is why Google created the Google File System for storing huge sets of data and the MapReduce method of chewing through it. At Yahoo, Doug Cutting read Google’s papers on MapReduce and GFS (which were published after the Googleplex had already moved on to another generation of data store and indexing technology) and decided to clone them for Yahoo’s own backend systems, and he named the project Hadoop after a stuffed elephant carried around by one of his kids. Yahoo open sourced the code eventually to try to build a community behind it, and today Hadoop is all the rage for dealing with big data.
Over the past few years it has been layered on with data warehousing features, then ad-hoc query capabilities that were not quite what most businesses were used to, and finally true SQL-compliant query capabilities. In effect, and once again, a flat file data store has been transformed into something that looks and smells more like a relational database.
Hadoop has a few clever things that make it a cheap alternative for data storage and batch mode and interactive processing against that data. For one thing, Hadoop runs on cheap iron, and even though it is written in Java, it runs reasonably fast so long as the data sets do not get too large. The current Hadoop data muncher can scale to around 4,000 nodes before its NameNode (kind of like a file allocation table in a disk drive except it is for the Hadoop Distributed File System) and JobTracker (kind of like Advanced Job Scheduler for IBM i) start to hit their limits. That’s a pretty big cluster, although it is nothing compared to the 100,000 nodes or so that Google is said to cram into one data center to run one workload.
The idea with Hadoop is to store data cheaply because you are going to want to have lots of it and you are going to have at least three copies of the data for data munching in parallel and for high availability. With cheap 3.5-inch SATA disks pushing up to 3 TB and now 4 TB of capacity, these are preferred over physically smaller, more reliable, and more costly 2.5-inch SAS drives. A hyperscale modular server like those from Super Micro, Dell, and Silicon Graphics that are tuned for jobs like Hadoop have one drive per core if at all possible. With the FatTwin machines from Super Micro, for instance, the machine can have four nodes in a 4U chassis, with each node having eight disk drives. If you go with four-core Xeon E5-2600 processors, you can have a balanced system and cram four nodes into one chassis, for a total of 32 cores and 128 TB of disk. At the extreme limits of Hadoop, you could, in theory, use such machines to build a 32,000 core and 128 PB Hadoop system in about 100 server racks, with a little room left over for switches.
Any way you cut that, this is a supercomputer. And like supercomputers, there is a scatter-gather method to toss work out to the processors, gather up results for many small calculations (or data sorts with Hadoop) and merge them. With supercomputer clusters, you generally have the data set in a separate high speed file system and you parcel out bits of data needed for a simulation to the nodes in the machine, and you do calculations, see how all of the other calculations on the other nodes affect the next step in the calculation, and many, many hours later you have the simulation of air over a wing or combustion in a car engine.
With MapReduce, you have the unstructured data out there in triplicates, with its distribution to specific drives in the cluster managed by the NameNode, and if you lose the NameNode, by the way, you lose the Hadoop file system–which is a very bad thing. Losing data is a very bad thing too, particularly if you have told upper management that keeping all of the operational and clickstream data from your online systems and merging it with transactional data is vital to understanding customer behavior and predicting it. Hence the triplicates of data spread around the Hadoop file system.
The neat thing about Hadoop is that it moves data munching jobs to the data and uses the core associated with a particular drive to do the munching rather than trying to move data out of an external file system and onto a server node. This simple twist means you can write Hadoop and Java and still get work out of it. Like, for instance, creating the Watson question-answer machine, which ran on Hadoop (with lots of extensions to make it work in real time instead of batch mode). I am going to gloss over the technical details of Watson, but I explained them here two years ago. This is a very different kind of data processing than IBM i shops are accustomed to, and uses radically different kinds of data and algorithms to extract some kind of insight out of the hodge-podge of bits, too.
But with the SQL-like layers that IBM is putting into its BigInsights distribution, called Big SQL, as well as the Hawq alternative from Pivotal (just spun out of VMware and EMC last week for the its Hadoop distribution, and Impala from Hadoop distie Cloudera, you are not going to have to be an expert in writing Java MapReduce algorithms to merge HDFS data with information stored in the relational databases that underpin your production systems. You will just do a database join using SQL commands and run an SQL query against the joined tables.
The goal is that you won’t even know you are running Hadoop, in fact. Just like the goal with the original System/38 was that you would not even know you were running a relational database management system. In the case of the AS/400, IBM layered a flat-file database onto the relational database to create a System/36 emulation environment so System/36 applications could run unchanged on top of the integrated relational database. In the case of these modified Hadoops, you are taking a distributed flat file data store and giving it SQL capabilities so it looks and feels like a relational database. Both are very clever. Both took very smart people, and a lot of them, to engineer.
The Big SQL, Hawq, and Impala SQL layers work differently from each other, but will do one important thing that was also done on mainframes in the 1970s that radically expanded their usefulness: they will move Hadoop from batch mode to interactive mode. Queries that would take hours or days can now take minutes or seconds against vast amounts of data. And I begin to think that it won’t be too long before the HDFS could become the basis of real transaction processing systems, obviating the need to have both relational and Hadoop data stores. The reason is simple: You want all of your data in one place, in one format, and accessible for both MapReduce and SQL processing. You don’t want to be moving your data from one machine to another, as we do with data warehouses today. It is all just too hard.
So what does Hadoop and all of this talk about Big Data have to do with IBM i customers? Well, to be honest, as more than a few people have put it: it’s not the size of your data, but how you use it.
Data has always been big data: web logs, system logs, ad tracking logs, app logs, and so on and so on, and the difference is that you used to throw it all away, or only store pieces of it for a short period of time in an Excel spreadsheet, SQL Server database, a data mart or a data warehouse. Big Data is really about trying to keep all data generated by all systems and applications, and then trying to figure out how to mash it up with third party data to derive some kind of insight that will allow them to better serve customers or find new ones. And a Hadoop cluster, particularly with these new-fangled SQL extensions, can do it for 1/100th the price–that’s two orders of magnitude per terabyte –of a traditional data warehouse and deliver near real-time performance on top of that.
This is a big deal, and it is going to change your IT department–either with or without you.
Next week, I will tell you what steps you can take to get familiar with Hadoop and start making use of it in your IBM i shop.