Making Hadoop Elephants Drink From Silverlake

May 20, 2013 Timothy Prickett Morgan

In last week’s issue, I talked generally about big data and the use of tools like the Hadoop big data muncher created by Yahoo to emulate the very early unstructured data processing and related file system created by Google to power its search engine. While you do not have to take a snapshot of the Internet and index it continuously, as Google did when the Google File System and the MapReduce batch processing method was created, you do have your own big data challenges.

Well, they are probably more accurately described as midrange data problems, to be honest, and if you want to be really precise, you might call it the midrange unstructured data challenge.

Luckily, there are plenty of tools out there to help you get started playing around with big data. What you need to do first is take a deep breath and not try to convince your CEO and CFO that you need tens or hundreds of thousands of dollars to buy a bunch of shiny new iron and a commercial Hadoop distribution from Cloudera, IBM, MapR Technologies, Pivotal, HortonWorks. Or fire up a cluster running the open source Apache distribution. You can play around a bit before making such a big commitment.

If you are committed to Hadoop, there are far easier ways to get started. Amazon Web Services, the cloud computing arm of the e-tailing giant that is by far the dominant public cloud in the world, has for several years sold a service called Elastic MapReduce (EMR) that is a hosted version of Hadoop. It runs on top of the EC2 compute cloud and uses the S3 object storage to hold unstructured data, and you can use the freebie M3 distribution from MapR, which runs atop the Hadoop Distributed File System, or the commercial-grade M5 distribution, which has a funky NFS-alike file system that allows you to mount data sets just like you would any other Unix system. For your purposes at an experimental stage, it makes no sense to go with M5. Go with the freebie M3 version that comes with the EMR service. (This is basically a packaging up of all the key components of the Apache Hadoop stack.) The way EMR works is that you have to pay for EC2 capacity, which ranges from small virtual machine slices up to big fat heavy instances that are heavy on storage and memory; the range anywhere from 6 cents to $4.60 per hour. The EMR software price, which ranges from 1.5 cents to 69 cents per hour, and the S3 storage costs extra too. A 10 TB slice of S3 will run you $668, and the more data you store in S3, the lower the price per unit of capacity.

This is a transient cluster on EMR, by the way, and that means it evaporates once your job is run. You can build a permanent Hadoop cluster on EC2 and S3 if you want. In both cases, you can use the autoscaling features of EC2 to ramp up capacity as needed and ramp it down. You have to manage this, and decide on a mix of on-demand, reserved, and spot instances you will need to run your workloads. If you are monkeying around, buying spot instances–ones AWS can’t sell at the moment–is the cheapest way to go. You can read a good comparison of EMR versus permanent EC2-Hadoop costs here. EMR is about a quarter the price of setting up Hadoop yourself (that’s just for the infrastructure), and if you reserve capacity or use a mix of reserved and spot slices, you can cut the cost even further. The point is, we are talking about maybe several thousands of dollars per year to experiment.

Microsoft is working with HortonWorks to create a similar Hadoop service on the Windows Azure public cloud called HDInsight, which is still in tech preview. Like Amazon’s EMR service, HDInsight will let you fire up Hadoop and the Hive query tool, which has an ODBC adapter to link it to other databases, spreadsheets, and other analytics tools. HDInsight costs 32 cents per hour for the head node for Hadoop management and 16 cents per hour for compute nodes, and this is with a 50 percent discount that is available during tech preview. A baby cluster with one head node and four compute nodes can give you Hadoop for $8,568 per year during the tech preview phase, and will double up to $17,136 for a full year when it goes into production. Microsoft is also offering discounts for companies who make a six-month (18.7 percent) or 12-month (31.3 percent) commitment. That puts you in the range of $11,781 once the tech preview is over for a full year of committed usage.

If you want to go the Google route and use some of the same technologies that the company uses to provide its own services to end users, you can take a spin on BigQuery. This is essentially a big database store that Google has created as a back-end for a number of its services, based on a system called Dremel, that allows for SQL-like queries against unstructured data, just like various SQL layers are doing for Hadoop. BigQuery was launched in 2010 and came out of beta a year ago.

If you don’t want to try to figure out BigQuery on your own, you can go to a French company called BIME, which has created a service by the same name that is a front-end for BigQuery that makes it more usable and that itself runs on Amazon Web Services. The BIME service creates dashboards based on data connections that suck information into BigQuery from various production systems so they can be analyzed and displayed in various dashboards. Pricing ranges from $180 to $720 per month for the BIME (pronounced “beam”) service.

Google has just this week announced Cloud Datastore, which is aimed at truly unstructured data sets. This service is based on Google’s BigTable distributed storage system, which is used as the back-end for web indexes, Google Earth, and Google Finance, and uses Google’s Metastore high availability replication, which does automatic data replication across multiple Google data centers to make sure you don’t lose data. Cloud Datastore has a free tier that gives you 50,000 read and write operations, 200 indexes, and 1GB of data stored per month. This seems like a very good place to monkey around and you can’t beat the price. And perhaps more significantly, you don’t have to learn Hadoop.

Ditto for the Splunk data munching service that went public last year and has been rumored to be a target of acquisition of Big Blue. Like Hadoop, the Splunk service borrows Google’s MapReduce mechanism, which breaks data into chunks, chews on it, and then aggregates that data to come up with a final data set that fulfills a batch query to the system. Splunk has its own data store, however, and does not rely on HDFS. This data store can ingest both structured data from databases as well as unstructured data generated by applications (clickstreams, web logs, application logs, and such). And just as Google created Sawzall and Yahoo created Pig to be a high-level query language that could in turn be converted to MapReduce instructions to chew on data scattered around the cluster, Splunk has its own search language, called Search Language, that rides atop the MapReduce layer and is far more easy to use than writing MapReduce instructions by hand in Java.

Splunk has a Free Edition, which can only ingest 500 MB of data per day and which is intended for developers to play around to see how to use it. It includes ad hoc reporting, dashboards, and other features, but not the full set. The Enterprise Edition has all the bells and whistles. Splunk is available as a standalone product you install on your own machines, or as a service that runs on a cloud. It has plug-ins and dashboards that cover a wide variety of industries and use cases, including plugging into Hadoop clusters to providing dashboards for various aspects of IT management, security, and compliance. The Enterprise Edition costs $5,000 for a perpetual license (not including support fees) and $2,000 for an annual subscription (which includes support fees. The Splunk Storm cloud service is free if you want to store 1 GB of aggregate data, and if you want to push it up to 2 GB, you are talking about $20 a month; 1 TB of data storage on the Splunk cloud will run you $3,000 per month.

This is by no means an exhaustive list of the possible Hadoop and other big data options that you have at your disposal as you contemplate how to add this type of processing to your IT operations. The point is to start somewhere, start small, and learn how these new batch and interactive processing techniques work, and then figure out how to embed them in your own business. You will have to get over the idea of shipping your operational data over encrypted links to a public cloud if you want to use many of these services. Or find on premise alternatives.

This is a good place to start if you want to chew on unstructured data. And if you have better ideas, please share. I am learning this stuff, too. And more importantly, if you have figured out how to make DB2 for i and SQL Server play nice with Hadoop and other big data munchers in your shop, don’t be afraid to share your insights.