What Does IBM’s Embrace Of Apache Spark Mean To IBM i?
September 21, 2015 Alex Woodie
IBM is actively embracing a new software framework called Apache Spark as the core engine powering predictive analytics application going forward. The big data world has gone absolutely gaga over Spark, and Big Blue has even put Spark on the venerable z/OS mainframe. But where does that leave its little brother, the IBM i platform?
Apache Spark, if you’re not familiar, is hands down the hottest technology at the moment in the world of big data analytics. The open source project is so popular because it implements a relatively easy to use framework for running in-memory analytic workloads across a distributed cluster. Benchmarks have shown Spark (which sports APIs for Java, Scala, Python, and R) is 10 to 100 times faster than MapReduce, which up to this point has been the predominant programming framework for Apache Hadoop.
While Spark doesn’t require Hadoop, most companies are using Spark within a Hadoop cluster, which provides the persistent storage mechanism needed for saving the raw data inputs for Spark, as well as the outputs for Spark. (Spark is an in-memory system, after all). When you buy a Hadoop distribution from vendors like Cloudera, or Hortonworks, or MapR Technologies, you get the Apache Spark binaries, as well as Spark technical support, as part of the package.
While IBM isn’t a Hadoop pure-play like the other vendors mentioned above, it does sell and support its own Hadoop distribution, called InfoSphere BigInsights. IBM’s Hadoop distro runs on Linux running on Power and X86. (IBM also supports a distribution of Hadoop for the z/OS mainframe called zDoop, which is based on software from Veristorm, but that’s another story.)
Earlier this year IBM made a major commitment to support Spark in various parts of its business, including its BigInsights Hadoop distribution and running as a standalone service on the BlueMix platform cloud. And in a blog post last month, Joel Horwitz, IBM’s director of portfolio marketing for big data and analytics, announced that Spark was coming to Power.
“Today, IBM announced that with IBM Open Platform with Apache Hadoop it is bringing Spark along with the Open Development Platform to IBMâ€™s industry-leading systems portfolio, including IBM Power Systems, the only server platform designed from the ground up to handle the demands of big data,” Horwitz wrote.
Since disposing of its X64 server business, IBM has been much more inclined to do direct comparisons between Power and Intel X64 servers, so it’s not surprising to see Big Blue opening the marketing floodgates with the opportunity presented by Spark.
“Power8 is the industry leader in system bandwidth, multithreading, and caching with capabilities that are four times what x86 platforms typically offer, capabilities that will dramatically improve the ability of Apache Spark to process data and iterate quickly,” Horwitz writes. “With Spark on Power Systems, we have the ability to shrink the latency between data and the point of interaction while reducing infrastructure costs and complexity.”
Our own @TDaytonPM did an analysis of the Power-versus-Xeon throwdown on Spark over at The Platform, which you can see here.
One of the great things about Spark is that it includes a suite of subprojects for various big data workloads, including machine learning, graph analytics, real-time streaming, and traditional SQL queries. According to Horwitz, compared to X64, Spark on Power delivers twice the performance for ML, graph, and streaming workloads, and three times the performance of SQL workloads. The addition of hardware accelerators, such as GPUs or other devices attached through the Coherent Accelerator Processor Interface (CAPI) links on Power8 chips make Spark run even faster.
It is clear that Spark on Power will benefit IBM customers whose business models demand the capability to quickly process huge volumes of unstructured or semi-structured data quickly. It’s not a stretch to see how companies in the oil and gas business or hedge funds, for example, can put big data processing to use to boost the bottom line.
It’s less clear how companies with more traditional business models–which typify IBM i shops–will benefit from this vast big-data processing power. Spark is really good at enabling users to combine and manipulate vast amounts of data and do so quickly and relatively painlessly (at least compared to the contorted processing pretzels that MapReduce would often force you into). That goes far beyond the basic reporting requirement of the typical IBM i shop, and it even goes beyond the more advanced data warehousing workloads that are still fairly rare on the platform.
But there are possibilities on the horizon, particularly among IBM i shops that have consumer-facing businesses, including those in retail, banking and insurance, and healthcare. (The big data game is mostly about B2C, although there are some B2B use cases beginning to emerge.) It all comes down to finding creative ways to manipulate new data streams with existing data, with the goal of improving some aspect of the business.
In retail, a company could combine clickstreams from an e-commerce website with the customer order history to figure out what a customer is likely to buy. In banking, a company could use Spark’s fast machine learning capabilities to spot anomalous transactions that could indicate fraudulent card activity. In healthcare, a hospital could use Spark to analyze the effectiveness of various treatments compared to the costs they incur. You could do all this without Spark, but the promise is that Spark makes it easier.
It’s doubtful that Spark itself will ever run directly on IBM i (although it’s open source, so you never really know). However, having Spark running in a Linux partition on the Power Systems box could give IBM i shops a reason to experiment with Spark–if not for the proximity of the production data, then the existence of Power Systems technical skills that eliminate the need to add yet one more hungry X64 server to the nest.
And while IBM i shops aren’t going to download Spark directly to a production Power Systems server, having Spark available on the box–or a separate box on the same LAN–could ultimately prove too tempting not to use. Spark is really the first in-memory data processing framework to catch fire among developers. That, combined with the dropping price of RAM and, could ignite some interesting new approaches to all sorts of computing problems.
Rob Thomas, vice president of product development at IBM Analytics, has been doing a lot of work in the IBM Rochester lab recently. While none of it apparently impacts the IBM i side of the house, Thomas is taking a broad view of the impact of Spark, and that very definitely could have an effect on the 125,000 or so companies around the world that still rely on IBM i to run their businesses.
“We believe Spark is going to be as revolutionary to data as Linux was to operating systems and IT,” Thomas told IT Jungle recently. “That’s why we talk about Spark as the being the analytic operating system for the enterprise. With Spark you get a unified programming model via Scala. Suddenly a client can access all the data that they have in the organization from a common framework. That’s very powerful.”