What Does ‘Big Data’ Mean for IBM i?

November 12, 2013 Alex Woodie

When IBM i 7.1 Technology Refresh 7 (TR7) ships on Friday, it will contain several updates to the DB2/400 database designed to help it handle big data, including an expansion of SQL indexes, easier movement to SSDs, and tools to track the growth of tables over time. But what exactly does big data mean on the IBM i? We set out to find some answers.

The traditional definition for big data has to do with the three “Vs,” which refer to the volume, velocity, and variety of data types. IBM sometimes likes to add a fourth “V” to the mix to represent veracity, or the lack thereof.

Data volumes are no doubt increasing, but that’s been true since the first 8-bit processors started to make their way into businesses. What’s different today is that data volumes are starting to get really, really big. The numbers are actually mind-boggling. According to IBM, every day the world generates 2.5 quintillion bytes (the equivalent of 2.5 exabytes, or 2,500,000,000,000,000,000 bytes) of data. Data volumes are roughly doubling every year.

The variety of data is getting wider and more disparate, too. Structured data, such as transactions logged into DB2/400 or other relational database systems, are growing, but at a slower rate than less-structured data types, such as HTML Web pages, pictures taken with smart phones, social media posts, and PDFs. IBM estimates that, by 2020, more than 40 percent of all data will be machine-generated data coming from Web servers, RFID and GPS sensors, financial transactions, medical devices, HVAC systems, and other machines that will encompass the so-called “Internet of Things.”

As people try to capture all these increasing data volumes and data types, the velocity becomes apparent. These pieces of data–such as Web clickstreams, call record details, and transaction information–are valuable, but that value can diminish as the data ages. Hence, it’s important to act on data as it arrives, or soon thereafter.

New data processing paradigms have emerged to help people store and process these big new data sets. The most popular is Apache Hadoop, an open source framework that enables users to turn ordinary X86 Linux servers into huge distributed clusters that can apply supercomputing-like capabilities against petabytes worth of unstructured data.

Then there are new NoSQL and NewSQL databases, such as MongoDB, that can easily handle semi-structured data and also scale-out horizontally in a fault-tolerate manner more easily than their relational cousins. Hadoop and the NoSQL/NewSQL databases are changing the economics of data storage and processing, and have become the building blocks of a new paradigm of big data-driven applications.

Big Data on IBM i

So where does the IBM i server fit into this new big data landscape? As you might imagine, you’re not going to run Hadoop or NewSQL on IBM i; those products run primarily on Linux. The proprietary nature of IBM i means it’s shielded somewhat from the big data goings-on of the wider IT universe. The fact that the IBM i server is primarily used by brick-and-mortar companies, as opposed to companies that make their money from the Web, also helps to keep the platform grounded in a firmer reality.

But on the other hand, there’s no doubt that IBM i is being impacted by the explosion of information. While the general IT world goes ga-ga for anything with “Hadoop” in the name, and NewSQL database companies continue to sprout up like mushrooms after a spring rain, organizations are counting on their IBM i servers to quietly deal with steadily increasing data volumes, if not necessarily varieties or velocities.

The biggest big data issue facing IBM i shops is growth of structured data stored in the DB2/400 relational database, according to IBM i experts with IBM and SEQUEL Software who talked with IT Jungle for this story.

It used to be fairly rare for IBM i customers to have super massive databases, but now it’s become quite common, according to Mike Stegeman with the Help/Systems‘ subsidiary. “It seemed at one time to be gradual growth, but then all of a sudden it exploded,” he said.

One SEQUEL Software customer had a requirement to access a single database file that had a billion records in it. The file supported a critical transactional system that, due to the structure of the file and the database tables, could not be purged, he said.

“With the IBM i, a lot of it is the transactional data,” Stegeman said. “We’re getting these customers who have these extremely large files on the i, and maybe some other databases that we can access. That’s kind of what their pain points are, and they want to have a tool that’s easy to use and can access the information without breaking the bank.”

Another common big-data related pain point has to do with partitioned tables. There’s a limit to the number of records that can be stored in a table, which leads some IBM i shops to utilize table partitioning. However, some business intelligence tools can’t support partitioned tables, and must run separate queries against them, according to SEQUEL Software, which touts its capability to run single queries against partitioned tables as a competitive advantage.

The IBM i server excels as a database machine, and since that database is relational in nature, people aren’t going to try to squeeze into it all the different data types. There is some growth in storage of Binary Large Objects (BLOBS) and Character Large Objects (CLOBS) on IBM i, but it appears to be minimal outside of specific industries (such as healthcare, with its requirement to store diagnostic images) and ERP systems (such as SAP‘s Business Suite running on IBM i, which is apparently weird). However, many customers are starting to store lots of PDFs in their IFS systems, which is worth noting.

Big Data Causes on IBM i

In the wider big data world, the big data phenomenon is being driven by the desire (and the new capability) to detect and exploit business opportunities in much shorter timeframes. Companies like Facebook, Google, and Twitter use big data technologies to serve ads based on all sorts of things they know about their users, while Netflix and Amazon use it to make product recommendations based on their collected intelligence.

Things are a little different on IBM i. In the IBM i world, the big growth of mostly relational data appears to be driven by two things: regulation and forecasting.

Purging unused data from DB2/400 used to be a standard part of good housekeeping on the platform. But today less than 20 percent of IBM i shops purge their data on a regular basis, according to informal polls taken by Help/Systems’ vice president of technical services, Tom Huntington.

“You have these various regulations and people aren’t able to purge their data,” he says. “[Through PowerTech] we see more and more people who are struggling with how to keep audit data around it.”

The combination of the declining cost of storage and availability of new data warehousing technology like Hadoop are impacting IBM i shops and what data they decide to keep, Stegeman says.

“You don’t need a whole floor on a gigantic skyscraper just to hold your hard drives to handle all your data,” he says. “They’re keeping their history around longer, either for auditing purposes or to find out how the company is doing overall. Or they may say, ‘Hey we’re not using it now, but maybe we will two or five years down the road.'”