Why You Need a Data Warehouse
April 24, 2012 Alex Woodie
If your IBM i shop is like most, it uses some type of business software, such as an ERP system built atop DB2/400, to process transactions. And while it may excel at processing data quickly and efficiently, the setup is not ideal for answering detailed questions that a manager may have about the business, i.e. running analytical workloads. For various reasons, a separate data warehouse will be a far superior solution–especially if it’s implemented on an IBM i server, says data warehousing expert Alan Jordan.
There are lots of good reasons to run a data warehouse. Perhaps the most obvious reason is to keep analytical workloads separate from production systems, reducing the chance that running a large query in the middle of the day will bring your transaction processing capability–your business–to its knees. Having a place where you consolidate data from multiple sources, such as separate SQL Server or Oracle databases, is another common driver.
But there are several not-so-obvious reasons for implementing a data warehouse. For starters, the architecture of a typical transactional database simply doesn’t fit well with analytical uses, says Alan Jordan, a vice president of Coglin Mill, a developer of extract, transform, and load (ETL) software for the IBM i platform.
“It’s one of those ‘You can’t get there from here’ things,” Jordan tells IT Jungle. “The database was not designed for human use. Its structure, the third normal form (3NF), was designed for the application to use. So if I have a reporting or business requirement where I need to analyze the data in this way, the operational data just doesn’t support that type of analysis.”
The 3NF principle, as implemented in the database schema used by most major ERP systems, mandates that different pieces of information–such as customer name, customer number, and customer address, product name, product number, product description, etc–are stored in separate tables within the database.
This setup works very well for submitting transactions through an application, as the ERP software can quickly join dozens of tables to assemble all the bits of information needed to complete an order. It’s fast and efficient. But this efficient database design becomes very cumbersome when humans are submitting the queries, rather than the application that was originally designed to work with the database.
“When I want to create a report that uses information such as names and descriptions and other attributes, instead of just pointing to one table and saying, ‘Here produce me the report from this table,’ you end up needing to join 15 to 20 tables together,” he says. “So the person using the reporting tool, if it’s pointing directly against the production database, has to work out how all these tables relate to each other and how to join them to get the data. That’s a complexity.”
But when you build a data mart, you change the database design, Jordan says. “You throw away those 3NF principles and build it as such that it’s easy to query, rather than easy to process transactions,” he says. “I’m not processing transactions with the data any more–I’m querying it, so I purposely design it for ease of querying. It’s a completely different philosophy on how to design the tables.”
That brings up the second not-to-obvious reason to implement a data warehouse: cryptic names. A good data mart design will implement easy-to-understand file and field names in place of the cryptic names that most production databases use. These cryptic names will often be found in RPG-based applications that were originally developed for the S/3X or AS/400 servers in the late 1970s and 1980s. In those days RPG had a limitation that restricted column and field names to six characters.
“How descriptive can you be–as far as what is the meaning or value of this particular piece of data–in six characters?” Jordan asks. “You may have good text descriptions of the names, but not every query or reporting tool will show you those text descriptions.”
“So If I’m the report writer, and I’m charged with using all this operational data, first I have 20 tables to mange instead of two or three. But also I have all these cryptic column names. There’s no real description or documentation that tells me what all the pieces of the data mean. It’s a challenge just understanding the data.”
The way that time is recorded in ERP databases is a third not-so-obvious reason to implement a dedicated data warehouse on the IBM i platform. Time is a major element of nearly every query or report, but it can be frustratingly difficult to extract with nearly any ERP system originally developed in the 1990s, before time became a normalized data type in DB2/400.
“When you have numerical field, it might be ordered as century-year-month-day, or century-month-day-year. There might be 10 different formats it might use,” he says. How do know what quarter something happened in? You have to go through complex groups to look at that data, that number, to know you’re in quarter one or in month two or in the year 2012. I’ve got multiple steps to go through. It’s not impossible to do. It’s just one more challenge that I have to handle in the reporting. But when you do it in a data warehouse, you convert those dates into true dates.”
These three items–the 3NF structure of databases, cryptic names, and complex time stamps–conspire to increase the complexity level for anybody attempting to query a production database. Reports become unwieldy and time that could be spent thinking about ways to improve the business is spent trying to understand cryptic file names and archaic time codes.
This takes away from the whole idea of business intelligence.
“Just like any other customer that implements business intelligence, I’m trying to better understand what’s going on in my business,” Jordan says. “I want to provide better customer service, I want to reduce cost, I want to reduce waste and identify problem areas–which areas are underperforming, which products or which brands am I not earning the correct margin on. But this information is not obvious at a transactional level.”
Without a data warehouse, users are on their own.
“They have to become an IT expert to understand the complexity of the raw data, or they have to get IT involved to write the reports. So the value of your reporting tool is compromised severely,” he says. “But once you have data that’s truly designed for reporting, it’s so much easier to work with. Now your business community has an easy-to-understand database. It’s easy to query and they’re empowered to do their business analysis job easily and efficiently.”
And contrary to popular belief, data warehousing doesn’t have to cost millions to implement and maintain and can be efficiently done with the database you’re already running–DB2/400.
“The industry talks about Oracle and Teradata. They’ve got the Gartner mindshare,” Jordan says. “It’s unfortunate that a lot of shops listen to that, drink the Kool-Aid, and not realize that IBM i on Power Systems and DB2 is perfectly suited to 90 percent of DW applications. If you’re a large telco or somebody where you have literally billions of transactions every day, then you might have a scalability issue and you need to go to one of those large-scale data appliances like Terradata or Netezza. Maybe at that end, iSeries isn’t scalable. But for your average casino or your average distribution company, there is absolutely no reason to take the data off the platform.”