Data Warehouse Mistakes Begin with i Avoidance
April 5, 2010 Dan Burger
If you ask Greg Veal what platform is best for creating a data warehouse, he’ll tell you it’s whatever platform you are using for your operational data. For readers of The Four Hundred, that platform is the IBM AS/400, which is officially referred to as the Power System running IBM i these days. Veal is intimately familiar with the IBM i. His experience dates back to the System/32, but his ideas are thoroughly modern.
From the viewpoint of a consultant who specializes in data warehousing projects and works on those that are big enough to warrant the hiring of an outside expert, Veal says the number of shops building data warehouses is increasing. However the number of shops building them on the i platform is dwarfed by those that choose to build them on Microsoft SQL Server.
“The most popular route is to port data to SQL Server,” is Veal’s observation.
“The three choices are: leaving the data on the i, which I believe is the best option; moving it to Oracle, which is the second best option except that it is expensive; and moving it to SQL Server. I used to think of SQL Server as a third-rate database, but it has come quite a ways and is pretty capable now.”
Of the data warehouse projects he sees, all of which have an IBM i as the main operational system, close to 100 percent involve moving data from DB2 for i, the integrated database in i5/OS and i, to SQL Server.
The problem with that strategy is that companies have a staff that knows IBM i and that knowledge is being wasted in many cases. There is a lack of understanding on the part of IT directors and business executives regarding the platform, the skills of the staff, and the benefit of using familiar tools and languages.
“Unfortunately, they don’t consider building the warehouse on i. They consider Oracle and SQL Server and they believe that Oracle is a little more robust, but they believe SQL Server will do the job for a lot less money,” he says.
Cost is certainly a consideration, but it can be misleading. IBM, as it often does with projects that it controls, helps propagate the belief that a data warehouse is an expensive proposition. Big Blue prefers to build its data warehouse projects using a framework that allows many plug-in features and automation for large-scale data transformations. The big enterprise shops will sign up for these types of packages and hire IBM services to come in and do the project.
Much can be accomplished without the highest levels of complexity and expense, Veal says. And the project manager in him offers some advice on how projects should be funded internally.
“All too often, IT funds these projects. That is a huge mistake,” he says. “The business or the department that benefits from the project should take responsibility for the project and pay for it. For instance, who knows how valuable adding another column of data is? The user does. If it’s on the IT budget, the added column idea might get cut. If it’s on the beneficiary department’s budget, it might get added.”
A better chance for success with a data warehousing project comes from involving business users. Staff is often working overtime doing two jobs while the project is in progress because the end users get involved in recommendation committees and helping with analyses. When the project isn’t owned by the executive in charge of the department that benefits, problems arise and IT will take the heat and get hit with the expenses.
“Politics factor into these things, and it works much better when IT is a paid assistant in the effort,” Veal says. “IT should be responsible for infrastructure and delivering the data accurately, but should not be the stewards of the data or the owners of the data. The business users validate whether the answers they are getting are right. It needs to be their project.”
It’s not uncommon for companies to ask for cost estimates and completion schedules prior to launching into a data warehouse project. That’s good business sense, right? In reality, Veal says, those estimates are almost always faulty. The person providing the estimate will typically tell the customer what they want to hear. No one should be surprised when projects don’t come in on budget and on schedule. If estimates were accurate, they would not often be approved. As it is, jobs are frequently awarded to the best liar.
Good project management can draw a box around one data mart and certain requirements, but projects seldom follow the original blueprint. “Scope creep,” those factors that expand the requirements and the cost of the project and the timeline for getting it done, can be counted for changing the final project enough to allow low-ball cost estimates to recover expenses that may have been left out of original estimates.
Pitfalls? Sure, there will always be mistakes made and toes will be stubbed. And all projects have some natural fluidity. Getting a handle on the variables, however, is important. Wading into the deep end of the pool before learning to swim is always more risky. You might want to consider a lifeguard.
“I have a bias because I am a consultant,” Veal admits, “but at the same time, you need to find someone who has done it successfully and has a track record. You want to make sure that knowledge transfers to your staff. And you bring in extra bodies, if necessary, to do the maintenance work while you get some of your internal people working on this project. When you own this project, it is a living thing. It doesn’t end because once people realize the power and the potential benefits, the project grows. If you get into this in a big way, understand that you will need someone for eternity–not just for the duration of the project. This person needs to be knowledgeable about building and managing data warehouses and how to optimize performance. You need a database administrator who knows data warehouses. Data warehouses are not built the same way as databases.”
This is a fairly typical scenario that Veal sees: Someone or some group within an organization says, “Hey, we could do some really cool reporting things and get away from printed reports or PDF conversions of those reports. We could actually have a system that allows the user to put in criteria. It can be done with SQL Reports and Reporting Services. So let’s put our data on SQL Server. We can do these reports that are more flexible for the user.”
That starts a project rolling. The company has staff with some SQL Server skills and they start putting ideas together without an in-depth look at the options. There’s an assumption that data is simply moved from one place to another and then users get a reporting tool. After the wheels are put in motion, the realization comes that data transformation is necessary and that someone needs to understand DB2 and how to use the SQL Query Engine.
Here’s another truism about data warehouse projects. When it comes to data cleansing, Veal says, expect to spend three to four times the effort than you first think you need to spend.
“This depends on what you are coming from,” he says, “but if you are coming from homegrown systems, you often have big-time troubles. People have data that will not go into a data warehouse.” As Veal points out, data warehouses require hierarchical structures. If-then logic can be done with code, but a data mart doesn’t have code. It has data. For data to be transferred into a data mart, it needs to be accessible without business logic stuck in the middle.
One form of data “dirtiness” is data that no longer has the relations that allow it to be asked any question and come back with an accurate answer.
“I am a huge advocate of writing your own programs to do data transformation,” Veal says. “Why? The vendors will say power users should do the definitions for the transformation. But you don’t want power users to be rule-based programmers every time a rule gets changed. The users will have to know a language. Each time they do it, they have to remember how to do it. Programmers, on the other hand, do it everyday, so it’s easy for them to make rule-based changes. Almost nobody’s data is so simplistic that the tools can do a good job with rules-based transformations. The vendors provide hooks for writing programs to handle the exceptions, but if you have to write RPG programs to handle the exceptions, and have users do the rest, why not let people do it in the language they know? I believe in custom ETL. It is cost efficient and it gets done better.”
There probably should be a lot of similarities from one data warehouse project to another, but there isn’t. Different companies have different ideas about what they want to do, how much they want to spend, what they think they need.
It bothers Veal that regardless of the size and scope of the project, most shops are unaware of the capabilities and the power of IBM i. Although it’s considered a workhorse, it’s known as a business machine that does transaction processing. Many of its most ardent proponents have not kept up with the advances the operating system and the hardware. This is symptomatic of companies that are not planning for the future.
“I believe we are in an information age–an information economy–and companies need to strongly informationalize, which means building data warehouses and taking advantage of business intelligence. If companies want to thrive, they need to invest in becoming knowledge organizations. Employees should get accurate and timely information quickly and easily. The thinking that all that’s needed are dashboards for the top-level managers is shortsighted. They are not seeing how powerful information in the hands of all the workers.”
Veal regularly presents sessions on data warehousing topics at the COMMONannual meetings. He’s also a subject matter expert for COMMON’s Business Intelligence/Database course of study, and an instructor at the RPG World conferences. For more information on him see: www.gregveal.com.
Data Warehouses: Know One When You See One?