Data Warehouses: Know One When You See One?
January 26, 2009 Dan Burger
People may have problems defining a data warehouse. They may have problems designing and building a data warehouse, too. But that hasn’t slowed down the desire to get a better grasp on data and use it more effectively to drive the business and preserve revenues. “What many people describe as a data warehouse is actually a data mart or something else,” says Bill O’Connell, chief technology officer of data warehousing within IBM‘s Information Management division.
What distinguishes a data warehouse is an enterprise design and a definition of the business problem to be solved, O’Connell says. The emphasis is on consolidating data in an environment that can grow in a linear, scalable way from the low terabytes up to the hundreds of terabytes. It means being able to do heavy-duty data mining to operational analytics within the same system.
Although data warehouse projects are often associated with large enterprises and big budgets, this is a notion disproved by many companies in the SMB arena. “There is a cost involved in building an enterprise data warehouse system,” O’Connell says. “It’s not just IT infrastructure or the cost of the warehouse. You have to consider data governance, stewardship, business processes, and your organization structure. All those factor into the maturity of the system and the business.”
Large enterprises were leading the data warehouse adoption early on because they could afford it, but the motivation came from understanding how these projects could ultimately be used to drive business.
From where O’Connell sits, he sees data warehouses reaching the SMB shops “where they have always done analytics. I wouldn’t call it warehousing, but that’s changing. They are building small warehouses now.”
Putting an accurate gauge on the current data warehouse market isn’t particularly easy. The latest IDC report on data warehouse software is based on numbers from 2007, which seems like eons ago. The figures speak loudly, however, showing a 15 percent gain in software revenues compared to 2006. Sales of data warehousing software also delivered double-digit increases in the previous two years.
Based on the revenue numbers relating to 2007, IDC ranked IBM number one in the category of data warehouse generation tools. Big Blue was followed by SAS Institute, Informatica, Microsoft, and Oracle. In the data warehouse management tools category, again based on revenue, the leader was Oracle. Following was IBM, Microsoft, Teradata, and SAS.
The complete list of data warehouse software vendors is indicative of the thriving nature of this market. These revenue leaders are the top of the pyramid, and it’s no coincidence that they design, market, and sell databases in almost every case.
“The future of the data warehouse platform software market remains bright,” says Dan Vesset, vice president of business analytics research at IDC. “As various business intelligence and analytics projects remain high on the priority lists of organizations of all sizes, the demand for data warehouse platform software to support these business intelligence and analytics projects is likely to continue to grow.”
Among System i users, data warehouse projects probably lag behind the market as a whole, but all the same reasons exist for implementing a project.
“From its beginnings as the AS/400, the box has enabled users to do a lot of things in terms of reports and queries that other platforms and databases were unable to do or could only do with great difficulty,” says Alan Jordan, vice president at Coglin Mill. “For a long time, the AS/400 was a head of its time and maybe people came to rely on that too much and for too long. Data warehousing and business intelligence technology has advanced and left many of the iSeries and System i users behind.”
“A lot of organizations think all they need is a reporting tool,” Jordan continues. “They lived with Query/400 for years. And now there are all sorts of reporting tools with great capabilities. But there is a reason for having a well-designed business intelligence architecture, which is what a data warehouse is.
Bill Langston, the director of marketing at New Generation Software, has a slightly different perspective on what the System i user is looking for. His view is more toward the line of business user or the departmental user, an area where data marts are popular.
“We find companies are very focused on a combination of real-time analysis and reporting based on live, production data in areas like shipping, logistics, inventory, and customer service and historical performance trends, in areas like finance and sales,” Langston says. “Data marts are the preferred way to obtain the historical information. Mid-market companies, especially today, don’t have many dedicated business analysts and even senior managers in these companies are often very hands-on when it comes to day-to-day operations. As a result, they tend to be more tactical in their perspective and much more cautious about data warehousing.”
The terms data warehouse and data mart are often used interchangeably, which is fine with some people and argued about by others. Without going down that road too far, let’s just say that you should always make certain you are on the same page with whomever you are talking with when these terms come up.
“The whole issue with the System i is you want to exploit that operating system,” O’Connell says. “Because it’s now running on Power hardware, users can put AIX and Linux on it. The benefit is that we can integrate databases to the operating system and it becomes a ‘black box’ approach.”
The types of data warehouse implementations that O’Connell typically encounters involves multiple platforms, but the System i is no stranger. He calls IBM’s low-end SMB customers the System i “sweet spot,” particularly in Europe. Most often the data warehouse handles less than 10 terabytes of information in a departmental situation rather than an enterprise-wide system.
“A lot of companies rely on the i because it is very simple and it is a hands-off box,” O’Connell says. “If you are going to take data off an i and do analytics off of that, and decide what applications will be brought into a single warehouse, you will put the data warehouse on the i. The people have the skills and are used to it. In an enterprise-class warehouse, I’m bringing together many different sources, and many different lines of business and different departments–some running on i, some on x, some on p, and some mainframe–that’s a different game.”
“We wouldn’t use a System i to build a warehouse that scales up into the tens of terabytes of raw data,” O’Connell says. “Those are very complex and are usually done on System p or System x. They use a lot of small servers and they grow the warehouses linearly by adding servers.”
Of course, there is no good reason why you could not use DB2 clustering technology to cluster a bunch of low-end Power Systems i boxes together to make a big database engine to run a data warehouse on. DB2 Multisystem exists, and has for 13 years.
O’Connell says the task of implementing a data warehouse is no more difficult on a System i than any other platform.
“I can do anything [in terms of data analytics] on a System i that I could do on a System x or System p,” he says. “Some of the tools may not run on every server, but in a client server relationship, I can run the tools somewhere else. The tools can be accessing data on the i no matter where they are running.”
There has been some feedback from System i users that they would prefer tools that run native, however.
Building data warehouses can be fairly simple or very complex. It depends on what O’Connell calls the maturity of a warehouse. A first phase might be a system that does ad hoc and batch reports. This differs from what most organizations are doing now, because building a data warehouse requires scrubbing the data–eliminating the “garbage in, garbage out” factor. As the data warehouse “matures,” it gains the capability to analyze data. It provides the why something happened in addition to telling you what happened. As the data warehouse advances and more functionality is added, it can do predictive and discovery analysis as well as matching and mirroring data, which means taking past data and comparing it to on-the-fly, current data.
Often overlooked in the excitement of building a data warehouse and gaining valuable business intelligence capabilities is the taking care of business side.
“You can’t just build an enterprise warehouse right away,” O’Connell warns. “To do this right requires changes in business processes, changes in the business itself, and in organizational structure. All these things must happen in parallel. Helping customers move a long as fast as possible is what we do. Learn to deal with their data, sunset old applications, consolidating environment, reconciling the data, understand the data in relation to data governance, publishing that data around metadata representation so the business can see it. All that must happen as well. The technology is the easy part. But what it takes to exploit this is much more complex.”
Among the misconceptions of what a data warehouse truly is, the big picture concept is one thing that is often missing.
Data warehousing is not business intelligence or a fancy derivative of queries and reports. In Jordan’s words a data warehouse “stores all the relevant information that can be used for business intelligence. It is detail-level information without the irrelevant data such as control fields and other meaningless codes that are useless from a business analysis perspective. The data becomes understandable and error free. It’s quality control on data.”
Correcting misconceptions and fine-tuning definitions so that everyone is speaking the same language is part of the process O’Connell goes through with just about every customer. Beyond that, he has a few questions and some all-purpose advice ready.
“When I work with a customer, I look at two things,” he says. “Let’s determine where you want to be in five years. Then start building toward that. We always build a warehouse one application at a time. That way we get value back quickly. The cost of putting up the initial stages of a warehouse should be short. It should be running in three months. And you get value right away. Then you add another application or another function and then that’s up and running in another three months. And value follows that. When it stops growing, it becomes a cost center. But it should always be growing and applications should always be added to increase value. We have warehouses that have dozens of applications going live every week.”