RODIN Simplifies Maintenance of Data Warehouses
November 8, 2005 Alex Woodie
Keeping large data warehouses up-to-date with the latest data from production systems should be easier for iSeries shops running the new release of RODIN, an extract, transform, and load (ETL) tool from Coglin Mill. With RODIN V4R1M6, the company has introduced a new capability to identify just the files that have been changed since the last update, saving time and resources during nightly updates.
RODIN is a native OS/400 ETL tool designed specifically for building data warehouses (as well as data marts and operational data stores) on iSeries servers. The tool supports other databases besides DB2/400 as sources of data, and its RPG programs execute the same cleansing and normalization routines on this outside data that they do on native OS/400 data. But the target of these data transformations is always the OS/400 server, which the Rochester, Minn., company views as a superior platform for data warehousing.
Because a data warehouse is supposed to provide that “single version of the truth” about a company’s data, it makes sense that its owners take pains to keep them up to date with the latest data from the various ERP systems they’re connected to. This becomes especially important as the lines between what is an “offline” data warehouse used periodically for business intelligence activities begin to blur with what is an online operational data store that’s feeding downstream production applications, such as a CRM system.
For most types of transactional data, it’s a fairly simple matter to identify the new entries, deletions, or other changes that need to be updated on the data warehouse, says Alan Jordan, vice president of development at Coglin Mill. There are usually date stamps and timestamps attached to these transactions that make it easy to find changes to this kind of data.
But it’s more challenging to find updates to non-transactional data, such as changes made to a customer status field, or a new telephone number or address, Jordan says. Historically, companies have had to re-load entire tables to be sure that they’re grabbing the latest, greatest pieces of information for the data warehouse.
For customers with just a few thousand customer files, it was no big deal to reload the tables every night, even if there were few, if any, changes. But for large companies with a high volume of updates to apply every night, this “brute force” method of keeping their data warehouse up-to-date was unnecessarily resource intensive, Jordan says.
Coglin Mill’s research and development team in Australia applied itself to the problem, and the solution they came up with is called “change data capture,” or CDC. CDC finds changes by analyzing the journal log in OS/400. While there are no timestamps on changes to non-transactional data, IBM‘s journaling function still picks up any changes made to that data, as long as journaling is turned on. The CDC job then outputs a table that tells RODIN exactly which areas of the data warehouse to update, avoiding the need to reload entire tables.
The CDC is a “very lightweight job” compared to the brute force approach of re-loading entire tables, Jordan says. “It simplifies your processing. The logic to identify the changes is vastly simplified by CDC.”
In addition to the new CDC feature, RODIN V4R1M6 gains two other notable, enhancements, including double byte character support. DBCS is an important feature as Coglin Mill tries to break into the Japanese market. The company currently has a very large Japanese customer (an unnamed carmaker) using RODIN to update data from six iSeries Model 870s to another Model 870 serving as a data warehouse.
What’s notable about this implementation is the sheer number of ETL jobs running. These six iSeries servers support 1,000 locations, and each location has about 200 tables that need to be uploaded to the data warehouse each night. While RODIN already had what Jordan considered to be excellent scalability, the company had to do some tweaks in RODIN to allow it to support 200,000 ETL jobs in one update.
There may or may not be a final modification to V4R1 before the company unveils V5R1, currently slated for late March. This feature would provide a new way to access non-DB2/400 databases, and would be based on IBM WebSphere technology. If one of Coglin Mill’s customers doesn’t need this immediately, this feature will be included with V5R1. The main new capability that Coglin Mill is building into V5R1 is support for remote journaling. This will enable users to perform the CDC routines remotely, on the data warehouse, as opposed to the source system.
Companies today are finding all sorts of innovative uses for data warehouses beyond “a place for running queries against,” Jordan says. “It’s become critical for their business.” As far as data warehousing on the iSeries goes, business has picked up a bit for Coglin Mill since the downturn following the dot-com bubble burst and 9/11, Jordan says, but the sweet spot has shifted.
Instead of large manufacturers, distributors, and retailers, the main class of business using iSeries for data warehousing are those that put high value on security and availability–banks, financial services, and insurance–and a lot of the recent growth has come from small- to medium-size companies, as opposed to the large-cap customers Coglin Mill concentrated on in the late 1990s.
The iSeries has a perception problem when it comes to data warehousing, Jordan says. “If you’re a medium to large customer, and you go to IBM and say, ‘I want to build a data warehouse, what do I do?’ they’ll say ‘Put it on pSeries,’ even if you’re an iSeries shop,” he says. “Their corporate message is no ETL for iSeries.”
ETL solutions such as Informatica‘s PowerCenter (which we wrote about last week) and IBM’s Ascential tools are good, and support the iSeries as a source or target, Jordan says. The only problem with those solutions, from an iSeries perspective, is they don’t run natively on iSeries. “We believe we’re the best on iSeries, and iSeries is the best solution for data warehousing,” he says.
RODIN V4R1M6 is available now. The company sells licenses in various ways, including by LPAR and by CPW rating. One-year subscriptions for the software start at $15,000. For more information, visit www.coglinmill.com.