The Data Quality Inflection Point
September 4, 2007 Alex Woodie
Got data quality issues? If you’re like many businesses that have struggled with the exponential growth of data over the last few years, you have a fair share of bad data hiding across your servers. But that shouldn’t relegate you to living in the gray world of questionable data, argues Arvind Parthasarathi, a senior director of solutions at data management tool vendor Informatica. In fact, we find ourselves currently at an inflection point that allows us to take control and turn the data quality problem around, Parthasarathi explains in this Q&A.
Alex Woodie: Do you get the sense that people are taking data quality seriously? Do you think it’s getting the attention it deserves?
Arvind Parthasarathi: We’re at a very interesting point in the history of data quality. You could argue that data quality has been a problem that has been around way before computers. With the advent of computers, that problem became a lot more challenging because now the amount of data that we’re dealing with is much larger. Over the past 20 to 30 years, data has been growing at an exponential rate, where the same questions that we’ve been asking for literally hundreds of years are manifesting in importance.
We’re at an inflection point now for data quality, and it’s coming from a couple of things. It’s partly due to the Internet and the communication that we have, because now data is driving throughout the entire organization. Another reason for the inflection point is the world of compliance. The adage of going to jail for bad data is definitely very true. The compliance and governance regime is where we have a non-monetary incentive to get this problem solved.
AW: What different forms does bad data take, and what steps can you take toward fixing them?
AP: We measure data quality along a number of dimensions. The first one is just consistency. Another one is accuracy. You look at conformity of data. There is also the completeness of the data. You need to have a structured perception of where your problems are, so one of the first things we recommend is a data quality assessment. Think of it as a red-yellow-green report card. Based upon your business priorities, we identify what is the first thing you want to tackle, then what’s the next thing you want to tackle. It behooves us to have quick value for what we do. I can’t come in and say, “Go with me on a four-year journey to solve data quality,” where you’ll see the benefits in that time-frame. It’s all about, “What can we come in and do now?”
AW: What are the sources of bad data? Does it come down to garbage in, garbage out, or are there other more insidious places it comes from?
AP: It goes back to the source of what you mean by bad. My bad may not be your bad. Clearly there are a number of human-introduced errors. It could be at the point of entry, or it could be as the data is going through a business process. It also could be because, in today’s world there’s a lot of extended enterprises going on, so a lot of functions are outsourced. It’s not only your organization and your processes, but it’s other people’s data sets and processes that impact your quality of data.
AW: What new approaches is Informatica taking to improve data quality?
AP: You can look at the data and say, this is broken, let’s go fix it. But the next generation of data quality, what we’re implementing with some of our customers now, is preventative data quality, which means, How do we get in front of this business process to prevent these errors from being introduced? For a lot of ERP-type applications, you have data coming in at the point of entry, or a system analyst will key in an order in an SAP or Oracle screen. We can actually trap that interface or trap that interrupt right at the point where they’re entering the data into SAP or Oracle, and it punches out into our data quality environment. We can cleanse that order and make sure it’s correct and along the lines of what the company wants. I think we’ll see more and more of that moving forward.
It’s like spring cleaning in your house. You can say every year we’re going to do spring cleaning and we’re just not going to worry about cleaning our house in between. But if every time you wash a dish or every time you have a meal, if you do a little bit of work as part of the process, you never get into a situation where you have to do these massive spring cleanings. And the massive spring cleanings are expensive. Companies don’t want to do them. It’s hard to create that kind of time, effort, and money, whereas it’s much easier to instrument data quality into everyday business processes.
AW: Do you find errors congregate in some areas more than others?
AP: Data quality issues usually manifests themselves in business processes, which usually span multiple applications, and that’s the number one reason why it tends to be insidious. You take a process and it has 10 steps, and you can look at the individual steps, and say we’re doing a good job here, but when you put it all together and then look at data quality across [the process], you realize it’s broken. And that’s the value of the notion of a data quality competence center or a data quality foundation that many of our customers are putting in.
AW: What’s the first step in putting together a data quality competence center?
AP: The number one step is you need to get buy-in from the business. While there are tools and all these things that software can provide, ultimately, like most IT projects, you really require the customer to buy in. And that’s where I think the inflection point is helping because a lot of these initiatives are driven by the businesses. They have chief risk officers and compliance officers, and certain companies that actually have data czars. Having that kind of rigid organizational infrastructure is a first step, and from that every thing else can flow, you get the support, you get the buy-in, you get the resources, you get the budget, then you can kick off your software implementation.
AW: What are some common pitfalls customers might run into as they start a data quality project?
AP: To come back to the notion of an inflection point . . . this is not something that has been tried and tested. This is not something that organizations have been solving for the last 40 years. As people are coming up to speed on this, one of the first things they find is there are a lot of problems they can go solve. As the output of a data quality assessment, we identified at one customer almost 63 different projects that they could go do. So let’s not boil the ocean, let’s pick one that has high impact to the business, and the best ROI, so we can start proving the model.
AW: What kind of results can businesses expect from a data quality initiative?
AP: It varies from industry to industry, but data quality projects tend to have very, very high ROI. The results can be fairly phenomenal. At the same time, customers get cagey about that, and there are competitive reasons about why they want to keep that stuff in-house. But to give you the generic studies that have been done in the retail industry, simply improving invoices and supplier contracts and things like that, by getting a handle on the core issues can actually result in some industries from 1 to 3 percent of total sales. The interesting part is that by improving data quality it’s not that there’s a line item on the balance sheet that says “data quality losses,” or something like that. It’s by improving data quality that you begin to see improvements in all the line items on the balance sheet. The people productivity improves, asset management improves, potential sales improves, losses improve. Competitively, you get better off. That’s why the percentages are very high.
AW: Do you find the same types of data quality problems in older mainframe-based applications as in newer Windows-type machines?
AP: Clearly a mainframe system that has been somewhat neglected over the past 15 years will have more than its share of problems. But if you go and solve the mainframe problems alone, I don’t think you’ll get the bang for the buck. If you take any system, there’s this notion of data decaying. If you have a customer database, people move, people change their phone numbers, people get married, people get divorced. That’s one of the reasons that it affects all systems. Now, it affects some systems more than others. But you have to solve it for the entire business process.
AW: Can you tell me a little bit about the Informatica Data Quality tools, formerly Similarity?
AP: That was a company that Informatica bought in January of 2006. Over the past one-and-a-half years, we’ve really expanded and grown that business. It’s a very strategic business for Informatica. It’s one where there’s a lot of investment and focus. The product itself is an end-to-end data quality platform. If you look at the notion of data quality, there are six steps that you have to go through. You have to discover what’s going on. This is the notion of assessment. Then the notion of data profiling–get the data out and try to understand what’s going on, tie it into the assessment. From there you move onto a definition stage, where you define where you want to get. That’s usually business users saying, for example, if you have an address database, there’s no point in being 100 percent accurate, because there’s no value to the business.
Improvement is the next cycle, where you want to start cleansing everything. Now that you’ve figured out what the problem is, I’ve defined where I need to be, I’ve assessed where I am, then I go about improving it, then I sort of establish a steady state, which is the monitor. I don’t want this to be a one-off thing. I want to be able to instrument this into my business process. So the functionality of the product allows you to do that entire lifecycle.
AW: Why wouldn’t every business want to have 100 percent accurate data all the time?
AP: You see industries where you need 99 percent uptime, or 99 percent data quality, because you can’t afford even a single error. There are industries that have a life aspect to them–healthcare, pharmaceutical, aerospace–where you just can’t be wrong. And there are other industries where a much lower amount is perfectly tolerable. In fact, if you take a consumer-driven environment, you may not be able to get above a percentage because your user base, especially if you have millions of people, is just continually in flux. It becomes a little bit of a return-on-investment type curve. How much more do I want to spend? Customers who say, “In my industry 50 to 70 percent is normal, and I’m okay being at 50 percent. I’m not willing to spend the additional money to get to 70 percent.” And others say, “Okay fine I want to be at 70 percent because I want to lead my industry and that’s how it helps my business process.” So it depends on the industry and the customer. If you’re a direct marketing company, and you’re entire bread and butter was about having this customer database, then clearly that’s probably gong to be a big component of it, but for a lot of other industries that may not be the case.