Jeff Jonas Explores the Nature of Data in COMMON Keynote
Corrected and Updated: May 21 and June 24, 2009
by Alex Woodie
Jeff Jonas knows a lot about data. Not only does the chief scientist of IBM's Entity Analytics group have a lot of actual data in his head, but he knows how to manipulate it and to get answers to security-related questions for governmental agencies and Las Vegas casinos. But without a breakthrough in how we store and query data, we'll soon be overwhelmed with more data that we can handle, leading to a decrease in understanding, Jonas told COMMON attendees during a keynote address at last month's convention.
At first glance, it appeared an interesting decision by COMMON managers to invite Jonas--an expert in creating IT security systems, with a focus on identity detection systems--to talk to a group of business-minded AS/400-types. Sure, the resident of Las Vegas, Nevada, probably worked with AS/400s while helping some of the biggest casinos on the Strip get a better grasp on identifying employees, vendors, and guests who might be participating in scams (it's a well-known fact that the biggest casinos run AS/400 iron). But as a security expert, Jonas' expertise is only tangentially related to AS/400 technology. Or is it?
In fact, Jonas' experience working with casinos and governmental agencies led to some interesting observations that cut across technological boundaries. During his keynote, Jonas--who blogs at www.jeffjonas.typepad.com--wowed the audience with enlightening stories about the nature of data, secrets of data mining, and the types of technological breakthroughs that are necessary if the IT industry wants to continue to claim that it's widening the breadth of what's knowable by users, companies, governments--by human beings--not shrinking it. From a business-oriented, AS/400 point of view, this has applicability to business intelligence and security.
First, Jonas established his bona fides with his audience. He founded Systems Research and Development in 1983. There, he developed a technology called Non Obvious Relationship Awareness (NORA) that can be used to spot similar identities across two or more databases. It was (and apparently still is) used in Vegas casinos. In 2001, SRD received funding from In-Q-Tel, which Jonas described as the private venture capital arm of the CIA, to help find criminals. After 9/11, Jonas was called to work with government agencies to help find terrorists. In 2005, SRD was bought by IBM and turned into IBM's Entity Analytics Solutions group. He's a true IT security geek and an accomplished triathlete, with a somewhat imposing demeanor and a rapid-fire way of talking.
IBM distinguished engineer Jeff Jonas, from his Youtube video on distributed security.
During his keynote, Jonas related a story that provided a good entry into the types of questions about data that he wrestles with. On a trip to Washington D.C., Jonas spoke with a counter-terrorism intelligence analyst at a governmental agency. "What do you wish you could have if you could have anything?" Jonas asked her. Answers to my questions faster, she said. "It sounds reasonable," Jonas told the audience, "but then I realized it was insane." Insane, because "What if the question was not a smart question today, but it's a smart question on Thursday?" Jonas says.
The point is, we cannot assume that data needed to answer the query existed and been recorded before the query was asked. In other words, it's a timing problem. "I said, 'What are the chances you could have every smart question, every day?'" Jonas asked. It's not a trivial question, and it doesn't have an easy answer. But it is Jonas' goal, however technically difficult (Jonas says it is attainable).
According to Jonas, organizations need to be asking questions constantly if they want to get smarter. If you don't query your data and test your previous assumptions with each new piece of data that you get, then you're not getting smarter.
Jonas related an example of a financial scam at a bank. An outside perpetrator is arrested, but investigators suspect he may have been working with somebody inside the bank. Six months later, one of the employees changes their home address in payroll system to the same address as in the case. How would they know that occurred, Jonas asked. "They wouldn't know. There's not a company out there that would have known, unless they're playing the game of data finds data and the relevance finds the user."
This led Jonas to expound his first principle. "If you do not treat new data in your enterprise as part of a question, you will never know the patterns, unless someone asks."
Constantly asking questions and evaluating new pieces of data can help an organization overcome what Jonas calls enterprise amnesia. "The smartest your organization can be is the net sum of its perceptions," Jonas told COMMON attendees.
Getting smarter by asking questions with every new piece of data is the same as putting a picture puzzle together, Jonas said. This is something that Jonas calls persistent context. "You find one piece that's simply blades of grass, but this is the piece that connects the windmill scene to the alligator scene," he says. "Without this one piece that you asked about, you'd have no way of knowing these two scenes are connected."
Sometimes, new pieces reverse earlier assertions. "The moment you process a new transaction (a new puzzle piece) it has the chance of changing the shape of the puzzle, and right before you go to the next piece, you ask yourself, 'Did I learn something that matters?'" he asks. "The smartest your organization is going to be is considering the importance right when the data is being stitched together."
Another project (not related to the government, but a commercial effort) had Jonas assisting an organization in compiling a database that correlated the identities of Americans with pieces of data from public records (such as property records, DMV records, phone books, etc). He knew there were about 300 million people in the U.S. But as Jonas started loading the data into his warehouse, the machine soon counted more than 300 million Americans. "We keep loading it, and pretty soon it says there are 600 million people in America--and if the number kept climbing to three billion, it surely would be a piece of junk. But my theory was it would collapse," he said.
He was right. Consider what happens when there are two records describing two different people as they appear to share the same name. "What happens is a third record shows up in the future that works like glue, which causes them to collapse," he said. Eventually, "the more data we loaded, the fewer number of people there were."
But large numbers can also work against you. At another federal agency (he wouldn't say which), Jonas got to thinking: What if they had a very large data warehouse in the basement with 4 exabytes (EB) of data, and it was expanding at the rate of 5 TB per minute. "You sit there and you realize you don't get to Friday night and run a batch job to answer the question of what does it all mean," he says. "You could use all the computing power and energy on Earth and you wouldn't be able to do it." The "it" he is referring to, of course, is seeing how each new piece of data affects all the other pieces of data.
"What's happening is data volumes are growing at this pace, yet an organization's ability to make sense of them isn't keeping up," Jonas said. "Today, say you can make sense of 7 percent of what's available, and in a few years it might be 4 percent, and in a few years after that it might be one percent. So the percentage of what's knowable is on the decline."
So while the sum of our knowledge is increasing, the ratio of what's knowable to the data that's available is getting smaller. Without some new technology to help "stitch things together," as Jonas puts it, we'll soon be wallowing in gobs of structured and unstructured data, with no discernable path out.
"I think the only way forward is going from applying algorithms to individual transactions, to first placing information in context--pixels to pictures--and only applying algorithms after one sees how the transaction relates to the other data," he said. "It's the only way that I can see that it's going to close this sense-making gap."
There is one thing software vendors can do to make their sense-making products more useful for the coming information explosion, Jonas said: Unify the data and the tools people use to query it.
Jonas sees this type of technology--loading queries into a database as data--helping to overcome the counter-terrorism intelligence analyst's dilemma of knowing when a question can be answered. "This is a nice and easy method that enables a future piece of data to find the question," he said in a follow-up e-mail after this story was first published. "In other words, if the question asked by the user has no answer today…if a piece of data that can answer the question arrives tomorrow, the system can alert the user that their question is now true."
IBM has shown a lot of interest lately in developing so-called "smart" sensor technology that sounds a lot like what Jonas is proposing. But is such a self-aware system even technologically attainable? "I see [the technological challenges] as trivial," he says in his e-mail follow-up. "This works well and is quite attainable."
This article has been corrected. In paragraph 13, the records on Americans that Jonas once analyzed for a private company came from public sources, such as property records and the phone book, not credit cards or employment rolls, as the story originally stated. Also, in paragraph 15, Jonas never worked for the government agency mentioned in the story, and had no direct knowledge of A) whether there was in fact a data warehouse in the basement, and B) how big that data warehouse might be if there, indeed, was one. Jonas also clarified several other statements attributed to him in the story.
Post this story to del.icio.us
Post this story to Digg
Post this story to Slashdot