Humans Fight, But Watson’s Chips Beat Quiz Champs
February 21, 2011 Timothy Prickett Morgan
Like a lot of Americans, the only time I ever watch the Jeopardy! game show is if I happen to be exhausted on a Friday night, I happen to have my work done early, and I happen to feel like spending a little cash to go down to the local beer and burger joint with the wife and kids. And when we do go, we inevitably fight over who gets to sit in the benches that face the TV so we can watch the game clues and try to come up with the questions and play along.
But last week, from Monday through Wednesday, my kids and I sat absolutely riveted to the TV at 7 p.m., watching the contest between the all-time Jeopardy! champs, Ken Jennings and Brad Rutter, who took on IBM‘s Watson question-answer machine (what some people are calling a supercomputer) in a two-round, three-day championship tournament that pitted men against machine on the slippery puns of this game show, which is as old as IBM’s System/360 mainframe.
In my conversation with IBM principle investigator, David Ferruci, who I talked to back in April 2009 when the “Blue J” project was launched to the world as the Watson QA system, that the machine would be processing natural language like we humans do. And in a classic example of how tough it is to actually communicate in this world, what I heard when Ferruci said that was “audio and visual text processing” and what Ferruci meant was “accepting textual input and breaking sentences down into components.” Watson did not hear Jeopardy! host Alex Trebek read the clues and could not hear when a human contestant gave a response that was wrong and then inadvertently repeated it. (Over the course of three days, it happened once, I believe.)
I was a bit disappointed by this, but David Gondek, one of the researchers at IBM’s TJ Watson Research Center in Hawthorne, New York, where the three-day tournament was played in a full mock-up of the Jeopardy! set, explained to me that IBM felt that the challenges in parsing sentences, trying to come up with responses, and creating a category and betting strategy were hard enough without adding speech recognition for three different people that Watson would have been able to hear, if it had ears. As it turns out, the moment a clue is revealed and Trebek starts reading the clue outloud (where human players Jennings and Rutter could read it on their screens), the same clue was transmitted as a text message to Watson. The buzzer is dead for all three players until Trebek finishes reading the clue and a human operator behind the panel flips a switch that turns the clue screen border red. At that moment, the player buzzers are enabled, and if a player hits a buzzer before the clue screen turns red, their buzzer button is put into statis for a half second, thus giving someone else the jump.
Jennings is known to be at one with the Jeopardy buzzer, which is one reason why he was able to win 74 straight Jeopardy! games. I happen to think that after Jennings had been playing for a while the other human players got psyched out as they took their turns trying to take him down, probably jumping the gun as many times as they missed the buzz. Watson doesn’t have emotions, can’t be psyched, and viciously ruled the buzzer during most of the three days of play.
Watson’s original hardware was a few racks of BlueGene/P massively parallel supercomputing nodes, but last year, knowing that the Jeopardy! challenge would be the biggest infomercial for Power Systems that Big Blue could ever wish for, the company switched to the absolutely midrange Power 750 servers. Watson’s processing nodes are four-socket Power 750 machines, which use the latest eight-core Power7 chips from IBM. There are nine machines in each rack, for a total of 90 servers and 2,880 cores. The cores are spinning at 3.55 GHz.
The cluster has 16 TB of main memory and 4 TB of clustered storage and has been stuffed with some 200 million pages of text data. It is not clear if Watson uses flash or disk drives, and what connects the server nodes together. (Gondek is not a hardware guy.) But I suspect that the machine has flash drives and uses either 40 Gb/sec InfiniBand or 10 Gigabit Ethernet switches to link the nodes. I would be using InfiniBand and its Remote Direct Memory Access (RDMA) capability if I were designing the hardware. What I can tell you is that the data is distributed across multiple nodes, and there is redundancy in the way the data is spread around so that a node failure doesn’t kill the machine as it plays.
IBM’s researchers chose Novell‘s SUSE Linux Enterprise Server 11 operating system to run on the nodes, which has slightly better performance on certain kinds of HPC work than either AIX or IBM i. The secret sauce in Watson is a set of software that IBM calls DeepQA, and as you can see from here, the company is not saying much about precisely what this stack of code is. But I have been able to piece together a few things.
The Apache Software Foundation was bragging that Watson makes use of the Hadoop data chunking program to organize information and make is accessible in a parallel–and therefore superfast–fashion. (Hadoop is an open source analog to the data-chewing technique created by Google as implemented by geeks at Yahoo after they read a paper describing Google’s proprietary MapReduce data chunking technique.)
Hadoop organizes the information, but it is Apache UIMA, short for Unstructured Information Management Architecture, that allows for unstructured information–text, audio, and video streams in theory, but text in the Watson example–to be analyzed and run through natural language parsing algorithms to figure out what is going on in that text. IBM started the UIMA effort in 2005, and the OmniFind semantic search engine in its DB2 data warehouses, for instance, are based on it. Since then, IBM has proposed UIMA as a standard and converted it to an open source project.
UIMA has frameworks for Java and C++, and Gondek says that most of the analytic algorithms created for Watson were written in Java, such as question analysis, passage scoring, and confidence estimation routines. There is a mix of C and C++ for algorithms where speed is important, and Prolog is used to do the question analysis. There’s about a million line of code in these routines.
So you can’t just take Hadoop and UIMA and create your own Watson Jeopardy!-playing machine. Sorry.
In addition to searching data in memory and parsing clues, Watson also was taught with pattern recognition software how to take the words in a clue and figure out what category of clue it is; meaning, is it looking for a person, place, or thing? Is it geography or a movie? Once these algorithms, which learned the kinds of words tend to appear in what kinds of categories by being fed the data from over 15,000 clue-response sets from real Jeopardy! games, figure out the category, then sets loose hundreds of algorithms that were created, largely by trial and error, to help it best sift through its data to find the right answer on particular kinds of clues. By chewing through those 15,000 clue-response sets from real games, Watson learns which algorithms are best for Jeopardy! specific categories. These algorithms, which took years to develop, are what gave Watson confidence in its answers–or showed when it did not have confidence. The different algorithms are given different weights for different categories of questions, and the overall probabilities shown during the tournament are some kind of average of all these statistics.
Here are the clever bits about Watson. First, the eureka moment that turned Watson into a much better player than it was originally was when the IBM Research team figured out that unlike a normal search engine, which gives each term in a collection of words it is tracking down and cross-linking the same weight, Watson would have to learn how to zoom in on the important words and do so quickly. This helps it identify the category and reduce the number of possible answers, which helps the machine come up with the answer quickly.
Another key insight was to limit the data. Rather than just suck all of the data out of the Internet (which would be a lot more than that), Watson relies on Wikipedia, the Bible, the Oxford English Dictionary, and various encyclopedias that summarize a lot of different data as its information source. Feeding it raw data–such as novels or full technical manuals–would only end up confusing the machine and make it worse at playing the game, Gondek explained to me. By restricting itself to what are essentially encyclopedic resources that have already culled down lots of data about zillions of things, Watson has something akin to Cliff Notes for the very broad domain encompassed by Jeopardy!
If you missed the Jeopardy! challenge, you can watch it on YouTube here, although you will have to hunt and peck around for the different video pieces. (I am fairly certain these are protected by copyright and may not be available when you go to view them.)
If you want to review the clues and responses, check out the J-Archive, a community driven site that posts the clues and responses to thousands of games. The first day’s game is number 3575 in the archive, from February 14. The second part of the first round is number 3576 from February 15, and the final day, which was for a whole, normal Jeopardy! match, is number 3577.
I kept track of the scores in real time, and built a table showing how each player progressed over the three days. Humanity did OK in two of the four rounds, but got whupped in two others, as you can see:
As Ken Jennings, who actually beat Watson in a full beta test run of the game back in January (IBM didn’t tell us that ahead of the show), put it as he answered his Final Jeopardy question: “I, for one, personally welcome our computer overlords.”
It will be interesting to see how many doctors, lawyers, and middle managers welcome a Watson into their offices when the system is reprogramed to do question-answer analysis on medicine, the law, and business.