Humans Fight, But Watson’s Chips Beat Quiz Champs

February 21, 2011 Timothy Prickett Morgan

Like a lot of Americans, the only time I ever watch the Jeopardy! game show is if I happen to be exhausted on a Friday night, I happen to have my work done early, and I happen to feel like spending a little cash to go down to the local beer and burger joint with the wife and kids. And when we do go, we inevitably fight over who gets to sit in the benches that face the TV so we can watch the game clues and try to come up with the questions and play along.

But last week, from Monday through Wednesday, my kids and I sat absolutely riveted to the TV at 7 p.m., watching the contest between the all-time Jeopardy! champs, Ken Jennings and Brad Rutter, who took on IBM‘s Watson question-answer machine (what some people are calling a supercomputer) in a two-round, three-day championship tournament that pitted men against machine on the slippery puns of this game show, which is as old as IBM’s System/360 mainframe.

In my conversation with IBM principle investigator, David Ferruci, who I talked to back in April 2009 when the “Blue J” project was launched to the world as the Watson QA system, that the machine would be processing natural language like we humans do. And in a classic example of how tough it is to actually communicate in this world, what I heard when Ferruci said that was “audio and visual text processing” and what Ferruci meant was “accepting textual input and breaking sentences down into components.” Watson did not hear Jeopardy! host Alex Trebek read the clues and could not hear when a human contestant gave a response that was wrong and then inadvertently repeated it. (Over the course of three days, it happened once, I believe.)

I was a bit disappointed by this, but David Gondek, one of the researchers at IBM’s TJ Watson Research Center in Hawthorne, New York, where the three-day tournament was played in a full mock-up of the Jeopardy! set, explained to me that IBM felt that the challenges in parsing sentences, trying to come up with responses, and creating a category and betting strategy were hard enough without adding speech recognition for three different people that Watson would have been able to hear, if it had ears. As it turns out, the moment a clue is revealed and Trebek starts reading the clue outloud (where human players Jennings and Rutter could read it on their screens), the same clue was transmitted as a text message to Watson. The buzzer is dead for all three players until Trebek finishes reading the clue and a human operator behind the panel flips a switch that turns the clue screen border red. At that moment, the player buzzers are enabled, and if a player hits a buzzer before the clue screen turns red, their buzzer button is put into statis for a half second, thus giving someone else the jump.

Jennings is known to be at one with the Jeopardy buzzer, which is one reason why he was able to win 74 straight Jeopardy! games. I happen to think that after Jennings had been playing for a while the other human players got psyched out as they took their turns trying to take him down, probably jumping the gun as many times as they missed the buzz. Watson doesn’t have emotions, can’t be psyched, and viciously ruled the buzzer during most of the three days of play.

Watson’s original hardware was a few racks of BlueGene/P massively parallel supercomputing nodes, but last year, knowing that the Jeopardy! challenge would be the biggest infomercial for Power Systems that Big Blue could ever wish for, the company switched to the absolutely midrange Power 750 servers. Watson’s processing nodes are four-socket Power 750 machines, which use the latest eight-core Power7 chips from IBM. There are nine machines in each rack, for a total of 90 servers and 2,880 cores. The cores are spinning at 3.55 GHz.

The cluster has 16 TB of main memory and 4 TB of clustered storage and has been stuffed with some 200 million pages of text data. It is not clear if Watson uses flash or disk drives, and what connects the server nodes together. (Gondek is not a hardware guy.) But I suspect that the machine has flash drives and uses either 40 Gb/sec InfiniBand or 10 Gigabit Ethernet switches to link the nodes. I would be using InfiniBand and its Remote Direct Memory Access (RDMA) capability if I were designing the hardware. What I can tell you is that the data is distributed across multiple nodes, and there is redundancy in the way the data is spread around so that a node failure doesn’t kill the machine as it plays.

IBM’s researchers chose Novell‘s SUSE Linux Enterprise Server 11 operating system to run on the nodes, which has slightly better performance on certain kinds of HPC work than either AIX or IBM i. The secret sauce in Watson is a set of software that IBM calls DeepQA, and as you can see from here, the company is not saying much about precisely what this stack of code is. But I have been able to piece together a few things.

The Apache Software Foundation was bragging that Watson makes use of the Hadoop data chunking program to organize information and make is accessible in a parallel–and therefore superfast–fashion. (Hadoop is an open source analog to the data-chewing technique created by Google as implemented by geeks at Yahoo after they read a paper describing Google’s proprietary MapReduce data chunking technique.)

Hadoop organizes the information, but it is Apache UIMA, short for Unstructured Information Management Architecture, that allows for unstructured information–text, audio, and video streams in theory, but text in the Watson example–to be analyzed and run through natural language parsing algorithms to figure out what is going on in that text. IBM started the UIMA effort in 2005, and the OmniFind semantic search engine in its DB2 data warehouses, for instance, are based on it. Since then, IBM has proposed UIMA as a standard and converted it to an open source project.

UIMA has frameworks for Java and C++, and Gondek says that most of the analytic algorithms created for Watson were written in Java, such as question analysis, passage scoring, and confidence estimation routines. There is a mix of C and C++ for algorithms where speed is important, and Prolog is used to do the question analysis. There’s about a million line of code in these routines.

So you can’t just take Hadoop and UIMA and create your own Watson Jeopardy!-playing machine. Sorry.

In addition to searching data in memory and parsing clues, Watson also was taught with pattern recognition software how to take the words in a clue and figure out what category of clue it is; meaning, is it looking for a person, place, or thing? Is it geography or a movie? Once these algorithms, which learned the kinds of words tend to appear in what kinds of categories by being fed the data from over 15,000 clue-response sets from real Jeopardy! games, figure out the category, then sets loose hundreds of algorithms that were created, largely by trial and error, to help it best sift through its data to find the right answer on particular kinds of clues. By chewing through those 15,000 clue-response sets from real games, Watson learns which algorithms are best for Jeopardy! specific categories. These algorithms, which took years to develop, are what gave Watson confidence in its answers–or showed when it did not have confidence. The different algorithms are given different weights for different categories of questions, and the overall probabilities shown during the tournament are some kind of average of all these statistics.

Here are the clever bits about Watson. First, the eureka moment that turned Watson into a much better player than it was originally was when the IBM Research team figured out that unlike a normal search engine, which gives each term in a collection of words it is tracking down and cross-linking the same weight, Watson would have to learn how to zoom in on the important words and do so quickly. This helps it identify the category and reduce the number of possible answers, which helps the machine come up with the answer quickly.

Another key insight was to limit the data. Rather than just suck all of the data out of the Internet (which would be a lot more than that), Watson relies on Wikipedia, the Bible, the Oxford English Dictionary, and various encyclopedias that summarize a lot of different data as its information source. Feeding it raw data–such as novels or full technical manuals–would only end up confusing the machine and make it worse at playing the game, Gondek explained to me. By restricting itself to what are essentially encyclopedic resources that have already culled down lots of data about zillions of things, Watson has something akin to Cliff Notes for the very broad domain encompassed by Jeopardy!

If you missed the Jeopardy! challenge, you can watch it on YouTube here, although you will have to hunt and peck around for the different video pieces. (I am fairly certain these are protected by copyright and may not be available when you go to view them.)

If you want to review the clues and responses, check out the J-Archive, a community driven site that posts the clues and responses to thousands of games. The first day’s game is number 3575 in the archive, from February 14. The second part of the first round is number 3576 from February 15, and the final day, which was for a whole, normal Jeopardy! match, is number 3577.

I kept track of the scores in real time, and built a table showing how each player progressed over the three days. Humanity did OK in two of the four rounds, but got whupped in two others, as you can see:

	Ken	IBM	Brad
	Jennings	Watson	Rutter
1			$200
2		$400
3		Daily Double, Bet $1,000
		$1,400
4		$1,600
5		$1,800
6	$200
7		$2,800
8		$3,000
9		$3,200
10		$3,600
11		$4,000
12			$600
13		$4,600
14			$1,000
15		$5,200
	Commercial Break	Commercial Break	Commercial Break
16		Wrong
	$600	$4,800
17	$1,600
18		$5,800
19		$6,400
20	$2,200
21	Wrong	Wrong	No Guess
	$1,200	$5,400
22	$2,200
23	Wrong	Wrong
	$1,200	$4,400	$2,000
24			$2,600
25		Wrong
		$3,600	$3,400
26			$4,200
27			$5,000
28	$2,000
29		$4,400
30		$5,000
Round One Finals:
	$2,000	$5,000	$5,000



	Ken	IBM	Brad
	Jennings	Watson	Rutter
Round One Finals:
	$2,000	$5,000	$5,000
1		$7,000
2		$8,600
3		$10,200
4		$11,400
5		$13,400
6		$14,600
7		Daily Double, Bet $6,435
		$21,035
8	Wrong	Wrong	Wrong
	$400	$19,435	$3,400
9		$21,035
10		Daily Double, Bet $1,246
		$22,281
11		$22,681
12		$23,081
13		$23,481
14		$23,881
15	$1,200
	Commercial Break	Commercial Break	Commercial Break
16			$4,600
17		$25,881
18		$26,681
19	$2,000
20	Wrong	Wrong	Wrong
21		$28,281
22		$28,681
23	$2,400
24		$30,681
25		$31,881
26			$5,400
27		$33,081
28		$33,881
29		$35,881
30		$36,681

		Final Jeopardy
FJ	Bet $2,400	Bet $947	Bet $5,000
	Right	Wrong	Right

Round Two Finals:
	$4,800	$35,734	$10,400

	Ken	IBM	Brad
	Jennings	Watson	Rutter
Round Three
1		$200
2		$1,000
3		Wrong	Wrong
		$0	-$1,000
4		$800
5		$1,800
6		$2,600
7		$3,600
8			-$200
9			$800
10	$800
11	$1,400
12	$2,400
13			$1,600
14			Wrong
	$3,400		$600
15		$4,200
	Commercial Break	Commercial Break	Commercial Break
16		$4,400
17	$3,600
18	Daily Double, Bet $3,600
	$7,200
19	$7,800
20			$1,200
21		$5,000
22			$1,400
23			$1,800
24	$8,400
25	$8,600
26	Wrong	Wrong	No response
	$8,200	$4,600
27			$2,200
28	$8,600
29			$2,400
30		$4,800

	Ken	IBM	Brad
	Jennings	Watson	Rutter
Round Four
1			$3,600
2	$10,200
3		$6,800
4		$8,400
5		Daily Double, Bet $2,127
		Wrong
		$6,273
6	$11,800
7	$13,400
8	$15,000
9		$7,473
10		$8,673
11		$9,873
12		$11,873
13		$13,873
14	$17,000
15		$15,073
16		Daily Double, Bet $367
		$15,440
17		$15,840
18		$16,240
19			$4,400
20		$18,240
21	$17,800
22		$20,240
23	$18,200
24			$5,200
25			$5,600
26		$21,040
27		$21,440
28		$21,840
29		$22,640
30		$23,440

		Final Jeopardy
FJ	Bet $1,000	Bet $17,973	Bet $5,600
	Right	Right	Right
	$19,200	$41,413	$11,200

Grand Totals
	$24,000	$77,147	$21,600

As Ken Jennings, who actually beat Watson in a full beta test run of the game back in January (IBM didn’t tell us that ahead of the show), put it as he answered his Final Jeopardy question: “I, for one, personally welcome our computer overlords.”

It will be interesting to see how many doctors, lawyers, and middle managers welcome a Watson into their offices when the system is reprogramed to do question-answer analysis on medicine, the law, and business.