Info Builders Prophesizes World Series Winner with Predictive Analytics
October 20, 2009 Alex Woodie
When it comes to making predictions, there are as many techniques as there are people. Some may involve a twig, a piece of string, and a really strong hunch, while others use giant supercomputers and trillions of bytes of data. One company trying to democratize the data-driven approach is i OS business intelligence software vendor Information Builders, which recently crunched baseball statistics to come up with the most probable winner of the World Series. Hint: the winner’s initials start with “LA,” not “NY.”
Unlike any other sport, baseball is a game deeply rooted in statistics. More than one hundred year’s worth of data has been meticulously recorded, making the results of every at-bat of every inning of every game in every season on hand for posterity. This data is available free of charge from several sites on the Web, which made for a fun and easy test bed of data for WebFOCUS RStat, a new predictive analytics component of Info Builder’s business intelligence software suite.
To get started with its World Series prediction project, Info Builders downloaded the statistics that it figured would make the most sense for its purpose. The company restricted its search to all the teams that made the playoffs since divisional play began in 1969, or less than 200 teams, according to Kevin Quinn, vice president of product marketing at New York City-based Info Builders.
The goal of the exercise was to determine which statistics correlated most closely with the teams that have won the World Series, and then to crunch the data to determine the winner of the World Series. In other words, the software looks at what was the most common statistical denominator among teams that won the World Series in the past, and applies that to the present teams and their statistical footprints.
Info Builder’s pulled all kinds of data from the archives, including things like batting averages, ERA, and runs scored, according to Quinn, who for the record is not a Yankees fan. (“There are 4 million people from New York that hate the Yankees. They’re called Mets fans.”)
Then came the hard part: Figuring out the best way to interpret the data, which can sometimes resemble art more than science. “You basically play with the data,” Quinn says. “You try a couple of different algorithms until you see an algorithm that seems to come up with something that seems logical. That’s what we did.”
Some of the approaches didn’t work. For example, one algorithm said that wining percentage and the number of team walks were the most predictive of a World Series crown. If that was the case, RStat predicted that the New York Yankees had the highest probability–19 to 20 percent–of winning it all. However, the tool also found that every other team had a 0 percent probability, which didn’t make any sense.
“That’s possible with any software. You throw so much stuff at it that it doesn’t mean anything,” Quinn says. “That’s why there’s a little bit of work that goes into predictive analytics. You need to have an understanding of the data, what generally is considered to make sense. You need to narrow things down from your own logic standpoint, then you start to come up with models that come up with predictions that seem to make sense.”
In the end, Info Builders settled on a group of statistics believed to be the most indicative of a World Series winner. They included winning percentage, runs scored, batting average, total extra base hits, ERA, and fielding percentage.
After running all of the teams’ season stats through decision tree and linear logistical regression algorithms, RStat determined that the Los Angeles Dodgers had a 34 percent chance of winning the World Series, compared to 32 percent for the Los Angeles Angels of Anaheim, and 29 percent for the New York Yankees. The next closest teams, including defending World Series champion Philadelphia Phillies–which currently hold a 2-1 lead over the hated Dodgers (editor’s prerogative) in the NLCS–had chances lower than 15 percent.
Obviously, unless you’re in the T-shirt or business, who wins the World Series will not be terribly significant to the future of your company. But substitute the search for a World Series winner with a search for the optimal inventory level, or the search for the hottest sales region, and you can see how this software may apply to your business.
“What we’re trying to show here is the software can be used for any purpose,” Quinn says. “Baseball is a fun thing, but it can be used to predict everything from what students are the most likely to graduate from university to using it to predict the best time to discount prices to maximize profits and sales.”
And while the RStat software runs on Windows or Linux software, there’s nothing to prevent System i shops from using it to analyze their historical data, housed in Info Builder’s WebFOCUS software running on the System i.