Yesterday I discovered (and ordered!) a new book from O'Reilly called Baseball Hacks by Joseph Alder. I've got a bookshelf full of O'Reilly books on other computer subjects, so I'm very excited to see this. On the web site for the book, there are a couple example hacks from the book.
Last year I spent some time getting the Lahman database into MySQL so I could fool around with some advanced baseball statistics. The Lahman database is a Microsoft Access database, and doesn't allow re-distribution, so for an open-source advocate like me, this isn't exactly the best source for baseball information. It took me a few days to get it all into MySQL successfully, and any of my improvements couldn't be distributed.
Well from reading the sample hacks, I discovered there's a less restrictive database that's also available for MySQL (a free database server). In addition, the author of Baseball Hacks shows how to connect a MySQL database with the fantastic statistical package R. R is also free, and is incredibly powerful. I also found previous article by the same author. Some of what appears below is based on that article.
Anyway, I can't wait to get the book to see what's in it, but meantime I did a very simple analysis comparing payroll to wins for the 2005 season. For the 2005 season, team payroll numbers range from a low of $29.7 million for the Tampa Bay Devil Rays to the Yankee's astronomical payroll of $208.3 million. The second place team, the Red Sox, spent only $123.5 million on player payrolls in 2005. What does all that money buy? I'm sure the owners hope it'll buy them enough wins to make it to the playoffs, and hopefully win the World Series. The White Sox, winners in 2005, were 13th in payroll at $75.2 million.
It turns out that payroll doesn't really account for a lot of whether a team wins or loses. It explained only 24% of the variation in wins in 2005. For comparison, a team's hits and earned run average explains 72% of the variation in wins. Obviously, getting lots of hits, and keeping your opponent from scoring runs will contribute to winning a lot of games.
But what I want to see is whether a team did better than expected based on their player spending. The Yankees didn't wind up with the best record in baseball, despite spending more than twice as much as every other team in baseball except the Red Sox. How badly did they under-perform?
Not that badly, actually. The plot below shows the relationship between payroll and wins for 2005. The straight line is the regression line showing the best linear fit to the data. The team letters on the plot show how they actually performed. Teams that show up above the line, played better than their salaries would have predicted. Those below, did much worse.
For example, look how far the Chicago White Sox (CHA) are from the regression line. The Cardinals also wind up well above what we would expect based solely on their salaries (and that's with Scott Rolen on the DL the whole season!). Also check out the Cleveland Indians. They're a team that has a lot of very good younger players who aren't eligible for arbitration yet, but have loads of talent.
You can see the Yankees over on the right, far from all the other teams. Based on their payroll, they should have won 102 games in 2005, but only managed 95. The Kansas City Royals were much worse, only managing to win 56 games when their player salaries predicted 75 wins. It's easy to explain why teams like the Dodgers or Giants didn't do well in 2005---their high paid players were injured for most of the year---but something else must be going on with Seattle and Kansas City.
What does all this tell us about baseball? Well, I'd argue that this metric (payroll vs. wins) tells us something about how effective the front office of a team is. Smart general managers will pick up talent that is undervalued by the market, buying more wins than they're paying for. Also, teams with a good farm system can "grow their own" talent, rather than having to buy it on the free market. Teams like Cleveland and Oakland are good examples of this. The excesses of George Steinbrenner should have been enough to buy a World Series championship, but the Yankee front office overpaid for all their veteran talent, and in 2005, they didn't live up to their high salaries.
If you want to see the R code I used to generate the plot, you can download it from the link.