I recently saw a pair of blog posts showing how to make heatmaps with straight R and with ggplot2. Basketball doesn’t really interest me, so I figured I’d attempt to do the same thing for the 2010 Oakland Athletics 40-man roster. Results are at the bottom of the post.
First, I needed to get the 40-man roster:
$ w3m -dump "http://oakland.athletics.mlb.com/team/roster_40man.jsp?c_id=oak" > 40man
Then trim it down so it’s just a listing of the player’s names.
Next, get the baseball data bank (BDB) database from http://baseball-databank.org/, convert and insert it into a PostgreSQL database using mysql2pgsql.perl.
A Python script reads the names from the roster, and dumps a CSV file of the batting and pitching data for the past two seasons for the players passed in.
$ cat 40man_names | ./get_two-year_batter_stats.py
The batting data looks like this:
name , age, g, ba, obp, slg, ops, rc, hrr, kr, bbr
Daric Barton (1B) , 25, 194, 0.238, 0.342, 0.365, 0.707, 73, 0.017, 0.173, 0.134
Travis Buck (RF) , 27, 74, 0.223, 0.289, 0.392, 0.682, 28, 0.035, 0.202, 0.073
Chris Carter (LF) , 28, 13, 0.261, 0.320, 0.261, 0.581, 1, 0.000, 0.360, 0.080
I’ve used the counting stats in the BDB to calculate batting average (ba), on-base percentage (obp), slugging percentage (slg), OPS (on-base percentage + slugging percentage), runs created (rc), home run rate (hrr), strikeout rate (kr) and walks rate (bbr).
And the pitching data:
name , age, g, ip, w, l, sv, wp, lp, wf, era, k9, bb9, hr9
Brett Anderson (P) , 22, 30, 175.33, 11, 11, 0, 0.37, 0.37, 0.00, 4.06, 7.70, 2.36, 1.03
Andrew Bailey (P) , 26, 68, 83.33, 6, 3, 26, 0.09, 0.04, 0.04, 1.84, 9.83, 2.92, 0.54
Jerry Blevins (P) , 27, 56, 60.00, 1, 3, 0, 0.02, 0.05, -0.04, 3.75, 8.70, 3.30, 0.60
Here I’ve calculated innings pitched (ip), winning percentage (wp), losing percentage (lp), win frequency (wf), earned run average (era), strikeouts per nine innings (k9), walks per nine (bb9), and home runs given up per nine innings (hr9). All these stats are for the last two Major League seasons.
Finally, generate the heat maps in R. For batting statistics:
mlb <- read.csv('batting.csv')
mlb$name <- with(mlb, reorder(name, ops))
mlb.m <- melt(mlb)
mlb.m <- ddply(mlb.m, .(variable), transform, rescale = rescale(value))
(p <- ggplot(mlb.m, aes(variable, name)) +
+ geom_tile(aes(fill = rescale), colour = "white") +
+ scale_fill_gradient(low = "gold", high = "darkgreen"))
base_size <- 14
p + theme_grey(base_size = base_size) + labs(x = "", y = "") +
+ scale_x_discrete(expand = c(0, 0)) + scale_y_discrete(expand = c(0, 0)) +
+ opts(legend.position = "none", axis.ticks = theme_blank(),
+ axis.text.x = theme_text(size = base_size * 0.8, angle = 0, hjust = 0.5, colour = "black"),
+ axis.text.y = theme_text(size = base_size * 0.8, lineheight = 0.9, colour="black", hjust = 1))
Pitching statistics are the same, except the third line (where I order the data frame) is:
mlb$name <- with(mlb, reorder(name, 1/(era+0.1)))
A’s batting heatmap, ordered by OPS
A’s pitching heatmap, ordered by ERA
You have to keep the number of games (or innings pitched for pitchers) in mind when you look at these charts. I don’t even know who some of those guys are, probably because they’ve only barely played in the majors. It might make some sense to split the pitching plot into plots for starters and relievers, but I’d need a good way to determine a pitcher’s status (innings pitched divided by games beyond some threshold, perhaps?).
As for the A’s, I like their pitching, but have serious doubts about their offense. I sure hope some of the younger guys on this chart start reaching their power potential because having Jack Cust as your only offensive weapon doesn’t bode well for the team scoring runs.