The following is a document-style version of a presentation I gave at work a couple weeks ago. It's a little less useful for a general audience because you don't have access to the same database I have, but I figured it might be useful for someone who is looking at using dplyr or in manipulating the GHCND data from NCDC.
Today we’re going to briefly take a look at the GHCND climate database and a couple new R packages (dplyr and tidyr) that make data import and manipulation a lot easier than using the standard library.
For further reading, consult the vignettes for dplyr and tidyr, and download the cheat sheet:
The GHCND database contains daily observation data from locations around the world. The README linked above describes the data set and the way the data is formatted. I have written scripts that process the station data and the yearly download files and insert it into a PostgreSQL database (noaa).
The script for inserting a yearly file (downloaded from http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/) is here: ghcn-daily-by_year_process.py
Without going into too much detail on the subject (read Hadley Wickham’s paper) for more information, but the basic idea is that it is much easier to analyze data when it is in a particular, “tidy”, form. A Tidy dataset has a single table for each type of real world object or type of data, and each table has one column per variable measured and one row per observation.
For example, here’s a tidy table representing daily weather observations with station × date as rows and the various variables as columns.
Getting raw data into this format is what we’ll look at today.
R libraries & data import
First, let’s load the libraries we’ll need:
library(dplyr) # data import library(tidyr) # column / row manipulation library(knitr) # tabular export library(ggplot2) # plotting library(scales) # “pretty” scaling library(lubridate) # date / time manipulations
dplyr and tidyr are the data import and manipulation libraries we will use, knitr is used to produce tabular data in report-quality forms, ggplot2 and scales are plotting libraries, and lubridate is a library that makes date and time manipulation easier.
Also note the warnings about how several R functions have been “masked” when we imported dplyr. This just means we'll be getting the dplyr versions instead of those we might be used to. In cases where we need both, you can preface the function with it's package: base::filter would us the normal filter function instead of the one from dplyr.
Next, connect to the database and the three tables we will need:
noaa_db <- src_postgres(host="mason", dbname="noaa") ghcnd_obs <- tbl(noaa_db, "ghcnd_obs") ghcnd_vars <- tbl(noaa_db, "ghcnd_variables")
The first statement connects us to the database and the next two create table links to the observation table and the variables table.
Here’s what those two tables look like:
## Observations: 29404870 ## Variables: ## $ station_id (chr) "USW00027502", "USW00027502", "USW00027502", "USW0... ## $ dte (date) 2011-05-01, 2011-05-01, 2011-05-01, 2011-05-01, 2... ## $ variable (chr) "AWND", "FMTM", "PRCP", "SNOW", "SNWD", "TMAX", "T... ## $ raw_value (dbl) 32, 631, 0, 0, 229, -100, -156, 90, 90, 54, 67, 1,... ## $ meas_flag (chr) "", "", "T", "T", "", "", "", "", "", "", "", "", ... ## $ qual_flag (chr) "", "", "", "", "", "", "", "", "", "", "", "", ""... ## $ source_flag (chr) "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", ... ## $ time_of_obs (int) NA, NA, 0, NA, NA, 0, 0, NA, NA, NA, NA, NA, NA, N...
## Observations: 82 ## Variables: ## $ variable (chr) "AWND", "EVAP", "MDEV", "MDPR", "MNPN", "MXPN",... ## $ description (chr) "Average daily wind speed (tenths of meters per... ## $ raw_multiplier (dbl) 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0....
Each row in the observation table rows contain the station_id, date, a variable code, the raw value for that variable, and a series of flags indicating data quality, source, and special measurements such as the “trace” value used for precipitation under the minimum measurable value.
Each row in the variables table contains a variable code, description and the multiplier used to convert the raw value from the observation table into an actual value.
This is an example of completely “normalized” data, and it’s stored this way because not all weather stations record all possible variables, and rather than having a single row for each station × date with a whole bunch of empty columns for those variables not measured, each row contains the station × data × variable data.
We are also missing information about the stations, so let’s load that data:
fai_stations <- tbl(noaa_db, "ghcnd_stations") %>% filter(station_name %in% c("FAIRBANKS INTL AP", "UNIVERSITY EXP STN", "COLLEGE OBSY")) glimpse(fai_stations)
## Observations: 3 ## Variables: ## $ station_id (chr) "USC00502107", "USW00026411", "USC00509641" ## $ station_name (chr) "COLLEGE OBSY", "FAIRBANKS INTL AP", "UNIVERSITY ... ## $ latitude (dbl) 64.86030, 64.80389, 64.85690 ## $ longitude (dbl) -147.8484, -147.8761, -147.8610 ## $ elevation (dbl) 181.9656, 131.6736, 144.7800 ## $ coverage (dbl) 0.96, 1.00, 0.98 ## $ start_date (date) 1948-05-16, 1904-09-04, 1904-09-01 ## $ end_date (date) 2015-04-03, 2015-04-02, 2015-03-13 ## $ variables (chr) "TMIN TOBS WT11 SNWD SNOW WT04 WT14 TMAX WT05 DAP... ## $ the_geom (chr) "0101000020E6100000A5BDC117267B62C0EC2FBB270F3750...
The first part is the same as before, loading the ghcnd_stations table, but we are filtering that data down to just the Fairbanks area stations with long term records. To do this, we use the pipe operator %>% which takes the data from the left side and passes it to the function on the right side, the filter function in this case.
filter requires one or more conditional statements with variable names on the left side and the condition on the right. Multiple conditions can be separated by commas if you want all the conditions to be required (AND) or separated by a logic operator (& for AND, | for OR). For example: filter(latitude > 70, longitude < -140).
When used on database tables, filter can also use conditionals that are built into the database which are passed directly as part of a WHERE clause. In our code above, we’re using the %in% operator here to select the stations from a list.
Now we have the station_ids we need to get just the data we want from the observation table and combine it with the other tables.
Here’s how we do it:
fai_raw <- ghcnd_obs %>% inner_join(fai_stations, by="station_id") %>% inner_join(ghcnd_vars, by="variable") %>% mutate(value=raw_value*raw_multiplier) %>% filter(qual_flag=='') %>% select(station_name, dte, variable, value) %>% collect() glimpse(fai_raw)
In order, here’s what we’re doing:
- Assign the result to fai_raw
- Join the observation table with the filtered station data, using station_id as the variable to combine against. Because this is an “inner” join, we only get results where station_id matches in both the observation and the filtered station data. At this point we only have observation data from our long-term Fairbanks stations.
- Join the variable table with the Fairbanks area observation data, using variable to link the tables.
- Add a new variable called value which is calculated by multiplying raw_value (coming from the observation table) by raw_multiplier (coming from the variable table).
- Remove rows where the quality flag is not an empty space.
- Select only the station name, date, variable and actual value columns from the data. Before we did this, each row would contain every column from all three tables, and most of that information is not necessary.
- Finally, we “collect” the results. dplyr doesn’t actually perform the full SQL until it absolutely has to. Instead it’s retrieving a small subset so that we can test our operations quickly. When we are happy with the results, we use collect() to grab the full data.
The data is still in a format that makes it difficult to analyze, with each row in the result containing a single station × date × variable observation. A tidy version of this data requires each variable be a column in the table, each row being a single date at each station.
To “pivot” the data, we use the spread function, and we'll also calculate a new variable and reduce the number of columns in the result.
fai_pivot <- fai_raw %>% spread(variable, value) %>% mutate(TAVG=(TMIN+TMAX)/2.0) %>% select(station_name, dte, TAVG, TMIN, TMAX, TOBS, PRCP, SNOW, SNWD, WSF1, WDF1, WSF2, WDF2, WSF5, WDF5, WSFG, WDFG, TSUN) head(fai_pivot)
## Source: local data frame [6 x 18] ## ## station_name dte TAVG TMIN TMAX TOBS PRCP SNOW SNWD WSF1 WDF1 ## 1 COLLEGE OBSY 1948-05-16 11.70 5.6 17.8 16.1 NA NA NA NA NA ## 2 COLLEGE OBSY 1948-05-17 15.55 12.2 18.9 17.8 NA NA NA NA NA ## 3 COLLEGE OBSY 1948-05-18 14.40 9.4 19.4 16.1 NA NA NA NA NA ## 4 COLLEGE OBSY 1948-05-19 14.15 9.4 18.9 12.2 NA NA NA NA NA ## 5 COLLEGE OBSY 1948-05-20 10.25 6.1 14.4 14.4 NA NA NA NA NA ## 6 COLLEGE OBSY 1948-05-21 9.75 1.7 17.8 17.8 NA NA NA NA NA ## Variables not shown: WSF2 (dbl), WDF2 (dbl), WSF5 (dbl), WDF5 (dbl), WSFG ## (dbl), WDFG (dbl), TSUN (dbl)
spread takes two parameters, the variable we want to spread across the columns, and the variable we want to use as the data value for each row × column intersection.
Now that we've got the data in a format we can work with, let's look at a few examples.
Find the coldest temperatures by winter year
First, let’s find the coldest winter temperatures from each station, by winter year. “Winter year” is just a way of grouping winters into a single value. Instead of the 2014–2015 winter, it’s the 2014 winter year. We get this by subtracting 92 days (the days in January, February, March) from the date, then pulling off the year.
Here’s the code.
fai_winter_year_minimum <- fai_pivot %>% mutate(winter_year=year(dte - days(92))) %>% filter(winter_year < 2014) %>% group_by(station_name, winter_year) %>% select(station_name, winter_year, TMIN) %>% summarize(tmin=min(TMIN*9/5+32, na.rm=TRUE), n=n()) %>% filter(n>350) %>% select(station_name, winter_year, tmin) %>% spread(station_name, tmin) last_twenty <- fai_winter_year_minimum %>% filter(winter_year > 1993) last_twenty
## Source: local data frame [20 x 4] ## ## winter_year COLLEGE OBSY FAIRBANKS INTL AP UNIVERSITY EXP STN ## 1 1994 -43.96 -47.92 -47.92 ## 2 1995 -45.04 -45.04 -47.92 ## 3 1996 -50.98 -50.98 -54.04 ## 4 1997 -43.96 -47.92 -47.92 ## 5 1998 -52.06 -54.94 -54.04 ## 6 1999 -50.08 -52.96 -50.98 ## 7 2000 -27.94 -36.04 -27.04 ## 8 2001 -40.00 -43.06 -36.04 ## 9 2002 -34.96 -38.92 -34.06 ## 10 2003 -45.94 -45.94 NA ## 11 2004 NA -47.02 -49.00 ## 12 2005 -47.92 -50.98 -49.00 ## 13 2006 NA -43.96 -41.98 ## 14 2007 -38.92 -47.92 -45.94 ## 15 2008 -47.02 -47.02 -49.00 ## 16 2009 -32.98 -41.08 -41.08 ## 17 2010 -36.94 -43.96 -38.02 ## 18 2011 -47.92 -50.98 -52.06 ## 19 2012 -43.96 -47.92 -45.04 ## 20 2013 -36.94 -40.90 NA
See if you can follow the code above. The pipe operator makes is easy to see each operation performed along the way.
There are a couple new functions here, group_by and summarize. group_by indicates at what level we want to group the data, and summarize uses those groupings to perform summary calculations using aggregate functions. We group by station and winter year, then we use the minimum and n functions to get the minimum temperature and number of days in each year where temperature data was available. You can see we are using n to remove winter years where more than two weeks of data are missing.
Also notice that we’re using spread again in order to make a single column for each station containing the minimum temperature data.
Here’s how we can write out the table data as a restructuredText document, which can be converted into many document formats (PDF, ODF, HTML, etc.):
sink("last_twenty.rst") print(kable(last_twenty, format="rst")) sink()
|winter_year||COLLEGE OBSY||FAIRBANKS INTL AP||UNIVERSITY EXP STN|
Finally, let’s plot the minimum temperatures for all three stations.
q <- fai_winter_year_minimum %>% gather(station_name, tmin, -winter_year) %>% arrange(winter_year) %>% ggplot(aes(x=winter_year, y=tmin, colour=station_name)) + geom_point(size=1.5, position=position_jitter(w=0.5,h=0.0)) + geom_smooth(method="lm", se=FALSE) + scale_x_continuous(name="Winter Year", breaks=pretty_breaks(n=20)) + scale_y_continuous(name="Minimum temperature (degrees F)", breaks=pretty_breaks(n=10)) + scale_color_manual(name="Station", labels=c("College Observatory", "Fairbanks Airport", "University Exp. Station"), values=c("darkorange", "blue", "darkcyan")) + theme_bw() + # theme(legend.position = c(0.150, 0.850)) + theme(axis.text.x = element_text(angle=45, hjust=1)) print(q)
To plot the data, we need the data in a slightly different format with each row containing winter year, station name and the minimum temperature. We’re plotting minimum temperature against winter year, coloring the points and trendlines using the station name. That means all three of those variables need to be on the same row.
To do that we use gather. The first parameter is the name of variable the columns will be moved into (the station names, which are currently columns, will become values in a row named station_name). The second is the name of the column that stores the observations (tmin) and the parameters after that are the list of columns to gather together. In our case, rather than specifying the names of the columns, we're specifying the inverse: all the columns except winter_year.
The result of the gather looks like this:
fai_winter_year_minimum %>% gather(station_name, tmin, -winter_year)
## Source: local data frame [321 x 3] ## ## winter_year station_name tmin ## 1 1905 COLLEGE OBSY NA ## 2 1907 COLLEGE OBSY NA ## 3 1908 COLLEGE OBSY NA ## 4 1909 COLLEGE OBSY NA ## 5 1910 COLLEGE OBSY NA ## 6 1911 COLLEGE OBSY NA ## 7 1912 COLLEGE OBSY NA ## 8 1913 COLLEGE OBSY NA ## 9 1915 COLLEGE OBSY NA ## 10 1916 COLLEGE OBSY NA ## .. ... ... ...
The plot is produced using ggplot2. A full introduction would be a seminar by itself, but the basics of our plot can be summarized as follows.
ggplot(aes(x=winter_year, y=tmin, colour=station_name)) +
aes defines variables and grouping.
geom_point(size=1.5, position=position_jitter(w=0.5,h=0.0)) + geom_smooth(method="lm", se=FALSE) +
geom_point draws points, geom_smooth draws fitted lines.
scale_x_continuous(name="Winter Year", breaks=pretty_breaks(n=20)) + scale_y_continuous(name="Minimum temperature (degrees F)", breaks=pretty_breaks(n=10)) + scale_color_manual(name="Station", labels=c("College Observatory", "Fairbanks Airport", "University Exp. Station"), values=c("darkorange", "blue", "darkcyan")) +
Scale functions define how the data is scaled into a plot and controls labelling.
theme_bw() + theme(axis.text.x = element_text(angle=45, hjust=1))
Theme functions controls the style.
For more information:
Linear regression, winter year and minimum temperature
Finally let’s look at the significance of those regression lines:
summary(lm(data=fai_winter_year_minimum, `COLLEGE OBSY` ~ winter_year))
## ## Call: ## lm(formula = `COLLEGE OBSY` ~ winter_year, data = fai_winter_year_minimum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -19.0748 -5.8204 0.1907 3.8042 17.1599 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -275.01062 105.20884 -2.614 0.0114 * ## winter_year 0.11635 0.05311 2.191 0.0325 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.599 on 58 degrees of freedom ## (47 observations deleted due to missingness) ## Multiple R-squared: 0.07643, Adjusted R-squared: 0.06051 ## F-statistic: 4.8 on 1 and 58 DF, p-value: 0.03249
summary(lm(data=fai_winter_year_minimum, `FAIRBANKS INTL AP` ~ winter_year))
## ## Call: ## lm(formula = `FAIRBANKS INTL AP` ~ winter_year, data = fai_winter_year_minimum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.529 -4.605 -1.025 4.007 19.764 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -171.19553 43.55177 -3.931 0.000153 *** ## winter_year 0.06250 0.02221 2.813 0.005861 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 7.037 on 104 degrees of freedom ## (1 observation deleted due to missingness) ## Multiple R-squared: 0.07073, Adjusted R-squared: 0.06179 ## F-statistic: 7.916 on 1 and 104 DF, p-value: 0.005861
summary(lm(data=fai_winter_year_minimum, `UNIVERSITY EXP STN` ~ winter_year))
## ## Call: ## lm(formula = `UNIVERSITY EXP STN` ~ winter_year, data = fai_winter_year_minimum) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.579 -5.818 -1.283 6.029 19.977 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -158.41837 51.03809 -3.104 0.00248 ** ## winter_year 0.05638 0.02605 2.164 0.03283 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 8.119 on 100 degrees of freedom ## (5 observations deleted due to missingness) ## Multiple R-squared: 0.04474, Adjusted R-squared: 0.03519 ## F-statistic: 4.684 on 1 and 100 DF, p-value: 0.03283
Essentially, all the models show a significant increase in minimum temperature over time, but none of them explain very much of the variation in minimum temperature.
I spent some time this weekend playing with a couple interesting new R packages that should help with some of the difficulty manipulating data with the base packages. Getting data into a format appropriate for plotting or running statistical models often seems to take more time than anything else, and the process can be very frustrating because of seemingly non-sensical error messages from R.
R guru Hadley Wickham has written some of the best packages for data manipulation (reshape2, plyr) and plotting (ggplot2). He’s got a new pair of packages (tidyr and dplyr) and a theory of getting data into the proper format (http://vita.had.co.nz/papers/tidy-data.pdf) that look very promising.
A couple other tools I wanted to look at include a new way of piping data from one operation to another (magrittr), which is a basic part of the Unix philosophy of having many tools that do one thing well and stringing them together to do neat things, and an interactive graphic package, ggvis, that should be really good for data investigation.
For this investigation, I’m looking at the names of our dogs, past and present, and how popular they are as people names in the United States. This data comes from the Social Security Administration and is available in the R package babynames.
install.packages("babynames") library(babynames) head(babynames) Source: local data frame [6 x 5] year sex name n prop 1 1880 F Mary 7065 0.07238359 2 1880 F Anna 2604 0.02667896 3 1880 F Emma 2003 0.02052149 4 1880 F Elizabeth 1939 0.01986579 5 1880 F Minnie 1746 0.01788843 6 1880 F Margaret 1578 0.01616720
The data has the number of registrations (n) and proportion of total registrations (prop) for each year from 1880 through 2013 for all name and sex combinations.
What I want to see is how popular out dog names are. All of our dogs have been adopted, but some we chose names for them (Nika, Piper, Kiva), and the rest came with names we didn’t change (Deuce, Buddy, Koidern, Lennier, Martin, Monte and the second Piper). Dog mushers often choose a theme for a litter of puppies, which accounts for some of the unusual names (Deuce came from a litter of classic cars, Lennier came from a litter of Babylon 5 character names, Koidern from a litter of Yukon River tributary names).
So what I want to do is subset the babynames database to just the names of our dogs, combine male and female names together, and plot the popularity of these names over time.
Here’s how that’s done using the magrittr pipe operator (%>%):
library(dplyr) library(magrittr) dog_names <- babynames %>% filter(name %in% c("Nika", "Piper", "Buddy", "Koidern", "Deuce", "Kiva", "Lennier", "Martin", "Monte")) %>% group_by(year, name) %>% summarise(prop=sum(prop)) %>% transform(name=factor(name)) %>% ungroup() %>% arrange(name, year)
We are assigning the result of all the pipes to dog_names. In order, we take the babynames data set and filter it by our dog’s names. Then we group by year and name, and summarize the proportion values by sum. At this point we have data with just our dog’s names, with the proportions of both male and female baby names combined. Next, we convert the name variable to a factor, remove the grouping, and sort by name and year.
Here’s what it looks like now:
> head(dog_names) year name prop 1 1893 Buddy 4.130866e-05 2 1894 Buddy 5.604573e-05 3 1896 Buddy 8.522045e-05 4 1898 Buddy 3.784782e-05 5 1899 Buddy 4.340353e-05 6 1900 Buddy 6.167015e-05
Now we’ve got tidy filtered and sorted data, so let’s plot it. I’ve been using ggplot2 for many years, and I think it’s the best way to produce publication quality figures. But usually you want to do some investigation of the data before doing that, and doing this in ggplot2 involves many cycles of code manipulation, plotting, viewing in order to see what you’ve got and how you want the final version to look.
ggvis is a new package that displays data interactively in a web browser. It also supports the pipe operator, so you can pipe the data directly into the plotting routine. It’s somewhat similar to ggplot2, but has some new conventions that are required in order to handle interactivity. Here’s a plot of my dog names data. The first part is the same as before, but I’m piping the result directly into ggivs.
library(ggvis) babynames %>% filter(name %in% c("Nika", "Piper", "Buddy", "Koidern", "Deuce", "Kiva", "Lennier", "Martin", "Monte")) %>% group_by(year, name) %>% summarise(prop=sum(prop)) %>% transform(name=factor(name)) %>% ungroup() %>% arrange(name, year) %>% ggvis(~year, ~prop, stroke=~name, fill=~name) %>% # layer_lines(strokeWidth:=2) %>% layer_points(size:=15) %>% add_axis("x", title="Year", format="####") %>% add_axis("y", title="Proportion of total names", title_offset=50) %>% add_legend(c("stroke", "fill"), title="Name")
Typically, I prefer to include lines and points in a timeseries plot like this, but I couldn't get ggvis to color the lines and the points without some very strange fill artifacts.
Here’s what I’d consider to be a high quality version of this, generated with ggplot:
library(ggplot2) library(scales) q <- ggplot(data=dog_names, aes(x=year, y=prop, colour=name)) + geom_point(size=1.75) + geom_line() + theme_bw() + scale_colour_brewer(palette="Set1") + scale_x_continuous(name="Year", breaks=pretty_breaks(n=10)) + scale_y_continuous(name="Proportion of total names", breaks=pretty_breaks(n=10)) rescale <- 0.50 svg("dog_names_ggplot2.svg", height=9*rescale, width=16*rescale) print(q) dev.off()
I think the two plots are pretty similar, and I’m impressed with how good the ggvis plot looks and how similar the language is to ggplot2. And I really like the pipe operator compared with a long list of individual statements or the way you add things together with ggplot2.
Both plots suffer from having too many groups (seven), which means it becomes difficult to interpret the colors on the plot. Choosing a good palette is key to this, and is one of those parts of figure production that can really take a long time. I don’t think my choices in the ggplot2 version is optimal, but I got tired of looking. The other problem is the collection of dog names with very low proportions among human babies. Because they’re all overlapping near the axis, this data is obscured. Both problems could be solved by stacking two plots on top of each other, one with the more popular names (Martin, Piper, Buddy and Monte) and one with the less popular ones (Deuce, Kiva, Nika) using different scales for the proportion axis.
What does the plot show? Among our dog’s names, Martin was the most popular, but it’s popularity has been declining since the 60s, and the name Piper has been increasing since 2000. Both Monte and Buddy were popular in the past, but have declined to low levels recently.
For reference, here are the number of babies in 2013 that were given names matching those of our dogs:
babynames %>% filter(name %in% c("Nika", "Piper", "Buddy", "Koidern", "Deuce", "Kiva", "Lennier", "Martin", "Monte") & year==2013) %>% group_by(year, name) %>% summarise(n=sum(n)) %>% transform(name=factor(name)) %>% ungroup() %>% arrange(desc(n)) year name n 1 2013 Piper 3166 2 2013 Martin 1330 3 2013 Monte 81 4 2013 Nika 67 5 2013 Buddy 21 6 2013 Kiva 18 7 2013 Deuce 5