mon, 09-jan-2017, 09:49


The latest forecast discussions for Northern Alaska have included warnings that we are likely to experience an extended period of below normal temperatures starting at the end of this week, and yesterday’s Deep Cold blog post discusses the similarity of model forecast patterns to patterns seen in the 1989 and 1999 extreme cold events.

Our dogs spend most of their time in the house when we’re home, but if both of us are at work they’re outside in the dog yard. They have insulated dog houses, but when it’s colder than −15° F, we put them into a heated dog barn. That means one of us has to come home in the middle of the day to let them out to go to the bathroom.

Since we’re past the Winter Solstice, and day length is now increasing, I was curious to see if that has an effect on daily temperature, hopeful that the frequency of days when we need to put the dogs in the barn is decreasing.


We’ll use daily minimum and maximum temperature data from the Fairbanks International Airport station, keeping track of how many years the temperatures are below −15° F and dividing by the total to get a frequency. We live in a cold valley on Goldstream Creek, so our temperatures are typically several degrees colder than the Fairbanks Airport, and we often don’t warm up as much during the day as in other places, but minimum airport temperature is a reasonable proxy for the overall winter temperature at our house.


The following plot shows the frequency of minimum (the top of each line) and maximum (the bottom) temperature colder than −15° F at the airport over the period of record, 1904−2016. The curved blue line represents a best fit line through the minimum temperature frequency, and the vertical blue line is drawn at the date when the frequency is the highest.

Frequency of days with temperatures below −15° F

The maximum frequency is January 12th, so we have a few more days before the likelihood of needing to put the dogs in the barn starts to decline. The plot also shows that we could still reach that threshold all the way into April.

For fun, here’s the same plot using −40° as the threshold:

Frequency of days with temperatures below −40°

The date when the frequency starts to decline is shifted slightly to January 15th, and you can see the frequencies are lower. In mid-January, we can expect minimum temperature to be colder than −15° F more than half the time, but temperatures colder than −40° are just under 15%. There’s also an interesting anomaly in mid to late December where the frequency of very cold temperatures appears to drop.

Appendix: R code


noaa <- src_postgres(host="localhost", dbname="noaa")

fairbanks <- tbl(noaa, build_sql("SELECT * FROM ghcnd_pivot
                                  WHERE station_name='FAIRBANKS INTL AP'")) %>%

save(fairbanks, file="fairbanks_ghcnd.rdat")

for_plot <- fairbanks %>%
           dte_str=format(dte, "%d %b"),
           min_below=ifelse(tmin_c < -26.11,1,0),
           max_below=ifelse(tmax_c < -26.11,1,0)) %>%
    filter(dte_str!="29 Feb") %>%
    mutate(doy=ifelse(leap_year(dte) & doy>60, doy-1, doy),
           doy=(doy+31+28+31+30)%%365) %>%
    group_by(doy, dte_str) %>%
    mutate(n_min=sum(ifelse(!, 1, 0)),
           n_max=sum(ifelse(!, 1, 0))) %>%
    summarize(min_freq=sum(min_below, na.rm=TRUE)/max(n_min, na.rm=TRUE),
              max_freq=sum(max_below, na.rm=TRUE)/max(n_max, na.rm=TRUE))

x_breaks <- for_plot %>%
    filter(doy %in% seq(49, 224, 7))

stats <- tibble(doy=seq(49, 224),
                pred=predict(loess(min_freq ~ doy,
                                   for_plot %>%
                                       filter(doy >= 49, doy <= 224))))

max_stats <- stats %>%
    arrange(desc(pred)) %>% head(n=1)

p <- ggplot(data=for_plot,
            aes(x=doy, ymin=min_freq, ymax=max_freq)) +
    geom_linerange() +
    geom_smooth(aes(y=min_freq), se=FALSE, size=0.5) +
    geom_segment(aes(x=max_stats$doy, xend=max_stats$doy,
                     y=-Inf, yend=max_stats$pred),
                 colour="blue", size=0.5) +
                       limits=c(49, 224),
                       labels=x_breaks$dte_str) +
    scale_y_continuous(name="Frequency of days colder than −15° F",
                       breaks=pretty_breaks(n=10)) +
    theme_bw() +
    theme(axis.text.x=element_text(angle=30, hjust=1))

# Minus 40
for_plot <- fairbanks %>%
           dte_str=format(dte, "%d %b"),
           min_below=ifelse(tmin_c < -40,1,0),
           max_below=ifelse(tmax_c < -40,1,0)) %>%
    filter(dte_str!="29 Feb") %>%
    mutate(doy=ifelse(leap_year(dte) & doy>60, doy-1, doy),
           doy=(doy+31+28+31+30)%%365) %>%
    group_by(doy, dte_str) %>%
    mutate(n_min=sum(ifelse(!, 1, 0)),
           n_max=sum(ifelse(!, 1, 0))) %>%
    summarize(min_freq=sum(min_below, na.rm=TRUE)/max(n_min, na.rm=TRUE),
              max_freq=sum(max_below, na.rm=TRUE)/max(n_max, na.rm=TRUE))

x_breaks <- for_plot %>%
    filter(doy %in% seq(63, 203, 7))

stats <- tibble(doy=seq(63, 203),
                pred=predict(loess(min_freq ~ doy,
                                   for_plot %>%
                                       filter(doy >= 63, doy <= 203))))

max_stats <- stats %>%
    arrange(desc(pred)) %>% head(n=1)

q <- ggplot(data=for_plot,
            aes(x=doy, ymin=min_freq, ymax=max_freq)) +
    geom_linerange() +
    geom_smooth(aes(y=min_freq), se=FALSE, size=0.5) +
    geom_segment(aes(x=max_stats$doy, xend=max_stats$doy,
                     y=-Inf, yend=max_stats$pred),
                 colour="blue", size=0.5) +
                       limits=c(63, 203),
                       labels=x_breaks$dte_str) +
    scale_y_continuous(name="Frequency of days colder than −40°",
                       breaks=pretty_breaks(n=10)) +
    theme_bw() +
    theme(axis.text.x=element_text(angle=30, hjust=1))
tags: weather  climate  temperature  R 
sat, 19-nov-2016, 15:50


So far this winter we’ve gotten only 4.1 inches of snow, well below the normal 19.7 inches, and there is only 2 inches of snow on the ground. At this point last year we had 8 inches and I’d been biking and skiing on the trail to work for two weeks. In his North Pacific Temperature Update blog post, Richard James mentions that winters like this one, with a combined strongly positive Pacific Decadal Oscillation phase and strongly negative North Pacific Mode phase tend to be a “distinctly dry” pattern for interior Alaska. I don’t pretend to understand these large scale climate patterns, but I thought it would be interesting to look at snowfall and snow depth in years with very little mid-November snow. In other years like this one do we eventually get enough snow that the trails fill in and we can fully participate in winter sports like skiing, dog mushing, and fat biking?


We will use daily data from the Global Historical Climate Data set for the Fairbanks International Airport station. Data prior to 1950 is excluded because of poor quality snowfall and snow depth data and because there’s a good chance that our climate has changed since then and patterns from that era aren’t a good model for the current climate in Alaska.

We will look at both snow depth and the cumulative winter snowfall.


The following tables show the ten years with the lowest cumulative snowfall and snow depth values from 1950 to the present on November 18th.

Year Cumulative Snowfall (inches)
1953 1.5
2016 4.1
1954 4.3
2014 6.0
2006 6.4
1962 7.5
1998 7.8
1960 8.5
1995 8.8
1979 10.2
Year Snow depth (inches)
1953 1
1954 1
1962 1
2016 2
2014 2
1998 3
1964 3
1976 3
1971 3
2006 4

2016 has the second-lowest cumulative snowfall behind 1953 and is tied for second with 2014 for snow depth with 1953, 1954 and 1962 all having only 1 inch of snow on November 18th.

It also seems like recent years appear in these tables more frequently than would be expected. Grouping by decade and averaging cumulative snowfall and snow depth yields the pattern in the chart below. The error bars (not shown) are fairly large, so the differences between decades aren’t likely to be statistically significant, but there is a pattern of lower snowfall amounts in recent decades.

Decadal average cumulative snowfall and snow depth

Now let’s see what happened in those years with low snowfall and snow depth values in mid-November starting with cumulative snowfall. The following plot (and the subsequent snow depth plot) shows the data for the low-value years (and one very high snowfall year—1990), with each year’s data as a separate line. The smooth dark cyan line through the middle of each plot is the smoothed line through the values for all years; a sort of “average” snowfall and snow depth curve.

Cumulative snowfall, years with low snow on November 18

In all four mid-November low-snowfall years, the cumulative snowfall values remain below average throughout the winter, but snow did continue to fall as the season went on. Even the lowest winter year here, 2006–2007, still ended the winter with 15 inches of snow on the groud.

The following plot shows snow depth for the four years with the lowest snow depth on November 18th. The data is formatted the same as in the previous plot except we’ve jittered the values slightly to make the plot easier to read.

Snow depth, years with low snow on November 18

The pattern here is similar, but the snow depths get much closer to the average values. Snow depth for all four low snow years remain low throughout November, but start rising in December, dramatically in 1954 and 2014.

One of the highest snowfall years between 1950 and 2016 was 1990–1991 (shown on both plots). An impressive 32.8 inches of snow fell in eight days between December 21st and December 28th, accounting for the sharp increase in cumulative snowfall and snow depth shown on both plots. There are five years in the record where the cumulative total for the entire winter was lower than these eight days in 1990.


Despite the lack of snow on the ground to this point in the year, the record shows that we are still likely to get enough snow to fill in the trails. We may need to wait until mid to late December, but it’s even possible we’ll eventually reach the long term average depth before spring.


Here’s the R code used to generate the statistics, tables and plots from this post:


noaa <- src_postgres(host="localhost", dbname="noaa")

snow <- tbl(noaa, build_sql(
   "WITH wdoy_data AS (
         SELECT dte, dte - interval '120 days' as wdte,
            tmin_c, tmax_c, (tmin_c+tmax_c)/2.0 AS tavg_c,
            prcp_mm, snow_mm, snwd_mm
         FROM ghcnd_pivot
         WHERE station_name = 'FAIRBANKS INTL AP'
         AND dte > '1950-09-01')
   SELECT dte, date_part('year', wdte) AS wyear, date_part('doy', wdte) AS wdoy,
         to_char(dte, 'Mon DD') AS mmdd,
         tmin_c, tmax_c, tavg_c, prcp_mm, snow_mm, snwd_mm
   FROM wdoy_data")) %>%
            snwd_mm=as.integer(snwd_mm)) %>%
   select(dte, wyear, wdoy, mmdd,
            tmin_c, tmax_c, tavg_c, prcp_mm, snow_mm, snwd_mm) %>% collect()

write_csv(snow, "pafa_data_with_wyear_post_1950.csv")
save(snow, file="pafa_data_with_wyear_post_1950.rdata")

cum_snow <- snow %>%
         snow_mm=ifelse(,0,snow_mm)) %>%
   group_by(wyear) %>%
         snow_na=cumsum(snow_na)) %>%
   ungroup() %>%
   mutate(snow_in_cum=round(snow_mm_cum/25.4, 1),
         snwd_in=round(snwd_mm/25.4, 0))

nov_18_snow <- cum_snow %>%
   filter(mmdd=='Nov 18') %>%
   select(wyear, snow_in_cum, snwd_in) %>%

decadal_avg <- nov_18_snow %>%
   mutate(decade=as.integer(wyear/10)*10) %>%
   group_by(decade) %>%
   summarize(`Snow depth`=mean(snwd_in),
            `Cumulative Snowfall`=mean(snow_in_cum),

decadal_averages <- ggplot(decadal_avg %>%
                              gather(variable, value, -decade) %>%
                              filter(variable %in% c("Cumulative Snowfall",
                                                      "Snow depth")),
                           aes(x=as.factor(decade), y=value, fill=variable)) +
            theme_bw() +
            geom_bar(stat="identity", position="dodge") +
            scale_x_discrete(name="Decade", breaks=c(1950, 1960, 1970, 1980,
                                                   1990, 2000, 2010)) +
            scale_y_continuous(name="Inches", breaks=pretty_breaks(n=10)) +


date_x_scale <- cum_snow %>%
   filter(grepl(' (01|15)', mmdd), wyear=='1994') %>%
   select(wdoy, mmdd)

cumulative_snowfall <-
   ggplot(cum_snow %>% filter(wyear %in% c(1953, 1954, 2014, 2006, 1990),
            aes(x=wdoy, y=snow_in_cum, colour=as.factor(wyear))) +
   theme_bw() +
   geom_smooth(data=cum_snow %>% filter(wdoy>183, wdoy<320),
               aes(x=wdoy, y=snow_in_cum),
               size=0.5, colour="darkcyan",
               se=FALSE) +
   geom_line(position="jitter") +
                     labels=date_x_scale$mmdd) +
   scale_y_continuous(name="Cumulative snowfall (in)",
                     breaks=pretty_breaks(n=10)) +
   scale_color_discrete(name="Winter year")


snow_depth <-
   ggplot(cum_snow %>% filter(wyear %in% c(1953, 1954, 1962, 2014, 1990),
            aes(x=wdoy, y=snwd_in, colour=as.factor(wyear))) +
   theme_bw() +
   geom_smooth(data=cum_snow %>% filter(wdoy>183, wdoy<320),
               aes(x=wdoy, y=snwd_in),
               size=0.5, colour="darkcyan",
               se=FALSE) +
   geom_line(position="jitter") +
                     labels=date_x_scale$mmdd) +
   scale_y_continuous(name="Snow Depth (in)",
                     breaks=pretty_breaks(n=10)) +
   scale_color_discrete(name="Winter year")

tags: snow depth  snowfall  weather  climate  R 
fri, 13-may-2016, 06:02

This morning’s weather forecast:


May 13th seems very early in the year to hit 80 degrees in Fairbanks, so I decided to check it out. What I’m doing here is selecting all the dates where the temperature is above 80°F, then ranking those dates by year and date, and extracting the “winner” for each year (where rank is 1).

WITH warm AS (
   SELECT extract(year from dte) AS year, dte,
      c_to_f(tmax_c) AS tmax_f
   FROM ghcnd_pivot
   WHERE station_name = 'FAIRBANKS INTL AP'
      AND c_to_f(tmax_c) >= 80.0),
ranked AS (
   SELECT year, dte, tmax_f,
      row_number() OVER (PARTITION BY year
                         ORDER BY dte) AS rank
   FROM warm)
   extract(doy from dte) AS doy,
   round(tmax_f, 1) as tmax_f
FROM ranked
WHERE rank = 1

And the results:

Earliest 80 degree dates, Fairbanks Airport
Date Day of year High temperature (°F)
1995-05-09 129 80.1
1975-05-11 131 80.1
1942-05-12 132 81.0
1915-05-14 134 80.1
1993-05-16 136 82.0
2002-05-20 140 80.1
2015-05-22 142 80.1
1963-05-22 142 84.0
1960-05-23 144 80.1
2009-05-24 144 80.1

If we hit 80°F today, it’ll be the fourth earliest day of year to hit that temperature since records started being kept in 1904.

Update: We didn’t reach 80°F on the 13th, but got to 82°F on May 14th, tied with that date in 1915 for the fourth earliest 80 degree temperature.

sat, 25-apr-2015, 10:21


One of the best sources of weather data in the United States comes from the National Weather Service's Cooperative Observer Network (COOP), which is available from NCDC. It's daily data, collected by volunteers at more than 10,000 locations. We participate in this program at our house (station id DW1454 / GHCND:USC00503368), collecting daily minimum and maximum temperature, liquid precipitation, snowfall and snow depth. We also collect river heights for Goldstream Creek as part of the Alaska Pacific River Forecast Center (station GSCA2). Traditionally, daily temperature measurements were collecting using a minimum maximum thermometer, which meant that the only way to calculate average daily temperature was by averaging the minimum and maximum temperature. Even though COOP observers typically have an electronic instrument that could calculate average daily temperature from continuous observations, the daily minimum and maximum data is still what is reported.

In an earlier post we looked at methods used to calculate average daily temperature, and if there are any biases present in the way the National Weather Service calculates this using the average of the minimum and maximum daily temperature. We looked at five years of data collected at my house every five minutes, comparing the average of these temperatures against the average of the daily minimum and maximum. Here, we will be repeating this analysis using data from the Climate Reference Network stations in the United States.

The US Climate Reference Network is a collection of 132 weather stations that are properly sited, maintained, and include multiple redundant measures of temperature and precipitation. Data is available from and includes monthly, daily, and hourly statistics, and sub-hourly (5-minute) observations. We’ll be focusing on the sub-hourly data, since it closely matches the data collected at my weather station.

A similar analysis using daily and hourly CRN data appears here.

Getting the raw data

I downloaded all the data using the following Unix commands:

$ wget
$ wget -np -m
$ find -type f -name 'CRN*.txt' -exec gzip {} \;

The code to insert all of this data into a database can be found here. Once inserted, I have a table named crn_stations that has the station data, and one named crn_subhourly with the five minute observation data.


Once again, we’ll use R to read the data, process it, and produce plots.


Load the libraries we need:


Connect to the database and load the data tables.

noaa_db <- src_postgres(dbname="noaa", host="mason")

crn_stations <- tbl(noaa_db, "crn_stations") %>%

crn_subhourly <- tbl(noaa_db, "crn_subhourly")

Remove observations without temperature data, group by station and date, calculate average daily temperature using the two methods, remove any daily data without a full set of data, and collect the results into an R data frame. This looks very similar to the code used to analyze the data from my weather station.

crn_daily <-
    crn_subhourly %>%
        filter(! %>%
        mutate(date=date(timestamp)) %>%
        group_by(wbanno, date) %>%
                  n=n()) %>%
        filter(n==24*12) %>%
        mutate(anomaly=t_minmax_avg-t_mean) %>%
        select(wbanno, date, t_mean, t_minmax_avg, anomaly) %>%

The two types of daily average temperatures are calculated in this step:


Where t_mean is the value calculated from all 288 five minute observations, and t_minmax_avg is the value from the daily minimum and maximum.

Now we join the observation data with the station data. This attaches station information such as the name and latitude of the station to each record.

crn_daily_stations <-
    crn_daily %>%
        inner_join(crn_stations, by="wbanno") %>%
        select(wbanno, date, state, location, latitude, longitude,
               t_mean, t_minmax_avg, anomaly)

Finally, save the data so we don’t have to do these steps again.

save(crn_daily_stations, file="crn_daily_averages.rdata")


Here are the overall results of the analysis.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
## -11.9000  -0.1028   0.4441   0.4641   1.0190  10.7900

The average anomaly across all stations and all dates is 0.44 degrees Celsius (0.79 degrees Farenheit). That’s a pretty significant error. Half the data is between −0.1 and 1.0°C (−0.23 and +1.8°F) and the full range is −11.9 to +10.8°C (−21.4 to +19.4°F).


Let’s look at some plots.

Raw data by latitude

To start, we’ll look at all the anomalies by station latitude. The plot only shows one percent of the actual anomalies because plotting 512,460 points would take a long time and the general pattern is clear from the reduced data set.

p <- ggplot(data=crn_daily_stations %>% sample_frac(0.01),
            aes(x=latitude, y=anomaly)) +
    geom_point(position="jitter", alpha="0.2") +
    geom_smooth(method="lm", se=FALSE) +
    theme_bw() +
    scale_x_continuous(name="Station latitude", breaks=pretty_breaks(n=10)) +
    scale_y_continuous(name="Temperature anomaly (degrees C)",


The clouds of points show the differences between the min/max daily average and the actual daily average temperature, where numbers above zero represent cases where the min/max calculation overestimates daily average temperature. The blue line is the fit of a linear model relating latitude with temperature anomaly. We can see that the anomaly is always positive, averaging around half a degree at lower latitudes and drops somewhat as we proceed northward. You also get a sense from the actual data of how variable the anomaly is, and at what latitudes most of the stations are found.

Here are the regression results:

summary(lm(anomaly ~ latitude, data=crn_daily_stations))
## Call:
## lm(formula = anomaly ~ latitude, data = crn_daily_stations)
## Residuals:
##      Min       1Q   Median       3Q      Max
## -12.3738  -0.5625  -0.0199   0.5499  10.3485
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.7403021  0.0070381  105.19   <2e-16 ***
## latitude    -0.0071276  0.0001783  -39.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.9632 on 512458 degrees of freedom
## Multiple R-squared:  0.00311,    Adjusted R-squared:  0.003108
## F-statistic:  1599 on 1 and 512458 DF,  p-value: < 2.2e-16

The overall model and coefficients are highly significant, and show a slight decrease in the positive anomaly as we move farther north. Perhaps this is part of the reason why the analysis of my station (at a latitude of 64.89) showed an average anomaly close to zero (−0.07°C / −0.13°F).

Anomalies by month and latitude

One of the results of our earlier analysis was a seasonal pattern in the anomalies at our station. Since we also know there is a latitudinal pattern, in the data, let’s combine the two, plotting anomaly by month, and faceting by latitude.

Station latitude are binned into groups for plotting, and the plots themselves show the range that cover half of all anomalies for that latitude category × month. Including the full range of anomalies in each group tends to obscure the overall pattern, and the plot of the raw data didn’t show an obvious skew to the rarer anomalies.

Here’s how we set up the data frames for the plot.

crn_daily_by_month <-
    crn_daily_stations %>%
               lat_bin=factor(ifelse(latitude<30, '<30',
                                     ifelse(latitude>60, '>60',
                              levels=c('<30', '30-40', '40-50',
                                       '50-60', '>60')))

summary_stats <- function(l) {
    s <- summary(l)
               first=s['1st Qu.'],
               third=s['3rd Qu.'],

crn_by_month_lat_bin <-
    crn_daily_by_month %>%
        group_by(month, lat_bin) %>%
        do(summary_stats(.$anomaly)) %>%

station_years <-
    crn_daily_by_month %>%
        mutate(year=year(date)) %>%
        group_by(wbanno, lat_bin) %>%
        summarize() %>%
        group_by(lat_bin) %>%

And the plot itself. At the end, we’re using a function called facet_adjust, which adds x-axis tick labels to the facet on the right that wouldn't ordinarily have them. The code comes from this stack overflow post.

p <- ggplot(data=crn_by_month_lat_bin,
            aes(x=month, ymin=first, ymax=third, y=mean)) +
    geom_hline(yintercept=0, alpha=0.2) +
    geom_hline(data=crn_by_month_lat_bin %>%
                        group_by(lat_bin) %>%
               aes(yintercept=mean), colour="darkorange", alpha=0.5) +
    geom_pointrange() +
    facet_wrap(~ lat_bin, ncol=3) +
    geom_text(data=station_years, size=4,
              aes(x=2.25, y=-0.5, ymin=0, ymax=0,
                  label=paste('n =', station_years))) +
    scale_y_continuous(name="Range including 50% of temperature anomalies") +
                     labels=c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')) +
    theme_bw() +
    theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1.25),

Each plot shows the range of anomalies from the first to the third quartile (50% of the observed anomalies) by month, with the dot near the middle of the line at the mean anomaly. The orange horizontal line shows the overall mean anomaly for that latitude category, and the count at the bottom of the plot indicates the number of “station years” for that latitude category.

It’s clear that there are seasonal patterns in the differences between the mean daily temperature and the min/max estimate. But each plot looks so different from the next that it’s not clear if the patterns we are seeing in each latitude category are real or artificial. It is also problematic that three of our latitude categories have very little data compared with the other two. It may be worth performing this analysis in a few years when the lower and higher latitude stations have a bit more data.


This analysis shows that there is a clear bias in using the average of minimum and maximum daily temperature to estimate average daily temperature. Across all of the CRN stations, the min/max estimator overestimates daily average temperature by almost a half a degree Celsius (0.8°F).

We also found that this error is larger at lower latitudes, and that there are seasonal patterns to the anomalies, although the seasonal patterns don’t seem to have clear transitions moving from lower to higher latitudes.

The current length of the CRN record is quite short, especially for the sub-hourly data used here, so the patterns may not be representative of the true situation.

tags: climate  temperature  CRN  COOP  weather  R  ggplot 
tue, 21-apr-2015, 17:33


The following is a document-style version of a presentation I gave at work a couple weeks ago. It's a little less useful for a general audience because you don't have access to the same database I have, but I figured it might be useful for someone who is looking at using dplyr or in manipulating the GHCND data from NCDC.


Today we’re going to briefly take a look at the GHCND climate database and a couple new R packages (dplyr and tidyr) that make data import and manipulation a lot easier than using the standard library.

For further reading, consult the vignettes for dplyr and tidyr, and download the cheat sheet:

GHCND database

The GHCND database contains daily observation data from locations around the world. The README linked above describes the data set and the way the data is formatted. I have written scripts that process the station data and the yearly download files and insert it into a PostgreSQL database (noaa).

The script for inserting a yearly file (downloaded from is here:

“Tidy” data

Without going into too much detail on the subject (read Hadley Wickham’s paper) for more information, but the basic idea is that it is much easier to analyze data when it is in a particular, “tidy”, form. A Tidy dataset has a single table for each type of real world object or type of data, and each table has one column per variable measured and one row per observation.

For example, here’s a tidy table representing daily weather observations with station × date as rows and the various variables as columns.

Station Date tmin_c tmax_c prcp snow ...
PAFA 2014-01-01 12 24 0.00 0.0 ...
PAFA 2014-01-01 8 11 0.02 0.2 ...
... ... ... ... ... ... ...

Getting raw data into this format is what we’ll look at today.

R libraries & data import

First, let’s load the libraries we’ll need:

library(dplyr)      # data import
library(tidyr)      # column / row manipulation
library(knitr)      # tabular export
library(ggplot2)    # plotting
library(scales)     # “pretty” scaling
library(lubridate)  # date / time manipulations

dplyr and tidyr are the data import and manipulation libraries we will use, knitr is used to produce tabular data in report-quality forms, ggplot2 and scales are plotting libraries, and lubridate is a library that makes date and time manipulation easier.

Also note the warnings about how several R functions have been “masked” when we imported dplyr. This just means we'll be getting the dplyr versions instead of those we might be used to. In cases where we need both, you can preface the function with it's package: base::filter would us the normal filter function instead of the one from dplyr.

Next, connect to the database and the three tables we will need:

noaa_db <- src_postgres(host="mason",
ghcnd_obs <- tbl(noaa_db, "ghcnd_obs")
ghcnd_vars <- tbl(noaa_db, "ghcnd_variables")

The first statement connects us to the database and the next two create table links to the observation table and the variables table.

Here’s what those two tables look like:

## Observations: 29404870
## Variables:
## $ station_id  (chr) "USW00027502", "USW00027502", "USW00027502", "USW0...
## $ dte         (date) 2011-05-01, 2011-05-01, 2011-05-01, 2011-05-01, 2...
## $ variable    (chr) "AWND", "FMTM", "PRCP", "SNOW", "SNWD", "TMAX", "T...
## $ raw_value   (dbl) 32, 631, 0, 0, 229, -100, -156, 90, 90, 54, 67, 1,...
## $ meas_flag   (chr) "", "", "T", "T", "", "", "", "", "", "", "", "", ...
## $ qual_flag   (chr) "", "", "", "", "", "", "", "", "", "", "", "", ""...
## $ source_flag (chr) "X", "X", "X", "X", "X", "X", "X", "X", "X", "X", ...
## $ time_of_obs (int) NA, NA, 0, NA, NA, 0, 0, NA, NA, NA, NA, NA, NA, N...
## Observations: 82
## Variables:
## $ variable       (chr) "AWND", "EVAP", "MDEV", "MDPR", "MNPN", "MXPN",...
## $ description    (chr) "Average daily wind speed (tenths of meters per...
## $ raw_multiplier (dbl) 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0....

Each row in the observation table rows contain the station_id, date, a variable code, the raw value for that variable, and a series of flags indicating data quality, source, and special measurements such as the “trace” value used for precipitation under the minimum measurable value.

Each row in the variables table contains a variable code, description and the multiplier used to convert the raw value from the observation table into an actual value.

This is an example of completely “normalized” data, and it’s stored this way because not all weather stations record all possible variables, and rather than having a single row for each station × date with a whole bunch of empty columns for those variables not measured, each row contains the station × data × variable data.

We are also missing information about the stations, so let’s load that data:

fai_stations <-
    tbl(noaa_db, "ghcnd_stations") %>%
    filter(station_name %in% c("FAIRBANKS INTL AP",
                               "UNIVERSITY EXP STN",
                               "COLLEGE OBSY"))
## Observations: 3
## Variables:
## $ station_id   (chr) "USC00502107", "USW00026411", "USC00509641"
## $ station_name (chr) "COLLEGE OBSY", "FAIRBANKS INTL AP", "UNIVERSITY ...
## $ latitude     (dbl) 64.86030, 64.80389, 64.85690
## $ longitude    (dbl) -147.8484, -147.8761, -147.8610
## $ elevation    (dbl) 181.9656, 131.6736, 144.7800
## $ coverage     (dbl) 0.96, 1.00, 0.98
## $ start_date   (date) 1948-05-16, 1904-09-04, 1904-09-01
## $ end_date     (date) 2015-04-03, 2015-04-02, 2015-03-13
## $ variables    (chr) "TMIN TOBS WT11 SNWD SNOW WT04 WT14 TMAX WT05 DAP...
## $ the_geom     (chr) "0101000020E6100000A5BDC117267B62C0EC2FBB270F3750...

The first part is the same as before, loading the ghcnd_stations table, but we are filtering that data down to just the Fairbanks area stations with long term records. To do this, we use the pipe operator %>% which takes the data from the left side and passes it to the function on the right side, the filter function in this case.

filter requires one or more conditional statements with variable names on the left side and the condition on the right. Multiple conditions can be separated by commas if you want all the conditions to be required (AND) or separated by a logic operator (& for AND, | for OR). For example: filter(latitude > 70, longitude < -140).

When used on database tables, filter can also use conditionals that are built into the database which are passed directly as part of a WHERE clause. In our code above, we’re using the %in% operator here to select the stations from a list.

Now we have the station_ids we need to get just the data we want from the observation table and combine it with the other tables.

Combining data

Here’s how we do it:

fai_raw <-
    ghcnd_obs %>%
    inner_join(fai_stations, by="station_id") %>%
    inner_join(ghcnd_vars, by="variable") %>%
    mutate(value=raw_value*raw_multiplier) %>%
    filter(qual_flag=='') %>%
    select(station_name, dte, variable, value) %>%

In order, here’s what we’re doing:

  • Assign the result to fai_raw
  • Join the observation table with the filtered station data, using station_id as the variable to combine against. Because this is an “inner” join, we only get results where station_id matches in both the observation and the filtered station data. At this point we only have observation data from our long-term Fairbanks stations.
  • Join the variable table with the Fairbanks area observation data, using variable to link the tables.
  • Add a new variable called value which is calculated by multiplying raw_value (coming from the observation table) by raw_multiplier (coming from the variable table).
  • Remove rows where the quality flag is not an empty space.
  • Select only the station name, date, variable and actual value columns from the data. Before we did this, each row would contain every column from all three tables, and most of that information is not necessary.
  • Finally, we “collect” the results. dplyr doesn’t actually perform the full SQL until it absolutely has to. Instead it’s retrieving a small subset so that we can test our operations quickly. When we are happy with the results, we use collect() to grab the full data.

De-normalize it

The data is still in a format that makes it difficult to analyze, with each row in the result containing a single station × date × variable observation. A tidy version of this data requires each variable be a column in the table, each row being a single date at each station.

To “pivot” the data, we use the spread function, and we'll also calculate a new variable and reduce the number of columns in the result.

fai_pivot <-
    fai_raw %>%
    spread(variable, value) %>%
    mutate(TAVG=(TMIN+TMAX)/2.0) %>%
    select(station_name, dte, TAVG, TMIN, TMAX, TOBS, PRCP, SNOW, SNWD,
           WSF1, WDF1, WSF2, WDF2, WSF5, WDF5, WSFG, WDFG, TSUN)
## Source: local data frame [6 x 18]
##   station_name        dte  TAVG TMIN TMAX TOBS PRCP SNOW SNWD WSF1 WDF1
## 1 COLLEGE OBSY 1948-05-16 11.70  5.6 17.8 16.1   NA   NA   NA   NA   NA
## 2 COLLEGE OBSY 1948-05-17 15.55 12.2 18.9 17.8   NA   NA   NA   NA   NA
## 3 COLLEGE OBSY 1948-05-18 14.40  9.4 19.4 16.1   NA   NA   NA   NA   NA
## 4 COLLEGE OBSY 1948-05-19 14.15  9.4 18.9 12.2   NA   NA   NA   NA   NA
## 5 COLLEGE OBSY 1948-05-20 10.25  6.1 14.4 14.4   NA   NA   NA   NA   NA
## 6 COLLEGE OBSY 1948-05-21  9.75  1.7 17.8 17.8   NA   NA   NA   NA   NA
## Variables not shown: WSF2 (dbl), WDF2 (dbl), WSF5 (dbl), WDF5 (dbl), WSFG
##   (dbl), WDFG (dbl), TSUN (dbl)

spread takes two parameters, the variable we want to spread across the columns, and the variable we want to use as the data value for each row × column intersection.


Now that we've got the data in a format we can work with, let's look at a few examples.

Find the coldest temperatures by winter year

First, let’s find the coldest winter temperatures from each station, by winter year. “Winter year” is just a way of grouping winters into a single value. Instead of the 2014–2015 winter, it’s the 2014 winter year. We get this by subtracting 92 days (the days in January, February, March) from the date, then pulling off the year.

Here’s the code.

fai_winter_year_minimum <-
    fai_pivot %>%
        mutate(winter_year=year(dte - days(92))) %>%
        filter(winter_year < 2014) %>%
        group_by(station_name, winter_year) %>%
        select(station_name, winter_year, TMIN) %>%
        summarize(tmin=min(TMIN*9/5+32, na.rm=TRUE), n=n()) %>%
        filter(n>350) %>%
        select(station_name, winter_year, tmin) %>%
        spread(station_name, tmin)

last_twenty <-
    fai_winter_year_minimum %>%
        filter(winter_year > 1993)

## Source: local data frame [20 x 4]
## 1         1994       -43.96            -47.92             -47.92
## 2         1995       -45.04            -45.04             -47.92
## 3         1996       -50.98            -50.98             -54.04
## 4         1997       -43.96            -47.92             -47.92
## 5         1998       -52.06            -54.94             -54.04
## 6         1999       -50.08            -52.96             -50.98
## 7         2000       -27.94            -36.04             -27.04
## 8         2001       -40.00            -43.06             -36.04
## 9         2002       -34.96            -38.92             -34.06
## 10        2003       -45.94            -45.94                 NA
## 11        2004           NA            -47.02             -49.00
## 12        2005       -47.92            -50.98             -49.00
## 13        2006           NA            -43.96             -41.98
## 14        2007       -38.92            -47.92             -45.94
## 15        2008       -47.02            -47.02             -49.00
## 16        2009       -32.98            -41.08             -41.08
## 17        2010       -36.94            -43.96             -38.02
## 18        2011       -47.92            -50.98             -52.06
## 19        2012       -43.96            -47.92             -45.04
## 20        2013       -36.94            -40.90                 NA

See if you can follow the code above. The pipe operator makes is easy to see each operation performed along the way.

There are a couple new functions here, group_by and summarize. group_by indicates at what level we want to group the data, and summarize uses those groupings to perform summary calculations using aggregate functions. We group by station and winter year, then we use the minimum and n functions to get the minimum temperature and number of days in each year where temperature data was available. You can see we are using n to remove winter years where more than two weeks of data are missing.

Also notice that we’re using spread again in order to make a single column for each station containing the minimum temperature data.

Here’s how we can write out the table data as a restructuredText document, which can be converted into many document formats (PDF, ODF, HTML, etc.):

print(kable(last_twenty, format="rst"))
Minimum temperatures by winter year, station
1994 -43.96 -47.92 -47.92
1995 -45.04 -45.04 -47.92
1996 -50.98 -50.98 -54.04
1997 -43.96 -47.92 -47.92
1998 -52.06 -54.94 -54.04
1999 -50.08 -52.96 -50.98
2000 -27.94 -36.04 -27.04
2001 -40.00 -43.06 -36.04
2002 -34.96 -38.92 -34.06
2003 -45.94 -45.94 NA
2004 NA -47.02 -49.00
2005 -47.92 -50.98 -49.00
2006 NA -43.96 -41.98
2007 -38.92 -47.92 -45.94
2008 -47.02 -47.02 -49.00
2009 -32.98 -41.08 -41.08
2010 -36.94 -43.96 -38.02
2011 -47.92 -50.98 -52.06
2012 -43.96 -47.92 -45.04
2013 -36.94 -40.90 NA


Finally, let’s plot the minimum temperatures for all three stations.

q <-
    fai_winter_year_minimum %>%
        gather(station_name, tmin, -winter_year) %>%
        arrange(winter_year) %>%
        ggplot(aes(x=winter_year, y=tmin, colour=station_name)) +
            geom_point(size=1.5, position=position_jitter(w=0.5,h=0.0)) +
            geom_smooth(method="lm", se=FALSE) +
            scale_x_continuous(name="Winter Year", breaks=pretty_breaks(n=20)) +
            scale_y_continuous(name="Minimum temperature (degrees F)", breaks=pretty_breaks(n=10)) +
                               labels=c("College Observatory",
                                        "Fairbanks Airport",
                                        "University Exp. Station"),
                               values=c("darkorange", "blue", "darkcyan")) +
            theme_bw() +
            # theme(legend.position = c(0.150, 0.850)) +
            theme(axis.text.x = element_text(angle=45, hjust=1))


To plot the data, we need the data in a slightly different format with each row containing winter year, station name and the minimum temperature. We’re plotting minimum temperature against winter year, coloring the points and trendlines using the station name. That means all three of those variables need to be on the same row.

To do that we use gather. The first parameter is the name of variable the columns will be moved into (the station names, which are currently columns, will become values in a row named station_name). The second is the name of the column that stores the observations (tmin) and the parameters after that are the list of columns to gather together. In our case, rather than specifying the names of the columns, we're specifying the inverse: all the columns except winter_year.

The result of the gather looks like this:

fai_winter_year_minimum %>%
    gather(station_name, tmin, -winter_year)
## Source: local data frame [321 x 3]
##    winter_year station_name tmin
## 1         1905 COLLEGE OBSY   NA
## 2         1907 COLLEGE OBSY   NA
## 3         1908 COLLEGE OBSY   NA
## 4         1909 COLLEGE OBSY   NA
## 5         1910 COLLEGE OBSY   NA
## 6         1911 COLLEGE OBSY   NA
## 7         1912 COLLEGE OBSY   NA
## 8         1913 COLLEGE OBSY   NA
## 9         1915 COLLEGE OBSY   NA
## 10        1916 COLLEGE OBSY   NA
## ..         ...          ...  ...


The plot is produced using ggplot2. A full introduction would be a seminar by itself, but the basics of our plot can be summarized as follows.

ggplot(aes(x=winter_year, y=tmin, colour=station_name)) +

aes defines variables and grouping.

geom_point(size=1.5, position=position_jitter(w=0.5,h=0.0)) +
geom_smooth(method="lm", se=FALSE) +

geom_point draws points, geom_smooth draws fitted lines.

scale_x_continuous(name="Winter Year", breaks=pretty_breaks(n=20)) +
scale_y_continuous(name="Minimum temperature (degrees F)",
                    breaks=pretty_breaks(n=10)) +
                    labels=c("College Observatory", "Fairbanks Airport",
                            "University Exp. Station"),
                    values=c("darkorange", "blue", "darkcyan")) +

Scale functions define how the data is scaled into a plot and controls labelling.

theme_bw() +
theme(axis.text.x = element_text(angle=45, hjust=1))

Theme functions controls the style.

For more information:

Linear regression, winter year and minimum temperature

Finally let’s look at the significance of those regression lines:

summary(lm(data=fai_winter_year_minimum, `COLLEGE OBSY` ~ winter_year))
## Call:
## lm(formula = `COLLEGE OBSY` ~ winter_year, data = fai_winter_year_minimum)
## Residuals:
##      Min       1Q   Median       3Q      Max
## -19.0748  -5.8204   0.1907   3.8042  17.1599
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -275.01062  105.20884  -2.614   0.0114 *
## winter_year    0.11635    0.05311   2.191   0.0325 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7.599 on 58 degrees of freedom
##   (47 observations deleted due to missingness)
## Multiple R-squared:  0.07643,    Adjusted R-squared:  0.06051
## F-statistic:   4.8 on 1 and 58 DF,  p-value: 0.03249
summary(lm(data=fai_winter_year_minimum, `FAIRBANKS INTL AP` ~ winter_year))
## Call:
## lm(formula = `FAIRBANKS INTL AP` ~ winter_year, data = fai_winter_year_minimum)
## Residuals:
##     Min      1Q  Median      3Q     Max
## -15.529  -4.605  -1.025   4.007  19.764
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -171.19553   43.55177  -3.931 0.000153 ***
## winter_year    0.06250    0.02221   2.813 0.005861 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7.037 on 104 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.07073,    Adjusted R-squared:  0.06179
## F-statistic: 7.916 on 1 and 104 DF,  p-value: 0.005861
summary(lm(data=fai_winter_year_minimum, `UNIVERSITY EXP STN` ~ winter_year))
## Call:
## lm(formula = `UNIVERSITY EXP STN` ~ winter_year, data = fai_winter_year_minimum)
## Residuals:
##     Min      1Q  Median      3Q     Max
## -15.579  -5.818  -1.283   6.029  19.977
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -158.41837   51.03809  -3.104  0.00248 **
## winter_year    0.05638    0.02605   2.164  0.03283 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 8.119 on 100 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.04474,    Adjusted R-squared:  0.03519
## F-statistic: 4.684 on 1 and 100 DF,  p-value: 0.03283

Essentially, all the models show a significant increase in minimum temperature over time, but none of them explain very much of the variation in minimum temperature.


This presentation was produced with the RMarkdown package. Allows you to mix text and R code, which is then run through R to produce documents in Word, PDF, HTML, and presentation formats.

0 1 >>
Meta Photolog Archives