Following up on my previous post, I tried the regression approach for predicting future snow depth from current values. As you recall, I produced a plot that showed how much snow we’ve had on the ground on each date at the Fairbanks Airport between 1917 and 2013. These boxplots gave us an idea of what a normal snow depth looks like on each date, but couldn’t really tell us much about what we might expect for snow depth for the rest of the winter.

# Regression

I ran a linear regression analysis looking at how snow depth on November 8th relates to snow depth on November 27th and December 25th of the same year. Here’s the SQL:

```
SELECT * FROM (
SELECT extract(year from dte) AS year,
max(CASE WHEN to_char(dte, 'mm-dd') = '11-08'
THEN round(snwd_mm/25.4, 1)
ELSE NULL END) AS nov_8,
max(CASE WHEN to_char(dte, 'mm-dd') = '11-27'
THEN round(snwd_mm/25.4, 1)
ELSE NULL END) AS nov_27,
max(CASE WHEN to_char(dte, 'mm-dd') = '12-15'
THEN round(snwd_mm/25.4, 1)
ELSE NULL END) AS dec_25
FROM ghcnd_pivot
WHERE station_name = 'FAIRBANKS INTL AP'
AND snwd_mm IS NOT NULL
GROUP BY extract(year from dte)
ORDER BY year
) AS sub
WHERE nov_8 IS NOT NULL
AND nov_27 IS NOT NULL
AND dec_25 IS NOT NULL;
```

I’m grouping on year, then grabbing the snow depth for the three dates of interest. I would have liked to include dates in January and February in order to see how the relationship weakens as the winter progresses, but that’s a lot more complicated because then we are comparing the dates from one year to the next and the grouping I used in the query above wouldn’t work.

One note on this analysis: linear regression has a bunch of assumptions that need to be met before considering the analysis to be valid. One of these assumptions is that observations are independent from one another, which is problematic in this case because snow depth is a cumulative statistic; the depth tomorrow is necessarily related to the depth of the snow today (snow depth tomorrow = snow depth today + snowfall). Whether it’s necessarily related to the depth of the snow a month from now is less certain, and I’m making the possibly dubious assumption that autocorrelation disappears when the time interval between observations is longer than a few weeks.

# Results

Here are the results comparing the snow depth on November 8th to November 27th:

```
> reg <- lm(data=results, nov_27 ~ nov_8)
> summary(reg)
Call:
lm(formula = nov_27 ~ nov_8, data = results)
Residuals:
Min 1Q Median 3Q Max
-8.7132 -3.0490 -0.6063 1.7258 23.8403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.1635 0.9707 3.259 0.0016 **
nov_8 1.1107 0.1420 7.820 1.15e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.775 on 87 degrees of freedom
Multiple R-squared: 0.4128, Adjusted R-squared: 0.406
F-statistic: 61.16 on 1 and 87 DF, p-value: 1.146e-11
```

And between November 8th and December 25th:

```
> reg <- lm(data=results, dec_25 ~ nov_8)
> summary(reg)
Call:
lm(formula = dec_25 ~ nov_8, data = results)
Residuals:
Min 1Q Median 3Q Max
-10.209 -3.195 -1.195 2.781 10.791
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.2227 0.8723 7.133 2.75e-10 ***
nov_8 0.9965 0.1276 7.807 1.22e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.292 on 87 degrees of freedom
Multiple R-squared: 0.412, Adjusted R-squared: 0.4052
F-statistic: 60.95 on 1 and 87 DF, p-value: 1.219e-11
```

Both regressions are very similar. The coefficients and the overall model are
both very significant, and the *R²* value indicates that in each case, the
snow depth on November 8th explains about 40% of the variation in the snow depth
on the later date. The amount of variation explained hardly changes at all,
despite almost a month difference between the two analyses.

Here's a plot of the relationship between today’s date and Christmas (PDF version)

The blue line is the linear regression model.

# Conclusions

For 2014, we’ve got 2 inches of snow on the ground on November 8th. The models predict we’ll have 5.4 inches on November 27th and 8 inches on December 25th. That isn’t great, but keep in mind that even though the relationship is quite strong, it explains less than half of the variation in the data, which means that it’s quite possible we will have a lot more, or less. Looking back at the plot, you can see that for all the years where we had two inches of snow on November 8th, we had between five and fifteen inches of snow in that same year on December 25th. I’m certainly hoping we’re closer to fifteen.