Setup and Motivation

  • We need mgcv for GAM models, mgcViz for added GAM visualization capabilities.

  • We are using the CO\(_2\) dataset that has been collected at the top of Mauna Loa, Hawaii since 1958. The data were dowloaded from here.

  • A linear model tries to fit the best straight line that passes through the data, so it doesn’t work well for all datasets.

  • In contrast, a GAM can capture complex relationships by fitting a non-linear smooth function through the data, while controlling how wiggly the smooth can get.

  • Create a year and month variable, rename the variable of interest to CO\(_2\).

  • Let’s look at the the full data set and the data since 2010.

  • Add a linear line lm and a gam-smooth line with gam in ggplot2.

co2<-read.csv("co2_mm_mlo.csv", skip=40)

co2 <- co2 |>
  mutate(month_year = make_date(co2$year, co2$month)) |>
  rename(co2 = average)

ggplot(co2, aes(x=decimal.date, y=co2))+
    geom_point()+
    geom_smooth(method='lm',col="red")+
    geom_smooth(method='gam',col="blue")+
    theme_bw()

co2_s<-co2[co2$year >= 2010,]
ggplot(co2_s, aes(x=decimal.date, y=co2))+
    geom_point()+
    geom_smooth(method='lm',col="red")+
    geom_smooth(method='gam',col="blue")+
    theme_bw()

  • We don’t see too much difference between the lm line and the gam line, although the smooth line fits a bit better on the whole dataset. We are missing the seasonal cycles with both trends.

  • To see if we can fit the seasonal cycles better, subset the data to look at monthly trends in the year 2024.

co2_2024<-co2[co2$year == 2024,]
ggplot(co2_2024, aes(x=decimal.date, y=co2))+
    geom_point()+
    geom_smooth(method='lm',col="red")+
    geom_smooth(method='gam',col="blue")+
    theme_bw()

  • The gam line is much better here, and the lm does not fit well at all.

  • Fit a lm to the 2024 subset.

lmod<-lm(co2~decimal.date, data=co2_2024)
summary(lmod)
## 
## Call:
## lm(formula = co2 ~ decimal.date, data = co2_2024)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.46169 -1.52754  0.07059  1.49176  2.24601 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3329.140   3651.894   0.912    0.383
## decimal.date   -1.435      1.804  -0.795    0.445
## 
## Residual standard error: 1.798 on 10 degrees of freedom
## Multiple R-squared:  0.05949,    Adjusted R-squared:  -0.03456 
## F-statistic: 0.6326 on 1 and 10 DF,  p-value: 0.4449
  • We see the date is not significant, and we have a very poor overall model with low R\(^2\).

  • We definitely need something better than a linear regression for these data.

Polynomial basis

  • Let’s look at a simple polynomial basis (covered in lecture).
lm_co2=lm(co2~month+I(month^2)+I(month^3),data=co2_2024)
lm_co2$coefficients
##  (Intercept)        month   I(month^2)   I(month^3) 
## 417.74242424   5.26447330  -0.92531302   0.04489899
# Fit polynomial result
month<-seq(1:12)
y <- 417.74242424 + 5.26447330*month -0.92531302*month^2 + 0.04489899*month^3

plot(co2_2024$co2, ylab="CO2, ppm",xlab="Month")
lines(y)

  • Use polynomial result to predict for different year
co2_2015<-co2[co2$year == 2015,]
poly_pred<-predict(lm_co2,newdat=co2_2015)

plot(co2_2015$co2,ylim=c(375,425),ylab="CO2, ppm",xlab="Month")
lines(poly_pred,col='red')

  • We see that this is not transferable from one year to the next because the overall trend is increasing.

GAM

  • Now let’s fit and look at the gam model.
  • To fit a gam model in the mgcv package, use s() with options bs="cr" for cubic regression and k sets the smoothness. k is the number of knots on the bases, however since the model fit is penalized, it estimates the smoothness parameter of the penalty so k gets updated by generalized cross validation.
  • Plot the model shows the fitted gam (smooth part only).
gmod_all<-gam(co2~s(decimal.date, bs="cr", k=20), data=co2)
summary(gmod_all)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## co2 ~ s(decimal.date, bs = "cr", k = 20)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 359.77219    0.07588    4741   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df     F p-value    
## s(decimal.date) 9.085  11.14 16058  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.996   Deviance explained = 99.6%
## GCV = 4.6826  Scale est. = 4.6238    n = 803
plot(gmod_all)

gampredco2<-predict(gmod_all, co2_s)
plot(gampredco2,type='l')

gmod_2024<-gam(co2~s(decimal.date, bs="cr"), data=co2_2024)
summary(gmod_2024)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## co2 ~ s(decimal.date, bs = "cr")
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 424.60417    0.05713    7433 1.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df   F  p-value    
## s(decimal.date) 6.997  8.031 107 0.000218 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.987   Deviance explained = 99.5%
## GCV = 0.11741  Scale est. = 0.039162  n = 12
plot(gmod_2024)

gampredco2_2024<-predict(gmod_2024, co2_2024)
plot(gampredco2_2024,type='l')

  • From the output: the edf is “estimated degrees of freedom”, which tells us approximately how many knots were fit (remember knots are related to the number of connected basis functions to create the overall curve). It is best to round edf to the nearest integer.

  • On the full dataset we see the edf is estimated to be 9 (we started with 20), this is pretty low for the amount of data, and we also see that it just captures a smooth overall trend over the years. Nevertheless, this smooth is significant (p-value<0.001).

  • On the 2024 dataset we see the edf is estimated to be 7. Recall, in one year there are only 12 data points, so 7 df is a lot! We see from the prediction plot that it fits the trend very well. The R\(^2\) of this model is very high.

  • In practice, beware of overfitting! A high edf may result in a good looking plot and high adjusted R\(^2\), but it may not translate well in validation.

  • We see that the smooth term is still not doing much. We need to capture the within year cycles in the data, so let’s add a smooth on month.

  • Now let’s try fitting to all data and to capture both the overall trend and seasonal (within year trends). For this we can split the time series into two parts and apply different smooths to each part (long term and shorter term trends).

  • We use a cubic regression spline cr for the overall trend and fit it specifying a larger number of knots like before. We use a cyclic cubic spline cc to capture the repeating seasonal trends from year to year.

  • Plot the gam model just with the plot() function of the gam model object.

gmod2<-gam(co2~s(decimal.date,bs="cr",k=20)+s(month,bs="cc"),data=co2)
summary(gmod2)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## co2 ~ s(decimal.date, bs = "cr", k = 20) + s(month, bs = "cc")
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 359.77219    0.01661   21666   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                   edf Ref.df      F p-value    
## s(decimal.date) 18.69  18.99 197068  <2e-16 ***
## s(month)         7.24   8.00   1943  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =      1   Deviance explained =  100%
## GCV = 0.22909  Scale est. = 0.22141   n = 803
plot(gmod2)

  • The model summary shows estimates for two smooths, both are significant and each has a different edf. The overall trend has 18 df now (like 18 knots over the whole range of data) and the month trend again has 7 knots to capture the seasonal, within-year trends.

  • The plots from the model output are a bit decieving, particularly for the second smooth over month. Let’s see how the overall model fits and looks by predicting the data and plotting the predictions.

pred_co2<-predict.gam(gmod2,co2)
plot(pred_co2,type='l')

  • Looks good! With the two smooths we capture the trends in the data very well!