We need mgcv
for GAM models, mgcViz
for
added GAM visualization capabilities.
We are using the CO\(_2\) dataset that has been collected at the top of Mauna Loa, Hawaii since 1958. The data were dowloaded from here.
A linear model tries to fit the best straight line that passes through the data, so it doesn’t work well for all datasets.
In contrast, a GAM can capture complex relationships by fitting a non-linear smooth function through the data, while controlling how wiggly the smooth can get.
Create a year and month variable, rename the variable of interest to CO\(_2\).
Let’s look at the the full data set and the data since 2010.
Add a linear line lm
and a gam-smooth line with
gam
in ggplot2.
co2<-read.csv("co2_mm_mlo.csv", skip=40)
co2 <- co2 |>
mutate(month_year = make_date(co2$year, co2$month)) |>
rename(co2 = average)
ggplot(co2, aes(x=decimal.date, y=co2))+
geom_point()+
geom_smooth(method='lm',col="red")+
geom_smooth(method='gam',col="blue")+
theme_bw()
co2_s<-co2[co2$year >= 2010,]
ggplot(co2_s, aes(x=decimal.date, y=co2))+
geom_point()+
geom_smooth(method='lm',col="red")+
geom_smooth(method='gam',col="blue")+
theme_bw()
We don’t see too much difference between the lm
line
and the gam
line, although the smooth line fits a bit
better on the whole dataset. We are missing the seasonal cycles with
both trends.
To see if we can fit the seasonal cycles better, subset the data to look at monthly trends in the year 2024.
co2_2024<-co2[co2$year == 2024,]
ggplot(co2_2024, aes(x=decimal.date, y=co2))+
geom_point()+
geom_smooth(method='lm',col="red")+
geom_smooth(method='gam',col="blue")+
theme_bw()
The gam line is much better here, and the lm
does
not fit well at all.
Fit a lm
to the 2024 subset.
lmod<-lm(co2~decimal.date, data=co2_2024)
summary(lmod)
##
## Call:
## lm(formula = co2 ~ decimal.date, data = co2_2024)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.46169 -1.52754 0.07059 1.49176 2.24601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3329.140 3651.894 0.912 0.383
## decimal.date -1.435 1.804 -0.795 0.445
##
## Residual standard error: 1.798 on 10 degrees of freedom
## Multiple R-squared: 0.05949, Adjusted R-squared: -0.03456
## F-statistic: 0.6326 on 1 and 10 DF, p-value: 0.4449
We see the date is not significant, and we have a very poor overall model with low R\(^2\).
We definitely need something better than a linear regression for these data.
lm_co2=lm(co2~month+I(month^2)+I(month^3),data=co2_2024)
lm_co2$coefficients
## (Intercept) month I(month^2) I(month^3)
## 417.74242424 5.26447330 -0.92531302 0.04489899
# Fit polynomial result
month<-seq(1:12)
y <- 417.74242424 + 5.26447330*month -0.92531302*month^2 + 0.04489899*month^3
plot(co2_2024$co2, ylab="CO2, ppm",xlab="Month")
lines(y)
co2_2015<-co2[co2$year == 2015,]
poly_pred<-predict(lm_co2,newdat=co2_2015)
plot(co2_2015$co2,ylim=c(375,425),ylab="CO2, ppm",xlab="Month")
lines(poly_pred,col='red')
s()
with
options bs="cr"
for cubic regression and k
sets the smoothness. k
is the number of knots on the bases,
however since the model fit is penalized, it estimates the smoothness
parameter of the penalty so k
gets updated by generalized
cross validation.gmod_all<-gam(co2~s(decimal.date, bs="cr", k=20), data=co2)
summary(gmod_all)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## co2 ~ s(decimal.date, bs = "cr", k = 20)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 359.77219 0.07588 4741 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(decimal.date) 9.085 11.14 16058 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.996 Deviance explained = 99.6%
## GCV = 4.6826 Scale est. = 4.6238 n = 803
plot(gmod_all)
gampredco2<-predict(gmod_all, co2_s)
plot(gampredco2,type='l')
gmod_2024<-gam(co2~s(decimal.date, bs="cr"), data=co2_2024)
summary(gmod_2024)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## co2 ~ s(decimal.date, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 424.60417 0.05713 7433 1.93e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(decimal.date) 6.997 8.031 107 0.000218 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.987 Deviance explained = 99.5%
## GCV = 0.11741 Scale est. = 0.039162 n = 12
plot(gmod_2024)
gampredco2_2024<-predict(gmod_2024, co2_2024)
plot(gampredco2_2024,type='l')
From the output: the edf
is “estimated degrees of
freedom”, which tells us approximately how many knots were fit (remember
knots are related to the number of connected basis functions to create
the overall curve). It is best to round edf
to the nearest
integer.
On the full dataset we see the edf is estimated to be 9 (we started with 20), this is pretty low for the amount of data, and we also see that it just captures a smooth overall trend over the years. Nevertheless, this smooth is significant (p-value<0.001).
On the 2024 dataset we see the edf is estimated to be 7. Recall, in one year there are only 12 data points, so 7 df is a lot! We see from the prediction plot that it fits the trend very well. The R\(^2\) of this model is very high.
In practice, beware of overfitting! A high edf
may
result in a good looking plot and high adjusted R\(^2\), but it may not translate well in
validation.
We see that the smooth term is still not doing much. We need to capture the within year cycles in the data, so let’s add a smooth on month.
Now let’s try fitting to all data and to capture both the overall trend and seasonal (within year trends). For this we can split the time series into two parts and apply different smooths to each part (long term and shorter term trends).
We use a cubic regression spline cr
for the overall
trend and fit it specifying a larger number of knots like before. We use
a cyclic cubic spline cc
to capture the repeating seasonal
trends from year to year.
Plot the gam model just with the plot()
function of
the gam model object.
gmod2<-gam(co2~s(decimal.date,bs="cr",k=20)+s(month,bs="cc"),data=co2)
summary(gmod2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## co2 ~ s(decimal.date, bs = "cr", k = 20) + s(month, bs = "cc")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 359.77219 0.01661 21666 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(decimal.date) 18.69 18.99 197068 <2e-16 ***
## s(month) 7.24 8.00 1943 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 1 Deviance explained = 100%
## GCV = 0.22909 Scale est. = 0.22141 n = 803
plot(gmod2)
The model summary shows estimates for two smooths, both are significant and each has a different edf. The overall trend has 18 df now (like 18 knots over the whole range of data) and the month trend again has 7 knots to capture the seasonal, within-year trends.
The plots from the model output are a bit decieving, particularly for the second smooth over month. Let’s see how the overall model fits and looks by predicting the data and plotting the predictions.
pred_co2<-predict.gam(gmod2,co2)
plot(pred_co2,type='l')