Introduction

Introduction to Generalized Additive Models with R and mgcv

Scientists are increasingly faced with complex, high dimensional data, and require flexible statistical models that can accommodate them. There is often a tension between applying simple models that don’t fit the data that well but are really easy to interpret, and using more black-box machine learning methods that do fit the data well but are hard to interpret. Generalized Additive Models (GAMs) fit into the gap between these two extremes, using highly interpretable splines to model non-linear relationships between covariates and response that are learned from the data.

The mgcv package for R is one of the most popular packages for fitting smooth, non-linear relationships, providing a wide range of powerful tools for modelling complex data. However, many scientists are not familiar with GAMs, how they learn from data to fit non-linear relationships, nor how to use the mgcv software to fit the models in interpret the results.

In this online webinar I will introduce participants to splines and how GAMs use splines to learn from the underlying data. I’ll show you how splines work and describe the different types of splines available in mgcv and what they can be used for. In addition, I’ll cover

Plus I’ll give examples of the range of statistical models and data types that can be handled and modelled within mgcv.

Participants will be assumed to be familiar with the basics of R (such as loading and manipulating data, and plotting) and regression in R (lm() and glm()).

The webinar is free to attend; this is a difficult time for everyone, monies are tight, and options for training have been reduced because of Covid-19. However, if you are financially able, please consider making a donation to the University of Regina Student Emergency Fund, which helps provide urgently needed support to our hardworking students. You can donate to the U of R Student Emergency Fund at https://giving.uregina.ca/student-eme….

hadCRUT dataset

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.2
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'tibble' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(readr)
URL <- "https://bit.ly/hadcrutv4"
gtemp <- read_table(URL, col_types = 'nnnnnnnnnnnn', 
      col_names = FALSE) %>% 
      select(num_range('X', 1:2)) %>% 
      setNames(nm = c('Year', 'Temperature'))
library(ggplot2)
ggplot(gtemp, aes(x=Year, y=Temperature)) +
  geom_line(size = 0.3) +
  geom_smooth(method = "loess")
## `geom_smooth()` using formula 'y ~ x'

Fitting a GAM

library(mgcv)
## Warning: package 'mgcv' was built under R version 4.1.2
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.
m <- gam(Temperature ~ s(Year), data = gtemp, method = 'REML')
summary(m)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## Temperature ~ s(Year)
## 
## Parametric coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -0.020523   0.009733  -2.109   0.0365 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df   F p-value    
## s(Year) 7.838   8.65 145  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.88   Deviance explained = 88.5%
## -REML = -91.196  Scale est. = 0.016294  n = 172

How is a GAM different?

In LM we model the mean of the data as sum of linear terms:

y i   =   β 0 +   β j x ji   +   ϵ i

A GAM is a sum of smooth functions or smooths

y i   =   β 0 +   S j x ji   +   ϵ i

Where ϵ i ~ N(0,σ 2 ), y i ~ Normal(for now)

Call the above equation the linear predictor in both cases

#model <- gam(y ~ s(x1) + s(x2) + te(x3, x4),
             #data = my_data_frame,
             #method = 'REML',
             #family = gaussian)

s() terms are smoots of one or more variables te() terms are the smooth equivalent of main effects + interactions

Basic expansions

In the polynomial models we used a polynomial basis expansion of

  • x 0     =   1 - the model constant term
  • x 1     =   x - linear term
  • x 2  
  • x 3  

Splines

Splines are functions composed of simpler functions.

Simpler functions are basis functions & the set of basis functions is a basis.

When we model using splines, each basis function b k has a coefficient β k .

Resultant spline is a sum of these weighted basis functions, evaluated at the values of x .

s ( x )   =   k = 1 k β k b k ( x )

Splines formed from basis functions

Splines formed from basis functions

Weight basis functions -> spline

Weight basis functions -> spline

How do GAMs learn from data?

How do GAMs learn from data?

Maximise penalised log-likelihood for β

Maximise penalised log-likelihood for β

How wiggly?

\[\begin{equation} \int_{\mathbb{R}}\left[f^{\prime \prime}\right]^{2} d x=\boldsymbol{\beta}^{\top} \mathbf{S} \boldsymbol{\beta} \end{equation}\]

Penalised fit

\[\begin{equation} \mathcal{L}_{p}(\boldsymbol{\beta})=\mathcal{L}(\boldsymbol{\beta})-\frac{1}{2} \dot{\lambda} \boldsymbol{\beta}^{\top} \mathbf{S}^{\prime} \boldsymbol{\beta} \end{equation}\]

Wiggliness

\[\begin{equation} \int_{\mathbb{R}}\left[f^{\prime \prime}\right]^{2} d x=\boldsymbol{\beta}^{\top} \mathbf{S} \boldsymbol{\beta}=W \end{equation}\]

(Wiggliness is 100% the right mathy word)

We penalize wiggliness to avoid overfitting