2026-04-22

The goal with a GAM is to estimate an unknown function that relates the mean of the random variable to covariates.
\[\mu = f(x)\] \[y \sim Normal(\mu, \sigma^2)\]
If we want to use standard regression methods to estimate \(f\), then we want to represent it such that the equation above becomes a linear model.
We can do this by choosing a basis of which \(f\) (or an approximation of \(f\)) is an element.
That is, we should choose a basis that is made of basis functions that can approximate nearly any smooth function.
Each of those basis functions gets weighted with a coefficient (i.e., a \(\beta\)) and summed together to represent \(f\).
A basis you are probably already familiar with is polynomials.
The basis functions for a polynomial basis take the form:
\[x^k ; k \ge 0\]
A basis you are probably already familiar with is polynomials.
The basis functions for a polynomial basis take the form:
\[x^k ; k \ge 0\]
So a polynomial basis expansion with two basis functions would be:
\[\beta_0 x^0 + \beta_1 x^1 = \beta_0 +\beta_1 x\]
A basis you are probably already familiar with is polynomials.
The basis functions for a polynomial basis take the form:
\[x^k ; k \ge 0\]
So a polynomial basis expansion with two basis functions would be:
\[\beta_0 x^0 + \beta_1 x^1 = \beta_0 +\beta_1 x\]
And a polynomial basis expansion with four basis functions would be:
\[\beta_0 x^0 + \beta_1 x^1 + \beta_2 x^2 + \beta_3 x^3\]
poly6 <- lm(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6))
plot(x, y, main = "Sixth Order Polynomial",
sub = expression(y == beta[0] + beta[1]*x + beta[2]*x^2 + beta[3]*x^3 + beta[4]*x^4 + beta[5]*x^5 + beta[6]*x^6))
pred6 <- data.frame(x = seq(-1.2, 1.2, length.out = 200))
pred6$y <- predict(poly6, newdata = pred6)
lines(pred6$x, pred6$y, col = 4, lwd = 2)poly10 <- lm(y ~ poly(x, 10))
plot(x, y, main = "Tenth Order Polynomial",
sub = expression(y == beta[0] + beta[1]*x + beta[2]*x^2 + beta[3]*x^3 + beta[4]*x^4 + beta[5]*x^5 + beta[6]*x^6 + beta[7]*x^7 + beta[8]*x^8 + beta[9]*x^9 + beta[10]*x^10))
pred10 <- data.frame(x = seq(-1.2, 1.2, length.out = 200))
pred10$y <- predict(poly10, newdata = pred10)
lines(pred10$x, pred10$y, col = 5, lwd = 2)poly25 <- lm(y ~ poly(x, 25))
plot(x, y, main = "Twenty-fifth Order Polynomial",
sub = expression(y == beta[0] + beta[1]*x + beta[2]*x^2 + ... + beta[24]*x^24 +beta[25]*x^25))
pred25 <- data.frame(x = seq(-1.2, 1.2, length.out = 200))
pred25$y <- predict(poly25, newdata = pred25)
lines(pred25$x, pred25$y, col = 6, lwd = 2)You may have noticed two things along the way here.
While we use them all the time, polynomial basis expansions aren’t actually all that great for our problem.
We need some method to obtain parsimony. Just from the eye test, it doesn’t seem like it’s worth it to use a 25th-degree polynomial over a 10th-degree polynomial.
You may have noticed two things along the way here:
Here are the 10 basis functions plotted separately:
We can give each one a weight (a \(\beta\) coefficient) and sum them up. For example, we can weight all 10 with \(\beta_i = 1\).
We can alternate \(\beta\)s between -1 and + 1.
We can try a bunch of other values, like this:
| \(\beta_1\) | \(\beta_2\) | \(\beta_3\) | \(\beta_4\) | \(\beta_5\) | \(\beta_6\) | \(\beta_7\) | \(\beta_8\) | \(\beta_9\) | \(\beta_{10}\) |
|---|---|---|---|---|---|---|---|---|---|
| 5 | -2 | -3 | - 5 | 1 | 2 | 4 | 2 | 0 | -3 |
We can try evenly spaced \(\beta\)s, for example, ranging from 0 to 2.
Now, back to the problem of parsimony. How can we decide how many basis functions (or knots) sufficiently represent our function?
The penalty should be applied to \(f\) based on how “wiggly” it is.
Wiggliness (the proper term) is defined by the second derivative of the function \(f\).
\[\int_{\mathbb{R}} [f^{\prime\prime}]^2 dx = \boldsymbol{\beta}^{\mathsf{T}}\mathbf{S}\boldsymbol{\beta} = \large{W}\]
It can be written as a function of the vector \(\boldsymbol{\beta}\) and a penalty matrix, \(\mathbf{S}\), which is specific to the basis chosen.
For simplicity, we’ll refer to the wiggliness as \(W\).
Rather than estimating the parameters of the model with maximum likelihood:
\[\hat{\theta} = \mathrm{arg\ max}\ \mathcal{L}(\theta | \mathbf{y})\]
We now use penalized maximum likelihood:
\[\hat{\theta} = \mathrm{arg\ max}\ [\log(\mathcal{L}(\theta | \mathbf{y})) - \lambda W]\]
But we’re still left with a parameter \((\lambda)\) that we need to choose.
We will refer to \(\lambda\) as the smoothing parameter.
We can choose the smoothing parameter manually.
# No penalty
sm_no <- gam(y ~ s(x, bs = "cr", k = 50), data = dat,
method = "REML", sp = 0)
# Low penalty
sm_lo <- gam(y ~ s(x, bs = "cr", k = 50), data = dat,
method = "REML", sp = 10)
# Medium penalty
sm_med <- gam(y ~ s(x, bs = "cr", k = 50), data = dat,
method = "REML", sp = 1000)
# High penalty
sm_hi <- gam(y ~ s(x, bs = "cr", k = 50), data = dat,
method = "REML", sp = 1000000)
# Extreme penalty
sm_ex <- gam(y ~ s(x, bs = "cr", k = 50), data = dat,
method = "REML", sp = 100000000)We can choose the smoothing parameter manually.
pred <- data.frame(x = seq(-1, 1, length.out = 500))
# No penalty
pred$none <- predict(sm_no, newdata = pred)
# Low penalty
pred$low <- predict(sm_lo, newdata = pred)
# Medium penalty
pred$medium <- predict(sm_med, newdata = pred)
# High penalty
pred$high <- predict(sm_hi, newdata = pred)
# Extreme penalty
pred$extreme <- predict(sm_ex, newdata = pred)
# Pivot longer
library(tidyr)
pred_long <- pivot_longer(pred, none:extreme,
names_to = "Penalty",
values_to = "y") %>%
mutate(Penalty = factor(Penalty, levels = c("none",
"low",
"medium",
"high",
"extreme")))
# Plot
ggplot() +
geom_point(data = dat, aes(x = x, y = y)) +
geom_line(data = pred_long, aes(x = x, y = y, color = Penalty),
linewidth = 2) +
theme_classic()But we can also estimate it from the data.
According to Gavin Simpson, for reasons of numerical stability, we should use restricted maximum likelihood (REML) estimation if we want to estimate \(\lambda\) as well as \(\boldsymbol{\beta}\).
We can do that in R with mgcv::gam(..., method = "REML").
So how did we do? The true generating function for these data was:
\[\mu = \sin(8x) + 2x^2 -2x + 2\] \[y \sim Normal(\mu, \sigma = 0.5)\]
pred$REML <- predict(sm_reml, newdata = pred)
pred$True <- my_fun(pred$x)
pred_long <- pred %>%
select(x, REML, True) %>%
pivot_longer(REML:True,
names_to = "Function",
values_to = "y") %>%
mutate(Function = factor(Function, levels = c("REML",
"True")))
# Plot
ggplot() +
geom_point(data = dat, aes(x = x, y = y)) +
geom_line(data = pred_long, aes(x = x, y = y, color = Function),
linewidth = 2) +
theme_classic()The GAM was a very good approximation for this function (and remember, there are confidence intervals we’re not showing here).
After estimating our parameters with REML, the fitted function \((\hat{f})\) has fewer effective degrees of freedom (EDF) than an unpenalized function would have.
Your choice of \(k\) (the number of knots/basis functions) sets the upper limit on complexity, such that \(\mathrm{EDF} < k\).
As long as we choose \(k\) to be large enough, GAM theory tells us that we will have a good approximation of \(f\).
However, increasing \(k\) increases computation time. A dataset with \(n\) data points takes \(\mathcal{O}(nk^2)\) operations to compute. So computation time is roughly proportional to the square of \(k\). Doubling \(k\) quadruples fitting time.
Choosing \(k\) is thus an important user choice. The mgcv::gam() default of k = 10 is completely arbitrary.
A good approach is to try a reasonably large number of knots and then evaluate your choice.
mgcv offers k.check() to evaluate whether your choice was sufficient.
See mgcv help files for further discussion of choosing \(k\). See ?k.check and ?choose.k.
s(x, bs = "cr")
s(x, bs = "cc")
s(x, bs = "tp")
?smooth.construct.tp.smooth.spec for some details.s(x).s(x, bs = "re")
glmmTMB or custom modelsglmmTMB for large numbers of individuals
mgcv can fit multiple types of spatial modelsHere’s a Gaussian process smooth I fit to real data.
s(x, y, bs = "gp")
See ?smooth.construct.gp.smooth.spec
Here’s an example with real data. Note that there is time varying selection for agriculture (using a cyclic cubic basis) and an individual random slope.
s(doy, by = agriculture, bs = "cc")
RRmgcv::gam() can fit Cox PH models with family = cox.phRSyntax:
In all models:
times is a vector of all 1sstratum is a factor variable uniquely identifying used steps and their matched available stepscase_ is a vector of 0s and 1s identifying the used (1) and available (0) stepsRSyntax:
For your specific model, you might want to change how you treat these:
s(h1) and s(h2) specify smooth selection for habitat variablessl_, log(sl_), and cos(ta_) are parametric terms for updating step-length and turn-angle distributions