Generalized Additive Models (GAMs) extend linear models by allowing non-linear smooth functions of predictors while maintaining interpretability:
g(E[Y]) = β₀ + f₁(X₁) + f₂(X₂) + … + fₚ(Xₚ)
where f_i are smooth spline functions and g is a link function.
When to use GAMs: Non-linear predictor-response relationships that still require interpretable, visualisable effects.
Simulated ecological dataset: species abundance (count) as a function of temperature, rainfall, and elevation with known non-linear true effects.
set.seed(42)
n <- 400
temperature <- runif(n, 5, 35)
rainfall <- runif(n, 100, 800)
elevation <- runif(n, 0, 3000)
log_abundance <-
3 +
0.8 * sin(temperature / 5) - 0.02 * (temperature - 20)^2 + # hump-shaped
0.003 * rainfall - 0.000004 * rainfall^2 + # concave
-0.001 * elevation + # linear decline
rnorm(n, 0, 0.5)
abundance <- round(exp(log_abundance))
df <- data.frame(abundance, temperature, rainfall, elevation)
summary(df)
## abundance temperature rainfall elevation
## Min. : 0.000 Min. : 5.007 Min. :100.3 Min. : 10.77
## 1st Qu.: 1.000 1st Qu.:12.165 1st Qu.:268.3 1st Qu.: 591.43
## Median : 2.000 Median :19.674 Median :434.2 Median :1413.04
## Mean : 3.672 Mean :19.913 Mean :441.6 Mean :1425.88
## 3rd Qu.: 4.000 3rd Qu.:27.432 3rd Qu.:610.7 3rd Qu.:2238.53
## Max. :40.000 Max. :34.897 Max. :797.2 Max. :2998.14
p1 <- ggplot(df, aes(temperature, abundance)) +
geom_point(alpha=0.25, color="steelblue") +
geom_smooth(method="loess", color="firebrick") +
labs(title="vs Temperature", x="Temperature (°C)", y="Abundance") + theme_minimal()
p2 <- ggplot(df, aes(rainfall, abundance)) +
geom_point(alpha=0.25, color="forestgreen") +
geom_smooth(method="loess", color="firebrick") +
labs(title="vs Rainfall", x="Rainfall (mm)", y=NULL) + theme_minimal()
p3 <- ggplot(df, aes(elevation, abundance)) +
geom_point(alpha=0.25, color="darkorange") +
geom_smooth(method="loess", color="firebrick") +
labs(title="vs Elevation", x="Elevation (m)", y=NULL) + theme_minimal()
p1 + p2 + p3
LOESS smoothers reveal non-linear relationships with abundance.
Interpretation: Non-linear LOESS curves for temperature and rainfall confirm that a GAM is more appropriate than a linear GLM.
Poisson GAM with thin-plate regression splines s() and
REML smoothness selection:
gam_model <- gam(
abundance ~ s(temperature) + s(rainfall) + s(elevation),
data = df, family = poisson(link = "log"), method = "REML"
)
summary(gam_model)
##
## Family: poisson
## Link function: log
##
## Formula:
## abundance ~ s(temperature) + s(rainfall) + s(elevation)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.41057 0.05293 7.757 8.68e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(temperature) 7.023 8.011 504.50 < 2e-16 ***
## s(rainfall) 3.876 4.780 27.48 4.83e-05 ***
## s(elevation) 1.373 1.654 716.46 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.762 Deviance explained = 84.4%
## -REML = 667.58 Scale est. = 1 n = 400
Key output guide:
| Output | Meaning |
|---|---|
| edf | edf=1 → linear; edf>1 → non-linear (higher = wigglier) |
| p-value (smooth) | Is the smooth significantly different from zero? |
| Deviance explained | R² analogue — higher is better |
par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))
plot(gam_model, shade = TRUE, shade.col = "lightblue",
seWithMean = TRUE, scale = 0, residuals = TRUE,
pch = 16, cex = 0.3, col = "grey50")
Partial effects. Grey shading = 95% CI. Rug = data density.
par(mfrow = c(1, 1))
| Predictor | Shape | Interpretation |
|---|---|---|
| Temperature | Hump-shaped | Species thrives at moderate temperatures |
| Rainfall | Concave | Beneficial up to a point, then levels off |
| Elevation | Near-linear decline | Harsher high-altitude conditions reduce abundance |
k.check(gam_model)
## k' edf k-index p-value
## s(temperature) 9 7.023383 1.0807151 0.9725
## s(rainfall) 9 3.875901 0.9512605 0.2275
## s(elevation) 9 1.373497 1.0003260 0.5825
Rule: k-index < 1 AND
p < 0.05 → basis too restrictive; increase
k in s().
par(mfrow = c(2, 2))
gam.check(gam_model, pch = 16, cex = 0.5)
Four-panel diagnostic: QQ plot, residuals vs fitted, histogram, response vs fitted.
##
## Method: REML Optimizer: outer newton
## full convergence after 5 iterations.
## Gradient range [-1.657642e-06,0.000123604]
## (score 667.5772 & scale 1).
## Hessian positive definite, eigenvalue range [0.06248418,2.418291].
## Model rank = 28 / 28
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(temperature) 9.00 7.02 1.08 0.96
## s(rainfall) 9.00 3.88 0.95 0.27
## s(elevation) 9.00 1.37 1.00 0.59
par(mfrow = c(1, 1))
| Plot | Good Sign |
|---|---|
| QQ Plot | Points on the diagonal |
| Residuals vs Fitted | Random scatter around zero |
| Histogram | Bell-shaped, symmetric |
| Response vs Fitted | Points on the 1:1 line |
concurvity(gam_model, full = TRUE)
## para s(temperature) s(rainfall) s(elevation)
## worst 3.195316e-25 0.1129922 0.08931919 0.11580280
## observed 3.195316e-25 0.0311632 0.03501274 0.04211687
## estimate 3.195316e-25 0.0703711 0.04150996 0.04083935
Rule: Values near 1 → one smooth approximated by others → unstable estimates. Values < 0.8 are acceptable.
glm_model <- glm(abundance ~ temperature + rainfall + elevation,
data = df, family = poisson(link = "log"))
cat("GAM AIC:", round(AIC(gam_model), 1), "\n")
## GAM AIC: 1310.3
cat("GLM AIC:", round(AIC(glm_model), 1), "\n\n")
## GLM AIC: 2084.7
anova(glm_model, gam_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: abundance ~ temperature + rainfall + elevation
## Model 2: abundance ~ s(temperature) + s(rainfall) + s(elevation)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 396.00 1166.57
## 2 386.73 369.28 9.2728 797.3 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Result: Lower AIC and significant LRT (p < 0.001) confirm the GAM fits substantially better than the linear GLM.
nd <- data.frame(temperature = seq(5, 35, length=200),
rainfall = median(df$rainfall),
elevation = median(df$elevation))
p <- predict(gam_model, newdata=nd, type="response", se.fit=TRUE)
nd$fit <- p$fit
nd$lower <- p$fit - 1.96*p$se.fit
nd$upper <- p$fit + 1.96*p$se.fit
ggplot(nd, aes(temperature, fit)) +
geom_ribbon(aes(ymin=lower, ymax=upper), alpha=0.2, fill="steelblue") +
geom_line(color="steelblue", linewidth=1.2) +
geom_rug(data=df, aes(x=temperature), inherit.aes=FALSE, alpha=0.15) +
labs(title="Marginal Effect: Temperature on Species Abundance",
subtitle="Rainfall & Elevation at medians | 95% CI shaded",
x="Temperature (°C)", y="Predicted Abundance") +
theme_minimal(base_size=12)
Predicted abundance across temperature (rainfall & elevation at median). 95% CI shaded.
s <- summary(gam_model)
cat("Deviance Explained:", round(s$dev.expl*100, 1), "%\n")
## Deviance Explained: 84.4 %
cat("Adjusted R-squared:", round(s$r.sq, 3), "\n")
## Adjusted R-squared: 0.762
cat("AIC: ", round(AIC(gam_model), 1), "\n\n")
## AIC: 1310.3
print(round(s$s.table, 3))
## edf Ref.df Chi.sq p-value
## s(temperature) 7.023 8.011 504.501 0
## s(rainfall) 3.876 4.780 27.476 0
## s(elevation) 1.373 1.654 716.461 0
Complete GAM pipeline demonstrated:
mgcvGAMs offer an excellent balance of flexibility and interpretability — ideal for ecology, epidemiology, environmental science, and beyond.