1 Introduction

Generalized Additive Models (GAMs) extend linear models by allowing non-linear smooth functions of predictors while maintaining interpretability:

g(E[Y]) = β₀ + f₁(X₁) + f₂(X₂) + … + fₚ(Xₚ)

where f_i are smooth spline functions and g is a link function.

When to use GAMs: Non-linear predictor-response relationships that still require interpretable, visualisable effects.


2 Data Generation

Simulated ecological dataset: species abundance (count) as a function of temperature, rainfall, and elevation with known non-linear true effects.

set.seed(42)
n <- 400
temperature <- runif(n, 5, 35)
rainfall    <- runif(n, 100, 800)
elevation   <- runif(n, 0, 3000)

log_abundance <-
  3 +
  0.8 * sin(temperature / 5) - 0.02 * (temperature - 20)^2 +  # hump-shaped
  0.003 * rainfall - 0.000004 * rainfall^2 +                   # concave
  -0.001 * elevation +                                          # linear decline
  rnorm(n, 0, 0.5)

abundance <- round(exp(log_abundance))
df <- data.frame(abundance, temperature, rainfall, elevation)

summary(df)
##    abundance       temperature        rainfall       elevation      
##  Min.   : 0.000   Min.   : 5.007   Min.   :100.3   Min.   :  10.77  
##  1st Qu.: 1.000   1st Qu.:12.165   1st Qu.:268.3   1st Qu.: 591.43  
##  Median : 2.000   Median :19.674   Median :434.2   Median :1413.04  
##  Mean   : 3.672   Mean   :19.913   Mean   :441.6   Mean   :1425.88  
##  3rd Qu.: 4.000   3rd Qu.:27.432   3rd Qu.:610.7   3rd Qu.:2238.53  
##  Max.   :40.000   Max.   :34.897   Max.   :797.2   Max.   :2998.14

3 Exploratory Data Analysis

p1 <- ggplot(df, aes(temperature, abundance)) +
  geom_point(alpha=0.25, color="steelblue") +
  geom_smooth(method="loess", color="firebrick") +
  labs(title="vs Temperature", x="Temperature (°C)", y="Abundance") + theme_minimal()

p2 <- ggplot(df, aes(rainfall, abundance)) +
  geom_point(alpha=0.25, color="forestgreen") +
  geom_smooth(method="loess", color="firebrick") +
  labs(title="vs Rainfall", x="Rainfall (mm)", y=NULL) + theme_minimal()

p3 <- ggplot(df, aes(elevation, abundance)) +
  geom_point(alpha=0.25, color="darkorange") +
  geom_smooth(method="loess", color="firebrick") +
  labs(title="vs Elevation", x="Elevation (m)", y=NULL) + theme_minimal()

p1 + p2 + p3
LOESS smoothers reveal non-linear relationships with abundance.

LOESS smoothers reveal non-linear relationships with abundance.

Interpretation: Non-linear LOESS curves for temperature and rainfall confirm that a GAM is more appropriate than a linear GLM.


4 Fitting the GAM

Poisson GAM with thin-plate regression splines s() and REML smoothness selection:

gam_model <- gam(
  abundance ~ s(temperature) + s(rainfall) + s(elevation),
  data = df, family = poisson(link = "log"), method = "REML"
)
summary(gam_model)
## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## abundance ~ s(temperature) + s(rainfall) + s(elevation)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.41057    0.05293   7.757 8.68e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df Chi.sq  p-value    
## s(temperature) 7.023  8.011 504.50  < 2e-16 ***
## s(rainfall)    3.876  4.780  27.48 4.83e-05 ***
## s(elevation)   1.373  1.654 716.46  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.762   Deviance explained = 84.4%
## -REML = 667.58  Scale est. = 1         n = 400

Key output guide:

Output Meaning
edf edf=1 → linear; edf>1 → non-linear (higher = wigglier)
p-value (smooth) Is the smooth significantly different from zero?
Deviance explained R² analogue — higher is better

5 Smooth Term Visualisation

par(mfrow = c(1, 3), mar = c(4, 4, 3, 1))
plot(gam_model, shade = TRUE, shade.col = "lightblue",
     seWithMean = TRUE, scale = 0, residuals = TRUE,
     pch = 16, cex = 0.3, col = "grey50")
Partial effects. Grey shading = 95% CI. Rug = data density.

Partial effects. Grey shading = 95% CI. Rug = data density.

par(mfrow = c(1, 1))
Predictor Shape Interpretation
Temperature Hump-shaped Species thrives at moderate temperatures
Rainfall Concave Beneficial up to a point, then levels off
Elevation Near-linear decline Harsher high-altitude conditions reduce abundance

6 Assumption Checking

6.1 Basis Dimension Adequacy

k.check(gam_model)
##                k'      edf   k-index p-value
## s(temperature)  9 7.023383 1.0807151  0.9725
## s(rainfall)     9 3.875901 0.9512605  0.2275
## s(elevation)    9 1.373497 1.0003260  0.5825

Rule: k-index < 1 AND p < 0.05 → basis too restrictive; increase k in s().

6.2 Residual Diagnostics

par(mfrow = c(2, 2))
gam.check(gam_model, pch = 16, cex = 0.5)
Four-panel diagnostic: QQ plot, residuals vs fitted, histogram, response vs fitted.

Four-panel diagnostic: QQ plot, residuals vs fitted, histogram, response vs fitted.

## 
## Method: REML   Optimizer: outer newton
## full convergence after 5 iterations.
## Gradient range [-1.657642e-06,0.000123604]
## (score 667.5772 & scale 1).
## Hessian positive definite, eigenvalue range [0.06248418,2.418291].
## Model rank =  28 / 28 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##                  k'  edf k-index p-value
## s(temperature) 9.00 7.02    1.08    0.96
## s(rainfall)    9.00 3.88    0.95    0.27
## s(elevation)   9.00 1.37    1.00    0.59
par(mfrow = c(1, 1))
Plot Good Sign
QQ Plot Points on the diagonal
Residuals vs Fitted Random scatter around zero
Histogram Bell-shaped, symmetric
Response vs Fitted Points on the 1:1 line

6.3 Concurvity

concurvity(gam_model, full = TRUE)
##                  para s(temperature) s(rainfall) s(elevation)
## worst    3.195316e-25      0.1129922  0.08931919   0.11580280
## observed 3.195316e-25      0.0311632  0.03501274   0.04211687
## estimate 3.195316e-25      0.0703711  0.04150996   0.04083935

Rule: Values near 1 → one smooth approximated by others → unstable estimates. Values < 0.8 are acceptable.


7 GAM vs GLM

glm_model <- glm(abundance ~ temperature + rainfall + elevation,
                 data = df, family = poisson(link = "log"))

cat("GAM AIC:", round(AIC(gam_model), 1), "\n")
## GAM AIC: 1310.3
cat("GLM AIC:", round(AIC(glm_model), 1), "\n\n")
## GLM AIC: 2084.7
anova(glm_model, gam_model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model 1: abundance ~ temperature + rainfall + elevation
## Model 2: abundance ~ s(temperature) + s(rainfall) + s(elevation)
##   Resid. Df Resid. Dev     Df Deviance  Pr(>Chi)    
## 1    396.00    1166.57                              
## 2    386.73     369.28 9.2728    797.3 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Result: Lower AIC and significant LRT (p < 0.001) confirm the GAM fits substantially better than the linear GLM.


8 Marginal Prediction

nd <- data.frame(temperature = seq(5, 35, length=200),
                 rainfall = median(df$rainfall),
                 elevation = median(df$elevation))
p  <- predict(gam_model, newdata=nd, type="response", se.fit=TRUE)
nd$fit   <- p$fit
nd$lower <- p$fit - 1.96*p$se.fit
nd$upper <- p$fit + 1.96*p$se.fit

ggplot(nd, aes(temperature, fit)) +
  geom_ribbon(aes(ymin=lower, ymax=upper), alpha=0.2, fill="steelblue") +
  geom_line(color="steelblue", linewidth=1.2) +
  geom_rug(data=df, aes(x=temperature), inherit.aes=FALSE, alpha=0.15) +
  labs(title="Marginal Effect: Temperature on Species Abundance",
       subtitle="Rainfall & Elevation at medians | 95% CI shaded",
       x="Temperature (°C)", y="Predicted Abundance") +
  theme_minimal(base_size=12)
Predicted abundance across temperature (rainfall & elevation at median). 95% CI shaded.

Predicted abundance across temperature (rainfall & elevation at median). 95% CI shaded.


9 Summary

s <- summary(gam_model)
cat("Deviance Explained:", round(s$dev.expl*100, 1), "%\n")
## Deviance Explained: 84.4 %
cat("Adjusted R-squared:", round(s$r.sq, 3), "\n")
## Adjusted R-squared: 0.762
cat("AIC:               ", round(AIC(gam_model), 1), "\n\n")
## AIC:                1310.3
print(round(s$s.table, 3))
##                  edf Ref.df  Chi.sq p-value
## s(temperature) 7.023  8.011 504.501       0
## s(rainfall)    3.876  4.780  27.476       0
## s(elevation)   1.373  1.654 716.461       0

10 Conclusion

Complete GAM pipeline demonstrated:

  1. Data Simulation — Count data with known non-linear effects
  2. EDA — LOESS plots motivated GAM over GLM
  3. Model Fitting — Poisson GAM with REML via mgcv
  4. Smooth Plots — Partial effects revealed effect shapes
  5. Diagnostics — k-check, gam.check, concurvity all assessed
  6. Comparison — GAM outperformed GLM (AIC + LRT)
  7. Prediction — Marginal curves with uncertainty quantification

GAMs offer an excellent balance of flexibility and interpretability — ideal for ecology, epidemiology, environmental science, and beyond.