Generalized linear models (GLMs) are a common framework for evaluating relationships between a response variable and a set of predictors. GLMs extend linear regression by allowing the response variable to follow a non-normal distribution (e.g., binomial or Poisson). However, they still assume a linear relationship between predictors and the response on the link scale, which may not capture more complex ecological patterns.
Generalized additive models (GAMs) offer greater flexibility by allowing non-linear relationships between predictors and the response through the use of smooth functions. This makes GAMs especially useful when ecological responses are expected to vary along environmental gradients in non-linear ways.
Here, I evaluate whether the linearity assumption improves model predictive performance. Specifically, I compare the fit and predictive accuracy of a GLM and a GAM applied to species occurrence data, using soil characteristics as environmental predictors. I also see how models perform individually and how predictors drive the response along the way.
Common juniper is a small tree/shrub of the cypress family. It is has the largest geographical range of any woody plant, practically covering the cool temperate ecosystems of the Northern Hemisphere.
With such a range, I’m excited to see which soil characteristics are influencing common juniper distribution. I’m also just a big fan of this plant.
I need some spatial and statistical libraries to handle the data prep and analysis. Here are my preferred:
library(sf) # vector data
library(terra) # raster data
library(tidyverse) # data manipulation and plotting
library(tmap) # I like tmap maps
library(mgcv) # for GAMs
library(ggeffects) # Response curves
library(gratia) # GAM response curves
library(car) # VIF
I first want to map the occurrence data, but my data is in .csv format from an iNaturalist download. I’ll take care of that and map!
wrld <- st_read("D:/Projects/Spatial-Ecology-Portfolio/data/glm/World_Continents.shp")
## Reading layer `World_Continents' from data source
## `D:\Projects\Spatial-Ecology-Portfolio\data\glm\World_Continents.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 8 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -20037510 ymin: -30240970 xmax: 20037510 ymax: 18418390
## Projected CRS: WGS 84 / Pseudo-Mercator
wrld <- wrld %>%
filter(CONTINENT != "Antarctica")
occ <- read.csv("D:/Projects/Spatial-Ecology-Portfolio/data/glm/presence.csv") %>%
na.omit() # get rid of NA's - sf cant handle missing values for vector creation
occ_sf <- st_as_sf(occ, coords = c("longitude", "latitude"), crs = 4326) # to match world
A quick but not so pretty map..
tm_shape(wrld) +
tm_polygons(col = "CONTINENT") +
tm_shape(occ_sf) +
tm_dots() +
tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"))
a much prettier map and a bit more insightful, showing observations per large grid cell.
grid <- st_make_grid(wrld,
cellsize = c(1000000,1000000),
square = T) %>%
st_intersection(., wrld) %>%
st_transform(., crs = st_crs(occ_sf)) %>%
st_cast(., "MULTIPOLYGON") %>%
st_sf() %>%
st_make_valid()
sf_use_s2(FALSE) # so the grid doesn't give me any issues!
## Spherical geometry (s2) switched off
grid_occ <- grid %>%
mutate(counts = lengths(st_intersects(., occ_sf)))
## although coordinates are longitude/latitude, st_intersects assumes that they
## are planar
tm_shape(grid_occ) +
tm_polygons(col = "counts", palette = "viridis", style = "fixed",
breaks = c(0, 1, 10, 50, 100, 200, 500, 1000, 2000, 3000, Inf)) +
tm_layout(legend.position = c("left", "bottom"))
rast.list <- list.files(path = "D:/Projects/Spatial-Ecology-Portfolio/data/glm",
pattern = "\\.tif$",
full.names = T)
preds <- lapply(rast.list, rast)
soils <- rast(preds)
The response variable in this study is binary, indicating species presence (1) or absence (0). While presence data are commonly available through field surveys or databases, true absence data are often lacking or unreliable due to detectability issues and incomplete sampling. To address this, pseudo-absences were generated — randomly selected background points assumed to represent absences.
Pseudo-absences help overcome the imbalance in presence-only datasets and reduce the risk of Type II errors (false negatives). In accordance with standard practice, the number of pseudo-absences was set at approximately 2–4 times the number of presences to balance statistical power and model sensitivity.
set.seed(111)
abs <- st_sample(st_geometry(wrld), type = "random", size = (nrow(occ_sf)*2)) %>%
st_sf() %>%
mutate(type = 0) %>%
st_transform(., st_crs(occ_sf))
one <- occ_sf %>%
select(geometry) %>%
mutate(type = 1)
data <- rbind(one, abs)
Now we need to extract predictor variable values to each presence or absence.
data <- st_transform(data, crs = crs(soils)) # change crs for data to match soils
train <- terra::extract(soils, data) # extract soil information to points
train$type <- data$type # add in 0,1 info
train <- train[c(8, 2:7)] # reorder for preferece
train <- st_drop_geometry(train)
train <- na.omit(train) %>%
as.data.frame()
pander::pander(head(train)) # view
type | clay | nitrogen | ph | sand | silt | soc |
---|---|---|---|---|---|---|
1 | 310 | 635 | 52 | 311 | 379 | 1072 |
1 | 120 | 395 | 62 | 517 | 363 | 791 |
1 | 258 | 908 | 54 | 295 | 447 | 1653 |
1 | 224 | 507 | 66 | 353 | 423 | 896 |
1 | 233 | 560 | 58 | 370 | 397 | 1080 |
1 | 236 | 795 | 56 | 328 | 436 | 920 |
mod_glm <- glm(type ~ clay + nitrogen + ph + sand + silt + soc,
data = train,
family = binomial)
summary(mod_glm)
##
## Call:
## glm(formula = type ~ clay + nitrogen + ph + sand + silt + soc,
## family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.152e+00 1.268e+01 -0.643 0.520
## clay 2.154e-03 1.268e-02 0.170 0.865
## nitrogen 2.340e-03 3.172e-05 73.762 <2e-16 ***
## ph -6.048e-02 8.786e-04 -68.837 <2e-16 ***
## sand 1.051e-02 1.268e-02 0.829 0.407
## silt 1.725e-02 1.268e-02 1.361 0.174
## soc -1.692e-03 1.749e-05 -96.727 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 180236 on 142198 degrees of freedom
## Residual deviance: 149102 on 142192 degrees of freedom
## AIC: 149116
##
## Number of Fisher Scoring iterations: 4
The model summary provides estimates for the intercept and predictor coefficients, along with their standard errors, z-values, and p-values. Among the soil variables, nitrogen (positive effect), pH (negative effect), and SOC (negative effect) are all statistically significant predictors of presence. In contrast, clay, sand, and silt are non-significant, and may be contributing unnecessary noise to the model — which we can check for potential collinearity.
It is important to emphasize that the coefficient estimates reflect the change in log-odds of presence per unit increase in each predictor, not a direct change in probability. This is a fundamental feature of logistic regression models with binary response variables.
The model shows good explanatory power: the null deviance of 180,236 is reduced to a residual deviance of 149,102, indicating that the predictors substantially improve model fit. Additionally, the Akaike Information Criterion (AIC) value of 149,116 provides a useful metric for model comparison. A lower AIC indicates a better-fitting model, and this value will be used to compare against more flexible alternatives, such as a generalized additive model (GAM).
Log-odds can be a difficult thing to interpret. I’m applying the code below to examine it differently, making our interpretation a bit more intuitive.
pander::pander(exp(coef(mod_glm)))
(Intercept) | clay | nitrogen | ph | sand | silt | soc |
---|---|---|---|---|---|---|
0.0002881 | 1.002 | 1.002 | 0.9413 | 1.011 | 1.017 | 0.9983 |
Predictor | Odds Ratio | Interpretation |
---|---|---|
(Intercept) | 0.00029 | Baseline odds of presence when all predictors = 0 (not interpretable) |
Clay | 1.0022 | Each 1-unit increase in clay increases odds of presence by 0.2% |
Nitrogen | 1.0023 | Each 1-unit increase in nitrogen increases odds by 0.23% |
pH | 0.9413 | Each 1-unit increase in pH decreases odds by about 5.9% |
Sand | 1.0106 | Each 1-unit increase in sand increases odds by 1.1% |
Silt | 1.0174 | Each 1-unit increase in silt increases odds by 1.7% |
SOC | 0.9983 | Each 1-unit increase in SOC decreases odds by about 0.17% |
Checking collinearity is extremely important in modeling. Using the vif() function, we get value to describe multicollinearity in our model. For reference, 5-10 presents problems, 10 and up is pretty bad.
pander::pander(vif(mod_glm))
clay | nitrogen | ph | sand | silt | soc |
---|---|---|---|---|---|
17639 | 3.005 | 1.436 | 55880 | 30333 | 3.35 |
We see that our non-significant variables from the glm are extremely high. This might be due to each of these variables being reported in a 0-100% content. I’ll probably drop 2 of them in a model 2.0.
This is a fun part of modeling. How does each variable influence the predicted probability of common juniper presence?
par(mfrow = c(1,3))
plot(ggpredict(mod_glm, terms = "ph"))
## Data were 'prettified'. Consider using `terms="ph [all]"` to get smooth
## plots.
plot(ggpredict(mod_glm, terms = "nitrogen"))
## Data were 'prettified'. Consider using `terms="nitrogen [all]"` to get
## smooth plots.
plot(ggpredict(mod_glm, terms = "soc"))
## Data were 'prettified'. Consider using `terms="soc [all]"` to get smooth
## plots.
par(mfrow=c(1,1))
plot(ggpredict(mod_glm, terms = c("ph", "nitrogen", "soc")))
## Data were 'prettified'. Consider using `terms="ph [all]"` to get smooth
## plots.
mod_glm2 <- glm(type ~ ph + nitrogen + soc + clay, data = train, family = binomial)
summary(mod_glm2)
##
## Call:
## glm(formula = type ~ ph + nitrogen + soc + clay, family = binomial,
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.912e+00 5.978e-02 65.44 <2e-16 ***
## ph -5.557e-02 8.189e-04 -67.86 <2e-16 ***
## nitrogen 2.516e-03 3.084e-05 81.58 <2e-16 ***
## soc -1.586e-03 1.730e-05 -91.64 <2e-16 ***
## clay -6.733e-03 9.718e-05 -69.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 180236 on 142198 degrees of freedom
## Residual deviance: 158224 on 142194 degrees of freedom
## AIC: 158234
##
## Number of Fisher Scoring iterations: 4
pander::pander(exp(coef(mod_glm2)))
(Intercept) | ph | nitrogen | soc | clay |
---|---|---|---|---|
49.98 | 0.9459 | 1.003 | 0.9984 | 0.9933 |
Predictor | Odds Ratio | Interpretation |
---|---|---|
(Intercept) | 50.089 | Baseline odds of presence when all predictors = 0 |
pH | 0.9459 | Each 1-unit increase in pH decreases odds of presence by ~5.4% |
Nitrogen | 1.0025 | Each 1-unit increase in nitrogen increases odds of presence by ~0.25% |
SOC | 0.9984 | Each 1-unit increase in SOC decreases odds of presence by ~0.16% |
Clay | 0.9933 | Each 1-unit increase in clay decreases odds of presence by ~0.67% |
pander::pander(vif(mod_glm2))
ph | nitrogen | soc | clay |
---|---|---|---|
1.405 | 3.29 | 3.619 | 1.143 |
par(mfrow = c(1,3))
plot(ggpredict(mod_glm2, terms = "ph"))
## Data were 'prettified'. Consider using `terms="ph [all]"` to get smooth
## plots.
plot(ggpredict(mod_glm2, terms = "nitrogen"))
## Data were 'prettified'. Consider using `terms="nitrogen [all]"` to get
## smooth plots.
plot(ggpredict(mod_glm2, terms = "soc"))
## Data were 'prettified'. Consider using `terms="soc [all]"` to get smooth
## plots.
plot(ggpredict(mod_glm2, terms = "clay"))
## Data were 'prettified'. Consider using `terms="clay [all]"` to get
## smooth plots.
par(mfrow=c(1,1))
plot(ggpredict(mod_glm2, terms = c("ph", "nitrogen", "soc", "clay")))
## Data were 'prettified'. Consider using `terms="ph [all]"` to get smooth
## plots.
## Package {see} needed to plot multiple panels in one integrated figure.
## Please install it by typing `install.packages("see", dependencies =
## TRUE)` into the console.
## [[1]]
##
## [[2]]
##
## [[3]]
AIC(mod_glm, mod_glm2)
## df AIC
## mod_glm 7 149116
## mod_glm2 5 158234
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.1
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
prob_glm <- predict(mod_glm, type = "response")
prob_glm2 <- predict(mod_glm2, type = "response")
roc_glm <- roc(train$type, prob_glm)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_glm2 <- roc(train$type, prob_glm2)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_glm, col = "blue", lwd = 2, main = "ROC Curves: GLM vs GLM2")
plot(roc_glm2, col = "red", lwd = 2, add = TRUE)
legend("bottomright", legend = c("mod_glm", "mod_glm2"),
col = c("blue", "red"), lwd = 2)
roc.test(roc_glm, roc_glm2)
##
## DeLong's test for two correlated ROC curves
##
## data: roc_glm and roc_glm2
## Z = 52.547, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## 0.04034700 0.04347342
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7809060 0.7389958
Modeling is most exciting when it requires weighing statistical rigor against predictive performance. In this case, two models offer competing advantages:
mod_glm provides higher predictive performance, as evidenced by a lower AIC, higher AUC, and improved classification accuracy. If the goal were solely predictive mapping, this model would clearly dominate.
mod_glm2, however, addresses important structural issues. It removes highly collinear predictors, includes only statistically significant variables, and results in a more interpretable and stable model. While it shows slightly reduced predictive power, it offers greater confidence in coefficient estimates and the ecological interpretation of soil variables.
Though mod_glm performs better numerically, we favor mod_glm2 for its stronger inferential validity. A responsible modeling approach should prioritize the integrity of the model structure and the interpretability of its predictors, especially when informing ecological understanding or conservation decisions. Predictive accuracy is valuable, but not at the expense of robustness and transparency.
Let’s see if a GAMs allowance of non-linear relationships improves our model.
Now I’ll fit a GAM using the same predictors, examine response smooths, compare AIC and AUC, and decide whether flexibility improves performance.
mod_gam <- gam(type ~ s(ph) + s(nitrogen) + s(soc) + s(clay),
data = train,
family = binomial)
summary(mod_gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## type ~ s(ph) + s(nitrogen) + s(soc) + s(clay)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.91343 0.05484 -34.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(ph) 8.121 8.267 1707 <2e-16 ***
## s(nitrogen) 8.914 8.996 2434 <2e-16 ***
## s(soc) 8.932 8.998 11097 <2e-16 ***
## s(clay) 7.809 8.210 5818 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.288 Deviance explained = 26.6%
## UBRE = -0.069221 Scale est. = 1 n = 142199
The GAM output differs slightly from the GLM summary in structure and interpretation. While the intercept is not of primary interest, it is significant and serves as the baseline from which smoothed relationships are built.
Each smooth term is reported with its estimated degrees of freedom (edf) and an associated p-value. The edf reflects the flexibility of the spline: values near 1 suggest a linear relationship, while values significantly above 1 indicate non-linear patterns. In this case, all predictors exhibit strongly non-linear responses:
ph (edf = 8.12),
nitrogen (edf = 8.91),
soc (edf = 8.93), and
clay (edf = 7.81)
These high edf values reveal that the relationship between each soil variable and the probability of presence is complex and curved, not simply linear. Moreover, all smooth terms are highly statistically significant (p < 2e-16), reinforcing the value of allowing non-linear structures in this model.
Additionally, the R-squared value of .288 explains ~29% in the binary response, which isn’t outstanding by any means. The deviance explained is 26.6%, which seems reasonable for this models response of presence and absence. Lastly, the UBRE value is -0.069221, which is negative and small, suggesting a good fit (this is the gam verison of glm AIC).
One of the strengths of using GAMs is the ability to visualize partial effect response curves. These plots display the estimated effect of each predictor on the response variable (in this case, presence probability) while holding all other variables constant.
The x-axis represents the range of observed values for a given predictor.
The y-axis represents the partial effect on the log-odds of presence — effectively how that variable alone shifts the predicted probability.
The black line is the estimated smooth function — a flexible, data-driven fit showing how the response changes.
The grey shaded area represents the 95% confidence interval, providing a sense of uncertainty around the estimate. Narrow bands indicate high certainty; wider bands suggest more variability or sparse data.
These plots are particularly useful for identifying non-linear thresholds, plateaus, or unimodal relationships that GLMs would miss due to their assumption of linearity.
draw(mod_gam)
The relationship between pH and presence probability is relatively flat across most of the gradient, with a noticeable dip beyond a value of ~80. This suggests that soil pH has little influence on common juniper presence up to a certain point, after which probability declines. This pattern may reflect physiological intolerance to higher pH levels, consistent with known ecological thresholds in many plant species.
There is a strong positive effect of nitrogen up to approximately 800 units, after which the curve levels off, followed by minor dips and rises as uncertainty increases. This indicates that common juniper presence increases sharply with nitrogen availability to a threshold, beyond which additional nitrogen no longer enhances probability of occurrence — a classic saturation effect.
SOC displays a highly non-linear and fluctuating relationship with presence probability. An initial positive peak around 500, followed by a dip and subtle oscillations, suggests that juniper favors low to moderate organic soils. High SOC values are associated with declines in presence, indicating avoidance of highly organic, possibly poorly drained soils.
The relationship shows a steady decline in presence probability as clay content increases. This aligns well with field observations: common juniper prefers well-drained, sandy or rocky soils. High clay content likely reflects heavier, wetter soils that may be unsuitable for establishment or survival.
I like ggeeffects and while the partial plots above are informative, these stand-along plots work great for visualizing each variable. I can also visualize them together as was done with the glm model above.
library(ggeffects)
par(mfrow = c(2, 2))
plot(ggpredict(mod_gam, terms = "ph"))
plot(ggpredict(mod_gam, terms = "nitrogen"))
plot(ggpredict(mod_gam, terms = "soc"))
plot(ggpredict(mod_gam, terms = "clay"))
par(mfrow = c(1, 1))
plot(ggpredict(mod_gam, terms = c("ph", "nitrogen", "soc", "clay")))
## Package {see} needed to plot multiple panels in one integrated figure.
## Please install it by typing `install.packages("see", dependencies =
## TRUE)` into the console.
## [[1]]
##
## [[2]]
##
## [[3]]
prob_gam <- predict(mod_gam, type = "response")
roc_gam <- roc(train$type, prob_gam)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_glm2, col = "blue", lwd = 2, main = "ROC Curves: GLM vs GAM")
plot(roc_gam, col = "darkgreen", lwd = 2, add = TRUE)
legend("bottomright", legend = c("mod_glm", "mod_gam"),
col = c("blue", "darkgreen"), lwd = 2)
roc.test(roc_glm2, roc_gam)
##
## DeLong's test for two correlated ROC curves
##
## data: roc_glm2 and roc_gam
## Z = -80.315, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.09133699 -0.08698529
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.7389958 0.8281569
Just as was done with comparing GLMs, DeLong’s test for correlated ROC curves shows that that the test was significant that there is a difference between the model performance, but that GAM was much higher (73.9% vs. 82.8%). The improvement is real.
Modeling is rarely as straightforward as comparing numbers. While some models may perform statistically better (as with the GLM vs. GLM2), selecting the “best” model depends just as much on how well it aligns with the ecological questions, underlying assumptions, and the way we intend to interpret and communicate the results.
In this case, the GAM outperformed the GLMs in predictive accuracy. But more importantly, it better reflects how we understand ecological systems. The relationships between ecological responses such as species presence, abundance, or canopy cover and environmental covariates like climate, soil, or topography are rarely linear. They bend, peak, and dip. The GAM allows us to embrace that complexity, not ignore it.
This isn’t to say GLMs aren’t useful — they are. They are interpretable, robust, and efficient. But when ecological realism matters, and when the response surface may be non-linear, GAMs offer a way to model the world more like it really is: dynamic, flexible, and full of gradients.