GAMvsGLMTutorial

Install Required Packages

Simulate Data

set.seed(12345)
# Simulate data
n <- 500
data <- tibble(
  age = rnorm(n, 50, 10),
  systolic_bp = rnorm(n, 120, 15),
  cholesterol = rnorm(n, 200, 25),
  outcome = rbinom(n, 1, 0.3)
)

Fit a logistic GLM and GAM

glm_fit <- glm(outcome ~ age + systolic_bp + cholesterol,
               data = data,
               family = binomial())

# Fit a logistic GAM with smooth terms
gam_fit <- gam(outcome ~ s(age) + s(systolic_bp) + s(cholesterol),
               data = data,
               family = binomial())

# Compare model summaries
summary(glm_fit)

## 
## Call:
## glm(formula = outcome ~ age + systolic_bp + cholesterol, family = binomial(), 
##     data = data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8282026  1.2344843  -0.671    0.502
## age          0.0110854  0.0102482   1.082    0.279
## systolic_bp -0.0071779  0.0066821  -1.074    0.283
## cholesterol  0.0006896  0.0040320   0.171    0.864
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 585.24  on 499  degrees of freedom
## Residual deviance: 582.82  on 496  degrees of freedom
## AIC: 590.82
## 
## Number of Fisher Scoring iterations: 4

summary(gam_fit)

## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## outcome ~ s(age) + s(systolic_bp) + s(cholesterol)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.9903     0.1010  -9.804   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df Chi.sq p-value
## s(age)         1.000  1.000  1.202   0.273
## s(systolic_bp) 1.328  1.595  1.698   0.429
## s(cholesterol) 1.000  1.000  0.026   0.873
## 
## R-sq.(adj) =  -0.000187   Deviance explained = 0.534%
## UBRE = 0.18154  Scale est. = 1         n = 500

tbl_regression(glm_fit, exponentiate = TRUE)  # exponentiate = TRUE gives ORs

Characteristic	OR	95% CI	p-value
age	1.01	0.99, 1.03	0.3
systolic_bp	0.99	0.98, 1.01	0.3
cholesterol	1.00	0.99, 1.01	0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

#  tidy summary for GAM

tidy_gam <- tidy(gam_fit)



# Extract parametric terms (intercept and linear predictors)
tidy_gam <- tidy(gam_fit, parameter = "parametric")
tidy_gam

## # A tibble: 3 × 5
##   term             edf ref.df statistic p.value
##   <chr>          <dbl>  <dbl>     <dbl>   <dbl>
## 1 s(age)          1.00   1.00    1.20     0.273
## 2 s(systolic_bp)  1.33   1.59    1.70     0.429
## 3 s(cholesterol)  1.00   1.00    0.0258   0.873

Model Matrix

In a linear or generalized additive model (GAM), the model matrix (model.matrix(gam_fit)) is the matrix that contains all the predictors used to fit the model. The rank of this matrix (qr(X)$rank) is the number of linearly independent rows in it, which gives us an idea of how well-conditioned the matrix is.

Rank of the matrix: This is the number of independent predictors in your model. If the rank is equal to the number of predictors, this means that the matrix is full rank, and there is no exact linear dependence between the variables.
Lower rank than the number of predictors: This could indicate multicollinearity—meaning some predictors are highly correlated, making it difficult to estimate their individual effects reliably. Solution: Removing highly correlated predictors or using regularization methods, such as Ridge or Lasso regression.

The Model Matrix The design matrix (X), which we are referring to as the model matrix, has 500 rows (one for each data point) and 28 columns (which corresponds to the number of parameters). This matrix includes:

Intercept: 1 parameter.

Smooth Terms: Each smooth term (s(age), s(systolic_bp), s(cholesterol)) is approximated by several basis functions, contributing additional parameters. By default, mgcv::gam() uses 10 basis functions for each smooth term The model will use multiple basis functions to represent each smooth term (s(age), s(systolic_bp), s(cholesterol)).

s(age) will use 10 basis functions

s(systolic_bp) will use another set of 10 basis functions

s(cholesterol) will use another 10 basis functions.

Together, the total number of parameters= 1 intercept+10 basis functions for s(age)+10 basis functions for s(systolic_bp)+10 basis functions for s(cholesterol) = 28 parameters. The fact that your model matrix rank is 28, and the number of parameters is also 28, suggests that the model is well-specified and that the smooth terms are represented by the expected number of basis functions

# Check the model matrix to identify collinearity
X <- model.matrix(gam_fit)
rank_matrix <- qr(X)$rank

# If rank is lower than the number of parameters, there's likely multicollinearity.
print(rank_matrix)

## [1] 28

#Get the number of coefficients in the model
num_params <- length(coef(gam_fit))
print(num_params)

## [1] 28

# View the model matrix
X <- model.matrix(gam_fit)
print(dim(X))  # Prints the dimensions (rows and columns) of the model matrix

## [1] 500  28

Visualise Smooth terms in GAM

Interpretation

The y-axis is the smooth contribution to the log-odds.
A wiggly line suggests a non-linear effect.
The shaded area is the confidence interval.

# Plot smooth effects
par(mfrow = c(1, 3))  # Arrange plots in one row
plot(gam_fit, shade = TRUE, seWithMean = TRUE)

Model Performance Comparison

# Predicted probabilities
glm_probs <- predict(glm_fit, type = "response")
gam_probs <- predict(gam_fit, type = "response")

# Generate ROC objects
glm_roc <- roc(data$outcome, glm_probs, levels = c(0, 1), direction = "<")
gam_roc <- roc(data$outcome, gam_probs, levels = c(0, 1), direction = "<")

# Get AUC values
glm_auc <- auc(glm_roc)
gam_auc <- auc(gam_roc)

# Print them
print(glm_auc)

## Area under the curve: 0.551

print(gam_auc)

## Area under the curve: 0.5561

plot(glm_roc, col = "blue", legacy.axes = TRUE, print.auc = TRUE, main = "ROC Curves")
lines(gam_roc, col = "red")
legend("bottomright", legend = c("GLM", "GAM"), col = c("blue", "red"), lwd = 2)

# Pseudo R-squared
1 - logLik(glm_fit) / logLik(glm(null.model <- glm(outcome ~ 1, data = data, family = binomial())))

## 'log Lik.' 0.04335119 (df=4)

1 - logLik(gam_fit) / logLik(glm(null.model))

## 'log Lik.' 0.04451903 (df=4.328511)

Conclusion

If plot(gam_fit) shows clear non-linear shapes, that’s where GAM adds value.
If the AUC or pseudo R² improves for the GAM, it suggests better predictive performance.
Even in simulated data, you can observe this if some variables influence the outcome in a nonlinear way.

Acknowledgement “Portions of this work were developed with assistance from ChatGPT, an AI language model developed by OpenAI (https://openai.com/chatgpt).”