Generalized Linear Models

1. Theoretical Framework

Generalized Linear Models (GLMs; Nelder & Wedderburn, 1972) extend ordinary least squares regression to outcome variables whose conditional distribution belongs to the exponential family. A GLM is defined by three components:

Random component — the conditional distribution of \(Y | \mathbf{x}\) (e.g., Normal, Binomial, Poisson, Gamma)
Systematic component — a linear predictor \(\eta_i = \mathbf{x}_i^\top \boldsymbol{\beta}\)
Link function \(g(\cdot)\) — a monotone differentiable function relating the conditional mean \(\mu_i = E[Y_i|\mathbf{x}_i]\) to the linear predictor: \(g(\mu_i) = \eta_i\)

The model is estimated via maximum likelihood, and inference relies on asymptotic normality of the MLE. Parameters are tested with Wald statistics (\(z = \hat\beta / \text{SE}\)) or likelihood ratio tests (LRTs), both asymptotically \(\chi^2\).

Family	Canonical link	Use case
Gaussian	Identity	Continuous, symmetric outcome
Binomial	Logit	Binary / proportion outcome
Poisson	Log	Non-negative integer counts
Negative Binomial	Log	Overdispersed counts
Gamma	Inverse / Log	Positive skewed continuous

2. Gaussian GLM — Insurance Charges

2.1 Data and Rationale

Medical insurance charges are right-skewed and strictly positive. A log transformation induces approximate normality, making a Gaussian GLM with identity link on \(\log(\text{charges})\) appropriate. Alternatively, a Gamma GLM with log link directly models the original scale without requiring a transformation.

p1 <- ggplot(df_insurance, aes(x=charges)) +
  geom_histogram(bins=50, fill="#2166ac", colour="white") +
  theme_bw() + labs(title="Raw charges (right-skewed)", x="Charges (USD)")

p2 <- ggplot(df_insurance, aes(x=log_charges)) +
  geom_histogram(bins=50, fill="#4dac26", colour="white") +
  theme_bw() + labs(title="Log-transformed charges (approx. normal)", x="log(Charges)")

plot_multiplot(p1, p2, cols=2)

## [[1]]

2.2 Model Specification and Estimation

We fit a saturated model including all main effects and the theoretically motivated smoker × bmi interaction (smoking amplifies the effect of obesity on health costs).

m_gauss_null <- glm(log_charges ~ 1,
                    family=gaussian(link="identity"),
                    data=df_insurance)

m_gauss_main <- glm(log_charges ~ age + sex + bmi + children + smoker + region,
                    family=gaussian(link="identity"),
                    data=df_insurance)

m_gauss_full <- glm(log_charges ~ age + sex + bmi + children + smoker + region +
                      smoker:bmi,
                    family=gaussian(link="identity"),
                    data=df_insurance)

summary(m_gauss_full)

## 
## Call:
## glm(formula = log_charges ~ age + sex + bmi + children + smoker + 
##     region + smoker:bmi, family = gaussian(link = "identity"), 
##     data = df_insurance)
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.3373697  0.0766733  95.697  < 2e-16 ***
## age              0.0347953  0.0008429  41.281  < 2e-16 ***
## sexmale         -0.0870645  0.0236073  -3.688 0.000235 ***
## bmi              0.0034060  0.0022676   1.502 0.133336    
## children         0.1031486  0.0097594  10.569  < 2e-16 ***
## smokeryes        0.1564189  0.1459995   1.071 0.284199    
## regionnorthwest -0.0711306  0.0337354  -2.108 0.035176 *  
## regionsoutheast -0.1627269  0.0339029  -4.800 1.77e-06 ***
## regionsouthwest -0.1375125  0.0338557  -4.062 5.16e-05 ***
## bmi:smokeryes    0.0455744  0.0046633   9.773  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1842831)
## 
##     Null deviance: 1130.47  on 1337  degrees of freedom
## Residual deviance:  244.73  on 1328  degrees of freedom
## AIC: 1546.1
## 
## Number of Fisher Scoring iterations: 2

2.3 Type II Analysis of Deviance

Anova(m_gauss_full, type=2)

## Analysis of Deviance Table (Type II tests)
## 
## Response: log_charges
##            LR Chisq Df Pr(>Chisq)    
## age         1704.11  1  < 2.2e-16 ***
## sex           13.60  1   0.000226 ***
## bmi           43.61  1  4.001e-11 ***
## children     111.71  1  < 2.2e-16 ***
## smoker      2822.41  1  < 2.2e-16 ***
## region        27.42  3  4.805e-06 ***
## bmi:smoker    95.51  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.4 Likelihood Ratio Test — Interaction Term

lrt_interaction <- anova(m_gauss_main, m_gauss_full, test="LRT")
lrt_interaction

## Analysis of Deviance Table
## 
## Model 1: log_charges ~ age + sex + bmi + children + smoker + region
## Model 2: log_charges ~ age + sex + bmi + children + smoker + region + 
##     smoker:bmi
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1329     262.33                          
## 2      1328     244.73  1   17.601 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat(sprintf("\nLRT: χ²(%d) = %.3f, p = %.4f\n",
    lrt_interaction$Df[2],
    lrt_interaction$Deviance[2],
    lrt_interaction$`Pr(>Chi)`[2]))

## 
## LRT: χ²(1) = 17.601, p = 0.0000

2.5 Model Fit

cat("Null deviance     :", round(m_gauss_null$deviance, 2), "\n")

## Null deviance     : 1130.47

cat("Residual deviance :", round(m_gauss_full$deviance, 2), "\n")

## Residual deviance : 244.73

cat("McFadden pseudo-R²:", round(1 - m_gauss_full$deviance/m_gauss_null$deviance, 4), "\n")

## McFadden pseudo-R²: 0.7835

cat("AIC               :", round(AIC(m_gauss_full), 2), "\n")

## AIC               : 1546.11

2.6 Variance Inflation Factors

VIF > 5 indicates problematic collinearity; VIF > 10 is severe.

vif(m_gauss_full)

##                 GVIF Df GVIF^(1/(2*Df))
## age         1.017506  1        1.008715
## sex         1.011478  1        1.005723
## bmi         1.387358  1        1.177862
## children    1.004195  1        1.002095
## smoker     25.203045  1        5.020263
## region      1.099757  3        1.015975
## bmi:smoker 25.533081  1        5.053027

2.7 Diagnostics

par(mfrow=c(2,3))
plot(m_gauss_full, which=1:6)

par(mfrow=c(1,1))

2.8 Marginal Effects of Smoking by BMI

nd <- expand.grid(
  age=mean(df_insurance$age),
  sex="female",
  bmi=seq(18, 53, length.out=100),
  children=0,
  smoker=c("no","yes"),
  region="northeast"
)
nd$fit <- predict(m_gauss_full, newdata=nd, type="response")

ggplot(nd, aes(x=bmi, y=exp(fit), colour=smoker, fill=smoker)) +
  geom_line(linewidth=1) +
  scale_colour_manual(values=c("#2166ac","#d6604d")) +
  theme_bw() +
  labs(title="Predicted charges by BMI and smoking status",
       subtitle="Marginal prediction at mean age, female, 0 children, northeast",
       x="BMI", y="Predicted charges (USD)", colour="Smoker")

3. Binomial GLM — Graduate Admission

3.1 Data and Rationale

The df_admission dataset contains binary admission decisions (0/1) from a graduate programme alongside GRE score, GPA, and institutional rank. A binomial GLM with logit link — logistic regression — models the log-odds of admission as a linear function of the predictors.

\[\log\frac{P(\text{admit}=1|\mathbf{x})}{1-P(\text{admit}=1|\mathbf{x})} = \beta_0 + \beta_1\,\text{gre} + \beta_2\,\text{gpa} + \beta_3\,\text{rank}\]

3.2 Descriptive Overview

p1 <- ggplot(df_admission, aes(x=gre, fill=factor(admit))) +
  geom_density(alpha=0.5) +
  scale_fill_manual(values=c("#2166ac","#d6604d"), labels=c("Rejected","Admitted")) +
  theme_bw() + labs(title="GRE by admission", fill=NULL)

p2 <- ggplot(df_admission, aes(x=gpa, fill=factor(admit))) +
  geom_density(alpha=0.5) +
  scale_fill_manual(values=c("#2166ac","#d6604d"), labels=c("Rejected","Admitted")) +
  theme_bw() + labs(title="GPA by admission", fill=NULL)

plot_multiplot(p1, p2, cols=2)

## [[1]]

3.3 Model Estimation

m_admit_null <- glm(admit ~ 1,
                    family=binomial(link="logit"),
                    data=df_admission)

m_admit_main <- glm(admit ~ gre + gpa + rank,
                    family=binomial(link="logit"),
                    data=df_admission)

summary(m_admit_main)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = binomial(link = "logit"), 
##     data = df_admission)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rankRank2   -0.675443   0.316490  -2.134 0.032829 *  
## rankRank3   -1.340204   0.345306  -3.881 0.000104 ***
## rankRank4   -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

3.4 Omnibus Likelihood Ratio Test

lrt_admit <- anova(m_admit_null, m_admit_main, test="LRT")
lrt_admit

## Analysis of Deviance Table
## 
## Model 1: admit ~ 1
## Model 2: admit ~ gre + gpa + rank
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       399     499.98                          
## 2       394     458.52  5   41.459 7.578e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cat(sprintf("\nOverall model: χ²(%d) = %.3f, p = %.4f\n",
    lrt_admit$Df[2],
    lrt_admit$Deviance[2],
    lrt_admit$`Pr(>Chi)`[2]))

## 
## Overall model: χ²(5) = 41.459, p = 0.0000

3.5 Odds Ratios and 95% Profile-Likelihood CIs

or_table <- cbind(
  OR    = exp(coef(m_admit_main)),
  exp(confint(m_admit_main))
)
round(or_table, 3)

##                OR 2.5 % 97.5 %
## (Intercept) 0.019 0.002  0.167
## gre         1.002 1.000  1.004
## gpa         2.235 1.174  4.324
## rankRank2   0.509 0.272  0.945
## rankRank3   0.262 0.132  0.512
## rankRank4   0.212 0.091  0.471

3.6 Model Fit Indices

# McFadden pseudo-R²
pseudo_r2 <- 1 - m_admit_main$deviance / m_admit_null$deviance
# Nagelkerke pseudo-R²
n  <- nrow(df_admission)
r2_cox <- 1 - exp((m_admit_main$deviance - m_admit_null$deviance) / n)
r2_nag <- r2_cox / (1 - exp(-m_admit_null$deviance / n))

cat("McFadden pseudo-R²  :", round(pseudo_r2, 4), "\n")

## McFadden pseudo-R²  : 0.0829

cat("Nagelkerke pseudo-R²:", round(r2_nag,   4), "\n")

## Nagelkerke pseudo-R²: 0.138

cat("AIC                 :", round(AIC(m_admit_main), 2), "\n")

## AIC                 : 470.52

cat("BIC                 :", round(BIC(m_admit_main), 2), "\n")

## BIC                 : 494.47

3.7 Discrimination — ROC Curve and AUC

pred_prob <- predict(m_admit_main, type="response")
roc_obj   <- roc(df_admission$admit, pred_prob, quiet=TRUE)

plot(roc_obj, col="#2166ac", lwd=2,
     main=sprintf("ROC Curve — Graduate Admission  (AUC = %.3f)", auc(roc_obj)))
abline(a=0, b=1, lty=2, col="grey60")

3.8 Classification Table (threshold = 0.5)

pred_class <- ifelse(pred_prob >= 0.5, 1, 0)
conf_mat   <- table(Predicted=pred_class, Observed=df_admission$admit)
conf_mat

##          Observed
## Predicted   0   1
##         0 254  97
##         1  19  30

cat(sprintf("\nAccuracy   : %.3f\n", sum(diag(conf_mat))/sum(conf_mat)))

## 
## Accuracy   : 0.710

cat(sprintf("Sensitivity: %.3f\n",  conf_mat[2,2]/sum(conf_mat[,2])))

## Sensitivity: 0.236

cat(sprintf("Specificity: %.3f\n",  conf_mat[1,1]/sum(conf_mat[,1])))

## Specificity: 0.930

3.9 Predicted Probability Curves

nd2 <- expand.grid(
  gre  = seq(220, 800, length.out=200),
  gpa  = c(2.5, 3.0, 3.5, 4.0),
  rank = factor("Rank2", levels=levels(df_admission$rank))
)
nd2$prob <- predict(m_admit_main, newdata=nd2, type="response")

ggplot(nd2, aes(x=gre, y=prob, colour=factor(gpa), group=factor(gpa))) +
  geom_line(linewidth=1) +
  scale_colour_brewer(palette="RdYlBu", direction=-1) +
  theme_bw() +
  labs(title="Predicted admission probability by GRE and GPA",
       subtitle="Institutional rank fixed at Rank 2",
       x="GRE score", y="P(admit = 1)", colour="GPA")

4. Poisson and Negative Binomial GLM — Insurance Children

4.1 Rationale

The number of dependent children is a non-negative integer — a natural candidate for Poisson regression. A key assumption of the Poisson model is equidispersion (\(\text{Var}[Y] = \mu\)). When the variance exceeds the mean (overdispersion), the Negative Binomial GLM provides a robust alternative by introducing a dispersion parameter \(\alpha\) such that \(\text{Var}[Y] = \mu + \alpha\mu^2\).

4.2 Checking for Overdispersion

cat("Mean of children:", round(mean(df_insurance$children), 4), "\n")

## Mean of children: 1.0949

cat("Var  of children:", round(var(df_insurance$children),  4), "\n")

## Var  of children: 1.4532

cat("Dispersion ratio :", round(var(df_insurance$children)/mean(df_insurance$children), 4), "\n")

## Dispersion ratio : 1.3272

ggplot(df_insurance, aes(x=children)) +
  geom_bar(fill="#2166ac", colour="white") +
  theme_bw() +
  labs(title="Distribution of number of children", x="Children", y="Count")

4.3 Poisson Model

m_pois <- glm(children ~ age + sex + bmi + smoker + region,
              family=poisson(link="log"),
              data=df_insurance)
summary(m_pois)

## 
## Call:
## glm(formula = children ~ age + sex + bmi + smoker + region, family = poisson(link = "log"), 
##     data = df_insurance)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     -0.178742   0.156281  -1.144   0.2527  
## age              0.003231   0.001874   1.724   0.0846 .
## sexmale          0.037921   0.052505   0.722   0.4702  
## bmi              0.002430   0.004521   0.538   0.5909  
## smokeryes        0.027626   0.064748   0.427   0.6696  
## regionnorthwest  0.093801   0.075061   1.250   0.2114  
## regionsoutheast -0.007951   0.077049  -0.103   0.9178  
## regionsouthwest  0.083873   0.075436   1.112   0.2662  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2001.6  on 1337  degrees of freedom
## Residual deviance: 1994.5  on 1330  degrees of freedom
## AIC: 3899.8
## 
## Number of Fisher Scoring iterations: 5

Formal overdispersion test via ratio of residual deviance to degrees of freedom; values substantially above 1 indicate overdispersion.

dispersion_ratio <- m_pois$deviance / m_pois$df.residual
cat("Residual deviance / df:", round(dispersion_ratio, 4), "\n")

## Residual deviance / df: 1.4996

cat("Interpretation: ratio", ifelse(dispersion_ratio > 1.5, "> 1.5 — overdispersion likely", "≤ 1.5 — equidispersion plausible"), "\n")

## Interpretation: ratio ≤ 1.5 — equidispersion plausible

4.4 Negative Binomial Model

m_nb <- glm.nb(children ~ age + sex + bmi + smoker + region,
               data=df_insurance)
summary(m_nb)

## 
## Call:
## glm.nb(formula = children ~ age + sex + bmi + smoker + region, 
##     data = df_insurance, init.theta = 2.596859124, link = log)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)
## (Intercept)     -0.184715   0.185935  -0.993    0.320
## age              0.003570   0.002235   1.597    0.110
## sexmale          0.039297   0.062625   0.628    0.530
## bmi              0.002196   0.005389   0.407    0.684
## smokeryes        0.025660   0.077409   0.331    0.740
## regionnorthwest  0.092149   0.089517   1.029    0.303
## regionsoutheast -0.008422   0.091333  -0.092    0.927
## regionsouthwest  0.083844   0.089907   0.933    0.351
## 
## (Dispersion parameter for Negative Binomial(2.5969) family taken to be 1)
## 
##     Null deviance: 1507.2  on 1337  degrees of freedom
## Residual deviance: 1502.0  on 1330  degrees of freedom
## AIC: 3834.6
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.597 
##           Std. Err.:  0.412 
## 
##  2 x log-likelihood:  -3816.557

cat("\nEstimated dispersion parameter (theta):", round(m_nb$theta, 4), "\n")

## 
## Estimated dispersion parameter (theta): 2.5969

4.5 Poisson vs. Negative Binomial — Vuong Test via LRT

Because the Poisson model is nested in the NB when \(\alpha \to 0\), we compare them using a likelihood ratio test.

lrt_nb <- anova(m_pois, m_nb, test="LRT")
cat("AIC Poisson        :", round(AIC(m_pois), 2), "\n")

## AIC Poisson        : 3899.8

cat("AIC Negative Binom :", round(AIC(m_nb),   2), "\n")

## AIC Negative Binom : 3834.56

cat("Log-likelihood Pois:", round(logLik(m_pois), 2), "\n")

## Log-likelihood Pois: -1941.9

cat("Log-likelihood NB  :", round(logLik(m_nb),   2), "\n")

## Log-likelihood NB  : -1908.28

4.6 Incidence Rate Ratios (IRR)

irr_table <- cbind(
  IRR   = exp(coef(m_nb)),
  exp(confint(m_nb))
)
round(irr_table, 3)

##                   IRR 2.5 % 97.5 %
## (Intercept)     0.831 0.576  1.197
## age             1.004 0.999  1.008
## sexmale         1.040 0.920  1.176
## bmi             1.002 0.992  1.013
## smokeryes       1.026 0.881  1.193
## regionnorthwest 1.097 0.920  1.307
## regionsoutheast 0.992 0.829  1.186
## regionsouthwest 1.087 0.912  1.297

4.7 Predicted Counts by Age and Smoking Status

nd3 <- expand.grid(
  age    = seq(18, 64, length.out=100),
  sex    = "female",
  bmi    = mean(df_insurance$bmi),
  smoker = c("no","yes"),
  region = "northeast"
)
nd3$predicted <- predict(m_nb, newdata=nd3, type="response")

ggplot(nd3, aes(x=age, y=predicted, colour=smoker)) +
  geom_line(linewidth=1) +
  scale_colour_manual(values=c("#2166ac","#d6604d")) +
  theme_bw() +
  labs(title="Predicted number of children by age and smoking status",
       subtitle="Negative Binomial GLM — marginal at mean BMI, female, northeast",
       x="Age", y="Predicted count", colour="Smoker")

5. Logistic Regression — Titanic Survival

5.1 Model with Interaction

The Titanic evacuation followed a “women and children first” protocol, motivating a sex × pclass interaction: the survival advantage of female passengers may differ by class.

m_titanic_null <- glm(survived ~ 1,
                      family=binomial(link="logit"),
                      data=df_titanic)

m_titanic_main <- glm(survived ~ sex + pclass + age,
                      family=binomial(link="logit"),
                      data=df_titanic)

m_titanic_int  <- glm(survived ~ sex * pclass + age,
                      family=binomial(link="logit"),
                      data=df_titanic)

anova(m_titanic_main, m_titanic_int, test="LRT")

## Analysis of Deviance Table
## 
## Model 1: survived ~ sex + pclass + age
## Model 2: survived ~ sex * pclass + age
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      1041     982.45                          
## 2      1039     931.99  2   50.464 1.101e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(m_titanic_int)

## 
## Call:
## glm(formula = survived ~ sex * pclass + age, family = binomial(link = "logit"), 
##     data = df_titanic)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        4.804345   0.546937   8.784  < 2e-16 ***
## sexmale           -3.886389   0.492375  -7.893 2.95e-15 ***
## pclass2nd         -1.529875   0.566481  -2.701  0.00692 ** 
## pclass3rd         -4.064965   0.510661  -7.960 1.72e-15 ***
## age               -0.038401   0.006743  -5.695 1.23e-08 ***
## sexmale:pclass2nd -0.070404   0.630978  -0.112  0.91116    
## sexmale:pclass3rd  2.488808   0.540042   4.609 4.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1414.62  on 1045  degrees of freedom
## Residual deviance:  931.99  on 1039  degrees of freedom
## AIC: 945.99
## 
## Number of Fisher Scoring iterations: 5

5.2 Odds Ratios

round(cbind(OR=exp(coef(m_titanic_int)), exp(confint(m_titanic_int))), 3)

##                        OR  2.5 %  97.5 %
## (Intercept)       122.040 45.331 399.049
## sexmale             0.021  0.007   0.049
## pclass2nd           0.217  0.065   0.630
## pclass3rd           0.017  0.006   0.043
## age                 0.962  0.949   0.975
## sexmale:pclass2nd   0.932  0.279   3.443
## sexmale:pclass3rd  12.047  4.498  38.727

5.3 Estimated Marginal Means (Survival Probability)

emm <- emmeans(m_titanic_int, ~ sex * pclass, type="response")
emm_df <- as.data.frame(emm)

ggplot(emm_df, aes(x=pclass, y=prob, colour=sex, group=sex)) +
  geom_point(size=3) +
  geom_line(linewidth=1) +
  geom_errorbar(aes(ymin=asymp.LCL, ymax=asymp.UCL), width=0.15) +
  scale_colour_manual(values=c("#d6604d","#2166ac")) +
  theme_bw() +
  labs(title="Estimated marginal survival probabilities",
       subtitle="Titanic — sex × passenger class interaction (95% CI)",
       x="Passenger class", y="P(survived = 1)", colour="Sex")

5.4 ROC and AUC

pred_titanic <- predict(m_titanic_int, type="response")
roc_titanic  <- roc(df_titanic$survived, pred_titanic, quiet=TRUE)
plot(roc_titanic, col="#d6604d", lwd=2,
     main=sprintf("ROC Curve — Titanic Survival  (AUC = %.3f)", auc(roc_titanic)))
abline(a=0, b=1, lty=2, col="grey60")

6. Model Comparison and Selection

6.1 Information Criteria

AIC penalises model complexity as \(-2\ell + 2k\); BIC uses \(-2\ell + k\ln n\) and imposes a heavier penalty for larger samples, favouring more parsimonious models.

models <- list(
  "Gaussian — main effects"    = m_gauss_main,
  "Gaussian — + smoker×bmi"    = m_gauss_full,
  "Logistic — admission"       = m_admit_main,
  "Poisson — children"         = m_pois,
  "Neg. Binom — children"      = m_nb,
  "Logistic — Titanic main"    = m_titanic_main,
  "Logistic — Titanic + int."  = m_titanic_int
)

aic_df <- data.frame(
  Model    = names(models),
  k        = sapply(models, function(m) length(coef(m))),
  LogLik   = sapply(models, function(m) round(as.numeric(logLik(m)), 2)),
  AIC      = sapply(models, function(m) round(AIC(m), 2)),
  BIC      = sapply(models, function(m) round(BIC(m), 2)),
  stringsAsFactors = FALSE
)
knitr::kable(aic_df, row.names=FALSE)


|Model                     |  k|   LogLik|     AIC|     BIC|
|:-------------------------|--:|--------:|-------:|-------:|
|Gaussian — main effects   |  9|  -808.52| 1637.03| 1689.02|
|Gaussian — + smoker×bmi   | 10|  -762.05| 1546.11| 1603.29|
|Logistic — admission      |  6|  -229.26|  470.52|  494.47|
|Poisson — children        |  8| -1941.90| 3899.80| 3941.39|
|Neg. Binom — children     |  8| -1908.28| 3834.56| 3881.35|
|Logistic — Titanic main   |  5|  -491.23|  992.45| 1017.22|
|Logistic — Titanic + int. |  7|  -465.99|  945.99|  980.66|

7. Assumptions and Diagnostics Summary

Assumption	Gaussian GLM	Logistic GLM	Poisson GLM
Correct distributional family	Residuals ~ Normal	Binary outcome	Non-negative integer
Linear relationship on link scale	Partial residual plots	Log-odds linearity	Log-mean linearity
Independence of observations	Design assumption	Design assumption	Design assumption
No influential outliers	Cook’s distance, leverage	Cook’s distance	Cook’s distance
No multicollinearity	VIF < 5	VIF < 5	VIF < 5
Equidispersion	—	—	Deviance / df ≈ 1

par(mfrow=c(1,3))
plot(m_gauss_full,   which=1, main="Gaussian GLM — residuals vs fitted")
plot(m_admit_main,   which=1, main="Logistic GLM — residuals vs fitted")
plot(m_nb,           which=1, main="Neg. Binom — residuals vs fitted")

par(mfrow=c(1,1))

8. References

Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A, 135(3), 370–384.

McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman & Hall.

Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.

Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.