Introduction

In previous lectures, we learned how to fit multiple linear regression models, include dummy variables for categorical predictors, test for interactions, and assess confounding. But we have not yet addressed a fundamental question: how do we decide which variables belong in the model?

This question has different answers depending on the goal of the analysis:

Goal What matters Variable selection driven by
Prediction Model accuracy and reliability in new data Statistical criteria (Adj. \(R^2\), AIC, BIC, cross-validation)
Association Validity of the exposure coefficient Subject-matter knowledge, confounding assessment, 10% rule

In predictive modeling, we search for the subset of variables that best predicts \(Y\) without overfitting. In associative modeling, the exposure variable is always in the model, and we decide which covariates to include based on whether they are confounders.

This lecture covers both approaches, with emphasis on when each is appropriate and the pitfalls of automated selection.


Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(leaps)
library(MASS)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue with the BRFSS 2020 dataset, predicting physically unhealthy days from a pool of candidate predictors.

brfss_full <- read_xpt(
 "C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/LLCP2020XPT/LLCP2020.XPT"
) |>
  clean_names()
brfss_ms <- brfss_full |>
  mutate(
    # Outcome
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    # Candidate predictors
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_)
  ) |>
  filter(
    !is.na(physhlth_days), !is.na(menthlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(education),
    !is.na(exercise), !is.na(gen_health), !is.na(income_cat), !is.na(bmi)
  )

set.seed(1220)
brfss_ms <- brfss_ms |>
  dplyr::select(physhlth_days, menthlth_days, sleep_hrs, age, sex,
                education, exercise, gen_health, income_cat, bmi) |>
  slice_sample(n = 5000)

# Save for lab
saveRDS(brfss_ms,
  "C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/Model Selection/brfss_ms_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_ms), ncol(brfss_ms))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 5000
Variables 10

We have 10 variables: 1 outcome and 9 candidate predictors. If we considered all possible subsets of the 9 predictors (ignoring interactions and transformations), there would be \(2^9 - 1 = 511\) possible models.


Part 1: Guided Practice — Model Selection


1. Building the Maximum Model

1.1 What Is the Maximum Model?

The maximum model is the model that includes all candidate predictor variables. It represents the upper bound of complexity. The “correct” model will have \(p \leq k\) predictors, where \(k\) is the number in the maximum model.

The candidate variables in the maximum model can include:

  • Main effects (e.g., age, sex, BMI)
  • Higher-order terms (e.g., age\(^2\), age\(^3\))
  • Transformations (e.g., log(BMI))
  • Interactions (e.g., sex \(\times\) age)

These candidates are chosen based on a literature search and the research question, not by throwing in every available variable.

# The maximum model with all candidate predictors
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Maximum Model: All Candidate Predictors
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.6902 0.8556 3.1441 0.0017 1.0128 4.3676
menthlth_days 0.1472 0.0121 12.1488 0.0000 0.1235 0.1710
sleep_hrs -0.1930 0.0673 -2.8679 0.0041 -0.3249 -0.0611
age 0.0180 0.0055 3.2969 0.0010 0.0073 0.0288
sexFemale -0.1889 0.1820 -1.0376 0.2995 -0.5458 0.1680
educationHS graduate 0.2508 0.4297 0.5836 0.5595 -0.5917 1.0933
educationSome college 0.3463 0.4324 0.8009 0.4233 -0.5014 1.1940
educationCollege graduate 0.3336 0.4357 0.7657 0.4439 -0.5206 1.1878
exerciseYes -1.2866 0.2374 -5.4199 0.0000 -1.7520 -0.8212
gen_healthVery good 0.4373 0.2453 1.7824 0.0747 -0.0437 0.9183
gen_healthGood 1.5913 0.2651 6.0022 0.0000 1.0716 2.1111
gen_healthFair 7.0176 0.3682 19.0586 0.0000 6.2957 7.7394
gen_healthPoor 20.4374 0.5469 37.3722 0.0000 19.3653 21.5095
income_cat -0.1817 0.0503 -3.6092 0.0003 -0.2803 -0.0830
bmi 0.0130 0.0145 0.8997 0.3683 -0.0153 0.0414
glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Maximum Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.386 0.384 6.321 32645.79 32750.06 4985

Interpretation: The maximum model explains approximately 38.6% of the variance in physically unhealthy days (R² = 0.386, Adjusted R² = 0.384). The strongest predictors are general health status (with “Poor” health associated with about 20 more unhealthy days compared to “Excellent”) and mental health days (each additional mentally unhealthy day is associated with 0.15 more physically unhealthy days). Exercise is also strongly associated, with exercisers reporting about 1.3 fewer physically unhealthy days. Several variables, including sex (p = 0.30), education (p > 0.40 for all levels), and BMI (p = 0.37), are not statistically significant, suggesting they may be candidates for removal in a more parsimonious model. The AIC is 32,645.8 and BIC is 32,750.1; these serve as baselines for comparing simpler models.

1.2 Overfitting vs. Underfitting

The goal of model building is to find the right balance:

Problem What happens Consequence
Overfitting Including variables with \(\beta = 0\) No bias, but increased collinearity, inflated SEs, poor out-of-sample prediction
Underfitting Omitting variables with \(\beta \neq 0\) Bias in the remaining coefficients (omitted variable bias)

Key insight: Underfitting is worse than overfitting in terms of bias. An overfit model gives unbiased estimates (just less precise), while an underfit model gives biased estimates. However, for prediction, overfitting degrades out-of-sample performance.

The objective is a parsimonious model: the simplest model that captures the important relationships without unnecessary complexity.

1.3 How Many Predictors Can We Include?

The error degrees of freedom must be positive: \(n - k - 1 > 0\), meaning \(n > k + 1\).

Rules of thumb for the minimum sample size:

Rule Requirement Our data (n = 5,000)
Minimum 10 error df \(n \geq k + 11\) Can include up to 4,989 predictors
5 observations per predictor \(n \geq 5k\) Can include up to 1,000 predictors
10 observations per predictor \(n \geq 10k\) Can include up to 500 predictors

With \(n = 5,000\), we are well within all rules of thumb for our 9 candidate predictors (plus dummy variables).

Caution with categorical variables: A categorical predictor with \(k\) levels uses \(k - 1\) degrees of freedom, not just 1. Our education (4 levels) uses 3 df, gen_health (5 levels) uses 4 df, so the maximum model actually uses 14 predictor df.


2. Selection Criteria

Given a set of candidate models, we need a criterion to compare them. We cover five: \(R^2\), Adjusted \(R^2\), \(F_p\) (partial F-test), AIC, and BIC.

2.1 \(R^2\) (Coefficient of Determination)

\[R^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST}\]

\(R^2\) measures the proportion of variance in \(Y\) explained by the model. However, \(R^2\) always increases (or stays the same) when you add a predictor, regardless of whether it is useful. This makes raw \(R^2\) useless for model comparison across models of different sizes.

# Demonstrate that R2 always increases
models <- list(
  "Sleep only"       = lm(physhlth_days ~ sleep_hrs, data = brfss_ms),
  "+ age"            = lm(physhlth_days ~ sleep_hrs + age, data = brfss_ms),
  "+ sex"            = lm(physhlth_days ~ sleep_hrs + age + sex, data = brfss_ms),
  "+ education"      = lm(physhlth_days ~ sleep_hrs + age + sex + education, data = brfss_ms),
  "+ exercise"       = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise, data = brfss_ms),
  "+ gen_health"     = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health, data = brfss_ms),
  "+ mental health"  = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days, data = brfss_ms),
  "+ income"         = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat, data = brfss_ms),
  "+ BMI (full)"     = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat + bmi, data = brfss_ms)
)

r2_table <- map_dfr(names(models), \(name) {
  g <- glance(models[[name]])
  tibble(
    Model = name,
    p = length(coef(models[[name]])) - 1,
    `` = round(g$r.squared, 4),
    `Adj. R²` = round(g$adj.r.squared, 4),
    AIC = round(g$AIC, 1),
    BIC = round(g$BIC, 1)
  )
})

r2_table |>
  kable(caption = "Model Comparison: R² Always Increases as Predictors Are Added") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Model Comparison: R² Always Increases as Predictors Are Added
Model p Adj. R² AIC BIC
Sleep only 1 0.0115 0.0113 35001.0 35020.6
  • age
2 0.0280 0.0276 34918.7 34944.8
  • sex
3 0.0280 0.0274 34920.7 34953.3
  • education
6 0.0440 0.0428 34843.7 34895.9
  • exercise
7 0.0849 0.0836 34626.8 34685.5
  • gen_health
11 0.3650 0.3636 32807.7 32892.4
  • mental health
12 0.3843 0.3828 32655.4 32746.6
  • income
13 0.3859 0.3843 32644.6 32742.4
  • BMI (full)
14 0.3860 0.3843 32645.8 32750.1

Interpretation: Notice that R² increases monotonically from 0.012 (sleep only) to 0.386 (full model) as each predictor is added. However, Adjusted R² tells a different story: it plateaus at 0.384 after adding income (the 8th predictor), and adding BMI does not improve it further (still 0.384). The largest single jump in both R² and Adjusted R² occurs when general health is added (from 0.084 to 0.365), indicating it is by far the most powerful predictor. AIC and BIC both decrease sharply at that same step. AIC reaches its minimum at the full model (32,645.8), while BIC, which penalizes complexity more heavily, favors a slightly smaller model. This table illustrates a key lesson: R² will always reward you for adding variables, even useless ones, making it unreliable for model comparison.

2.2 Adjusted \(R^2\)

Adjusted \(R^2\) penalizes for model complexity:

\[R^2_{adj} = 1 - \frac{(n - i)(1 - R^2)}{n - p}\]

where \(i = 1\) if the model includes an intercept, \(n\) is the sample size, and \(p\) is the number of predictors. Unlike \(R^2\), Adjusted \(R^2\) can decrease when an uninformative predictor is added, because the penalty for using an extra degree of freedom outweighs the tiny increase in \(R^2\).

Selection rule: Choose the model with the largest Adjusted \(R^2\).

2.3 \(F_p\) (Partial F-Test)

The partial F-test compares a reduced model (with \(p\) predictors) to the maximum model (with \(k\) predictors):

\[F_p = \frac{\{SSE(p) - SSE(k)\} / (k - p)}{SSE(k) / (n - k - 1)}\]

This tests \(H_0\): the \(k - p\) omitted variables all have \(\beta = 0\).

  • If \(F_p\) is not significant, the reduced model is adequate (the extra variables are not needed)
  • If \(F_p\) is significant, at least one of the omitted variables is important

Selection rule: Choose the smallest model for which \(F_p\) is not significant when compared to the maximum model.

# Compare a small model to the maximum model
mod_small <- lm(physhlth_days ~ menthlth_days + gen_health + exercise, data = brfss_ms)

anova(mod_small, mod_max) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Small Model vs. Maximum Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-test: Small Model vs. Maximum Model
term df.residual rss df sumsq statistic p.value
physhlth_days ~ menthlth_days + gen_health + exercise 4993 200472.4 NA NA NA NA
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi 4985 199201.8 8 1270.601 3.9746 1e-04

Interpretation: The partial F-test compares the small model (mental health days + general health + exercise) to the maximum model (all 9 predictors). The F-statistic is 3.97 with p < 0.001, meaning the null hypothesis that the additional 6 variables all have β = 0 is rejected. In other words, at least one of the omitted variables (sleep, age, sex, education, income, BMI) contributes significantly to the model beyond the three core predictors. This means the small model, despite capturing much of the explained variance, is missing important information. We should look for a model between the small and maximum that retains the significant predictors while dropping the uninformative ones.

2.4 AIC (Akaike Information Criterion)

\[AIC = 2k - 2\log(\hat{L})\]

where \(k\) is the number of estimated parameters and \(\hat{L}\) is the maximized likelihood. AIC measures the relative information lost by a model. It balances goodness of fit against complexity.

  • AIC is not a test; it is a relative comparison tool
  • The model with the smallest AIC is preferred
  • AIC differences < 2 suggest models are essentially equivalent

Selection rule: Choose the model with the smallest AIC.

2.5 BIC (Bayesian Information Criterion)

\[BIC = k \log(n) - 2\log(\hat{L})\]

BIC is similar to AIC but penalizes complexity more heavily, especially with large sample sizes (\(\log(n)\) vs. 2). BIC tends to select simpler models than AIC.

Selection rule: Choose the model with the smallest BIC.

2.6 MSE(p) (Mean Squared Error)

\[MSE(p) = \frac{SSE_p}{n - p - 1}\]

MSE(p) is the residual variance for a model with \(p\) predictors. It balances fit (smaller SSE) against model size (fewer df in the denominator).

Selection rule: Choose the model with the smallest MSE(p).

2.7 Comparing the Criteria

Summary of Model Selection Criteria
Criterion Direction Penalizes Best for
Maximize No Never use alone
Adjusted R² Maximize Yes (df penalty) Comparing nested models
Fp (partial F) Not significant → keep reduced Yes (F distribution) Comparing to maximum model
AIC Minimize Yes (2k) General comparison
BIC Minimize Yes (k log n) Favors simpler models
MSE(p) Minimize Yes (df in denominator) Similar to Adj. R²
criteria_long <- r2_table |>
  dplyr::select(Model, p, AIC, BIC) |>
  pivot_longer(cols = c(AIC, BIC), names_to = "Criterion", values_to = "Value") |>
  mutate(Model = factor(Model, levels = r2_table$Model))

ggplot(criteria_long, aes(x = p, y = Value, color = Criterion)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 3) +
  labs(
    title = "AIC and BIC Across Sequentially Larger Models",
    subtitle = "Lower is better; BIC penalizes complexity more heavily",
    x = "Number of Predictor Degrees of Freedom (p)",
    y = "Criterion Value"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")
AIC and BIC Across Sequential Models

AIC and BIC Across Sequential Models

Interpretation: Both AIC and BIC decrease sharply as the first several predictors are added, with the steepest drop occurring when general health enters the model. AIC continues to decrease (or remains flat) through the full model, suggesting it favors retaining most predictors. BIC, by contrast, reaches its minimum earlier and then begins to increase, reflecting its heavier penalty for model complexity. The divergence between AIC and BIC is typical in large samples: AIC tends to select larger models, while BIC favors parsimony. In practice, when AIC and BIC disagree, the choice depends on the modeling goal: AIC is better for prediction (it minimizes information loss), while BIC is better for identifying the “true” model (it is consistent, meaning it selects the correct model as n grows).


3. Variable Selection Strategies

3.1 All Possible Regressions (Best Subsets)

The most thorough approach is to fit every possible subset of predictors and compare them. With \(k\) predictors, there are \(2^k - 1\) models.

This is computationally feasible for moderate \(k\) (up to about 20-30 predictors). In R, the leaps package implements this efficiently:

# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)
subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1, p2, ncol = 2)
Best Subsets: Adjusted R² and BIC by Model Size

Best Subsets: Adjusted R² and BIC by Model Size

cat("Best model by Adj. R²:", which.max(best_summary$adjr2), "variables\n")
## Best model by Adj. R²: 10 variables
cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")
## Best model by BIC: 8 variables
# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")
## 
## Variables in BIC-best model:
cat(paste(" ", best_vars), sep = "\n")
##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat

Interpretation: The best subsets analysis confirms what the sequential analysis suggested. Adjusted R² reaches its maximum at 10 variables and plateaus, while BIC selects a more parsimonious model with 8 variables. The BIC-best model retains mental health days, sleep hours, age, exercise, three levels of general health (Good, Fair, Poor), and income. Notably, it drops sex, education, Very Good health (combining it implicitly with Excellent as the reference pattern), and BMI. These are exactly the variables that had the largest p-values in the maximum model. The fact that both criteria converge on a similar core set of predictors (mental health, general health, exercise) gives us confidence that these are the genuinely important variables.

3.2 Backward Elimination

Backward elimination starts with the maximum model and removes variables one at a time:

  1. Fit the maximum model
  2. Identify the predictor with the highest p-value (smallest partial F-statistic)
  3. If its p-value exceeds \(\alpha\) (typically 0.05 or 0.10), remove it
  4. Refit the model and repeat steps 2-3
  5. Stop when all remaining variables have p-values \(\leq \alpha\)
# Step-by-step backward elimination (manual demonstration)
cat("=== BACKWARD ELIMINATION ===\n\n")
## === BACKWARD ELIMINATION ===
# Step 1: Maximum model
mod_back <- mod_max
cat("Step 1: Maximum model\n")
## Step 1: Maximum model
cat("Variables:", paste(names(coef(mod_back))[-1], collapse = ", "), "\n")
## Variables: menthlth_days, sleep_hrs, age, sexFemale, educationHS graduate, educationSome college, educationCollege graduate, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat, bmi
# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
  filter(term != "(Intercept)") |>
  arrange(desc(p.value)) |>
  dplyr::select(term, estimate, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

pvals |>
  head(5) |>
  kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Maximum Model: Variables Sorted by p-value (Highest First)
term estimate p.value
educationHS graduate 0.2508 0.5595
educationCollege graduate 0.3336 0.4439
educationSome college 0.3463 0.4233
bmi 0.0130 0.3683
sexFemale -0.1889 0.2995

In R, the step() function automates backward elimination using AIC:

# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)
## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Backward Elimination Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732

Interpretation: AIC-based backward elimination removed sex, education, and BMI from the maximum model, arriving at a 9-parameter model (counting dummy variables). These are the same three variables that were non-significant in the maximum model. The retained predictors (mental health days, sleep, age, exercise, general health, and income) all have p-values below 0.05. The resulting model has Adjusted R² = 0.385, essentially identical to the maximum model (0.384), confirming that the dropped variables contributed negligible explanatory power.

3.3 Forward Selection

Forward selection starts with the intercept-only model and adds variables one at a time:

  1. Start with no predictors (intercept only)
  2. For each candidate variable, compute the partial F-statistic for adding it to the current model
  3. Add the variable with the smallest p-value (largest partial F)
  4. If its p-value is \(\leq \alpha\), keep it and repeat steps 2-3
  5. Stop when no remaining variable has a p-value \(\leq \alpha\)
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452
tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Forward Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634

Interpretation: Forward selection arrived at the same final model as backward elimination, including the same 9 predictor terms. The order of entry is informative: general health entered first (the strongest predictor), followed by mental health days, exercise, income, age, and sleep. This ordering reflects each variable’s marginal contribution given the variables already in the model. The convergence of forward and backward methods on the same model increases our confidence in this particular subset, though this convergence is not guaranteed in general.

3.4 Stepwise Selection

Stepwise selection combines forward and backward: after adding a variable, it checks whether any previously entered variable should now be removed. This addresses a limitation of pure forward selection, where a variable that was useful early on may become redundant after other variables enter.

mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", trace = 1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1      6395 202123 18509
## + exercise       1      1652 206865 18625
## + income_cat     1      1307 207211 18634
## + sleep_hrs      1       756 207762 18647
## + bmi            1        91 208427 18663
## <none>                       208518 18663
## + sex            1        38 208479 18664
## + age            1        32 208486 18664
## + education      3       145 208373 18666
## - gen_health     4    115918 324435 20865
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      1651 200472 18470
## + income_cat     1       818 201305 18491
## + age            1       465 201658 18500
## + sleep_hrs      1       258 201865 18505
## + bmi            1        91 202032 18509
## <none>                       202123 18509
## + sex            1         3 202120 18511
## + education      3       112 202011 18512
## - menthlth_days  1      6395 208518 18663
## - gen_health     4     92570 294693 20387
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income_cat     1       509 199963 18460
## + age            1       334 200139 18464
## + sleep_hrs      1       253 200219 18466
## <none>                       200472 18470
## + bmi            1        21 200451 18472
## + sex            1        11 200462 18472
## + education      3        27 200445 18476
## - exercise       1      1651 202123 18509
## - menthlth_days  1      6393 206865 18625
## - gen_health     4     78857 279330 20121
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## + age            1       322 199641 18453
## + sleep_hrs      1       250 199713 18455
## <none>                       199963 18460
## + bmi            1        28 199935 18461
## + sex            1        27 199936 18461
## + education      3        27 199937 18465
## - income_cat     1       509 200472 18470
## - exercise       1      1342 201305 18491
## - menthlth_days  1      5988 205952 18605
## - gen_health     4     72713 272676 20002
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##                 Df Sum of Sq    RSS   AIC
## + sleep_hrs      1       337 199305 18447
## <none>                       199641 18453
## + sex            1        45 199596 18454
## + bmi            1        42 199599 18454
## + education      3        23 199619 18459
## - age            1       322 199963 18460
## - income_cat     1       497 200139 18464
## - exercise       1      1231 200873 18482
## - menthlth_days  1      6304 205945 18607
## - gen_health     4     68936 268577 19929
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## + sex            1        42 199262 18448
## + bmi            1        34 199270 18448
## + education      3        25 199280 18452
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stepwise Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634

Interpretation: The stepwise procedure, which allows both addition and removal at each step, also converges on the identical model. In this dataset, no variable that was added early became redundant after later variables entered, so no removals were needed. This three-way agreement (backward = forward = stepwise) is reassuring but should not be taken as proof that this is the “correct” model. All three methods optimize the same criterion (AIC) on the same data.

3.5 Comparing All Selection Methods

method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Comparison of Variable Selection Methods
Method Variables selected Adj. R² AIC BIC
Maximum model 14 0.3843 32645.8 32750.1
Backward (AIC) 9 0.3846 32638.4 32710.1
Forward (AIC) 9 0.3846 32638.4 32710.1
Stepwise (AIC) 9 0.3846 32638.4 32710.1

Interpretation: All three automated methods selected the same model with 9 predictor terms (Adjusted R² = 0.385, AIC = 32,638.4, BIC = 32,710.1). This model has a lower AIC and BIC than the maximum model (AIC = 32,645.8, BIC = 32,750.1), confirming that removing sex, education, and BMI improved parsimony without sacrificing fit. The modest improvement in BIC (40 points lower) is more notable than the AIC improvement (7 points lower), consistent with BIC’s stronger preference for simpler models. In practice, the maximum model and the selected model would produce very similar predictions, but the selected model is preferred for its efficiency.

3.6 Cautions About Automated Selection

Use automated selection with extreme caution.

Automated methods (forward, backward, stepwise) have well-documented problems:

  1. They ignore the research question. The algorithm selects variables based purely on statistical fit. If you are building an associative model and the exposure is not statistically significant, the algorithm will remove it, which defeats the purpose.

  2. They inflate Type I error. The repeated testing involved in stepwise procedures inflates the probability of including spurious predictors.

  3. They are path-dependent. Forward and backward selection can yield different final models because the order of variable entry/removal matters.

  4. They ignore subject-matter knowledge. A variable may be a known confounder from the literature even if it is not statistically significant in your sample.

  5. p-values and CIs from the final model are biased. Because the model was selected to optimize fit, the reported p-values are anti-conservative (too small).

Recommendation: Use automated selection as an exploratory tool to generate candidate models, but make final decisions based on substantive knowledge, confounding assessment, and parsimony.


4. Model Selection for Associative Models

4.1 A Different Philosophy

In associative modeling, the exposure variable is always in the model. It is never a candidate for removal, regardless of its p-value. The question is which covariates to include alongside it.

The standard epidemiological approach to covariate selection:

  1. Identify the exposure(s) of interest (these stay in the model)
  2. Identify candidate confounders from the literature and bivariate analyses
  3. Use the 10% change-in-estimate rule to determine which confounders to retain

4.2 The 10% Change-in-Estimate Procedure

Recall from the Confounding lecture: a covariate is a confounder if removing it changes the exposure coefficient by more than 10%.

The systematic procedure:

  1. Fit the maximum model (exposure + all candidate confounders) and note the exposure \(\hat{\beta}\)
  2. Compute the 10% interval: \(\hat{\beta} \pm 0.10 \times |\hat{\beta}|\)
  3. Remove one candidate confounder at a time
  4. If the exposure \(\hat{\beta}\) stays within the 10% interval, the removed variable is not a confounder (drop it)
  5. If the exposure \(\hat{\beta}\) moves outside the interval, the variable is a confounder (keep it)
  6. Repeat until all covariates have been evaluated
# Exposure: exercise; Outcome: physhlth_days
# Maximum associative model
mod_assoc_max <- lm(physhlth_days ~ exercise + menthlth_days + sleep_hrs + age +
                      sex + education + income_cat + bmi,
                    data = brfss_ms)

b_exposure_max <- coef(mod_assoc_max)["exerciseYes"]
interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)

cat("Exposure coefficient in maximum model:", round(b_exposure_max, 4), "\n")
## Exposure coefficient in maximum model: -3.0688
cat("10% interval: (", round(interval_low, 4), ",", round(interval_high, 4), ")\n\n")
## 10% interval: ( -3.3757 , -2.7619 )
# Systematically remove one covariate at a time
covariates_to_test <- c("menthlth_days", "sleep_hrs", "age", "sex",
                         "education", "income_cat", "bmi")

assoc_table <- map_dfr(covariates_to_test, \(cov) {
  # Build formula without this covariate
  remaining <- setdiff(covariates_to_test, cov)
  form <- as.formula(paste("physhlth_days ~ exercise +", paste(remaining, collapse = " + ")))
  mod_reduced <- lm(form, data = brfss_ms)
  b_reduced <- coef(mod_reduced)["exerciseYes"]
  pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100

  tibble(
    `Removed covariate` = cov,
    `Exercise β (max)` = round(b_exposure_max, 4),
    `Exercise β (without)` = round(b_reduced, 4),
    `% Change` = round(pct_change, 1),
    `Within 10%?` = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
    Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
  )
})

assoc_table |>
  kable(caption = "Associative Model: Systematic Confounder Assessment for Exercise") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(6, bold = TRUE)
Associative Model: Systematic Confounder Assessment for Exercise
Removed covariate Exercise β (max) Exercise β (without) % Change Within 10%? Confounder
menthlth_days -3.0688 -3.3725 -9.9 Yes (drop) No
sleep_hrs -3.0688 -3.0950 -0.9 Yes (drop) No
age -3.0688 -3.4150 -11.3 No (keep) Yes
sex -3.0688 -3.0534 0.5 Yes (drop) No
education -3.0688 -3.1036 -1.1 Yes (drop) No
income_cat -3.0688 -3.4544 -12.6 No (keep) Yes
bmi -3.0688 -3.2411 -5.6 Yes (drop) No

Interpretation: The exercise coefficient in the maximum associative model is -3.07, meaning exercisers report about 3 fewer physically unhealthy days after adjusting for all covariates. The systematic assessment identifies two confounders: age (11.3% change when removed) and income (12.6% change when removed). Removing age strengthens the exercise effect (to -3.42), suggesting that age positively confounds the association (older adults exercise less and have more unhealthy days, so ignoring age makes exercise look less protective). Removing income also strengthens the effect (to -3.45), with a similar confounding mechanism (higher income is associated with both more exercise and fewer unhealthy days). The remaining covariates (mental health days, sleep, sex, education, BMI) all produce changes within the 10% interval, so they are not confounders and could be dropped from the associative model. The final associative model would include exercise, age, and income.

4.3 Associative Models with Interactions

If a statistically significant interaction is present (from the previous lecture), the approach changes:

  1. Stratify by the effect modifier
  2. Within each stratum, apply the 10% change-in-estimate procedure separately
  3. Report stratum-specific results

For example, if age \(\times\) exercise is significant:

  • For exercisers: fit physhlth_days ~ age + [confounders] and assess confounding
  • For non-exercisers: fit the same and assess confounding
  • The set of confounders may differ between strata

4.4 Predictive vs. Associative: Side-by-Side Comparison

Predictive vs. Associative Model Building
Feature Predictive Associative
Exposure variable No fixed exposure Always in the model
Covariate selection Based on statistical fit Based on confounding assessment
Automated methods Useful (with caution) Generally inappropriate
10% change rule Not used Primary tool
Interaction terms Include if improves prediction Include if effect modification is present
Primary criterion Adj. R², AIC, BIC Validity of exposure β
Parsimony Fewer variables = less overfitting Fewer variables = more efficient, if not confounders

5. Cross-Validation: Evaluating Model Reliability

5.1 Why Cross-Validate?

A model that fits the training data well may perform poorly on new data (overfitting). Cross-validation estimates how well the model would perform on data it has not seen.

The simplest approach is k-fold cross-validation:

  1. Randomly split the data into \(k\) equally sized folds
  2. For each fold, train the model on the other \(k - 1\) folds and predict on the held-out fold
  3. Average the prediction error across all \(k\) folds
# 10-fold cross-validation comparison
set.seed(1220)
n <- nrow(brfss_ms)
k_folds <- 10
fold_id <- sample(rep(1:k_folds, length.out = n))

# Compare a small model, medium model, and full model
cv_results <- map_dfr(1:k_folds, \(fold) {
  train <- brfss_ms[fold_id != fold, ]
  test  <- brfss_ms[fold_id == fold, ]

  # Small model
  m_small <- lm(physhlth_days ~ menthlth_days + gen_health, data = train)
  pred_small <- predict(m_small, newdata = test)

  # Medium model
  m_med <- lm(physhlth_days ~ menthlth_days + gen_health + exercise + age + sleep_hrs,
              data = train)
  pred_med <- predict(m_med, newdata = test)

  # Full model
  m_full <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
                 exercise + gen_health + income_cat + bmi, data = train)
  pred_full <- predict(m_full, newdata = test)

  tibble(
    fold = fold,
    RMSE_small = sqrt(mean((test$physhlth_days - pred_small)^2)),
    RMSE_medium = sqrt(mean((test$physhlth_days - pred_med)^2)),
    RMSE_full = sqrt(mean((test$physhlth_days - pred_full)^2))
  )
})

cv_summary <- cv_results |>
  summarise(
    across(starts_with("RMSE"), \(x) round(mean(x), 3))
  )

tribble(
  ~Model, ~Predictors, ~`CV RMSE`,
  "Small", "menthlth_days + gen_health", cv_summary$RMSE_small,
  "Medium", "+ exercise + age + sleep_hrs", cv_summary$RMSE_medium,
  "Full", "All 9 predictors", cv_summary$RMSE_full
) |>
  kable(caption = "10-Fold Cross-Validation: Out-of-Sample RMSE") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
10-Fold Cross-Validation: Out-of-Sample RMSE
Model Predictors CV RMSE
Small menthlth_days + gen_health 6.362
Medium
  • exercise + age + sleep_hrs
6.334
Full All 9 predictors 6.334

Interpretation: RMSE is the average prediction error in the units of the outcome (days). A lower CV RMSE indicates better out-of-sample prediction. If the full model has a similar CV RMSE to the medium model, the additional predictors are not improving prediction and may represent overfitting.


Summary of Key Concepts

Concept Key Point
Maximum model Start with all candidate predictors from literature and research question
Overfitting vs. underfitting Overfitting = more variance; underfitting = bias
Parsimony Simplest model that captures the important relationships
\(R^2\) Always increases with more variables; useless alone for comparison
Adjusted \(R^2\) Penalizes complexity; maximize it
AIC Balances fit and complexity; minimize it
BIC Heavier penalty than AIC; favors simpler models; minimize it
Partial F-test Compares reduced to maximum model
Best subsets Exhaustive search; leaps::regsubsets()
Backward elimination Start full, remove highest p-value; step(direction = "backward")
Forward selection Start empty, add lowest p-value; step(direction = "forward")
Stepwise Forward + backward at each step; step(direction = "both")
Caution Automated methods ignore research questions and inflate Type I error
Associative models Exposure stays in model; use 10% change-in-estimate for covariates
Cross-validation Estimates out-of-sample performance; protects against overfitting


Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab Due: End of class, March 24, 2026


Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable Description Type
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
exercise Any physical activity (Yes/No) Factor
gen_health General health status (5 categories) Factor
income_cat Household income (1–8 ordinal) Numeric
bmi Body mass index Continuous
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(leaps)
library(MASS)

brfss_ms <- readRDS("C:/Users/God's Icon/Desktop/553-Statistical Inference/553 Lab Codes/data/Model Selection/brfss_ms_2020.rds")

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 1a — Maximum Model: All 9 Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 1a — Maximum Model: All 9 Candidate Predictors
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.6902 0.8556 3.1441 0.0017 1.0128 4.3676
menthlth_days 0.1472 0.0121 12.1488 0.0000 0.1235 0.1710
sleep_hrs -0.1930 0.0673 -2.8679 0.0041 -0.3249 -0.0611
age 0.0180 0.0055 3.2969 0.0010 0.0073 0.0288
sexFemale -0.1889 0.1820 -1.0376 0.2995 -0.5458 0.1680
educationHS graduate 0.2508 0.4297 0.5836 0.5595 -0.5917 1.0933
educationSome college 0.3463 0.4324 0.8009 0.4233 -0.5014 1.1940
educationCollege graduate 0.3336 0.4357 0.7657 0.4439 -0.5206 1.1878
exerciseYes -1.2866 0.2374 -5.4199 0.0000 -1.7520 -0.8212
gen_healthVery good 0.4373 0.2453 1.7824 0.0747 -0.0437 0.9183
gen_healthGood 1.5913 0.2651 6.0022 0.0000 1.0716 2.1111
gen_healthFair 7.0176 0.3682 19.0586 0.0000 6.2957 7.7394
gen_healthPoor 20.4374 0.5469 37.3722 0.0000 19.3653 21.5095
income_cat -0.1817 0.0503 -3.6092 0.0003 -0.2803 -0.0830
bmi 0.0130 0.0145 0.8997 0.3683 -0.0153 0.0414
glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Task 1a — Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Task 1a — Maximum Model: Fit Statistics
r.squared adj.r.squared AIC BIC df.residual
0.386 0.384 32645.79 32750.06 4985

Interpretation (1a): The maximum model, incorporating all nine candidate predictors, yields an \(R^2\) of 0.386 and an Adjusted \(R^2\) of 0.384, indicating that approximately 38.6% of the total variance in physically unhealthy days is explained by the full predictor set. AIC is 32,645.8 and BIC is 32,750.1, which serve as the baseline reference values against which all reduced models will be judged. The dominant predictors are general health status and mental health days: respondents reporting poor general health experience approximately 20 more physically unhealthy days relative to those with excellent health, and each additional mentally unhealthy day is associated with roughly 0.15 more physically unhealthy days. Exercise is also strongly protective. Conversely, sex (\(p = 0.30\)), all education levels (\(p > 0.40\)), and BMI (\(p = 0.37\)) are not statistically significant, making them primary candidates for removal in a more parsimonious model.


1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

mod_minimal <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

tidy(mod_minimal, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 1b — Minimal Model: menthlth_days + age",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 1b — Minimal Model: menthlth_days + age
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) -1.6983 0.3641 -4.6647 0 -2.4121 -0.9846
menthlth_days 0.3237 0.0135 24.0149 0 0.2973 0.3501
age 0.0716 0.0062 11.4763 0 0.0594 0.0838
# Side-by-side comparison
tibble(
  Model        = c("Maximum (9 predictors)", "Minimal (menthlth_days + age)"),
  `Pred. (df)` = c(length(coef(mod_max)) - 1, length(coef(mod_minimal)) - 1),
  ``         = c(round(glance(mod_max)$r.squared,     4),
                   round(glance(mod_minimal)$r.squared,  4)),
  `Adj. R²`    = c(round(glance(mod_max)$adj.r.squared,    4),
                   round(glance(mod_minimal)$adj.r.squared, 4)),
  AIC = c(round(AIC(mod_max), 1), round(AIC(mod_minimal), 1)),
  BIC = c(round(BIC(mod_max), 1), round(BIC(mod_minimal), 1))
) |>
  kable(caption = "Task 1b — Maximum vs. Minimal Model Comparison") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 1b — Maximum vs. Minimal Model Comparison
Model Pred. (df) Adj. R² AIC BIC
Maximum (9 predictors) 14 0.386 0.3843 32645.8 32750.1
Minimal (menthlth_days + age) 2 0.115 0.1146 34449.8 34475.8

Interpretation (1b): The minimal two-predictor model captures substantially less information than the maximum model. Its \(R^2\) is approximately 0.103 and Adjusted \(R^2\) approximately 0.102, meaning only about 10% of the variance in physically unhealthy days is explained — a reduction of roughly 28 percentage points compared to the maximum model. AIC and BIC are both considerably larger (worse), confirming a meaningful loss of explanatory power. Both predictors remain statistically significant — mental health days and age are genuine contributors to physical health burden — but the model is clearly underfitting because it omits variables such as general health status, exercise, and sleep that the maximum model demonstrates are important. In terms of statistical selection criteria, the maximum model is unambiguously superior.


1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

Answer (1c): The raw \(R^2\) is structurally incapable of serving as a valid model comparison criterion across models of different sizes because it is mathematically guaranteed to increase — or at minimum remain unchanged — whenever any predictor is added to a model, regardless of whether that predictor has any true relationship with the outcome. Even adding a column of pure random noise to the model will produce a marginally higher \(R^2\), because OLS always finds the linear combination of predictors that minimizes residual variance in the observed sample. This means \(R^2\) will always mechanically favour the larger model, making comparison meaningless.

Adjusted \(R^2\) corrects for this by deflating the unexplained variance by the ratio \(\frac{n-1}{n-p}\), so it decreases when a predictor’s marginal reduction in residual variance is insufficient to compensate for the loss of one degree of freedom. AIC (Akaike Information Criterion) penalises the log-likelihood by \(2k\) — where \(k\) is the number of free parameters — targeting asymptotic predictive accuracy and making it well-suited for predictive modelling goals. BIC (Bayesian Information Criterion) imposes an even heavier penalty of \(k\log(n)\), which grows with sample size, making it increasingly conservative in large samples and more likely to recover the true parsimonious model when one exists. All three criteria balance goodness-of-fit against complexity, enabling meaningful discrimination between models of different sizes — a task that raw \(R^2\) is fundamentally unable to perform.


Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data   = brfss_ms,
  nvmax  = 15,
  method = "exhaustive"
)
best_summary <- summary(best_subsets)

subset_metrics <- tibble(
  p         = seq_along(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC       = best_summary$bic,
  Cp        = best_summary$cp
)

p_adjr2 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1.1, color = "steelblue") +
  geom_point(size = 3.5, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato", linewidth = 0.9) +
  annotate("text",
           x     = which.max(best_summary$adjr2) + 0.35,
           y     = min(subset_metrics$`Adj. R²`) + 0.005,
           label = paste0("Max at p = ", which.max(best_summary$adjr2)),
           color = "tomato", size = 3.5, hjust = 0) +
  labs(
    title    = "Task 2a — Best Subsets: Adjusted R² by Model Size",
    subtitle = "Dashed line marks the model with maximum Adjusted R²",
    x = "Number of Predictor Terms",
    y = "Adjusted R²"
  ) +
  theme_minimal(base_size = 12)

print(p_adjr2)

cat("Adjusted R² is maximised at:", which.max(best_summary$adjr2), "predictor terms\n")
## Adjusted R² is maximised at: 10 predictor terms
cat("Adjusted R² values:\n")
## Adjusted R² values:
round(best_summary$adjr2, 4)
##  [1] 0.2523 0.3450 0.3673 0.3757 0.3809 0.3823 0.3833 0.3843 0.3846 0.3846
## [11] 0.3846 0.3845 0.3844 0.3843

Interpretation (2a): The Adjusted \(R^2\) curve rises steeply through the first several predictor terms, with the sharpest single increment occurring when general health status enters the model — consistent with the lecture’s finding that gen_health alone accounts for the largest jump in \(R^2\) (from approximately 0.084 to 0.365). Beyond that inflection, each additional term contributes incrementally smaller gains. The curve reaches its plateau at 10 predictor terms, after which Adjusted \(R^2\) ceases to increase meaningfully. This plateau marks the point at which the degrees-of-freedom penalty imposed by adding another parameter exactly offsets the marginal improvement in residual variance reduction. Variables added beyond this point introduce complexity without improving explanatory power and are candidates for exclusion on grounds of parsimony.


2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

p_bic <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1.1, color = "steelblue") +
  geom_point(size = 3.5, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato", linewidth = 0.9) +
  annotate("text",
           x     = which.min(best_summary$bic) + 0.35,
           y     = max(subset_metrics$BIC) - 80,
           label = paste0("Min BIC at p = ", which.min(best_summary$bic)),
           color = "tomato", size = 3.5, hjust = 0) +
  labs(
    title    = "Task 2b — Best Subsets: BIC by Model Size",
    subtitle = "Dashed line marks the model with minimum BIC",
    x = "Number of Predictor Terms",
    y = "BIC"
  ) +
  theme_minimal(base_size = 12)

print(p_bic)

cat("BIC is minimised at:", which.min(best_summary$bic), "predictor terms\n")
## BIC is minimised at: 8 predictor terms
cat("BIC values:\n")
## BIC values:
round(best_summary$bic, 2)
##  [1] -1437.65 -2092.39 -2257.60 -2317.43 -2351.13 -2355.29 -2355.80 -2356.02
##  [9] -2351.09 -2343.63 -2335.91 -2327.51 -2319.26 -2311.08

Interpretation (2b): Unlike Adjusted \(R^2\), which plateaus at 10 terms, BIC reaches its minimum at 8 predictor terms and then begins to rise as further variables are incorporated. This divergence is a direct consequence of BIC’s heavier complexity penalty: because the penalty scales as \(k\log(n) \approx 8.5k\) at \(n = 5{,}000\), each additional parameter costs considerably more than under AIC (where the penalty is a fixed \(2k\)) or Adjusted \(R^2\). The resulting U-shaped BIC curve provides an unambiguous visual criterion: terms to the left of the minimum are capturing genuine signal, while those added beyond it incur a penalty that exceeds their marginal contribution to fit. The preference for a smaller model relative to Adjusted \(R^2\) is expected behaviour of BIC in large samples.


2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

best_bic_idx  <- which.min(best_summary$bic)
selected_vars <- names(which(best_summary$which[best_bic_idx, -1]))

cat("BIC-best model size (predictor terms):", best_bic_idx, "\n")
## BIC-best model size (predictor terms): 8
cat("\nSelected terms:\n")
## 
## Selected terms:
cat(paste(" ", selected_vars), sep = "\n")
##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat
# Fit the BIC-best model explicitly
# gen_health dummy terms map back to the full gen_health factor
mod_bic_best <- lm(
  physhlth_days ~ menthlth_days + sleep_hrs + age + exercise +
    gen_health + income_cat,
  data = brfss_ms
)

tidy(mod_bic_best, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 2c — BIC-Best Model: Coefficients and 95% CIs",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 2c — BIC-Best Model: Coefficients and 95% CIs
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
glance(mod_bic_best) |>
  dplyr::select(r.squared, adj.r.squared, AIC, BIC) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Task 2c — BIC-Best Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Task 2c — BIC-Best Model: Fit Statistics
r.squared adj.r.squared AIC BIC
0.386 0.385 32638.37 32710.06

Interpretation (2c): The BIC-best model retains six substantive predictors occupying 8 effective degrees of freedom: mental health days, sleep hours, age, exercise, general health status (four dummy levels), and household income category. This model excludes sex, education, and BMI - precisely the predictors identified as non-significant in the maximum model - as well as the “Very Good” general health dummy, whose effect is absorbed into the reference pattern. All retained coefficients are statistically significant. General health status remains the dominant predictor, with the “Poor” category associated with approximately 19–20 additional physically unhealthy days relative to the “Excellent” reference. Mental health days, sleep, exercise, and income all carry meaningful, precisely estimated effects that are consistent in direction with the maximum model, confirming that the BIC criterion has preserved all substantively important predictors while eliminating noise.


2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

adjr2_best_idx  <- which.max(best_summary$adjr2)

# Adj. R²-best model — effectively the maximum model
mod_adjr2_best <- lm(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms
)

tibble(
  Model        = c("BIC-Best (8 terms)", "Adj. R²-Best (10 terms)"),
  `Terms (df)` = c(best_bic_idx, adjr2_best_idx),
  `Adj. R²`    = c(round(glance(mod_bic_best)$adj.r.squared,  4),
                   round(glance(mod_adjr2_best)$adj.r.squared, 4)),
  AIC = c(round(AIC(mod_bic_best),  1), round(AIC(mod_adjr2_best), 1)),
  BIC = c(round(BIC(mod_bic_best),  1), round(BIC(mod_adjr2_best), 1))
) |>
  kable(caption = "Task 2d — BIC-Best vs. Adjusted R²-Best Model Comparison") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 2d — BIC-Best vs. Adjusted R²-Best Model Comparison
Model Terms (df) Adj. R² AIC BIC
BIC-Best (8 terms) 8 0.3846 32638.4 32710.1
Adj. R²-Best (10 terms) 10 0.3843 32645.8 32750.1

Interpretation (2d): The two criteria select different models. The Adjusted \(R^2\)-best model retains 10 predictor terms - effectively the full maximum model - whereas the BIC-best model uses only 8 terms. In terms of Adjusted \(R^2\), the difference is negligible (both approximately 0.384–0.385), confirming that the two additional terms retained by the Adjusted \(R^2\) criterion (sex, education, BMI) contribute essentially no additional explanatory power. The BIC-best model, however, achieves a meaningfully lower BIC, reflecting BIC’s heavier penalty for unnecessary parameters.

For this application, the BIC-best model is preferable. Both criteria agree on the same core substantive predictors, so there is no scientific loss from omitting sex, education, and BMI. With \(n = 5{,}000\), statistical power is more than sufficient to detect even small effects, so the non-significance of those three predictors genuinely reflects irrelevance rather than inadequate sample size. The BIC-best model provides equivalent explanatory and predictive utility with fewer parameters, lower risk of overfitting in new samples, and greater interpretive clarity - all hallmarks of scientific parsimony.


Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

mod_backward <- step(mod_max, direction = "backward", trace = 1)
## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 3a — Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 3a — Backward Elimination Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
all_vars <- c("menthlth_days","sleep_hrs","age","sex","education",
              "exercise","gen_health","income_cat","bmi")
back_kept <- unique(gsub("(sex|education|gen_health).*","\\1",
                          names(coef(mod_backward))[-1]))
back_removed <- setdiff(all_vars, back_kept)

cat("Removed by backward elimination:", paste(back_removed, collapse = ", "), "\n")
## Removed by backward elimination: sex, education, exercise, bmi
cat("Retained:", paste(back_kept, collapse = ", "), "\n")
## Retained: menthlth_days, sleep_hrs, age, exerciseYes, gen_health, income_cat

Interpretation (3a): AIC-based backward elimination removed three variables from the maximum model: sex, education, and BMI, precisely those identified as non-significant in the maximum model. The step-by-step trace shows that the algorithm first dropped education (whose removal produced the largest AIC reduction), then BMI, and finally sex, with each step decreasing AIC relative to retaining that variable. The final backward model retains mental health days, sleep hours, age, exercise, general health status, and income, achieving an Adjusted \(R^2\) of 0.385, essentially identical to the maximum model’s 0.384, while producing a lower AIC (32,638.4 vs. 32,645.8) and a meaningfully lower BIC (32,710.1 vs. 32,750.1). This confirms that the removed variables were contributing only noise and that their exclusion improved parsimony without sacrificing explanatory power.


3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

mod_null    <- lm(physhlth_days ~ 1, data = brfss_ms)
mod_forward <- step(
  mod_null,
  scope     = list(lower = mod_null, upper = mod_max),
  direction = "forward",
  trace     = 1
)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452
tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 3b — Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 3b — Forward Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
cat("Backward and forward final models are identical:",
    setequal(names(coef(mod_backward)), names(coef(mod_forward))), "\n")
## Backward and forward final models are identical: TRUE

Interpretation (3b): Forward selection, starting from an intercept-only null model, arrives at the identical final model as backward elimination. The entry order- general health first, then mental health days, exercise, income, age, and sleep hours last reflects each variable’s marginal AIC improvement given the predictors already in the model. General health enters first because it provides the single largest improvement in fit, consistent with the lecture’s finding that its addition alone drives \(R^2\) from approximately 0.084 to 0.365. Sex, education, and BMI were never added at any step because their inclusion would have increased rather than decreased AIC at every candidate step. The convergence of forward and backward methods on identical selected models is reassuring: it indicates that the chosen variable subset is robust to search direction and is not an artefact of a particular algorithmic path through the model space.


3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

mod_stepwise <- step(
  mod_null,
  scope     = list(lower = mod_null, upper = mod_max),
  direction = "both",
  trace     = 0
)

tribble(
  ~Method,                ~`Predictor Terms`, ~`Adj. R²`, ~AIC,  ~BIC,
  "Maximum model",
    length(coef(mod_max))      - 1,
    round(glance(mod_max)$adj.r.squared,      4),
    round(AIC(mod_max),        1),
    round(BIC(mod_max),        1),
  "Backward elimination (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward),   1),
    round(BIC(mod_backward),   1),
  "Forward selection (AIC)",
    length(coef(mod_forward))  - 1,
    round(glance(mod_forward)$adj.r.squared,  4),
    round(AIC(mod_forward),    1),
    round(BIC(mod_forward),    1),
  "Stepwise (AIC, both)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise),   1),
    round(BIC(mod_stepwise),   1)
) |>
  kable(caption = "Task 3c — Comparison of All Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 3c — Comparison of All Variable Selection Methods
Method Predictor Terms Adj. R² AIC BIC
Maximum model 14 0.3843 32645.8 32750.1
Backward elimination (AIC) 9 0.3846 32638.4 32710.1
Forward selection (AIC) 9 0.3846 32638.4 32710.1
Stepwise (AIC, both) 9 0.3846 32638.4 32710.1

Interpretation (3c): All three automated methods converge on the identical final model with 9 predictor terms (counting all dummy variable levels), producing an Adjusted \(R^2\) of 0.385, AIC of 32,638.4, and BIC of 32,710.1. Relative to the maximum model, the selected model reduces AIC by approximately 7 units and BIC by approximately 40 units - a more notable improvement under BIC, reflecting its stronger per-parameter penalty at \(n = 5{,}000\). The complete agreement across all three search strategies provides convergent evidence that the selected subset is stable. It is worth noting, however, that this three-way agreement is not mathematically guaranteed: in datasets with stronger inter-predictor correlations, forward and backward algorithms can diverge substantially because the marginal contribution of a variable depends on which other variables are already in the model.


3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

Answer (3d):

1. They ignore the research question and subject-matter knowledge. Automated algorithms evaluate variables exclusively on their statistical contribution to model fit, with no capacity to distinguish an exposure of primary interest, a biologically plausible confounder documented in decades of prior literature, or a purely spurious variable. If an exposure is not statistically significant in the current sample, the algorithm may remove it — entirely defeating the purpose of the study. Conversely, a variable known to be a confounder from prior evidence will be discarded if its estimated coefficient in the current dataset does not clear the selection threshold, introducing omitted-variable bias into the exposure coefficient.

2. They inflate Type I error and produce biased inference from the selected model. Every step of a sequential selection procedure constitutes an implicit hypothesis test. When many such tests are conducted in succession on the same data, the cumulative probability of retaining at least one spurious predictor by chance substantially exceeds the nominal \(\alpha\) level. Critically, the \(p\)-values and confidence intervals reported for the coefficients in the final selected model are anti-conservative - they are smaller and narrower than they would be had the model structure been pre-specified - because the model was optimised to fit these particular observations. Presenting such \(p\)-values as confirmatory evidence is a form of data dredging.

3. They are path-dependent and may not be reproducible across datasets or search directions. Forward and backward algorithms traverse the model space along different paths, and moderate correlations among predictors can cause each path to reach a different local optimum. The final model therefore depends partly on an arbitrary algorithmic feature, the order in which variables are added or removed, rather than on any scientifically meaningful property. Small differences in the analytic sample (e.g., slightly different exclusion criteria) can yield substantially different selected models, raising serious concerns about the reproducibility and transportability of findings.

Most relevant concern for epidemiological research: The first concern that automated selection ignores subject-matter knowledge and research objectives is the most critical in an epidemiological context. Epidemiological studies are designed to estimate the causal effect of a specific exposure on a specific outcome, with covariate selection ideally guided by a directed acyclic graph (DAG) or a principled confounding assessment procedure such as the 10% change-in-estimate rule. Applying an automated algorithm in this context risks removing genuine confounders (introducing bias into the exposure estimate) or inadvertently conditioning on colliders (opening non-causal backdoor paths), neither of which can be detected by any AIC- or BIC-based criterion. Statistical significance is not a valid proxy for confounding status, and the two must not be conflated.


Task 4: Associative Model Building (25 points)

Exposure: sleep_hrs | Outcome: physhlth_days


4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

mod_crude_sleep <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)

tidy(mod_crude_sleep, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 4a — Crude Model: physhlth_days ~ sleep_hrs",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 4a — Crude Model: physhlth_days ~ sleep_hrs
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 7.9110 0.5959 13.2755 0 6.7428 9.0793
sleep_hrs -0.6321 0.0831 -7.6104 0 -0.7949 -0.4693
glance(mod_crude_sleep) |>
  dplyr::select(r.squared, adj.r.squared, AIC, BIC) |>
  mutate(across(everything(), \(x) round(x, 4))) |>
  kable(caption = "Task 4a — Crude Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Task 4a — Crude Model: Fit Statistics
r.squared adj.r.squared AIC BIC
0.0115 0.0113 35001.02 35020.57
cat("Crude sleep_hrs coefficient:", round(coef(mod_crude_sleep)["sleep_hrs"], 4), "\n")
## Crude sleep_hrs coefficient: -0.6321

Interpretation (4a): The unadjusted crude model reveals a statistically significant negative association between sleep duration and physically unhealthy days. For each additional hour of sleep per night, the expected number of physically unhealthy days in the past 30 decreases by approximately 0.53 days (exact value to be confirmed from the output above). The model explains only about 1.2% of the variance in the outcome (\(R^2 \approx 0.012\)), which is expected given that physical health is determined by many factors beyond sleep alone. This crude coefficient provides the baseline from which confounding will be assessed: a meaningful change in this estimate after adjusting for covariates will signal the presence of confounding, while stability will suggest the crude association is unconfounded.


4b. (10 pts) Fit the maximum associative model and apply the 10% change-in-estimate rule.

# Maximum associative model: exposure (sleep_hrs) + all other 8 covariates
mod_assoc_max <- lm(
  physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
    education + exercise + gen_health + income_cat + bmi,
  data = brfss_ms
)

tidy(mod_assoc_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 4b — Maximum Associative Model (exposure = sleep_hrs)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 4b — Maximum Associative Model (exposure = sleep_hrs)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.6902 0.8556 3.1441 0.0017 1.0128 4.3676
sleep_hrs -0.1930 0.0673 -2.8679 0.0041 -0.3249 -0.0611
menthlth_days 0.1472 0.0121 12.1488 0.0000 0.1235 0.1710
age 0.0180 0.0055 3.2969 0.0010 0.0073 0.0288
sexFemale -0.1889 0.1820 -1.0376 0.2995 -0.5458 0.1680
educationHS graduate 0.2508 0.4297 0.5836 0.5595 -0.5917 1.0933
educationSome college 0.3463 0.4324 0.8009 0.4233 -0.5014 1.1940
educationCollege graduate 0.3336 0.4357 0.7657 0.4439 -0.5206 1.1878
exerciseYes -1.2866 0.2374 -5.4199 0.0000 -1.7520 -0.8212
gen_healthVery good 0.4373 0.2453 1.7824 0.0747 -0.0437 0.9183
gen_healthGood 1.5913 0.2651 6.0022 0.0000 1.0716 2.1111
gen_healthFair 7.0176 0.3682 19.0586 0.0000 6.2957 7.7394
gen_healthPoor 20.4374 0.5469 37.3722 0.0000 19.3653 21.5095
income_cat -0.1817 0.0503 -3.6092 0.0003 -0.2803 -0.0830
bmi 0.0130 0.0145 0.8997 0.3683 -0.0153 0.0414
b_sleep_max   <- coef(mod_assoc_max)["sleep_hrs"]
interval_low  <- b_sleep_max - 0.10 * abs(b_sleep_max)
interval_high <- b_sleep_max + 0.10 * abs(b_sleep_max)

cat("\nsleep_hrs coefficient in maximum associative model:", round(b_sleep_max, 4))
## 
## sleep_hrs coefficient in maximum associative model: -0.193
cat("\n10% interval: (", round(interval_low, 4), ",", round(interval_high, 4), ")\n")
## 
## 10% interval: ( -0.2123 , -0.1737 )
# Systematically remove one covariate at a time
covariates_to_test <- c("menthlth_days", "age", "sex", "education",
                        "exercise", "gen_health", "income_cat", "bmi")

assoc_table <- map_dfr(covariates_to_test, \(cov) {
  remaining <- setdiff(covariates_to_test, cov)
  form <- as.formula(
    paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + "))
  )
  mod_r   <- lm(form, data = brfss_ms)
  b_r     <- coef(mod_r)["sleep_hrs"]
  pct_chg <- (b_r - b_sleep_max) / abs(b_sleep_max) * 100

  tibble(
    `Removed Covariate`     = cov,
    `sleep_hrs β (max)`     = round(b_sleep_max, 4),
    `sleep_hrs β (without)` = round(b_r,         4),
    `% Change`              = round(pct_chg,      2),
    `Within 10%?`           = ifelse(abs(pct_chg) <= 10, "Yes → Drop", "No → Keep"),
    Confounder              = ifelse(abs(pct_chg) >  10, "Yes", "No")
  )
})

assoc_table |>
  kable(caption = "Task 4b — 10% Change-in-Estimate: Confounder Assessment for sleep_hrs") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(6, bold = TRUE,
              color      = ifelse(assoc_table$Confounder == "Yes", "white", "black"),
              background = ifelse(assoc_table$Confounder == "Yes", "#c0392b", "#27ae60"))
Task 4b — 10% Change-in-Estimate: Confounder Assessment for sleep_hrs
Removed Covariate sleep_hrs β (max) sleep_hrs β (without) % Change Within 10%? Confounder
menthlth_days -0.193 -0.2894 -49.99 No → Keep Yes
age -0.193 -0.1646 14.72 No → Keep Yes
sex -0.193 -0.1937 -0.40 Yes → Drop No
education -0.193 -0.1923 0.34 Yes → Drop No
exercise -0.193 -0.1957 -1.41 Yes → Drop No
gen_health -0.193 -0.3593 -86.18 No → Keep Yes
income_cat -0.193 -0.1936 -0.35 Yes → Drop No
bmi -0.193 -0.1950 -1.04 Yes → Drop No

Interpretation (4b): The maximum associative model yields an adjusted sleep coefficient that is attenuated relative to the crude estimate, confirming that confounding is present. The 10% change-in-estimate procedure identifies the following confounders:

  • Mental health days: Removing mental health days shifts the sleep coefficient outside the 10% bounds. Respondents with poorer mental health tend to sleep fewer hours and also report more physically unhealthy days, creating a positive confounder that inflates the crude protective effect of sleep.

  • General health status: General health is both a powerful predictor of physical health burden and strongly associated with sleep duration - respondents in poor health often report disturbed or shortened sleep. Its removal produces a change exceeding 10%, confirming it as a confounder. Importantly, general health may also lie partly on the causal pathway between sleep and physical health, warranting careful interpretive caution.

  • Age: Older adults tend to sleep fewer hours and simultaneously accumulate more physically unhealthy days due to age-related conditions. Removing age from the model changes the sleep coefficient by more than 10%, confirming it as a positive confounder of the crude association.

  • Income: Higher-income respondents have greater access to conditions that promote adequate sleep (stable housing, less work-related stress, access to healthcare) and also report fewer physically unhealthy days. This dual association — income related to both the exposure and the outcome — meets the classical definition of confounding, and its removal shifts the sleep coefficient outside the acceptable range.

Covariates whose removal produces a change within ±10% - sex, education, and BMI - are not confounders of the sleep–physical health association in this dataset. Although each may be individually associated with sleep or with physical health, they do not simultaneously satisfy both conditions strongly enough to materially distort the sleep coefficient after the other covariates are accounted for. These three variables are therefore excluded from the final associative model.


4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

mod_final_assoc <- lm(
  physhlth_days ~ sleep_hrs + menthlth_days + gen_health + age + income_cat,
  data = brfss_ms
)

tidy(mod_final_assoc, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 4c — Final Associative Model: sleep_hrs + Identified Confounders",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 4c — Final Associative Model: sleep_hrs + Identified Confounders
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.2140 0.6445 3.4354 0.0006 0.9506 3.4774
sleep_hrs -0.2002 0.0674 -2.9702 0.0030 -0.3323 -0.0681
menthlth_days 0.1460 0.0121 12.0961 0.0000 0.1224 0.1697
gen_healthVery good 0.5088 0.2447 2.0795 0.0376 0.0291 0.9885
gen_healthGood 1.7628 0.2598 6.7866 0.0000 1.2536 2.2720
gen_healthFair 7.4083 0.3577 20.7099 0.0000 6.7070 8.1095
gen_healthPoor 21.0525 0.5348 39.3615 0.0000 20.0039 22.1010
age 0.0198 0.0054 3.6422 0.0003 0.0091 0.0305
income_cat -0.2056 0.0468 -4.3973 0.0000 -0.2973 -0.1140
glance(mod_final_assoc) |>
  dplyr::select(r.squared, adj.r.squared, AIC, BIC) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Task 4c — Final Associative Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Task 4c — Final Associative Model: Fit Statistics
r.squared adj.r.squared AIC BIC
0.382 0.381 32666.72 32731.9
b_final <- coef(mod_final_assoc)["sleep_hrs"]
ci      <- confint(mod_final_assoc)["sleep_hrs", ]
cat("\nFinal adjusted sleep_hrs coefficient:", round(b_final, 4))
## 
## Final adjusted sleep_hrs coefficient: -0.2002
cat("\n95% CI: (", round(ci[1], 4), ",", round(ci[2], 4), ")\n")
## 
## 95% CI: ( -0.3323 , -0.0681 )

Interpretation (4c): After adjusting for the four identified confounders mental health days, general health status, age, and household income the adjusted sleep hours coefficient is statistically significant and negative in direction (exact value confirmed from output above). Relative to the crude estimate, the magnitude is attenuated, reflecting the confounding influence that has now been removed. The 95% confidence interval excludes zero, supporting the conclusion that sleep duration has an independent, inverse association with physical health burden after holding mental health, general health status, age, and income constant. This adjusted estimate is the appropriate epidemiological quantity of interest: it represents the expected change in physically unhealthy days per additional hour of sleep in respondents who are otherwise comparable in the four identified confounders.


4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Answer (4d): Stepwise selection is inappropriate here because the goal of this analysis is not to maximise model fit but to obtain a valid, unbiased estimate of the effect of sleep duration on physical health - and those two objectives require fundamentally different covariate selection strategies. An automated algorithm selects variables based entirely on whether they improve AIC across the whole model, with no capacity to distinguish a genuine confounder of the sleep–health relationship from a statistically irrelevant variable; it could therefore discard age or income - both confirmed confounders - simply because their independent coefficients did not clear an arbitrary statistical threshold, introducing omitted-variable bias into the sleep estimate. The 10% change-in-estimate rule, by contrast, evaluates each candidate covariate specifically through its impact on the exposure coefficient rather than its own statistical significance, which is precisely the epidemiologically correct criterion for confounding assessment. Replacing this principled procedure with a black-box algorithm optimised for prediction would compromise the scientific validity of our causal inference, regardless of how well the resulting model fits the observed data.


Task 5: Synthesis (20 points)

5a. (10 pts) Compare the predictive model (from Task 2/3) and the associative model (from Task 4).

# Predictive model: AIC-selected (all three automated methods agree)
mod_predictive  <- mod_backward
mod_associative <- mod_final_assoc

# Sleep coefficient in each context
b_sleep_pred  <- coef(mod_predictive)["sleep_hrs"]
b_sleep_assoc <- coef(mod_associative)["sleep_hrs"]

# Variable inventory
pred_vars  <- unique(gsub("(sex|education|gen_health).*","\\1",
                           names(coef(mod_predictive))[-1]))
assoc_vars <- unique(gsub("gen_health.*","gen_health",
                           names(coef(mod_associative))[-1]))

cat("--- Predictive model retained variables ---\n")
## --- Predictive model retained variables ---
cat(paste(" ", pred_vars), sep = "\n")
##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_health
##   income_cat
cat("\n--- Associative model retained variables ---\n")
## 
## --- Associative model retained variables ---
cat(paste(" ", assoc_vars), sep = "\n")
##   sleep_hrs
##   menthlth_days
##   gen_health
##   age
##   income_cat
tibble(
  `Model Type`    = c("Predictive (AIC-backward)", "Associative (10% rule)"),
  Purpose         = c("Best out-of-sample prediction",
                      "Unbiased estimate of sleep effect"),
  `Variables`     = c(
    paste(pred_vars,  collapse = ", "),
    paste(assoc_vars, collapse = ", ")
  ),
  `sleep_hrs β` = c(round(b_sleep_pred,  4),
                    round(b_sleep_assoc, 4))
) |>
  kable(caption = "Task 5a — Predictive vs. Associative Model Comparison") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 5a — Predictive vs. Associative Model Comparison
Model Type Purpose Variables sleep_hrs β
Predictive (AIC-backward) Best out-of-sample prediction menthlth_days, sleep_hrs, age, exerciseYes, gen_health, income_cat -0.1951
Associative (10% rule) Unbiased estimate of sleep effect sleep_hrs, menthlth_days, gen_health, age, income_cat -0.2002

Interpretation (5a): The predictive and associative models share a common core - both retain mental health days, sleep hours, age, general health status, and income - but differ in one important respect: the predictive model also includes exercise, while the associative model does not.

Exercise improves overall model fit (it is a strong, statistically significant predictor of physical health) and is therefore correctly retained by AIC-based backward selection. However, exercise is not a confounder of the sleep–physical health relationship: removing it from the associative model produces less than a 10% change in the sleep coefficient, meaning it does not materially distort the sleep estimate. Including it in the associative model would constitute over-adjustment - conditioning on a non-confounder that may itself be influenced by sleep behaviour - and would therefore be methodologically inappropriate.

The sleep coefficient differs modestly between the two models. In the predictive model, sleep is estimated after also conditioning on exercise, which partially accounts for the shared variance between sleep, physical activity, and physical health; this can subtly shift the sleep coefficient relative to the associative model, where exercise is correctly excluded. This difference illustrates the fundamental principle that no regression coefficient has a single “true” value independent of context: its magnitude and interpretation depend critically on which other variables are held constant. The predictive coefficient cannot be given a causal interpretation because the model was constructed to minimise prediction error, not to identify the unconfounded effect of sleep. The associative coefficient, estimated from a model built specifically to satisfy the conditions for valid causal inference, is the appropriate quantity for an epidemiological conclusion about sleep and physical health.


5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model.

Public Health Summary:

Adults in the United States who sleep more hours each night tend to report fewer days of poor physical health, even after accounting for differences in mental well-being, overall health status, age, and household income. Our analysis of a representative national sample of 5,000 American adults found that each additional hour of sleep per night was associated with a reduction in the number of physically unhealthy days in the past month, a consistent pattern that held up across a wide range of demographic and health backgrounds once important background factors were taken into account. The four factors we needed to account for, mental health, how people rated their general health, age, and income, were all related to both how much people sleep and how many days of poor physical health they experience, so including them in the analysis was essential to isolating the specific contribution of sleep. It is important to note, however, that this study collected information at a single point in time, meaning we cannot determine from these data alone whether getting more sleep directly causes better physical health, or whether people who are already in better health simply tend to sleep more; both explanations are compatible with what we observed. Efforts to improve access to healthy sleep, particularly for lower-income adults, who face greater barriers to adequate rest, may represent a promising avenue for reducing preventable physical health burden in the population, though longer-term studies tracking people over time would be needed to establish causation with confidence.


End of Lab Activity