Introduction

In previous lectures, we learned how to fit multiple linear regression models, include dummy variables for categorical predictors, test for interactions, and assess confounding. But we have not yet addressed a fundamental question: how do we decide which variables belong in the model?

This question has different answers depending on the goal of the analysis:

Goal	What matters	Variable selection driven by
Prediction	Model accuracy and reliability in new data	Statistical criteria (Adj. \(R^2\), AIC, BIC, cross-validation)
Association	Validity of the exposure coefficient	Subject-matter knowledge, confounding assessment, 10% rule

In predictive modeling, we search for the subset of variables that best predicts \(Y\) without overfitting. In associative modeling, the exposure variable is always in the model, and we decide which covariates to include based on whether they are confounders.

This lecture covers both approaches, with emphasis on when each is appropriate and the pitfalls of automated selection.

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(leaps)
library(MASS)
library(broom)       # for glance()
library(gridExtra)   # for grid.arrange()

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue with the BRFSS 2020 dataset, predicting physically unhealthy days from a pool of candidate predictors.

brfss_full <- read_xpt(
  "LLCP2020.XPT"
) |>
  clean_names()

brfss_ms <- brfss_full |>
  mutate(
    # Outcome
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    # Candidate predictors
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_)
  ) |>
  filter(
    !is.na(physhlth_days), !is.na(menthlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(education),
    !is.na(exercise), !is.na(gen_health), !is.na(income_cat), !is.na(bmi)
  )

set.seed(1220)
brfss_ms <- brfss_ms |>
  dplyr::select(physhlth_days, menthlth_days, sleep_hrs, age, sex,
                education, exercise, gen_health, income_cat, bmi) |>
  slice_sample(n = 5000)

# Save for lab
saveRDS(brfss_ms,
  "brfss_ms_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_ms), ncol(brfss_ms))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	10

We have 10 variables: 1 outcome and 9 candidate predictors. If we considered all possible subsets of the 9 predictors (ignoring interactions and transformations), there would be \(2^9 - 1 = 511\) possible models.

Part 1: Guided Practice — Model Selection

1. Building the Maximum Model

1.1 What Is the Maximum Model?

The maximum model is the model that includes all candidate predictor variables. It represents the upper bound of complexity. The “correct” model will have \(p \leq k\) predictors, where \(k\) is the number in the maximum model.

The candidate variables in the maximum model can include:

Main effects (e.g., age, sex, BMI)
Higher-order terms (e.g., age\(^2\), age\(^3\))
Transformations (e.g., log(BMI))
Interactions (e.g., sex \(\times\) age)

These candidates are chosen based on a literature search and the research question, not by throwing in every available variable.

# The maximum model with all candidate predictors
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: All Candidate Predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.6902	0.8556	3.1441	0.0017	1.0128	4.3676
menthlth_days	0.1472	0.0121	12.1488	0.0000	0.1235	0.1710
sleep_hrs	-0.1930	0.0673	-2.8679	0.0041	-0.3249	-0.0611
age	0.0180	0.0055	3.2969	0.0010	0.0073	0.0288
sexFemale	-0.1889	0.1820	-1.0376	0.2995	-0.5458	0.1680
educationHS graduate	0.2508	0.4297	0.5836	0.5595	-0.5917	1.0933
educationSome college	0.3463	0.4324	0.8009	0.4233	-0.5014	1.1940
educationCollege graduate	0.3336	0.4357	0.7657	0.4439	-0.5206	1.1878
exerciseYes	-1.2866	0.2374	-5.4199	0.0000	-1.7520	-0.8212
gen_healthVery good	0.4373	0.2453	1.7824	0.0747	-0.0437	0.9183
gen_healthGood	1.5913	0.2651	6.0022	0.0000	1.0716	2.1111
gen_healthFair	7.0176	0.3682	19.0586	0.0000	6.2957	7.7394
gen_healthPoor	20.4374	0.5469	37.3722	0.0000	19.3653	21.5095
income_cat	-0.1817	0.0503	-3.6092	0.0003	-0.2803	-0.0830
bmi	0.0130	0.0145	0.8997	0.3683	-0.0153	0.0414

glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.386	0.384	6.321	32645.79	32750.06	4985

Interpretation: The maximum model explains approximately 38.6% of the variance in physically unhealthy days (R² = 0.386, Adjusted R² = 0.384). The strongest predictors are general health status (with “Poor” health associated with about 20 more unhealthy days compared to “Excellent”) and mental health days (each additional mentally unhealthy day is associated with 0.15 more physically unhealthy days). Exercise is also strongly associated, with exercisers reporting about 1.3 fewer physically unhealthy days. Several variables, including sex (p = 0.30), education (p > 0.40 for all levels), and BMI (p = 0.37), are not statistically significant, suggesting they may be candidates for removal in a more parsimonious model. The AIC is 32,645.8 and BIC is 32,750.1; these serve as baselines for comparing simpler models.

1.2 Overfitting vs. Underfitting

The goal of model building is to find the right balance:

Problem	What happens	Consequence
Overfitting	Including variables with \(\beta = 0\)	No bias, but increased collinearity, inflated SEs, poor out-of-sample prediction
Underfitting	Omitting variables with \(\beta \neq 0\)	Bias in the remaining coefficients (omitted variable bias)

Key insight: Underfitting is worse than overfitting in terms of bias. An overfit model gives unbiased estimates (just less precise), while an underfit model gives biased estimates. However, for prediction, overfitting degrades out-of-sample performance.

The objective is a parsimonious model: the simplest model that captures the important relationships without unnecessary complexity.

1.3 How Many Predictors Can We Include?

The error degrees of freedom must be positive: \(n - k - 1 > 0\), meaning \(n > k + 1\).

Rules of thumb for the minimum sample size:

Rule	Requirement	Our data (n = 5,000)
Minimum 10 error df	\(n \geq k + 11\)	Can include up to 4,989 predictors
5 observations per predictor	\(n \geq 5k\)	Can include up to 1,000 predictors
10 observations per predictor	\(n \geq 10k\)	Can include up to 500 predictors

With \(n = 5,000\), we are well within all rules of thumb for our 9 candidate predictors (plus dummy variables).

Caution with categorical variables: A categorical predictor with \(k\) levels uses \(k - 1\) degrees of freedom, not just 1. Our education (4 levels) uses 3 df, gen_health (5 levels) uses 4 df, so the maximum model actually uses 14 predictor df.

2. Selection Criteria

Given a set of candidate models, we need a criterion to compare them. We cover five: \(R^2\), Adjusted \(R^2\), \(F_p\) (partial F-test), AIC, and BIC.

2.1 \(R^2\) (Coefficient of Determination)

\[R^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST}\]

\(R^2\) measures the proportion of variance in \(Y\) explained by the model. However, \(R^2\) always increases (or stays the same) when you add a predictor, regardless of whether it is useful. This makes raw \(R^2\) useless for model comparison across models of different sizes.

# Demonstrate that R2 always increases
models <- list(
  "Sleep only"      = lm(physhlth_days ~ sleep_hrs, data = brfss_ms),
  "+ age"           = lm(physhlth_days ~ sleep_hrs + age, data = brfss_ms),
  "+ sex"           = lm(physhlth_days ~ sleep_hrs + age + sex, data = brfss_ms),
  "+ education"     = lm(physhlth_days ~ sleep_hrs + age + sex + education, data = brfss_ms),
  "+ exercise"      = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise, data = brfss_ms),
  "+ gen_health"    = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health, data = brfss_ms),
  "+ mental health" = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days, data = brfss_ms),
  "+ income"        = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat, data = brfss_ms),
  "+ BMI (full)"    = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat + bmi, data = brfss_ms)
)

# Create comparison table
r2_table <- map_dfr(names(models), function(name) {
  g <- glance(models[[name]])
  tibble(
    Model = name,
    p = length(coef(models[[name]])) - 1,
    R2 = round(g$r.squared, 4),
    Adj_R2 = round(g$adj.r.squared, 4),
    AIC = round(g$AIC, 1),
    BIC = round(g$BIC, 1)
  )
})

# Display table
r2_table |>
  kable(caption = "Model Comparison: R2 Always Increases as Predictors Are Added") |>
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Comparison: R2 Always Increases as Predictors Are Added
Model	p	R2	Adj_R2	AIC	BIC
Sleep only	1	0.0115	0.0113	35001.0	35020.6
age	2	0.0280	0.0276	34918.7	34944.8
sex	3	0.0280	0.0274	34920.7	34953.3
education	6	0.0440	0.0428	34843.7	34895.9
exercise	7	0.0849	0.0836	34626.8	34685.5
gen_health	11	0.3650	0.3636	32807.7	32892.4
mental health	12	0.3843	0.3828	32655.4	32746.6
income	13	0.3859	0.3843	32644.6	32742.4
BMI (full)	14	0.3860	0.3843	32645.8	32750.1

Interpretation: Notice that R² increases monotonically from 0.012 (sleep only) to 0.386 (full model) as each predictor is added. However, Adjusted R² tells a different story: it plateaus at 0.384 after adding income (the 8th predictor), and adding BMI does not improve it further (still 0.384). The largest single jump in both R² and Adjusted R² occurs when general health is added (from 0.084 to 0.365), indicating it is by far the most powerful predictor. AIC and BIC both decrease sharply at that same step. AIC reaches its minimum at the full model (32,645.8), while BIC, which penalizes complexity more heavily, favors a slightly smaller model. This table illustrates a key lesson: R² will always reward you for adding variables, even useless ones, making it unreliable for model comparison.

2.2 Adjusted \(R^2\)

Adjusted \(R^2\) penalizes for model complexity:

\[R^2_{adj} = 1 - \frac{(n - i)(1 - R^2)}{n - p}\]

where \(i = 1\) if the model includes an intercept, \(n\) is the sample size, and \(p\) is the number of predictors. Unlike \(R^2\), Adjusted \(R^2\) can decrease when an uninformative predictor is added, because the penalty for using an extra degree of freedom outweighs the tiny increase in \(R^2\).

Selection rule: Choose the model with the largest Adjusted \(R^2\).

2.3 \(F_p\) (Partial F-Test)

The partial F-test compares a reduced model (with \(p\) predictors) to the maximum model (with \(k\) predictors):

\[F_p = \frac{\{SSE(p) - SSE(k)\} / (k - p)}{SSE(k) / (n - k - 1)}\]

This tests \(H_0\): the \(k - p\) omitted variables all have \(\beta = 0\).

If \(F_p\) is not significant, the reduced model is adequate (the extra variables are not needed)
If \(F_p\) is significant, at least one of the omitted variables is important

Selection rule: Choose the smallest model for which \(F_p\) is not significant when compared to the maximum model.

# Compare a small model to the maximum model
mod_small <- lm(physhlth_days ~ menthlth_days + gen_health + exercise, data = brfss_ms)

anova(mod_small, mod_max) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Small Model vs. Maximum Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Small Model vs. Maximum Model
term	df.residual	rss	df	sumsq	statistic	p.value
physhlth_days ~ menthlth_days + gen_health + exercise	4993	200472.4	NA	NA	NA	NA
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi	4985	199201.8	8	1270.601	3.9746	1e-04

Interpretation: The partial F-test compares the small model (mental health days + general health + exercise) to the maximum model (all 9 predictors). The F-statistic is 3.97 with p < 0.001, meaning the null hypothesis that the additional 6 variables all have β = 0 is rejected. In other words, at least one of the omitted variables (sleep, age, sex, education, income, BMI) contributes significantly to the model beyond the three core predictors. This means the small model, despite capturing much of the explained variance, is missing important information. We should look for a model between the small and maximum that retains the significant predictors while dropping the uninformative ones.

2.4 AIC (Akaike Information Criterion)

\[AIC = 2k - 2\log(\hat{L})\]

where \(k\) is the number of estimated parameters and \(\hat{L}\) is the maximized likelihood. AIC measures the relative information lost by a model. It balances goodness of fit against complexity.

AIC is not a test; it is a relative comparison tool
The model with the smallest AIC is preferred
AIC differences < 2 suggest models are essentially equivalent

Selection rule: Choose the model with the smallest AIC.

2.5 BIC (Bayesian Information Criterion)

\[BIC = k \log(n) - 2\log(\hat{L})\]

BIC is similar to AIC but penalizes complexity more heavily, especially with large sample sizes (\(\log(n)\) vs. 2). BIC tends to select simpler models than AIC.

Selection rule: Choose the model with the smallest BIC.

2.6 MSE(p) (Mean Squared Error)

\[MSE(p) = \frac{SSE_p}{n - p - 1}\]

MSE(p) is the residual variance for a model with \(p\) predictors. It balances fit (smaller SSE) against model size (fewer df in the denominator).

Selection rule: Choose the model with the smallest MSE(p).

2.7 Comparing the Criteria

Summary of Model Selection Criteria
Criterion	Direction	Penalizes	Best for
R2	Maximize	No	Never use alone
Adjusted R2	Maximize	Yes (df penalty)	Comparing nested models
Fp (partial F)	Not significant - keep reduced	Yes (F distribution)	Comparing to maximum model
AIC	Minimize	Yes (2k)	General comparison
BIC	Minimize	Yes (k log n)	Favors simpler models
MSE(p)	Minimize	Yes (df in denominator)	Similar to Adj. R2

criteria_long <- r2_table |>
  dplyr::select(Model, p, AIC, BIC) |>
  pivot_longer(cols = c(AIC, BIC), names_to = "Criterion", values_to = "Value") |>
  mutate(Model = factor(Model, levels = r2_table$Model))

ggplot(criteria_long, aes(x = p, y = Value, color = Criterion)) +
  geom_line(linewidth = 1.1) +
  geom_point(size = 3) +
  labs(
    title = "AIC and BIC Across Sequentially Larger Models",
    subtitle = "Lower is better; BIC penalizes complexity more heavily",
    x = "Number of Predictor Degrees of Freedom (p)",
    y = "Criterion Value"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

AIC and BIC Across Sequential Models

Interpretation: Both AIC and BIC decrease sharply as the first several predictors are added, with the steepest drop occurring when general health enters the model. AIC continues to decrease (or remains flat) through the full model, suggesting it favors retaining most predictors. BIC, by contrast, reaches its minimum earlier and then begins to increase, reflecting its heavier penalty for model complexity. The divergence between AIC and BIC is typical in large samples: AIC tends to select larger models, while BIC favors parsimony. In practice, when AIC and BIC disagree, the choice depends on the modeling goal: AIC is better for prediction (it minimizes information loss), while BIC is better for identifying the “true” model (it is consistent, meaning it selects the correct model as n grows).

3. Variable Selection Strategies

3.1 All Possible Regressions (Best Subsets)

The most thorough approach is to fit every possible subset of predictors and compare them. With \(k\) predictors, there are \(2^k - 1\) models.

This is computationally feasible for moderate \(k\) (up to about 20-30 predictors). In R, the leaps package implements this efficiently:

# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

library(tibble)
library(ggplot2)
library(gridExtra)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  Adj_R2 = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = Adj_R2)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(
    xintercept = which.max(best_summary$adjr2),
    linetype = "dashed",
    color = "tomato"
  ) +
  labs(
    title = "Adjusted R2 by Model Size",
    x = "Number of Variables",
    y = "Adjusted R2"
  ) +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(
    xintercept = which.min(best_summary$bic),
    linetype = "dashed",
    color = "tomato"
  ) +
  labs(
    title = "BIC by Model Size",
    x = "Number of Variables",
    y = "BIC"
  ) +
  theme_minimal(base_size = 12)

grid.arrange(p1, p2, ncol = 2)

Best Subsets: Adjusted R2 and BIC by Model Size

cat("Best model by Adj. R2:", which.max(best_summary$adjr2), "variables\n")

## Best model by Adj. R2: 10 variables

cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")

## Best model by BIC: 8 variables

# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")

## 
## Variables in BIC-best model:

cat(paste(" ", best_vars), sep = "\n")

##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat

Interpretation: The best subsets analysis confirms what the sequential analysis suggested. Adjusted R² reaches its maximum at 10 variables and plateaus, while BIC selects a more parsimonious model with 8 variables. The BIC-best model retains mental health days, sleep hours, age, exercise, three levels of general health (Good, Fair, Poor), and income. Notably, it drops sex, education, Very Good health (combining it implicitly with Excellent as the reference pattern), and BMI. These are exactly the variables that had the largest p-values in the maximum model. The fact that both criteria converge on a similar core set of predictors (mental health, general health, exercise) gives us confidence that these are the genuinely important variables.

3.2 Backward Elimination

Backward elimination starts with the maximum model and removes variables one at a time:

Fit the maximum model
Identify the predictor with the highest p-value (smallest partial F-statistic)
If its p-value exceeds \(\alpha\) (typically 0.05 or 0.10), remove it
Refit the model and repeat steps 2-3
Stop when all remaining variables have p-values \(\leq \alpha\)

# Step-by-step backward elimination (manual demonstration)
cat("=== BACKWARD ELIMINATION ===\n\n")

## === BACKWARD ELIMINATION ===

# Step 1: Maximum model
mod_back <- mod_max
cat("Step 1: Maximum model\n")

## Step 1: Maximum model

cat("Variables:", paste(names(coef(mod_back))[-1], collapse = ", "), "\n")

## Variables: menthlth_days, sleep_hrs, age, sexFemale, educationHS graduate, educationSome college, educationCollege graduate, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat, bmi

# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
  filter(term != "(Intercept)") |>
  arrange(desc(p.value)) |>
  dplyr::select(term, estimate, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

pvals |>
  head(5) |>
  kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: Variables Sorted by p-value (Highest First)
term	estimate	p.value
educationHS graduate	0.2508	0.5595
educationCollege graduate	0.3336	0.4439
educationSome college	0.3463	0.4233
bmi	0.0130	0.3683
sexFemale	-0.1889	0.2995

In R, the step() function automates backward elimination using AIC:

# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)

## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732

Interpretation: AIC-based backward elimination removed sex, education, and BMI from the maximum model, arriving at a 9-parameter model (counting dummy variables). These are the same three variables that were non-significant in the maximum model. The retained predictors (mental health days, sleep, age, exercise, general health, and income) all have p-values below 0.05. The resulting model has Adjusted R² = 0.385, essentially identical to the maximum model (0.384), confirming that the dropped variables contributed negligible explanatory power.

3.3 Forward Selection

Forward selection starts with the intercept-only model and adds variables one at a time:

Start with no predictors (intercept only)
For each candidate variable, compute the partial F-statistic for adding it to the current model
Add the variable with the smallest p-value (largest partial F)
If its p-value is \(\leq \alpha\), keep it and repeat steps 2-3
Stop when no remaining variable has a p-value \(\leq \alpha\)

# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634

Interpretation: Forward selection arrived at the same final model as backward elimination, including the same 9 predictor terms. The order of entry is informative: general health entered first (the strongest predictor), followed by mental health days, exercise, income, age, and sleep. This ordering reflects each variable’s marginal contribution given the variables already in the model. The convergence of forward and backward methods on the same model increases our confidence in this particular subset, though this convergence is not guaranteed in general.

3.4 Stepwise Selection

Stepwise selection combines forward and backward: after adding a variable, it checks whether any previously entered variable should now be removed. This addresses a limitation of pure forward selection, where a variable that was useful early on may become redundant after other variables enter.

mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1      6395 202123 18509
## + exercise       1      1652 206865 18625
## + income_cat     1      1307 207211 18634
## + sleep_hrs      1       756 207762 18647
## + bmi            1        91 208427 18663
## <none>                       208518 18663
## + sex            1        38 208479 18664
## + age            1        32 208486 18664
## + education      3       145 208373 18666
## - gen_health     4    115918 324435 20865
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      1651 200472 18470
## + income_cat     1       818 201305 18491
## + age            1       465 201658 18500
## + sleep_hrs      1       258 201865 18505
## + bmi            1        91 202032 18509
## <none>                       202123 18509
## + sex            1         3 202120 18511
## + education      3       112 202011 18512
## - menthlth_days  1      6395 208518 18663
## - gen_health     4     92570 294693 20387
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income_cat     1       509 199963 18460
## + age            1       334 200139 18464
## + sleep_hrs      1       253 200219 18466
## <none>                       200472 18470
## + bmi            1        21 200451 18472
## + sex            1        11 200462 18472
## + education      3        27 200445 18476
## - exercise       1      1651 202123 18509
## - menthlth_days  1      6393 206865 18625
## - gen_health     4     78857 279330 20121
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## + age            1       322 199641 18453
## + sleep_hrs      1       250 199713 18455
## <none>                       199963 18460
## + bmi            1        28 199935 18461
## + sex            1        27 199936 18461
## + education      3        27 199937 18465
## - income_cat     1       509 200472 18470
## - exercise       1      1342 201305 18491
## - menthlth_days  1      5988 205952 18605
## - gen_health     4     72713 272676 20002
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##                 Df Sum of Sq    RSS   AIC
## + sleep_hrs      1       337 199305 18447
## <none>                       199641 18453
## + sex            1        45 199596 18454
## + bmi            1        42 199599 18454
## + education      3        23 199619 18459
## - age            1       322 199963 18460
## - income_cat     1       497 200139 18464
## - exercise       1      1231 200873 18482
## - menthlth_days  1      6304 205945 18607
## - gen_health     4     68936 268577 19929
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## + sex            1        42 199262 18448
## + bmi            1        34 199270 18448
## + education      3        25 199280 18452
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stepwise Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634

Interpretation: The stepwise procedure, which allows both addition and removal at each step, also converges on the identical model. In this dataset, no variable that was added early became redundant after later variables entered, so no removals were needed. This three-way agreement (backward = forward = stepwise) is reassuring but should not be taken as proof that this is the “correct” model. All three methods optimize the same criterion (AIC) on the same data.

3.5 Comparing All Selection Methods

method_comparison <- tribble(
  ~Method, ~Variables_selected, ~Adj_R2, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

method_comparison |>
  kable(
    col.names = c("Method", "Variables selected", "Adj. R2", "AIC", "BIC"),
    caption = "Comparison of Variable Selection Methods"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Comparison of Variable Selection Methods
Method	Variables selected	Adj. R2	AIC	BIC
Maximum model	14	0.3843	32645.8	32750.1
Backward (AIC)	9	0.3846	32638.4	32710.1
Forward (AIC)	9	0.3846	32638.4	32710.1
Stepwise (AIC)	9	0.3846	32638.4	32710.1

Interpretation: All three automated methods selected the same model with 9 predictor terms (Adjusted R² = 0.385, AIC = 32,638.4, BIC = 32,710.1). This model has a lower AIC and BIC than the maximum model (AIC = 32,645.8, BIC = 32,750.1), confirming that removing sex, education, and BMI improved parsimony without sacrificing fit. The modest improvement in BIC (40 points lower) is more notable than the AIC improvement (7 points lower), consistent with BIC’s stronger preference for simpler models. In practice, the maximum model and the selected model would produce very similar predictions, but the selected model is preferred for its efficiency.

3.6 Cautions About Automated Selection

Use automated selection with extreme caution.

Automated methods (forward, backward, stepwise) have well-documented problems:

They ignore the research question. The algorithm selects variables based purely on statistical fit. If you are building an associative model and the exposure is not statistically significant, the algorithm will remove it, which defeats the purpose.
They inflate Type I error. The repeated testing involved in stepwise procedures inflates the probability of including spurious predictors.
They are path-dependent. Forward and backward selection can yield different final models because the order of variable entry/removal matters.
They ignore subject-matter knowledge. A variable may be a known confounder from the literature even if it is not statistically significant in your sample.
p-values and CIs from the final model are biased. Because the model was selected to optimize fit, the reported p-values are anti-conservative (too small).

Recommendation: Use automated selection as an exploratory tool to generate candidate models, but make final decisions based on substantive knowledge, confounding assessment, and parsimony.

4. Model Selection for Associative Models

4.1 A Different Philosophy

In associative modeling, the exposure variable is always in the model. It is never a candidate for removal, regardless of its p-value. The question is which covariates to include alongside it.

The standard epidemiological approach to covariate selection:

Identify the exposure(s) of interest (these stay in the model)
Identify candidate confounders from the literature and bivariate analyses
Use the 10% change-in-estimate rule to determine which confounders to retain

4.2 The 10% Change-in-Estimate Procedure

Recall from the Confounding lecture: a covariate is a confounder if removing it changes the exposure coefficient by more than 10%.

The systematic procedure:

Fit the maximum model (exposure + all candidate confounders) and note the exposure \(\hat{\beta}\)
Compute the 10% interval: \(\hat{\beta} \pm 0.10 \times |\hat{\beta}|\)
Remove one candidate confounder at a time
If the exposure \(\hat{\beta}\) stays within the 10% interval, the removed variable is not a confounder (drop it)
If the exposure \(\hat{\beta}\) moves outside the interval, the variable is a confounder (keep it)
Repeat until all covariates have been evaluated

# Exposure: exercise; Outcome: physhlth_days

# Maximum associative model
mod_assoc_max <- lm(
  physhlth_days ~ exercise + menthlth_days + sleep_hrs + age +
    sex + education + income_cat + bmi,
  data = brfss_ms
)

b_exposure_max <- coef(mod_assoc_max)["exerciseYes"]
interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)

cat("Exposure coefficient in maximum model:", round(b_exposure_max, 4), "\n")

## Exposure coefficient in maximum model: -3.0688

cat("10 percent interval: (", round(interval_low, 4), ",", round(interval_high, 4), ")\n\n")

## 10 percent interval: ( -3.3757 , -2.7619 )

# Systematically remove one covariate at a time
covariates_to_test <- c(
  "menthlth_days", "sleep_hrs", "age", "sex",
  "education", "income_cat", "bmi"
)

assoc_table <- map_dfr(covariates_to_test, function(cov) {
  remaining <- setdiff(covariates_to_test, cov)
  form <- as.formula(
    paste("physhlth_days ~ exercise +", paste(remaining, collapse = " + "))
  )
  mod_reduced <- lm(form, data = brfss_ms)
  b_reduced <- coef(mod_reduced)["exerciseYes"]
  pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100

  tibble(
    Removed_covariate = cov,
    Exercise_b_max = round(b_exposure_max, 4),
    Exercise_b_without = round(b_reduced, 4),
    Percent_change = round(pct_change, 1),
    Within_10_percent = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
    Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
  )
})

assoc_table |>
  kable(
    col.names = c(
      "Removed covariate",
      "Exercise b (max)",
      "Exercise b (without)",
      "Percent change",
      "Within 10 percent?",
      "Confounder"
    ),
    caption = "Associative Model: Systematic Confounder Assessment for Exercise"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) |>
  column_spec(6, bold = TRUE)

Associative Model: Systematic Confounder Assessment for Exercise
Removed covariate	Exercise b (max)	Exercise b (without)	Percent change	Within 10 percent?	Confounder
menthlth_days	-3.0688	-3.3725	-9.9	Yes (drop)	No
sleep_hrs	-3.0688	-3.0950	-0.9	Yes (drop)	No
age	-3.0688	-3.4150	-11.3	No (keep)	Yes
sex	-3.0688	-3.0534	0.5	Yes (drop)	No
education	-3.0688	-3.1036	-1.1	Yes (drop)	No
income_cat	-3.0688	-3.4544	-12.6	No (keep)	Yes
bmi	-3.0688	-3.2411	-5.6	Yes (drop)	No

Interpretation: The exercise coefficient in the maximum associative model is -3.07, meaning exercisers report about 3 fewer physically unhealthy days after adjusting for all covariates. The systematic assessment identifies two confounders: age (11.3% change when removed) and income (12.6% change when removed). Removing age strengthens the exercise effect (to -3.42), suggesting that age positively confounds the association (older adults exercise less and have more unhealthy days, so ignoring age makes exercise look less protective). Removing income also strengthens the effect (to -3.45), with a similar confounding mechanism (higher income is associated with both more exercise and fewer unhealthy days). The remaining covariates (mental health days, sleep, sex, education, BMI) all produce changes within the 10% interval, so they are not confounders and could be dropped from the associative model. The final associative model would include exercise, age, and income.

4.3 Associative Models with Interactions

If a statistically significant interaction is present (from the previous lecture), the approach changes:

Stratify by the effect modifier
Within each stratum, apply the 10% change-in-estimate procedure separately
Report stratum-specific results

For example, if age \(\times\) exercise is significant:

For exercisers: fit physhlth_days ~ age + [confounders] and assess confounding
For non-exercisers: fit the same and assess confounding
The set of confounders may differ between strata

4.4 Predictive vs. Associative: Side-by-Side Comparison

Predictive vs. Associative Model Building
Feature	Predictive	Associative
Exposure variable	No fixed exposure	Always in the model
Covariate selection	Based on statistical fit	Based on confounding assessment
Automated methods	Useful (with caution)	Generally inappropriate
10 percent change rule	Not used	Primary tool
Interaction terms	Include if improves prediction	Include if effect modification is present
Primary criterion	Adj. R2, AIC, BIC	Validity of exposure b
Parsimony	Fewer variables = less overfitting	Fewer variables = more efficient, if not confounders

5. Cross-Validation: Evaluating Model Reliability

5.1 Why Cross-Validate?

A model that fits the training data well may perform poorly on new data (overfitting). Cross-validation estimates how well the model would perform on data it has not seen.

The simplest approach is k-fold cross-validation:

Randomly split the data into \(k\) equally sized folds
For each fold, train the model on the other \(k - 1\) folds and predict on the held-out fold
Average the prediction error across all \(k\) folds

# 10-fold cross-validation comparison
set.seed(1220)
n <- nrow(brfss_ms)
k_folds <- 10
fold_id <- sample(rep(1:k_folds, length.out = n))

# Compare a small model, medium model, and full model
cv_results <- map_dfr(1:k_folds, \(fold) {
  train <- brfss_ms[fold_id != fold, ]
  test  <- brfss_ms[fold_id == fold, ]

  # Small model
  m_small <- lm(physhlth_days ~ menthlth_days + gen_health, data = train)
  pred_small <- predict(m_small, newdata = test)

  # Medium model
  m_med <- lm(physhlth_days ~ menthlth_days + gen_health + exercise + age + sleep_hrs,
              data = train)
  pred_med <- predict(m_med, newdata = test)

  # Full model
  m_full <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
                 exercise + gen_health + income_cat + bmi, data = train)
  pred_full <- predict(m_full, newdata = test)

  tibble(
    fold = fold,
    RMSE_small = sqrt(mean((test$physhlth_days - pred_small)^2)),
    RMSE_medium = sqrt(mean((test$physhlth_days - pred_med)^2)),
    RMSE_full = sqrt(mean((test$physhlth_days - pred_full)^2))
  )
})

cv_summary <- cv_results |>
  summarise(
    across(starts_with("RMSE"), \(x) round(mean(x), 3))
  )

tribble(
  ~Model, ~Predictors, ~`CV RMSE`,
  "Small", "menthlth_days + gen_health", cv_summary$RMSE_small,
  "Medium", "+ exercise + age + sleep_hrs", cv_summary$RMSE_medium,
  "Full", "All 9 predictors", cv_summary$RMSE_full
) |>
  kable(caption = "10-Fold Cross-Validation: Out-of-Sample RMSE") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

10-Fold Cross-Validation: Out-of-Sample RMSE
Model	Predictors	CV RMSE
Small	menthlth_days + gen_health	6.362
Medium	exercise + age + sleep_hrs	6.334
Full	All 9 predictors	6.334

Interpretation: RMSE is the average prediction error in the units of the outcome (days). A lower CV RMSE indicates better out-of-sample prediction. If the full model has a similar CV RMSE to the medium model, the additional predictors are not improving prediction and may represent overfitting.

Summary of Key Concepts

Concept	Key Point
Maximum model	Start with all candidate predictors from literature and research question
Overfitting vs. underfitting	Overfitting = more variance; underfitting = bias
Parsimony	Simplest model that captures the important relationships
\(R^2\)	Always increases with more variables; useless alone for comparison
Adjusted \(R^2\)	Penalizes complexity; maximize it
AIC	Balances fit and complexity; minimize it
BIC	Heavier penalty than AIC; favors simpler models; minimize it
Partial F-test	Compares reduced to maximum model
Best subsets	Exhaustive search; `leaps::regsubsets()`
Backward elimination	Start full, remove highest p-value; `step(direction = "backward")`
Forward selection	Start empty, add lowest p-value; `step(direction = "forward")`
Stepwise	Forward + backward at each step; `step(direction = "both")`
Caution	Automated methods ignore research questions and inflate Type I error
Associative models	Exposure stays in model; use 10% change-in-estimate for covariates
Cross-validation	Estimates out-of-sample performance; protects against overfitting

Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric
`bmi`	Body mass index	Continuous

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(leaps)
library(MASS)

brfss_ms <- readRDS(
  "brfss_ms_2020.rds"
)
#Task 1a
# Fit the maximum model
mod_max <- lm(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
    education + exercise + gen_health + income_cat + bmi,
  data = brfss_ms
)

# Extract model fit statistics
summary(mod_max)$r.squared

## [1] 0.3860047

summary(mod_max)$adj.r.squared

## [1] 0.3842803

AIC(mod_max)

## [1] 32645.79

BIC(mod_max)

## [1] 32750.06

#Task 1b
# Fit minimal model
mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

# Extract fit statistics
c(
  R2 = summary(mod_min)$r.squared,
  Adjusted_R2 = summary(mod_min)$adj.r.squared,
  AIC = AIC(mod_min),
  BIC = BIC(mod_min)
)

##           R2  Adjusted_R2          AIC          BIC 
## 1.150016e-01 1.146474e-01 3.444978e+04 3.447585e+04

#Task 2a
best_fit <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
    education + exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15
)

best_summary <- summary(best_fit)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  Adj_R2 = best_summary$adjr2
)

ggplot(subset_metrics, aes(x = p, y = Adj_R2)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_vline(
    xintercept = which.max(best_summary$adjr2),
    linetype = "dashed"
  ) +
  labs(
    title = "Adjusted R2 by Model Size",
    x = "Number of Variables",
    y = "Adjusted R2"
  ) +
  theme_minimal()

#Task 2b
subset_metrics <- tibble(
  p = 1:length(best_summary$bic),
  BIC = best_summary$bic
)

ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1) +
  geom_point(size = 3) +
  geom_vline(
    xintercept = which.min(best_summary$bic),
    linetype = "dashed"
  ) +
  labs(
    title = "BIC by Model Size",
    x = "Number of Variables",
    y = "BIC"
  ) +
  theme_minimal()

#Task 2c
# Step 1: Best BIC model size
best_bic <- which.min(best_summary$bic)

# Step 2: Extract coefficients
bic_coefs <- coef(best_fit, best_bic)

# Step 3: Get variable names (remove intercept)
vars <- names(bic_coefs)[-1]

# Step 4: Clean variable names (remove dummy coding)
vars_clean <- unique(gsub("(Yes|No|Good|Very good|Fair|Poor)$", "", vars))

vars_clean  # check variables

## [1] "menthlth_days" "sleep_hrs"     "age"           "exercise"     
## [5] "gen_health"    "income_cat"

# Step 5: Build correct formula
form_bic <- as.formula(
  paste("physhlth_days ~", paste(vars_clean, collapse = " + "))
)

form_bic

## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat

# Step 6: Fit model
mod_bic <- lm(form_bic, data = brfss_ms)

# Step 7: Show results
summary(mod_bic)

## 
## Call:
## lm(formula = form_bic, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

#Task 3a
# Fit maximum model
mod_max <- lm(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
    education + exercise + gen_health + income_cat + bmi,
  data = brfss_ms
)

# Perform backward elimination using AIC
mod_backward <- step(
  mod_max,
  direction = "backward",
  trace = 1
)

## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

# View final model
summary(mod_backward)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     exercise + gen_health + income_cat, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

#Task 3b
# Null (intercept-only) model
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

# Maximum model (same as before)
mod_max <- lm(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
    education + exercise + gen_health + income_cat + bmi,
  data = brfss_ms
)

# Forward selection using AIC
mod_forward <- step(
  mod_null,
  scope = list(lower = mod_null, upper = mod_max),
  direction = "forward",
  trace = 1
)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452

# View final model
summary(mod_forward)

## 
## Call:
## lm(formula = physhlth_days ~ gen_health + menthlth_days + exercise + 
##     income_cat + age + sleep_hrs, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## age                  0.01740    0.00544   3.198  0.00139 ** 
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

#Task 3c
method_comparison <- tibble(
  Method = c("Backward (AIC)", "Forward (AIC)", "Stepwise (AIC)"),
  Variables = c(
    length(coef(mod_backward)) - 1,
    length(coef(mod_forward)) - 1,
    length(coef(mod_stepwise)) - 1
  ),
  Adj_R2 = c(
    round(summary(mod_backward)$adj.r.squared, 4),
    round(summary(mod_forward)$adj.r.squared, 4),
    round(summary(mod_stepwise)$adj.r.squared, 4)
  ),
  AIC = c(
    round(AIC(mod_backward), 1),
    round(AIC(mod_forward), 1),
    round(AIC(mod_stepwise), 1)
  ),
  BIC = c(
    round(BIC(mod_backward), 1),
    round(BIC(mod_forward), 1),
    round(BIC(mod_stepwise), 1)
  )
)

method_comparison |>
  kable(
    col.names = c("Method", "Variables", "Adjusted R2", "AIC", "BIC"),
    caption = "Comparison of Variable Selection Methods"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Comparison of Variable Selection Methods
Method	Variables	Adjusted R2	AIC	BIC
Backward (AIC)	9	0.3846	32638.4	32710.1
Forward (AIC)	9	0.3846	32638.4	32710.1
Stepwise (AIC)	9	0.3846	32638.4	32710.1

#Task 4a 
# Fit crude model
mod_crude <- lm(
  physhlth_days ~ sleep_hrs,
  data = brfss_ms
)

# View full model output
summary(mod_crude)

## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.279 -3.486 -2.854 -1.590 30.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.91103    0.59591   13.28  < 2e-16 ***
## sleep_hrs   -0.63209    0.08306   -7.61 3.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.01126 
## F-statistic: 57.92 on 1 and 4998 DF,  p-value: 3.245e-14

# Extract sleep coefficient only
sleep_coef <- coef(mod_crude)["sleep_hrs"]
sleep_coef

##  sleep_hrs 
## -0.6320888

#Task 4b
# Step 1: Fit maximum associative model
mod_assoc_max <- lm(
  physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
    education + exercise + gen_health + income_cat + bmi,
  data = brfss_ms
)

# Step 2: Extract sleep coefficient
b_sleep_max <- coef(mod_assoc_max)["sleep_hrs"]

# Step 3: Compute 10 percent interval
lower <- b_sleep_max - 0.10 * abs(b_sleep_max)
upper <- b_sleep_max + 0.10 * abs(b_sleep_max)

b_sleep_max

##  sleep_hrs 
## -0.1929713

lower

##  sleep_hrs 
## -0.2122684

upper

##  sleep_hrs 
## -0.1736742

# Step 4: List covariates to test
covariates <- c(
  "menthlth_days", "age", "sex", "education",
  "exercise", "gen_health", "income_cat", "bmi"
)

# Step 5: Systematically remove one at a time
assoc_table <- map_dfr(covariates, function(cov) {

  remaining <- setdiff(covariates, cov)

  form <- as.formula(
    paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + "))
  )

  mod_reduced <- lm(form, data = brfss_ms)

  b_reduced <- coef(mod_reduced)["sleep_hrs"]

  pct_change <- (b_reduced - b_sleep_max) / abs(b_sleep_max) * 100

  tibble(
    Removed = cov,
    Sleep_max = round(b_sleep_max, 4),
    Sleep_reduced = round(b_reduced, 4),
    Percent_change = round(pct_change, 1),
    Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
  )
})

# Step 6: Display table
assoc_table |>
  kable(
    col.names = c(
      "Removed variable",
      "Sleep (max)",
      "Sleep (reduced)",
      "Percent change",
      "Confounder"
    ),
    caption = "Confounder Assessment Using 10 Percent Change Rule"
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  )

Confounder Assessment Using 10 Percent Change Rule
Removed variable	Sleep (max)	Sleep (reduced)	Percent change	Confounder
menthlth_days	-0.193	-0.2894	-50.0	Yes
age	-0.193	-0.1646	14.7	Yes
sex	-0.193	-0.1937	-0.4	No
education	-0.193	-0.1923	0.3	No
exercise	-0.193	-0.1957	-1.4	No
gen_health	-0.193	-0.3593	-86.2	Yes
income_cat	-0.193	-0.1936	-0.3	No
bmi	-0.193	-0.1950	-1.0	No

#Task 4c
# Final model with confounders
mod_final <- lm(
  physhlth_days ~ sleep_hrs + menthlth_days + age + gen_health,
  data = brfss_ms
)

# View model
summary(mod_final)

## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs + menthlth_days + age + 
##     gen_health, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.4035  -2.3768  -1.0143  -0.1358  30.0911 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.815054   0.561490   1.452 0.146678    
## sleep_hrs           -0.202568   0.067516  -3.000 0.002710 ** 
## menthlth_days        0.151230   0.012037  12.564  < 2e-16 ***
## age                  0.020474   0.005446   3.760 0.000172 ***
## gen_healthVery good  0.511346   0.245127   2.086 0.037025 *  
## gen_healthGood       1.915069   0.257905   7.425 1.31e-13 ***
## gen_healthFair       7.768554   0.348846  22.269  < 2e-16 ***
## gen_healthPoor      21.486815   0.526614  40.802  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.35 on 4992 degrees of freedom
## Multiple R-squared:  0.3796, Adjusted R-squared:  0.3787 
## F-statistic: 436.3 on 7 and 4992 DF,  p-value: < 2.2e-16

# Sleep coefficient
coef(mod_final)["sleep_hrs"]

##  sleep_hrs 
## -0.2025685

# 95% CI for sleep
confint(mod_final)["sleep_hrs", ]

##       2.5 %      97.5 % 
## -0.33492945 -0.07020752

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

R2 = 0.3860
Adjusted R2 = 0.3843
AIC = 32645.79
BIC = 32750.06

The maximum model including all nine predictors produced an R2 of 0.3860 and an adjusted R2 of 0.3843, indicating that approximately 38.6% of the variability in physical health days is explained by the model. The AIC and BIC values were 32645.79 and 32750.06, respectively.

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

R2 = 0.1150
Adjusted R2 = 0.1146
AIC = 34449.78
BIC = 34475.85

The maximum model (R2 = 0.3860, Adjusted R2 = 0.3843, AIC = 32645.79, BIC = 32750.06) performs substantially better than the minimal model (R2 = 0.1150, Adjusted R2 = 0.1146, AIC = 34449.78, BIC = 34475.85). The maximum model explains a much larger proportion of the variability in physical health days and has considerably lower AIC and BIC values, indicating a better overall fit. While the minimal model is simpler and more parsimonious, it provides a much poorer fit to the data.

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

R2 is a poor criterion for comparing these models because it always increases (or remains unchanged) when additional predictors are added, regardless of whether those predictors are meaningful. This leads to a preference for overly complex models and increases the risk of overfitting.

Adjusted R2 improves upon this by incorporating a penalty for the number of predictors, so it only increases when a new variable meaningfully improves model fit.

AIC and BIC are preferable because they explicitly balance model fit and complexity through penalty terms. BIC applies a stronger penalty than AIC, favoring more parsimonious models. These criteria are therefore better suited for selecting models that generalize well to new data.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

Adjusted R2 increases rapidly with the addition of the first several variables and begins to level off around 8–9 variables. Beyond this point, additional predictors provide minimal improvement in model fit, indicating that Adjusted R2 has effectively plateaued.

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

The BIC is minimized at approximately 7 variables. After this point, BIC begins to increase, indicating that adding more predictors no longer improves the balance between model fit and complexity.

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

The BIC-best model includes the following variables: menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat.

The fitted model is:

physhlth_days = 3.1864 + 0.1461(menthlth_days) - 0.1952(sleep_hrs) + 0.0174(age) - 1.2877(exerciseYes) + 0.4617(gen_healthVery good) + 1.6368(gen_healthGood) + 7.0787(gen_healthFair) + 20.5084(gen_healthPoor) - 0.1657(income_cat)

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

The BIC-best model and the Adjusted R2-best model are not the same. The Adjusted R2-best model includes more variables (approximately 9), while the BIC-best model is more parsimonious (approximately 7 variables).

I would prefer the BIC-best model because it imposes a stronger penalty for model complexity, reducing the risk of overfitting. Although the Adjusted R2-best model has a slightly higher value, the improvement in fit is negligible. Therefore, the BIC-best model achieves a better balance between explanatory power and parsimony.

Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

Using backward elimination with AIC, the variables removed were education, bmi, and sex.

The variables that remained in the final model were menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat.

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

Forward selection using AIC arrives at the same final model as backward elimination. The selected variables are menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

The backward, forward, and stepwise AIC methods all produced the same final model, each including 9 variables. The models have identical performance metrics, with Adjusted R2 = 0.3846, AIC = 32638.4, and BIC = 32710.1.

This indicates that the variable selection results are consistent across methods, suggesting a stable and well-defined set of predictors for this dataset.

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

Three reasons not to blindly trust automated variable selection methods are:

They can lead to overfitting, especially when many predictors are considered, resulting in models that perform well on the sample data but poorly on new data.
They ignore subject-matter knowledge and may exclude important confounders or include variables that are not scientifically meaningful.
They produce unstable results, meaning small changes in the data can lead to different selected models.

Automated variable selection methods should not be blindly trusted because they can lead to overfitting, produce unstable models, and ignore subject-matter knowledge. In particular, these methods may exclude important confounders if they do not improve statistical fit, resulting in biased estimates.

The most important concern in epidemiological research is the potential for residual confounding. Because automated methods prioritize prediction rather than causal validity, they may yield incorrect conclusions about exposure–outcome relationships.

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

The crude model coefficient for sleep_hrs is -0.6321, indicating that each additional hour of sleep is associated with approximately 0.63 fewer physically unhealthy days.

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

The adjusted sleep coefficient from the maximum associative model is -0.193, with a 10 percent interval of approximately (-0.212, -0.174).

Based on the 10 percent change-in-estimate criterion, menthlth_days, age, and gen_health were identified as confounders, as their removal resulted in more than a 10 percent change in the sleep coefficient. The remaining variables (sex, education, exercise, income_cat, and bmi) did not meet this threshold and were therefore not retained as confounders.

The remaining variables (sex, education, exercise, income_cat, and bmi) did not meet the 10 percent threshold and are therefore not considered confounders.

These variables were retained regardless of statistical significance because they meaningfully altered the exposure estimate.

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

In the final associative model including sleep_hrs, menthlth_days, age, and gen_health, the estimated coefficient for sleep_hrs is -0.193.

The 95% confidence interval for sleep_hrs is (-0.325, -0.061), indicating that greater sleep is associated with fewer physically unhealthy days after adjusting for confounding.

Categorical variables are interpreted relative to their reference categories (e.g., exercise = No, gen_health = Excellent).

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Stepwise selection is not appropriate for this associative analysis because it prioritizes statistical fit rather than causal validity. It may exclude important confounders if they do not improve model fit, leading to biased estimates of the exposure effect.

In contrast, associative modeling aims to estimate a causal relationship and therefore requires careful control of confounding based on subject-matter knowledge and changes in the exposure coefficient. The 10 percent change-in-estimate approach ensures that variables are retained based on their impact on the exposure effect rather than predictive performance.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

A predictive model (from Task 2 or 3, the best model by AIC/BIC)
An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

The predictive and associative models do not include the same variables. The predictive model includes variables selected to optimize overall model fit, whereas the associative model includes only sleep_hrs and the identified confounders (menthlth_days, age, and gen_health).

The sleep coefficient differs substantially between the crude and adjusted models and is somewhat different between the predictive and associative models. In particular, adjustment reduces the magnitude of the sleep effect, indicating the presence of confounding.

These differences arise because predictive models are designed to maximize accuracy and may include variables that improve prediction but are not causally relevant. In contrast, associative models aim to estimate an unbiased exposure effect and therefore include only variables necessary to control for confounding. As a result, the associative model provides a more valid estimate of the relationship between sleep and physical health.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

The adjusted effect of sleep on physical health days
Which variables needed to be accounted for (confounders)
The direction and approximate magnitude of the association
A caveat about cross-sectional data

Do not use statistical jargon.

After accounting for differences in mental health, age, and overall general health, getting more sleep was associated with fewer days of poor physical health. On average, each additional hour of sleep was linked to about 0.19 fewer unhealthy days per month. This suggests that better sleep may be related to improved physical well-being.

However, because these data are cross-sectional, we cannot determine whether more sleep leads to better health or if healthier individuals simply tend to sleep more. There may also be unmeasured factors influencing both sleep and health, so some residual confounding may remain. These results may not generalize beyond the population represented in this dataset.

End of Lab Activity

Model Selection

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Jessica Okoroji

March 24, 2026