Introduction

In previous lectures, we learned how to fit multiple linear regression models, include dummy variables for categorical predictors, test for interactions, and assess confounding. But we have not yet addressed a fundamental question: how do we decide which variables belong in the model?

This question has different answers depending on the goal of the analysis:

Goal	What matters	Variable selection driven by
Prediction	Model accuracy and reliability in new data	Statistical criteria (Adj. \(R^2\), AIC, BIC, cross-validation)
Association	Validity of the exposure coefficient	Subject-matter knowledge, confounding assessment, 10% rule

In predictive modeling, we search for the subset of variables that best predicts \(Y\) without overfitting. In associative modeling, the exposure variable is always in the model, and we decide which covariates to include based on whether they are confounders.

This lecture covers both approaches, with emphasis on when each is appropriate and the pitfalls of automated selection.

Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(leaps)
library(MASS)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue with the BRFSS 2020 dataset, predicting physically unhealthy days from a pool of candidate predictors.

brfss_full <- read_xpt(
  "~/Desktop/EPI553/data/LLCP2020.XPT "
) |>
  clean_names()

brfss_ms <- brfss_full |>
  mutate(
    # Outcome
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    # Candidate predictors
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_)
  ) |>
  filter(
    !is.na(physhlth_days), !is.na(menthlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(education),
    !is.na(exercise), !is.na(gen_health), !is.na(income_cat), !is.na(bmi)
  )

set.seed(1220)
brfss_ms <- brfss_ms |>
  dplyr::select(physhlth_days, menthlth_days, sleep_hrs, age, sex,
                education, exercise, gen_health, income_cat, bmi) |>
  slice_sample(n = 5000)

# Save for lab
saveRDS(brfss_ms,
  "~/Desktop/EPI553/data/brfss_ms2_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_ms), ncol(brfss_ms))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	5000
Variables	10

We have 10 variables: 1 outcome and 9 candidate predictors. If we considered all possible subsets of the 9 predictors (ignoring interactions and transformations), there would be \(2^9 - 1 = 511\) possible models.

Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric
`bmi`	Body mass index	Continuous

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(leaps)
library(MASS)

brfss_ms <- readRDS(
  "~/Desktop/EPI553/data/brfss_ms2_2020.rds"
)

glimpse(brfss_ms)

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

R^2= 0.386 Adjusted R^2= 0.384 AIC= 32645.79 BIC= 32750.06

# The maximum model predicting 'physhlth_days with all 9 candidate predictors
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)
# table
tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: All Candidate Predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.6902	0.8556	3.1441	0.0017	1.0128	4.3676
menthlth_days	0.1472	0.0121	12.1488	0.0000	0.1235	0.1710
sleep_hrs	-0.1930	0.0673	-2.8679	0.0041	-0.3249	-0.0611
age	0.0180	0.0055	3.2969	0.0010	0.0073	0.0288
sexFemale	-0.1889	0.1820	-1.0376	0.2995	-0.5458	0.1680
educationHS graduate	0.2508	0.4297	0.5836	0.5595	-0.5917	1.0933
educationSome college	0.3463	0.4324	0.8009	0.4233	-0.5014	1.1940
educationCollege graduate	0.3336	0.4357	0.7657	0.4439	-0.5206	1.1878
exerciseYes	-1.2866	0.2374	-5.4199	0.0000	-1.7520	-0.8212
gen_healthVery good	0.4373	0.2453	1.7824	0.0747	-0.0437	0.9183
gen_healthGood	1.5913	0.2651	6.0022	0.0000	1.0716	2.1111
gen_healthFair	7.0176	0.3682	19.0586	0.0000	6.2957	7.7394
gen_healthPoor	20.4374	0.5469	37.3722	0.0000	19.3653	21.5095
income_cat	-0.1817	0.0503	-3.6092	0.0003	-0.2803	-0.0830
bmi	0.0130	0.0145	0.8997	0.3683	-0.0153	0.0414

# Report statistics
glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.386	0.384	6.321	32645.79	32750.06	4985

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria.

Statistics for minimum model R^2= 0.115 Adjusted R^2= 0.115 AIC= 34449.78 BIC= 34475.85

# The minimum model with all candidate predictors
mod_min <- lm(physhlth_days ~ menthlth_days + age,
              data = brfss_ms)

tidy(mod_min, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Minimum Model: Mental Health Days and Age as Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Minimum Model: Mental Health Days and Age as Predictors
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	-1.6983	0.3641	-4.6647	-2.4121	-0.9846
menthlth_days	0.3237	0.0135	24.0149	0.2973	0.3501
age	0.0716	0.0062	11.4763	0.0594	0.0838

glance(mod_min) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Minimum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Minimum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.115	0.115	7.58	34449.78	34475.85	4997

# Comparing a minimal model to the maximum model
mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

anova(mod_min, mod_max) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Minimum Model vs. Maximum Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Minimum Model vs. Maximum Model
term	df.residual	rss	df	sumsq	statistic	p.value
physhlth_days ~ menthlth_days + age	4997	287124.8	NA	NA	NA	NA
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi	4985	199201.8	12	87923	183.3551	0

How do the two models compare? The R^2 and adjusted R^2 values are larger for the maximum model and the AIC and BIC are smaller in comparison to the minimal model. Based on the larger adjusted R^2 in the maximum model, that model would be chosen instead of the minimum model. BIC and AIC also indicate that the maximum model is preferred since it has the smaller values. The partial F-test compares the minimal model to the maximum model. The F-statistic is 183.3551 with p<0.001, meaning the additional predictors in the maximum model significantly improve the model. If the p-value was greater than 0.001 we would favor the minimal model but this value indicates we are missing important variables.

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

R^2 is a poor criterion for comparing these two models because R^2 will always increase as predictors are added. Since the maximum model has 9 predictors the R^2 will be larger in comparison to the minimum model since it has only 2 predictors even though the maximum model includes predictors that are useless.Therefore, R^2 is unreliable for comparing models. The adjusted R^2 on the other hand does not increase for every predictor added, it will decrease when a useless/uninformative predictor is added to the model. AIC and BIC are also better choices for comparison because they balance goodness of fit against complexity (BIC penalizes complexity more heavily). AIC and BIC in the maximum model are serving as baseline for comparison and since they are smaller in the maximum model they are the preferred choice.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

The adjusted R^2 plateaus at model size 10 variables.

# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

# create a plot of adjusted R^2 vs number of variables 
subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

p1

# at what model size does adjusted R^2 plateau
cat("Best model by Adj. R²:", which.max(best_summary$adjr2), "variables\n")

## Best model by Adj. R²: 10 variables

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

The model size of 8 variables minimizes BIC.

# create plot of BIC vs number of variables 
p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

p2

# which model size minimizes BIC
cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")

## Best model by BIC: 8 variables

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

The variables included in the BIC-best model are menthlth_days, sleep_hrs, age, exerciseYes, gen_healthgood, gen_healthfair, gen_healthpoor, and income_cat.

Model coefficients menthlth_days 0.1461 sleep_hrs -0.1951 age 0.0174 exerciseYes -1.2877 gen_healthgood 1.6368 gen_healthfair 7.0787 gen_healthpoor 20.5084 income_cat -0.1657

# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")

## 
## Variables in BIC-best model:

cat(paste(" ", best_vars), sep = "\n")

##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat

# fit using lm()
mod_bic <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age +
    exercise + gen_health + income_cat,
  data = brfss_ms
)

tidy(mod_bic, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "BIC-Best Model Coefficients",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

BIC-Best Model Coefficients
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

The BIC-best model includes 8 variables and the adjusted R^2-best model includes about 10 variables so they are not the same. I would prefer the BIC-best model because it uses fewer predictors. The BIC model also penalizes complexity more heavily which helps fix the problem of overfitting.

Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

The backward elimination removed sex, bmi, and education. The variables that remained were sleep, mental health days, income, age, exercise, and general health. That left 9 predictors including the dummy variables for general health.

# backward elimination step-by-step
cat("=== BACKWARD ELIMINATION ===\n\n")

## === BACKWARD ELIMINATION ===

# Step 1: Maximum model
mod_back <- mod_max
cat("Step 1: Maximum model\n")

## Step 1: Maximum model

cat("Variables:", paste(names(coef(mod_back))[-1], collapse = ", "), "\n")

## Variables: menthlth_days, sleep_hrs, age, sexFemale, educationHS graduate, educationSome college, educationCollege graduate, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat, bmi

# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
  filter(term != "(Intercept)") |>
  arrange(desc(p.value)) |>
  dplyr::select(term, estimate, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

pvals |>
  head(5) |>
  kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: Variables Sorted by p-value (Highest First)
term	estimate	p.value
educationHS graduate	0.2508	0.5595
educationCollege graduate	0.3336	0.4439
educationSome college	0.3463	0.4233
bmi	0.0130	0.3683
sexFemale	-0.1889	0.2995

# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)

## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19880
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18476
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

# create table of backward elimination using AIC results
tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

The forward elimination arrived at the same model as backward elimination.

# forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20386
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20802
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

# stepwise selection
mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20386
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20802
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1      6395 202123 18509
## + exercise       1      1652 206865 18625
## + income_cat     1      1307 207211 18634
## + sleep_hrs      1       756 207762 18647
## + bmi            1        91 208427 18663
## <none>                       208518 18663
## + sex            1        38 208479 18664
## + age            1        32 208486 18664
## + education      3       145 208373 18666
## - gen_health     4    115918 324435 20865
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      1651 200472 18470
## + income_cat     1       818 201305 18491
## + age            1       465 201658 18500
## + sleep_hrs      1       258 201865 18505
## + bmi            1        91 202032 18509
## <none>                       202123 18509
## + sex            1         3 202120 18511
## + education      3       112 202011 18512
## - menthlth_days  1      6395 208518 18663
## - gen_health     4     92570 294693 20386
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income_cat     1       509 199963 18460
## + age            1       334 200139 18464
## + sleep_hrs      1       253 200219 18466
## <none>                       200472 18470
## + bmi            1        21 200451 18472
## + sex            1        11 200462 18472
## + education      3        27 200445 18476
## - exercise       1      1651 202123 18509
## - menthlth_days  1      6393 206865 18625
## - gen_health     4     78857 279330 20121
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## + age            1       322 199641 18453
## + sleep_hrs      1       250 199713 18455
## <none>                       199963 18460
## + bmi            1        28 199935 18461
## + sex            1        27 199936 18461
## + education      3        27 199937 18465
## - income_cat     1       509 200472 18470
## - exercise       1      1342 201305 18491
## - menthlth_days  1      5988 205952 18605
## - gen_health     4     72713 272676 20002
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##                 Df Sum of Sq    RSS   AIC
## + sleep_hrs      1       337 199305 18447
## <none>                       199641 18453
## + sex            1        45 199596 18454
## + bmi            1        42 199599 18454
## + education      3        23 199619 18459
## - age            1       322 199963 18460
## - income_cat     1       497 200139 18464
## - exercise       1      1231 200873 18482
## - menthlth_days  1      6304 205945 18607
## - gen_health     4     68936 268577 19928
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## + sex            1        42 199262 18448
## + bmi            1        34 199270 18448
## + education      3        25 199280 18452
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

# comparing all 3 models 
method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparison of Variable Selection Methods
Method	Variables selected	Adj. R²	AIC	BIC
Maximum model	14	0.3843	32645.8	32750.1
Backward (AIC)	9	0.3846	32638.4	32710.1
Forward (AIC)	9	0.3846	32638.4	32710.1
Stepwise (AIC)	9	0.3846	32638.4	32710.1

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

Using automated selection can inflate type I error because repeat testing in stepwise procedures inflates the probability of including spurious (false) predictors. It can also ignore the research question because if the exposure is not statistically significant then the algorithm will remove it. Subject-matter knowledge can also be ignored meaning known confounders from past research can be ignored in the model if they are not statistically significant in your sample. The most relevant concern for epidemiological research is ignoring a known confounder and research question because they can lead to biased and incorrect conclusions.

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

sleep coefficient= -0.63209

mod_crude <- lm(physhlth_days ~ sleep_hrs, data= brfss_ms)
summary(mod_crude)

## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.279 -3.486 -2.854 -1.590 30.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.91103    0.59591   13.28  < 2e-16 ***
## sleep_hrs   -0.63209    0.08306   -7.61 3.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.01126 
## F-statistic: 57.92 on 1 and 4998 DF,  p-value: 3.245e-14

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

# Exposure: sleep_hrs; Outcome: physhlth_days
# Maximum associative model
mod_assoc_max <- lm(physhlth_days ~ sleep_hrs + menthlth_days + exercise + age + sex + education + income_cat + bmi,
                    data = brfss_ms)

b_exposure_max <- coef(mod_assoc_max)["sleep_hrs"]
interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)

cat("Exposure coefficient in maximum model:", round(b_exposure_max, 4), "\n")

## Exposure coefficient in maximum model: -0.3593

cat("10% interval: (", round(interval_low, 4), ",", round(interval_high, 4), ")\n\n")

## 10% interval: ( -0.3952 , -0.3233 )

# Systematically remove one covariate at a time
covariates_to_test <- c("menthlth_days", "exercise", "age", "sex",
                         "education", "income_cat", "bmi")

assoc_table <- map_dfr(covariates_to_test, \(cov) {
  # Build formula without this covariate
  remaining <- setdiff(covariates_to_test, cov)
  form <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + ")))
  mod_reduced <- lm(form, data = brfss_ms)
  b_reduced <- coef(mod_reduced)["sleep_hrs"]
  pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100

  tibble(
    `Removed covariate` = cov,
    `Sleep_hrs β (max)` = round(b_exposure_max, 4),
    `Sleep_hrs β (without)` = round(b_reduced, 4),
    `% Change` = round(pct_change, 1),
    `Within 10%?` = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
    Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
  )
})

assoc_table |>
  kable(caption = "Associative Model: Systematic Confounder Assessment for Sleep Hours") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(6, bold = TRUE)

Associative Model: Systematic Confounder Assessment for Sleep Hours
Removed covariate	Sleep_hrs β (max)	Sleep_hrs β (without)	% Change	Within 10%?	Confounder
menthlth_days	-0.3593	-0.5804	-61.6	No (keep)	Yes
exercise	-0.3593	-0.3779	-5.2	Yes (drop)	No
age	-0.3593	-0.2733	23.9	No (keep)	Yes
sex	-0.3593	-0.3633	-1.1	Yes (drop)	No
education	-0.3593	-0.3611	-0.5	Yes (drop)	No
income_cat	-0.3593	-0.3723	-3.6	Yes (drop)	No
bmi	-0.3593	-0.3738	-4.0	Yes (drop)	No

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

sleep coefficient= -0.4473 95% CI= [-0.6042 , -0.2904]

mod_final <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age,
                    data = brfss_ms)
summary(mod_final)

## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs + menthlth_days + age, 
##     data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2384  -3.3681  -2.0898  -0.2769  29.9556 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.28702    0.64571   1.993   0.0463 *  
## sleep_hrs     -0.44730    0.08001  -5.590 2.39e-08 ***
## menthlth_days  0.31185    0.01360  22.925  < 2e-16 ***
## age            0.07554    0.00626  12.066  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.557 on 4996 degrees of freedom
## Multiple R-squared:  0.1205, Adjusted R-squared:   0.12 
## F-statistic: 228.2 on 3 and 4996 DF,  p-value: < 2.2e-16

conf_int <- confint(mod_final)["sleep_hrs", ]

cat("95% CI: (", round(conf_int[1], 4), ",", round(conf_int[2], 4), ")\n")

## 95% CI: ( -0.6042 , -0.2904 )

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Automated selection, such as stepwise selection, is inappropriate for this associative analysis because they are more of an exploratory tool as opposed to analyzing association. This associative analysis looks at confounders and lets the me decide to include them rather than automated removal which can remove important known confounders which can lead to biases. This method allows a better estimate of association to be drawn between sleep and physical health.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

A predictive model (from Task 2 or 3, the best model by AIC/BIC)
An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

The two models do not include the same variables. The predictive model used more predictors (8 for the BIC best model and 9 for the automated selection models) while the associative model used sleep hours and the identified confounders.The sleep coefficient differs between the two models, based on the best BIC model the sleep coefficient is -0.1951 while the associative models is -0.4473. One reason they might differ is because the predictive model has more has additional variables influencing it.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

The adjusted effect of sleep on physical health days
Which variables needed to be accounted for (confounders)
The direction and approximate magnitude of the association
A caveat about cross-sectional data

Individuals who reported more hours of sleep each night reported less physically unhealthy days, meaning there is a negative association between the two variables. The magnitude of association is -0.45. In other words, each additional hour of sleep is associated with about .45 less physically unhealthy days each month. Mental health days and age were confounders. While there is an association between the two variables, causality cannot be determined since data was collected at the same point in time.

End of Lab Activity

Model Selection

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

March 24, 2026