EPI 553 — Model Selection Lab Due: End of class, March 24, 2026
In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically.
Use the saved analytic dataset from today’s lecture.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
bmi |
Body mass index | Continuous |
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(leaps)
library(MASS)
library(plotly)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue with the BRFSS 2020 dataset, predicting physically unhealthy days from a pool of candidate predictors.
brfss_ms <- brfss_full |>
mutate(
# Outcome
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Candidate predictors
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
age = age80,
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_)
) |>
filter(
!is.na(physhlth_days), !is.na(menthlth_days), !is.na(sleep_hrs),
!is.na(age), age >= 18, !is.na(sex), !is.na(education),
!is.na(exercise), !is.na(gen_health), !is.na(income_cat), !is.na(bmi)
)
set.seed(1220)
brfss_ms <- brfss_ms |>
dplyr::select(physhlth_days, menthlth_days, sleep_hrs, age, sex,
education, exercise, gen_health, income_cat, bmi) |>
slice_sample(n = 5000)
# Save for lab
saveRDS(brfss_ms,
"/Users/nataliasmall/Downloads/EPI 553/brfss_ms_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_ms), ncol(brfss_ms))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 10 |
1a. (5 pts) Fit the maximum model predicting
physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.
# The maximum model with all 9 candidate predictors
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms)
tidy(mod_max, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Maximum Model: All Candidate Predictors",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.6902 | 0.8556 | 3.1441 | 0.0017 | 1.0128 | 4.3676 |
| menthlth_days | 0.1472 | 0.0121 | 12.1488 | 0.0000 | 0.1235 | 0.1710 |
| sleep_hrs | -0.1930 | 0.0673 | -2.8679 | 0.0041 | -0.3249 | -0.0611 |
| age | 0.0180 | 0.0055 | 3.2969 | 0.0010 | 0.0073 | 0.0288 |
| sexFemale | -0.1889 | 0.1820 | -1.0376 | 0.2995 | -0.5458 | 0.1680 |
| educationHS graduate | 0.2508 | 0.4297 | 0.5836 | 0.5595 | -0.5917 | 1.0933 |
| educationSome college | 0.3463 | 0.4324 | 0.8009 | 0.4233 | -0.5014 | 1.1940 |
| educationCollege graduate | 0.3336 | 0.4357 | 0.7657 | 0.4439 | -0.5206 | 1.1878 |
| exerciseYes | -1.2866 | 0.2374 | -5.4199 | 0.0000 | -1.7520 | -0.8212 |
| gen_healthVery good | 0.4373 | 0.2453 | 1.7824 | 0.0747 | -0.0437 | 0.9183 |
| gen_healthGood | 1.5913 | 0.2651 | 6.0022 | 0.0000 | 1.0716 | 2.1111 |
| gen_healthFair | 7.0176 | 0.3682 | 19.0586 | 0.0000 | 6.2957 | 7.7394 |
| gen_healthPoor | 20.4374 | 0.5469 | 37.3722 | 0.0000 | 19.3653 | 21.5095 |
| income_cat | -0.1817 | 0.0503 | -3.6092 | 0.0003 | -0.2803 | -0.0830 |
| bmi | 0.0130 | 0.0145 | 0.8997 | 0.3683 | -0.0153 | 0.0414 |
glance(mod_max) |>
dplyr::select(r.squared, adj.r.squared, AIC, BIC) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = "Maximum Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | AIC | BIC |
|---|---|---|---|
| 0.386 | 0.384 | 32645.79 | 32750.06 |
𝑅2: 0.386 Adjusted𝑅2: 0.384 AIC: 32645.79 BIC: 32750.06
1b. (5 pts) Now fit a “minimal” model using only
menthlth_days and age. Report the same four
criteria. How do the two models compare?
# Compare a minimal model to the maximum model
mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)
glance(mod_min) |>
dplyr::select(r.squared, adj.r.squared, AIC, BIC) |>
mutate(across(everything(), \(x) round(x, 4))) |>
kable(caption = "Minimal Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | AIC | BIC |
|---|---|---|---|
| 0.115 | 0.1146 | 34449.78 | 34475.85 |
𝑅2: 0.115 Adjusted𝑅2: 0.1146 AIC: 34449.78 BIC: 34475.85 The minimal model has smaller𝑅2 and adjusted𝑅2 values, but greater AIC and BIC values compared to the maximum model. Considering Adjusted𝑅2 is generally preferred over standard 𝑅2 since it accounts for the number of predictors, as per the selection rule for 𝑅2, we would choose the model with the largest Adjusted 𝑅2, which is the maximum model. Similarly, regarding selection rules for AIC and BIC, the model with the smaller values are preferred, which in this case is the maximum model.
1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices? 𝑅2 is a poor criterion for comparing these two models because 𝑅2 always increases (or stays the same) when a predictors added, regardless of whether it is useful, rendering it useless for model comparison across models of different sizes. Adjusted𝑅2 is a better choice because, it has the ability to decrease when an uninformative predictor is added; the penalty for using an extra degree of freedom outweighs the tiny increase in𝑅2. AIC is also a better choice because it measures the relative information lost by a model, making for an effective relative comparison tool. Additionally, BIC is similar to AIC but penalizes complexity more heavily, especially with large sample sizes making it another better choice than𝑅2.
2a. (5 pts) Use leaps::regsubsets() to
perform best subsets regression with nvmax = 15. Create a
plot of Adjusted \(R^2\) vs. number of
variables. At what model size does Adjusted \(R^2\) plateau?
# Perform best subsets regression
best_subsets <- regsubsets(
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
exercise + gen_health + income_cat + bmi,
data = brfss_ms,
nvmax = 15,
method = "exhaustive"
)
best_summary <- summary(best_subsets)
# Plot of adjusted R2 vs number of variables
subset_metrics <- tibble(
p = 1:length(best_summary$adjr2),
`Adj. R²` = best_summary$adjr2,
BIC = best_summary$bic,
Cp = best_summary$cp
)
p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_summary$adjr2),
linetype = "dashed", color = "tomato") +
labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p1, ncol = 1)Best Subsets: Adjusted R² by Model Size
Answer: Adjusted R² reaches its maximum at 10 variables and plateaus. However, the model begins to plateau around 5 variables; after this point, adding more predictors only yields very small improvements in the Adjusted R² value.
2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?
# Plot of BIC vs number of variables
p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.min(best_summary$bic),
linetype = "dashed", color = "tomato") +
labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p2, ncol = 1)Best Subsets: BIC by Model Size
Answer: The model size that minimizes BIC is 8.
2c. (5 pts) Identify the variables included in the
BIC-best model. Fit this model explicitly using lm() and
report its coefficients.
## Best model by BIC: 8 variables
# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")##
## Variables in BIC-best model:
## menthlth_days
## sleep_hrs
## age
## exerciseYes
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## income_cat
# Fit BIC-best model
mod_best_bic <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + exercise +
gen_health + income_cat,
data = brfss_ms)
tidy(mod_best_bic, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "BIC-Best Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
DisclaimerAlthough the BIC best model identified 8 specific variables, the full gen_health factor is included in this model to maintain the integrity of the categorical variable. This resulted in the inclusion of the “Very good” category alongside the levels selected by the best-subsets procedure.
2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why? The Adjusted R²-best model is larger than the BIC-best model.
## Best model by Adj. R²: 10 variables
# Show which variables are in the Adj. R²-best model
best_adjr2_idx <- which.max(best_summary$adjr2)
best_vars_2 <- names(which(best_summary$which[best_adjr2_idx, -1]))
cat("\nVariables in Adj. R²-best model:\n")##
## Variables in Adj. R²-best model:
## menthlth_days
## sleep_hrs
## age
## sexFemale
## exerciseYes
## gen_healthVery good
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## income_cat
# Fit Adj. R²-best model
mod_best_adjr2 <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
gen_health + income_cat,
data = brfss_ms)
best_comparison <- tribble(
~`Best model`, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
"BIC-best model",
length(coef(mod_best_bic)) - 2,
round(glance(mod_best_bic)$adj.r.squared, 4),
round(AIC(mod_best_bic), 1),
round(BIC(mod_best_bic), 1),
"Adj. R²-best model",
length(coef(mod_best_adjr2)) - 1,
round(glance(mod_best_adjr2)$adj.r.squared, 4),
round(AIC(mod_best_adjr2), 1),
round(BIC(mod_best_adjr2), 1)
)
best_comparison |>
kable(caption = "Comparison:BIC-best model vs Adj. R²-best model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Best model | Variables selected | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| BIC-best model | 8 | 0.3846 | 32638.4 | 32710.1 |
| Adj. R²-best model | 10 | 0.3846 | 32639.3 | 32717.5 |
Interpretation:The BIC-best model and the Adjusted R²-best models are not the same. The Adjusted R²-best model includes 10 variables, while the BIC-best model includes only 8. The BIC-best model is preferred because while it has the exact same Adjusted R² value (0.3846) as the larger model, it sustains lower AIC and BIC values. Thus improving parsimony without sacrificing fit.
3a. (5 pts) Perform backward elimination using
step() with AIC as the criterion. Which variables are
removed? Which remain?
# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)## Start: AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
## exercise + gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - education 3 29 199231 18449
## - bmi 1 32 199234 18453
## - sex 1 43 199245 18454
## <none> 199202 18454
## - sleep_hrs 1 329 199530 18461
## - age 1 434 199636 18463
## - income_cat 1 521 199722 18466
## - exercise 1 1174 200376 18482
## - menthlth_days 1 5898 205100 18598
## - gen_health 4 66437 265639 19886
##
## Step: AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - bmi 1 32 199262 18448
## - sex 1 40 199270 18448
## <none> 199231 18449
## - sleep_hrs 1 327 199557 18455
## - age 1 439 199670 18458
## - income_cat 1 520 199751 18460
## - exercise 1 1151 200381 18476
## - menthlth_days 1 5929 205159 18594
## - gen_health 4 66459 265690 19880
##
## Step: AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## - sex 1 42 199305 18447
## <none> 199262 18448
## - sleep_hrs 1 334 199596 18454
## - age 1 427 199690 18457
## - income_cat 1 514 199776 18459
## - exercise 1 1222 200484 18476
## - menthlth_days 1 5921 205184 18592
## - gen_health 4 67347 266609 19896
##
## Step: AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## - sleep_hrs 1 337 199641 18453
## - age 1 409 199713 18455
## - income_cat 1 492 199797 18457
## - exercise 1 1214 200518 18475
## - menthlth_days 1 5882 205186 18590
## - gen_health 4 67980 267285 19906
tidy(mod_backward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
Interpretation: Six variables remain: menthlth_days, sleep_hrs, age, exercise, gen_health, income_cat. Three variables were removed: sex, education, and bmi.
3b. (5 pts) Perform forward selection using
step(). Does it arrive at the same model as backward
elimination?
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)
mod_forward <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward", trace = 1)## Start: AIC=20865.24
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 4 115918 208518 18663
## + menthlth_days 1 29743 294693 20386
## + exercise 1 19397 305038 20559
## + income_cat 1 19104 305332 20564
## + education 3 5906 318530 20779
## + age 1 4173 320263 20802
## + bmi 1 4041 320395 20805
## + sleep_hrs 1 3717 320719 20810
## <none> 324435 20865
## + sex 1 7 324429 20867
##
## Step: AIC=18662.93
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 6394.9 202123 18509
## + exercise 1 1652.4 206865 18625
## + income_cat 1 1306.9 207211 18634
## + sleep_hrs 1 756.1 207762 18647
## + bmi 1 91.2 208427 18663
## <none> 208518 18663
## + sex 1 38.5 208479 18664
## + age 1 32.2 208486 18664
## + education 3 145.0 208373 18666
##
## Step: AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 1650.52 200472 18470
## + income_cat 1 817.89 201305 18491
## + age 1 464.73 201658 18500
## + sleep_hrs 1 257.79 201865 18505
## + bmi 1 90.51 202032 18509
## <none> 202123 18509
## + sex 1 3.00 202120 18511
## + education 3 111.58 202011 18512
##
## Step: AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income_cat 1 509.09 199963 18460
## + age 1 333.74 200139 18464
## + sleep_hrs 1 253.06 200219 18466
## <none> 200472 18470
## + bmi 1 21.21 200451 18472
## + sex 1 10.74 200462 18472
## + education 3 26.94 200445 18476
##
## Step: AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
##
## Df Sum of Sq RSS AIC
## + age 1 321.97 199641 18453
## + sleep_hrs 1 250.25 199713 18455
## <none> 199963 18460
## + bmi 1 27.98 199935 18461
## + sex 1 27.17 199936 18461
## + education 3 26.66 199937 18465
##
## Step: AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age
##
## Df Sum of Sq RSS AIC
## + sleep_hrs 1 336.79 199305 18447
## <none> 199641 18453
## + sex 1 45.31 199596 18454
## + bmi 1 42.00 199599 18454
## + education 3 22.62 199619 18459
##
## Step: AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age + sleep_hrs
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## + sex 1 42.328 199262 18448
## + bmi 1 34.434 199270 18448
## + education 3 24.800 199280 18452
tidy(mod_forward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
Answer:Yes, forward selection arrives at the same final model as backward elimination. While they started from different points, both processes arrived at a final model with an AIC of 18446.98, retaining the same predictors: gen_health, menthlth_days, exercise, income_cat, age, and sleep_hrs. Thus, providing greater evidence that this specific subset of variables is the most stable choice when predicting physical health days.
3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.
# Stepwise Selection
mod_stepwise <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "both", trace = 1)## Start: AIC=20865.24
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 4 115918 208518 18663
## + menthlth_days 1 29743 294693 20386
## + exercise 1 19397 305038 20559
## + income_cat 1 19104 305332 20564
## + education 3 5906 318530 20779
## + age 1 4173 320263 20802
## + bmi 1 4041 320395 20805
## + sleep_hrs 1 3717 320719 20810
## <none> 324435 20865
## + sex 1 7 324429 20867
##
## Step: AIC=18662.93
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 6395 202123 18509
## + exercise 1 1652 206865 18625
## + income_cat 1 1307 207211 18634
## + sleep_hrs 1 756 207762 18647
## + bmi 1 91 208427 18663
## <none> 208518 18663
## + sex 1 38 208479 18664
## + age 1 32 208486 18664
## + education 3 145 208373 18666
## - gen_health 4 115918 324435 20865
##
## Step: AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 1651 200472 18470
## + income_cat 1 818 201305 18491
## + age 1 465 201658 18500
## + sleep_hrs 1 258 201865 18505
## + bmi 1 91 202032 18509
## <none> 202123 18509
## + sex 1 3 202120 18511
## + education 3 112 202011 18512
## - menthlth_days 1 6395 208518 18663
## - gen_health 4 92570 294693 20386
##
## Step: AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income_cat 1 509 199963 18460
## + age 1 334 200139 18464
## + sleep_hrs 1 253 200219 18466
## <none> 200472 18470
## + bmi 1 21 200451 18472
## + sex 1 11 200462 18472
## + education 3 27 200445 18476
## - exercise 1 1651 202123 18509
## - menthlth_days 1 6393 206865 18625
## - gen_health 4 78857 279330 20121
##
## Step: AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
##
## Df Sum of Sq RSS AIC
## + age 1 322 199641 18453
## + sleep_hrs 1 250 199713 18455
## <none> 199963 18460
## + bmi 1 28 199935 18461
## + sex 1 27 199936 18461
## + education 3 27 199937 18465
## - income_cat 1 509 200472 18470
## - exercise 1 1342 201305 18491
## - menthlth_days 1 5988 205952 18605
## - gen_health 4 72713 272676 20002
##
## Step: AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age
##
## Df Sum of Sq RSS AIC
## + sleep_hrs 1 337 199305 18447
## <none> 199641 18453
## + sex 1 45 199596 18454
## + bmi 1 42 199599 18454
## + education 3 23 199619 18459
## - age 1 322 199963 18460
## - income_cat 1 497 200139 18464
## - exercise 1 1231 200873 18482
## - menthlth_days 1 6304 205945 18607
## - gen_health 4 68936 268577 19928
##
## Step: AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age + sleep_hrs
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## + sex 1 42 199262 18448
## + bmi 1 34 199270 18448
## + education 3 25 199280 18452
## - sleep_hrs 1 337 199641 18453
## - age 1 409 199713 18455
## - income_cat 1 492 199797 18457
## - exercise 1 1214 200518 18475
## - menthlth_days 1 5882 205186 18590
## - gen_health 4 67980 267285 19906
tidy(mod_stepwise, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
# Compare backward, forward, and stepwise results
method_comparison <- tribble(
~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
"Maximum model",
length(coef(mod_max)) - 1,
round(glance(mod_max)$adj.r.squared, 4),
round(AIC(mod_max), 1),
round(BIC(mod_max), 1),
"Backward (AIC)",
length(coef(mod_backward)) - 1,
round(glance(mod_backward)$adj.r.squared, 4),
round(AIC(mod_backward), 1),
round(BIC(mod_backward), 1),
"Forward (AIC)",
length(coef(mod_forward)) - 1,
round(glance(mod_forward)$adj.r.squared, 4),
round(AIC(mod_forward), 1),
round(BIC(mod_forward), 1),
"Stepwise (AIC)",
length(coef(mod_stepwise)) - 1,
round(glance(mod_stepwise)$adj.r.squared, 4),
round(AIC(mod_stepwise), 1),
round(BIC(mod_stepwise), 1)
)
method_comparison |>
kable(caption = "Comparison of Variable Selection Methods") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | Variables selected | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| Maximum model | 14 | 0.3843 | 32645.8 | 32750.1 |
| Backward (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
| Forward (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
| Stepwise (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
Interpretation: All three automated methods arrive at the same final model (AIC = 18662.93) retaining the same adjusted R², AIC, and BIC values. Compared to the maximum model, the selected model has a lower AIC and BIC and a relatively similar Adjusted R².
3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?
Three reasons why you should not blindly trust the results of automated variable selection is: One:They ignore the research question. The algorithm selects variables based purely on statistical fit. If you are building an associative model and the exposure is not statistically significant, the algorithm will remove it, which defeats the purpose. Two:They inflate Type I error. The repeated testing involved in stepwise procedures inflates the probability of including spurious predictors. Three:They ignore subject-matter knowledge. A variable may be a known confounder from the literature even if it is not statistically significant in your sample.
In epidemiological research, ignoring the research question and subject-matter knowledge is the most relevant concern, especially regarding confounding. An automated selection algorithm’s tendency to remove predictors because they are not statistically significant contradicts the researcher’s ability to identify confounders based on a causal framework sustained by extensive literature review. Thus resulting in biased estimates for the primary exposure.
For this task, the exposure is
sleep_hrs and the outcome is
physhlth_days. You are building an associative model to
estimate the effect of sleep on physical health.
4a. (5 pts) Fit the crude model:
physhlth_days ~ sleep_hrs. Report the sleep
coefficient.
# Exposure: sleep_hrs; Outcome: physhlth_days
# Fit the crude model
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
tidy(mod_crude, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Crude Model: physhlth_days ~ sleep_hrs",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.9110 | 0.5959 | 13.2755 | 0 | 6.7428 | 9.0793 |
| sleep_hrs | -0.6321 | 0.0831 | -7.6104 | 0 | -0.7949 | -0.4693 |
b_exposure_crude <- coef(mod_crude)["sleep_hrs"]
cat("Exposure coefficient in crude model:", round(b_exposure_crude, 4), "\n")## Exposure coefficient in crude model: -0.6321
Sleep Coefficient (unadjusted):-0.6321 (Each additional hour of sleep is associated with 0.6321 less physically unhealthy days, in the past 30 days)
4b. (10 pts) Fit the maximum associative model:
physhlth_days ~ sleep_hrs + [all other covariates]. Note
the adjusted sleep coefficient and compute the 10% interval. Then
systematically remove each covariate one at a time and determine which
are confounders using the 10% rule. Present your results in a summary
table.
# Maximum associative model
mod_assoc_max <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms)
b_exposure_max <- coef(mod_assoc_max)["sleep_hrs"]
interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)
cat("Exposure coefficient in maximum model:", round(b_exposure_max, 4), "\n")## Exposure coefficient in maximum model: -0.193
## 10% interval: ( -0.2123 , -0.1737 )
# Systematically remove one covariate at a time
covariates_to_test <- c("menthlth_days", "age", "sex", "education",
"exercise", "gen_health", "income_cat", "bmi")
assoc_table <- map_dfr(covariates_to_test, \(cov) {
# Build formula without this covariate
remaining <- setdiff(covariates_to_test, cov)
form <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + ")))
mod_reduced <- lm(form, data = brfss_ms)
b_reduced <- coef(mod_reduced)["sleep_hrs"]
pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100
tibble(
`Removed covariate` = cov,
`sleep_hrs β (max)` = round(b_exposure_max, 4),
`sleep_hrs β (without)` = round(b_reduced, 4),
`% Change` = round(pct_change, 1),
`Within 10%?` = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
)
})
assoc_table |>
kable(caption = "Associative Model: Systematic Confounder Assessment for Sleep Hours") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(6, bold = TRUE)| Removed covariate | sleep_hrs β (max) | sleep_hrs β (without) | % Change | Within 10%? | Confounder |
|---|---|---|---|---|---|
| menthlth_days | -0.193 | -0.2894 | -50.0 | No (keep) | Yes |
| age | -0.193 | -0.1646 | 14.7 | No (keep) | Yes |
| sex | -0.193 | -0.1937 | -0.4 | Yes (drop) | No |
| education | -0.193 | -0.1923 | 0.3 | Yes (drop) | No |
| exercise | -0.193 | -0.1957 | -1.4 | Yes (drop) | No |
| gen_health | -0.193 | -0.3593 | -86.2 | No (keep) | Yes |
| income_cat | -0.193 | -0.1936 | -0.3 | Yes (drop) | No |
| bmi | -0.193 | -0.1950 | -1.0 | Yes (drop) | No |
Sleep Coefficient (adjusted):-0.193 10% interval:( -0.2123 , -0.1737 )
4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.
# Fit final associative model
mod_assoc_fin <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age + gen_health, data = brfss_ms)
tidy(mod_assoc_fin, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Final Associative Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.8151 | 0.5615 | 1.4516 | 0.1467 | -0.2857 | 1.9158 |
| sleep_hrs | -0.2026 | 0.0675 | -3.0003 | 0.0027 | -0.3349 | -0.0702 |
| menthlth_days | 0.1512 | 0.0120 | 12.5637 | 0.0000 | 0.1276 | 0.1748 |
| age | 0.0205 | 0.0054 | 3.7595 | 0.0002 | 0.0098 | 0.0312 |
| gen_healthVery good | 0.5113 | 0.2451 | 2.0860 | 0.0370 | 0.0308 | 0.9919 |
| gen_healthGood | 1.9151 | 0.2579 | 7.4255 | 0.0000 | 1.4095 | 2.4207 |
| gen_healthFair | 7.7686 | 0.3488 | 22.2693 | 0.0000 | 7.0847 | 8.4524 |
| gen_healthPoor | 21.4868 | 0.5266 | 40.8018 | 0.0000 | 20.4544 | 22.5192 |
# Report sleep coefficient
b_exposure_fin <- coef(mod_assoc_fin)["sleep_hrs"]
cat("Final Sleep Coefficient:", round(b_exposure_fin, 4), "\n")## Final Sleep Coefficient: -0.2026
# Report 95% CI
sleep_fin <-
tidy(mod_assoc_fin, conf.int = TRUE) |>
filter(term == "sleep_hrs")
cat("95% CI: (", round(sleep_fin$conf.low, 4), ",", round(sleep_fin$conf.high, 4),") \n")## 95% CI: ( -0.3349 , -0.0702 )
Final Sleep Coefficient:-0.2026 95% CI:(-0.3349 , -0.0702)
4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.
Answer:Automated selection is inappropriate for this associative analysis because the primary exposure variable (sleep hours) must remain in the model to address the research question, regardless of its statistical significance (p-value). Since stepwise selection prioritizes statistical fit over causal theory, if we are building an associative model and the exposure is not statistically significant, the algorithm will remove it, which defeats the purpose. Automated selection should be used as an exploratory tool to generate candidate models, but final decisions, especially for a associative analysis, should be based on substantive knowledge, confounding assessment, and parsimony.
5a. (10 pts) You have now built two models for the same data:
Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?
# Comparison: predictive model (Task 2c, BIC-best) vs associative model (Task 4c, sleep + confounders)
bind_rows(
glance(mod_best_bic) |>
dplyr::select(r.squared, adj.r.squared, AIC, BIC) |>
mutate(Model = "Predictive (BIC-best)"),
glance(mod_assoc_fin) |>
dplyr::select(r.squared, adj.r.squared, AIC, BIC) |>
mutate(Model = "Associative (sleep + confounders)")
) |>
dplyr::select(Model, everything()) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
rename(`R²` = r.squared, `Adj. R²` = adj.r.squared) |>
kable(caption = "Comparison: Predictive vs. Associative Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | R² | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| Predictive (BIC-best) | 0.3857 | 0.3846 | 32638.37 | 32710.06 |
| Associative (sleep + confounders) | 0.3796 | 0.3787 | 32684.06 | 32742.71 |
Answer:The predictive and associative models do not include the exact same variables. The predictive model includes mental health days, sleep hours, age, exercise, general health, and income variables. Variables selected maximize predictive accuracy as per the AIC/BIC criterion. The associative model includes sleep_hrs as a fixed exposure plus only those covariates confirmed as confounders by the 10% change-in-estimate rule. sleep_hrs was never removed regardless of its p-value. In the associative model, the final adjusted sleep coefficient is -0.2026, while in the predictive model the sleep coefficient in the BIC-best model is -0.1951. The sleep coefficient is relatively similar across both models because both adjust for the most important confounders, but they are not identical. The predictive model chooses variables based on statistical fit, while the associative model adjusts only for confirmed confounders.
5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:
Do not use statistical jargon.
Answer: Among U.S. adults, our study found that getting more sleep is linked to better physical health. While accounting for other contributing factors such as mentally unhealthy days, age, and general health status, each additional hour of sleep is associated with a decrease of about 0.2 physically unhealthy days per month. Since the BRFSS 2020 dataset used for this analysis is cross-sectional, we cannot determine causality. We cannot say for certain if more sleep causes better physical health or if being healthier physically simply makes it easier to sleep.
End of Lab Activity