Introduction

In previous lectures, we learned how to fit multiple linear regression models, include dummy variables for categorical predictors, test for interactions, and assess confounding. But we have not yet addressed a fundamental question: how do we decide which variables belong in the model?

This question has different answers depending on the goal of the analysis:

Goal | What matters | Variable selection driven by |

|******|************-|******************************| | Prediction | Model accuracy and reliability in new data | Statistical criteria (Adj. \(R^2\), AIC, BIC, cross-validation) | | Association | Validity of the exposure coefficient | Subject-matter knowledge, confounding assessment, 10% rule |

In predictive modeling, we search for the subset of variables that best predicts \(Y\) without overfitting. In associative modeling, the exposure variable is always in the model, and we decide which covariates to include based on whether they are confounders.

This lecture covers both approaches, with emphasis on when each is appropriate and the pitfalls of automated selection.


Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(leaps)
library(MASS)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue with the BRFSS 2020 dataset, predicting physically unhealthy days from a pool of candidate predictors.

brfss_full <- read_xpt(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/lab11/LLCP2020.XPT"
) |>
  clean_names()
brfss_ms <- brfss_full |>
  mutate(
    # Outcome
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    # Candidate predictors
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_)
  ) |>
  filter(
    !is.na(physhlth_days), !is.na(menthlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(education),
    !is.na(exercise), !is.na(gen_health), !is.na(income_cat), !is.na(bmi)
  )

set.seed(1220)
brfss_ms <- brfss_ms |>
  dplyr::select(physhlth_days, menthlth_days, sleep_hrs, age, sex,
                education, exercise, gen_health, income_cat, bmi) |>
  slice_sample(n = 5000)

# Save for lab
saveRDS(brfss_ms,
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/lab11/brfss_ms_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_ms), ncol(brfss_ms))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 5000
Variables 10

We have 10 variables: 1 outcome and 9 candidate predictors. If we considered all possible subsets of the 9 predictors (ignoring interactions and transformations), there would be \(2^9 - 1 = 511\) possible models.


Summary of Key Concepts

Concept | Key Point |

|*********|*********–| | Maximum model | Start with all candidate predictors from literature and research question | | Overfitting vs. underfitting | Overfitting = more variance; underfitting = bias | | Parsimony | Simplest model that captures the important relationships | | \(R^2\) | Always increases with more variables; useless alone for comparison | | Adjusted \(R^2\) | Penalizes complexity; maximize it | | AIC | Balances fit and complexity; minimize it | | BIC | Heavier penalty than AIC; favors simpler models; minimize it | | Partial F-test | Compares reduced to maximum model | | Best subsets | Exhaustive search; leaps::regsubsets() | | Backward elimination | Start full, remove highest p-value; step(direction = "backward") | | Forward selection | Start empty, add lowest p-value; step(direction = "forward") | | Stepwise | Forward + backward at each step; step(direction = "both") | | Caution | Automated methods ignore research questions and inflate Type I error | | Associative models | Exposure stays in model; use 10% change-in-estimate for covariates | | Cross-validation | Estimates out-of-sample performance; protects against overfitting |



Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab Due: End of class, March 24, 2026


Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable | Description | Type |

|*********-|************-|******| | physhlth_days | Physically unhealthy days in past 30 | Continuous (0–30) | | menthlth_days | Mentally unhealthy days in past 30 | Continuous (0–30) | | sleep_hrs | Sleep hours per night | Continuous (1–14) | | age | Age in years (capped at 80) | Continuous | | sex | Sex (Male/Female) | Factor | | education | Education level (4 categories) | Factor | | exercise | Any physical activity (Yes/No) | Factor | | gen_health | General health status (5 categories) | Factor | | income_cat | Household income (1–8 ordinal) | Numeric | | bmi | Body mass index | Continuous |

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(leaps)
library(MASS)

brfss_ms <- readRDS(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/lab12/brfss_ms_2020.rds"
)

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

# The maximum model with all candidate predictors
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Maximum Model: All Candidate Predictors
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.6902 0.8556 3.1441 0.0017 1.0128 4.3676
menthlth_days 0.1472 0.0121 12.1488 0.0000 0.1235 0.1710
sleep_hrs -0.1930 0.0673 -2.8679 0.0041 -0.3249 -0.0611
age 0.0180 0.0055 3.2969 0.0010 0.0073 0.0288
sexFemale -0.1889 0.1820 -1.0376 0.2995 -0.5458 0.1680
educationHS graduate 0.2508 0.4297 0.5836 0.5595 -0.5917 1.0933
educationSome college 0.3463 0.4324 0.8009 0.4233 -0.5014 1.1940
educationCollege graduate 0.3336 0.4357 0.7657 0.4439 -0.5206 1.1878
exerciseYes -1.2866 0.2374 -5.4199 0.0000 -1.7520 -0.8212
gen_healthVery good 0.4373 0.2453 1.7824 0.0747 -0.0437 0.9183
gen_healthGood 1.5913 0.2651 6.0022 0.0000 1.0716 2.1111
gen_healthFair 7.0176 0.3682 19.0586 0.0000 6.2957 7.7394
gen_healthPoor 20.4374 0.5469 37.3722 0.0000 19.3653 21.5095
income_cat -0.1817 0.0503 -3.6092 0.0003 -0.2803 -0.0830
bmi 0.0130 0.0145 0.8997 0.3683 -0.0153 0.0414
glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Maximum Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.386 0.384 6.321 32645.79 32750.06 4985

Interpretation: The maximum model explains approximately 38.6% of the variance in physically unhealthy days (R² = 0.386, Adjusted R² = 0.384). The AIC is 32,645.8 and BIC is 32,750.1; these serve as baselines for comparing simpler models.

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

mod_min <- lm (physhlth_days ~ menthlth_days + age, data = brfss_ms)

tidy(mod_min, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Minimum Model: physhlth_days ~Mental health days and  Age ",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Minimum Model: physhlth_days ~Mental health days and Age
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) -1.6983 0.3641 -4.6647 0 -2.4121 -0.9846
menthlth_days 0.3237 0.0135 24.0149 0 0.2973 0.3501
age 0.0716 0.0062 11.4763 0 0.0594 0.0838
glance(mod_min) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Minimum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Minimum Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.115 0.115 7.58 34449.78 34475.85 4997

Interpretation: The minimum model explains approximately 11.5% of the variance in meantally unhealthy days (R² = 0.115, Adjusted R² = 0.115). The AIC is 34449.78 and BIC is 34475.85.

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

anova(mod_min, mod_max) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Minimum Model vs. Maximum Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-test: Minimum Model vs. Maximum Model
term df.residual rss df sumsq statistic p.value
physhlth_days ~ menthlth_days + age 4997 287124.8 NA NA NA NA
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi 4985 199201.8 12 87923 183.3551 0

In the first model R² is higher because it has maximum number of predictors, R² here is a bad criterion because it always goes up when you add predictors — even if they’re useless. The full model has 9 predictors vs 2 in the minimum model, so of course R² is higher. The drop in RSS (287,124 → 199,202) looks big, but R² alone can’t tell if that improvement is real or just from adding more variables. -Better options: Adjusted R² of max model is greater than the min model (Adj R² max = 0.384 vs Adj R² max= 0.115)
As Lower AIC and BIC = better model- AIC max 32645.79, BIC max - 32750.06 vs. AIC min 34449.78, BIC min - 34475.85 -In conclusion the maximum model has higher R² and Adjusted R² and lower AIC and BIC, so the additional variables are contributing meaningful improvement to the model, not just noise.


Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1, p2, ncol = 2)
Best Subsets: Adjusted R² and BIC by Model Size

Best Subsets: Adjusted R² and BIC by Model Size

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")
## Best model by BIC: 8 variables
# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")
## 
## Variables in BIC-best model:
cat(paste(" ", best_vars), sep = "\n")
##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat
brfss_ms <- brfss_ms |>
  mutate(
    gh_verygood = as.integer(gen_health == "Very good"), #not in the BIC best model
    gh_good     = as.integer(gen_health == "Good"),
    gh_fair     = as.integer(gen_health == "Fair"),
    gh_poor     = as.integer(gen_health == "Poor")
  )

mod_bic<- lm(physhlth_days ~ menthlth_days + sleep_hrs + age +
                      exercise + gh_good + gh_fair + gh_poor + income_cat,
                    data = brfss_ms)

tidy(mod_bic, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Best model by BIC",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Best model by BIC
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.4563 0.6510 5.3090 0.0000 2.1800 4.7327
menthlth_days 0.1472 0.0120 12.2442 0.0000 0.1237 0.1708
sleep_hrs -0.1986 0.0672 -2.9551 0.0031 -0.3303 -0.0668
age 0.0182 0.0054 3.3652 0.0008 0.0076 0.0289
exerciseYes -1.3032 0.2335 -5.5808 0.0000 -1.7610 -0.8454
gh_good 1.3489 0.2108 6.3975 0.0000 0.9355 1.7622
gh_fair 6.7793 0.3253 20.8432 0.0000 6.1416 7.4169
gh_poor 20.1975 0.5170 39.0699 0.0000 19.1840 21.2110
income_cat -0.1654 0.0472 -3.5049 0.0005 -0.2580 -0.0729

Interpretation: - For each additional day of poor mental health, physically unhealthy day increase on average 0.146, holding all other variables in constant. - For each additional hour of sleep, there is 0.195 physically unhealthy days less, holding all other factors in constant. - For each +1 year of age, there is 0.02 increase in physically unhealthy days, holding all other factors in constant. - For any exercise in the past 30 days, there is 1.29 less days of physically unhealthy days on average, holding all other factors in constant. - For reported general health Good there are on average 1.34 days more of Physically unhealthy days in comparicon for reported general health status “Excellent”, holding all other factors in constant. - For reported general health Fair there are on average 6.78 days more of Physically unhealthy days in comparicon for reported general health status “Excellent”, holding all other factors in constant. - For reported general health status Poor there are on average 20.5 days more of Physically unhealthy days in comparicon for reported general health status “Excellent”, holding all other factors in constant.

cat("Best model by Adj. R²:", which.max(best_summary$adjr2), "variables\n")
## Best model by Adj. R²: 10 variables
# Show which variables are in the Adj.R² -best model
best_adj_r <- which.max (best_summary$adjr2)
best_vars1 <-names(which(best_summary$which[best_adj_r, -1]))
cat(paste (" ", best_vars1), sep = "\n")
##   menthlth_days
##   sleep_hrs
##   age
##   sexFemale
##   exerciseYes
##   gen_healthVery good
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat
mod_adjr2 <-lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex 
                    + exercise + gen_health + income_cat,
                    data = brfss_ms)

tidy(mod_adjr2, conf.int = TRUE) |>
       mutate(across(where(is.numeric), \(x) round (x,4))) |>
  kable(
    caption = " Best model by Adj. R²",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling (bootstrap_options = c("stripped", "hover"), full_width = FALSE)
Best model by Adj. R²
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.2833 0.6730 4.8789 0.0000 1.9640 4.6026
menthlth_days 0.1474 0.0121 12.1760 0.0000 0.1237 0.1711
sleep_hrs -0.1943 0.0672 -2.8909 0.0039 -0.3261 -0.0625
age 0.0179 0.0055 3.2710 0.0011 0.0072 0.0286
sexFemale -0.1865 0.1812 -1.0295 0.3033 -0.5418 0.1687
exerciseYes -1.2921 0.2336 -5.5303 0.0000 -1.7501 -0.8340
gen_healthVery good 0.4591 0.2441 1.8807 0.0601 -0.0195 0.9377
gen_healthGood 1.6259 0.2602 6.2485 0.0000 1.1158 2.1360
gen_healthFair 7.0576 0.3622 19.4844 0.0000 6.3475 7.7677
gen_healthPoor 20.4704 0.5436 37.6579 0.0000 19.4047 21.5361
income_cat -0.1698 0.0474 -3.5856 0.0003 -0.2626 -0.0770

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Best model by Adj. R²",
    length(coef(mod_adjr2)) - 1,
    round(glance(mod_adjr2)$adj.r.squared, 4),
    round(AIC(mod_adjr2), 1),
    round(BIC(mod_adjr2), 1),
  "Best model by BIC",
    length(coef(mod_bic)) - 1,
    round(glance(mod_bic)$adj.r.squared, 4),
    round(AIC(mod_bic), 1),
    round(BIC(mod_bic), 1),
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Comparison of Variable Selection Methods
Method Variables selected Adj. R² AIC BIC
Best model by Adj. R² 10 0.3846 32639.3 32717.5
Best model by BIC 8 0.3843 32640.0 32705.1

Interpretation: Comparing the BIC-best model to the Adjusted \(R^2\)-best model, we observe that R² and Adj R², AIC and BIC are basically similar. But best model by BIC has slightly lesser BIc, as we that gen_health: Very good has p-value >0.05 as not significantly different from excellent reference group. ***

Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

# Automated backward elimination using AIC
mod_backward1 <- step(mod_max, direction = "backward", trace = 1)
## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

Interpretation: Backward elimination using AIC was performed starting from the full model. The stepwise procedure iteratively removed variables that improved (reduced) the AIC. The variables education, bmi, and sex were removed during the selection process. The final model retained menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat as predictors.

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

# Automated forward selection using AIC

mod_null1 <- lm(physhlth_days ~ 1, data = brfss_ms)

mod_forward1 <- step(mod_null1,

                    scope = list(lower = mod_null1, upper = mod_max),

                    direction = "forward", trace = 1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452
tidy(mod_forward1, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Forward Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634

Interpretation: Forward selection using AIC was performed starting from the null model. Variables were added sequentially based on the greatest reduction in AIC at each step. The final model included gen_health, menthlth_days, exercise, income_cat, age, and sleep_hrs. This final set of predictors is identical to the model obtained from backward elimination, indicating that both selection methods converged to the same model under the AIC criterion.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

mod_stepwise1 <- step(mod_null1,
                     scope = list(lower = mod_null1, upper = mod_max),
                     direction = "both", trace = 1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1      6395 202123 18509
## + exercise       1      1652 206865 18625
## + income_cat     1      1307 207211 18634
## + sleep_hrs      1       756 207762 18647
## + bmi            1        91 208427 18663
## <none>                       208518 18663
## + sex            1        38 208479 18664
## + age            1        32 208486 18664
## + education      3       145 208373 18666
## - gen_health     4    115918 324435 20865
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      1651 200472 18470
## + income_cat     1       818 201305 18491
## + age            1       465 201658 18500
## + sleep_hrs      1       258 201865 18505
## + bmi            1        91 202032 18509
## <none>                       202123 18509
## + sex            1         3 202120 18511
## + education      3       112 202011 18512
## - menthlth_days  1      6395 208518 18663
## - gen_health     4     92570 294693 20387
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income_cat     1       509 199963 18460
## + age            1       334 200139 18464
## + sleep_hrs      1       253 200219 18466
## <none>                       200472 18470
## + bmi            1        21 200451 18472
## + sex            1        11 200462 18472
## + education      3        27 200445 18476
## - exercise       1      1651 202123 18509
## - menthlth_days  1      6393 206865 18625
## - gen_health     4     78857 279330 20121
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## + age            1       322 199641 18453
## + sleep_hrs      1       250 199713 18455
## <none>                       199963 18460
## + bmi            1        28 199935 18461
## + sex            1        27 199936 18461
## + education      3        27 199937 18465
## - income_cat     1       509 200472 18470
## - exercise       1      1342 201305 18491
## - menthlth_days  1      5988 205952 18605
## - gen_health     4     72713 272676 20002
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##                 Df Sum of Sq    RSS   AIC
## + sleep_hrs      1       337 199305 18447
## <none>                       199641 18453
## + sex            1        45 199596 18454
## + bmi            1        42 199599 18454
## + education      3        23 199619 18459
## - age            1       322 199963 18460
## - income_cat     1       497 200139 18464
## - exercise       1      1231 200873 18482
## - menthlth_days  1      6304 205945 18607
## - gen_health     4     68936 268577 19929
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## + sex            1        42 199262 18448
## + bmi            1        34 199270 18448
## + education      3        25 199280 18452
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward1)) - 1,
    round(glance(mod_backward1)$adj.r.squared, 4),
    round(AIC(mod_backward1), 1),
    round(BIC(mod_backward1), 1),
  "Forward (AIC)",
    length(coef(mod_forward1)) - 1,
    round(glance(mod_forward1)$adj.r.squared, 4),
    round(AIC(mod_forward1), 1),
    round(BIC(mod_forward1), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise1)) - 1,
    round(glance(mod_stepwise1)$adj.r.squared, 4),
    round(AIC(mod_stepwise1), 1),
    round(BIC(mod_stepwise1), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Comparison of Variable Selection Methods
Method Variables selected Adj. R² AIC BIC
Maximum model 14 0.3843 32645.8 32750.1
Backward (AIC) 9 0.3846 32638.4 32710.1
Forward (AIC) 9 0.3846 32638.4 32710.1
Stepwise (AIC) 9 0.3846 32638.4 32710.1

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research? > -It ignores the research question and may remove variables that are theoretically important but not statistically significant in the sample. -It inflates Type I error because multiple comparisons are implicitly performed during the stepwise procedure, increasing the chance of including spurious predictors. -It is path-dependent, meaning forward and backward procedures (or different starting models) can lead to different final models, which reduces reproducibility and stability of the results. ***

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

mod_sleep <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)

tidy(mod_sleep, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Associative Model: Sleep hours and Physical health day",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Associative Model: Sleep hours and Physical health day
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 7.9110 0.5959 13.2755 0 6.7428 9.0793
sleep_hrs -0.6321 0.0831 -7.6104 0 -0.7949 -0.4693
glance(mod_sleep) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Sleep Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Sleep Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.011 0.011 8.011 35001.02 35020.57 4998

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

mod_sleep_max <- lm( physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)
tidy(mod_sleep_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Associative Model: Sleep hours and Physical health day",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Associative Model: Sleep hours and Physical health day
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.6902 0.8556 3.1441 0.0017 1.0128 4.3676
sleep_hrs -0.1930 0.0673 -2.8679 0.0041 -0.3249 -0.0611
menthlth_days 0.1472 0.0121 12.1488 0.0000 0.1235 0.1710
age 0.0180 0.0055 3.2969 0.0010 0.0073 0.0288
sexFemale -0.1889 0.1820 -1.0376 0.2995 -0.5458 0.1680
educationHS graduate 0.2508 0.4297 0.5836 0.5595 -0.5917 1.0933
educationSome college 0.3463 0.4324 0.8009 0.4233 -0.5014 1.1940
educationCollege graduate 0.3336 0.4357 0.7657 0.4439 -0.5206 1.1878
exerciseYes -1.2866 0.2374 -5.4199 0.0000 -1.7520 -0.8212
gen_healthVery good 0.4373 0.2453 1.7824 0.0747 -0.0437 0.9183
gen_healthGood 1.5913 0.2651 6.0022 0.0000 1.0716 2.1111
gen_healthFair 7.0176 0.3682 19.0586 0.0000 6.2957 7.7394
gen_healthPoor 20.4374 0.5469 37.3722 0.0000 19.3653 21.5095
income_cat -0.1817 0.0503 -3.6092 0.0003 -0.2803 -0.0830
bmi 0.0130 0.0145 0.8997 0.3683 -0.0153 0.0414
glance(mod_sleep_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum  Sleep Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Maximum Sleep Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.386 0.384 6.321 32645.79 32750.06 4985

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

mod_sleep_fin <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age + exercise + gen_health +
income_cat, data=brfss_ms)

tidy(mod_sleep_fin, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Associative Model: Sleep hours and Physical health day",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Final Associative Model: Sleep hours and Physical health day
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
glance(mod_sleep_fin) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Final Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Final Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.386 0.385 6.32 32638.37 32710.06 4990

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis. >Interpetation : Stepwise selection is not appropriate here because our goal is association, not prediction. These methods rely only on statistical criteria and can drop important confounders, which would bias the sleep coefficient. They also tend to produce unstable models and inflate Type I error. Instead, we selected variables based on subject-matter knowledge to properly control for confounding and get a valid estimate of the association.


Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

  • A predictive model (from Task 2 or 3, the best model by AIC/BIC)
  • An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ

anova(mod_sleep_fin, mod_bic)
## Analysis of Variance Table
## 
## Model 1: physhlth_days ~ sleep_hrs + menthlth_days + age + exercise + 
##     gen_health + income_cat
## Model 2: physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gh_good + gh_fair + gh_poor + income_cat
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1   4990 199305                              
## 2   4991 199447 -1   -142.89 3.5774 0.05863 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:The two models include different sets of variables, with the predictive model being more parsimonious and the associative model using more detailed health indicators. The ANOVA test shows that the more complex model does not significantly improve model fit (p = 0.0586). This suggests that adding the extra variables does not meaningfully change the model performance. The sleep coefficient may still be similar across models, but any differences are likely due to how the models adjust for other variables rather than a true change in the relationship.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include: > After adjusting for mental health, age, exercise, general health, and income, sleep is still associated with physical health. Each additional hour of sleep is linked to about 0.2 fewer physically unhealthy days, indicating a negative relationship between sleep and poor health. These other variables were included because they may influence both sleep and physical health and could confound the association. Overall, the results suggest that better sleep is related to better physical health, but since the data are cross-sectional, we cannot conclude that sleep causes this effect.


End of Lab Activity