Introduction

In previous lectures, we learned how to fit multiple linear regression models, include dummy variables for categorical predictors, test for interactions, and assess confounding. But we have not yet addressed a fundamental question: how do we decide which variables belong in the model?

This question has different answers depending on the goal of the analysis:

Goal What matters Variable selection driven by
Prediction Model accuracy and reliability in new data Statistical criteria (Adj. \(R^2\), AIC, BIC, cross-validation)
Association Validity of the exposure coefficient Subject-matter knowledge, confounding assessment, 10% rule

In predictive modeling, we search for the subset of variables that best predicts \(Y\) without overfitting. In associative modeling, the exposure variable is always in the model, and we decide which covariates to include based on whether they are confounders.

This lecture covers both approaches, with emphasis on when each is appropriate and the pitfalls of automated selection.


Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(leaps)
library(MASS)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

The BRFSS 2020 Dataset

We continue with the BRFSS 2020 dataset, predicting physically unhealthy days from a pool of candidate predictors.

brfss_full <- read_xpt(
  "C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/BRFSS_2020.XPT"
) |>
  clean_names()
brfss_ms <- brfss_full |>
  mutate(
    # Outcome
    physhlth_days = case_when(
      physhlth == 88                  ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                           ~ NA_real_
    ),
    # Candidate predictors
    menthlth_days = case_when(
      menthlth == 88                  ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                           ~ NA_real_
    ),
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
      TRUE                           ~ NA_real_
    ),
    age = age80,
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    ),
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_)
  ) |>
  filter(
    !is.na(physhlth_days), !is.na(menthlth_days), !is.na(sleep_hrs),
    !is.na(age), age >= 18, !is.na(sex), !is.na(education),
    !is.na(exercise), !is.na(gen_health), !is.na(income_cat), !is.na(bmi)
  )

set.seed(1220)
brfss_ms <- brfss_ms |>
  dplyr::select(physhlth_days, menthlth_days, sleep_hrs, age, sex,
                education, exercise, gen_health, income_cat, bmi) |>
  slice_sample(n = 5000)

# Save for lab
saveRDS(brfss_ms,
  "C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/BRFSS_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_ms), ncol(brfss_ms))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 5000
Variables 10

In-Class Lab Activity


Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable Description Type
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
exercise Any physical activity (Yes/No) Factor
gen_health General health status (5 categories) Factor
income_cat Household income (1–8 ordinal) Numeric
bmi Body mass index Continuous
brfss_ms <- readRDS(
  "C:/Users/graci/OneDrive/Documents/UA GRAD SCHOOL/2nd Semester/EPI553/BRFSS_2020.rds"
)

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Maximum Model: All Candidate Predictors
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.6902 0.8556 3.1441 0.0017 1.0128 4.3676
menthlth_days 0.1472 0.0121 12.1488 0.0000 0.1235 0.1710
sleep_hrs -0.1930 0.0673 -2.8679 0.0041 -0.3249 -0.0611
age 0.0180 0.0055 3.2969 0.0010 0.0073 0.0288
sexFemale -0.1889 0.1820 -1.0376 0.2995 -0.5458 0.1680
educationHS graduate 0.2508 0.4297 0.5836 0.5595 -0.5917 1.0933
educationSome college 0.3463 0.4324 0.8009 0.4233 -0.5014 1.1940
educationCollege graduate 0.3336 0.4357 0.7657 0.4439 -0.5206 1.1878
exerciseYes -1.2866 0.2374 -5.4199 0.0000 -1.7520 -0.8212
gen_healthVery good 0.4373 0.2453 1.7824 0.0747 -0.0437 0.9183
gen_healthGood 1.5913 0.2651 6.0022 0.0000 1.0716 2.1111
gen_healthFair 7.0176 0.3682 19.0586 0.0000 6.2957 7.7394
gen_healthPoor 20.4374 0.5469 37.3722 0.0000 19.3653 21.5095
income_cat -0.1817 0.0503 -3.6092 0.0003 -0.2803 -0.0830
bmi 0.0130 0.0145 0.8997 0.3683 -0.0153 0.0414
glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Maximum Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.386 0.384 6.321 32645.79 32750.06 4985

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

The maximum model has a higher R^2 (0.386) and adjusted R^2 (0.384), showing a better fit. The AIC and BIC are also lower showing the improved model performance is greater. The minimal model a lower R^2 (0.115) and adjusted R^2 (0.115), and this model explains less variability in physically unhealthy days. The comparison between the two models shows that additional predictors improve model fit.

mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

tidy(mod_min, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Minimal Model: menthlth_days + age",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Minimal Model: menthlth_days + age
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) -1.6983 0.3641 -4.6647 0 -2.4121 -0.9846
menthlth_days 0.3237 0.0135 24.0149 0 0.2973 0.3501
age 0.0716 0.0062 11.4763 0 0.0594 0.0838
glance(mod_min) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Minimal Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Minimal Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.115 0.115 7.58 34449.78 34475.85 4997

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

R^2 is a poor criterion for comparing these models because it always increases when you add predictors even when they are not meaningful and improve the model. Adjusted R^2, AIC, and BIC are better choices because they penalize model complexity. Adjusted R^2 adjusts for the number of predictors while AIC and BIC balance the model fit and penalize additional predictors.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

Adjusted R^2 increases steadily at first and then begins to plateau around 9-10 variables. This shows us that adding more predictors beyond 9-10 variables provides minimal improvement in the model fit.

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

BIC begins to plateau around 8 variables.

best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1, p2, ncol = 2)

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")
## 
## Variables in BIC-best model:
cat(paste(" ", best_vars), sep = "\n")
##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat
bic_model <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age +   exercise + gen_health + income_cat, data = brfss_ms)

summary(bic_model)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     exercise + gen_health + income_cat, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

The BIC model and adjusted R^2 model are not the same. The adjusted R^2 model includes more variables and penalizes complexity, while BIC favors simpler models. I would prefer the BIC model because it reduces the risk of over fitting.


Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

The variables that remain are menthlth_days, sleep_hrs, age, exercise, income and gen_health. The variables that were removed are physical health, education, sex and bmi.

# Step-by-step backward elimination (manual demonstration)

cat("=== BACKWARD ELIMINATION ===\n\n")
## === BACKWARD ELIMINATION ===
# Step 1: Maximum model
mod_back <- mod_max
cat("Step 1: Maximum model\n")
## Step 1: Maximum model
cat("Variables:", paste(names(coef(mod_back))[-1], collapse = ", "), "\n")
## Variables: menthlth_days, sleep_hrs, age, sexFemale, educationHS graduate, educationSome college, educationCollege graduate, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat, bmi
# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
  filter(term != "(Intercept)") |>
  arrange(desc(p.value)) |>
  dplyr::select(term, estimate, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

pvals |>
  head(5) |>
  kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Maximum Model: Variables Sorted by p-value (Highest First)
term estimate p.value
educationHS graduate 0.2508 0.5595
educationCollege graduate 0.3336 0.4439
educationSome college 0.3463 0.4233
bmi 0.0130 0.3683
sexFemale -0.1889 0.2995
mod_backward <- step(mod_max, direction = "backward", trace = 1)
## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Backward Elimination Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

Yes the model does arrive to the same model as backward elimination.

# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452
tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Forward Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", trace = 1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1      6395 202123 18509
## + exercise       1      1652 206865 18625
## + income_cat     1      1307 207211 18634
## + sleep_hrs      1       756 207762 18647
## + bmi            1        91 208427 18663
## <none>                       208518 18663
## + sex            1        38 208479 18664
## + age            1        32 208486 18664
## + education      3       145 208373 18666
## - gen_health     4    115918 324435 20865
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      1651 200472 18470
## + income_cat     1       818 201305 18491
## + age            1       465 201658 18500
## + sleep_hrs      1       258 201865 18505
## + bmi            1        91 202032 18509
## <none>                       202123 18509
## + sex            1         3 202120 18511
## + education      3       112 202011 18512
## - menthlth_days  1      6395 208518 18663
## - gen_health     4     92570 294693 20387
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income_cat     1       509 199963 18460
## + age            1       334 200139 18464
## + sleep_hrs      1       253 200219 18466
## <none>                       200472 18470
## + bmi            1        21 200451 18472
## + sex            1        11 200462 18472
## + education      3        27 200445 18476
## - exercise       1      1651 202123 18509
## - menthlth_days  1      6393 206865 18625
## - gen_health     4     78857 279330 20121
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## + age            1       322 199641 18453
## + sleep_hrs      1       250 199713 18455
## <none>                       199963 18460
## + bmi            1        28 199935 18461
## + sex            1        27 199936 18461
## + education      3        27 199937 18465
## - income_cat     1       509 200472 18470
## - exercise       1      1342 201305 18491
## - menthlth_days  1      5988 205952 18605
## - gen_health     4     72713 272676 20002
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##                 Df Sum of Sq    RSS   AIC
## + sleep_hrs      1       337 199305 18447
## <none>                       199641 18453
## + sex            1        45 199596 18454
## + bmi            1        42 199599 18454
## + education      3        23 199619 18459
## - age            1       322 199963 18460
## - income_cat     1       497 200139 18464
## - exercise       1      1231 200873 18482
## - menthlth_days  1      6304 205945 18607
## - gen_health     4     68936 268577 19929
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## + sex            1        42 199262 18448
## + bmi            1        34 199270 18448
## + education      3        25 199280 18452
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stepwise Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Comparison of Variable Selection Methods
Method Variables selected Adj. R² AIC BIC
Maximum model 14 0.3843 32645.8 32750.1
Backward (AIC) 9 0.3846 32638.4 32710.1
Forward (AIC) 9 0.3846 32638.4 32710.1
Stepwise (AIC) 9 0.3846 32638.4 32710.1

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

Three reasons why you should not blindly trust the results of automated variale selection:

  1. They ignore the research question.
  2. They ignore subject matter knowledge.
  3. p-values and CIs from the final model are biased.

The most relevant for epidemiological research ignoring subject matter knowledge. Subject matter knowledge is important for identifying confounders and for deciding what should or should not be adjusted for. When relying on automated methods, you may exclude true confounders or adjust for inappropriate variables.


Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

Sleep coefficient is -0.63209.

crude_mod <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
summary(crude_mod)
## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.279 -3.486 -2.854 -1.590 30.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.91103    0.59591   13.28  < 2e-16 ***
## sleep_hrs   -0.63209    0.08306   -7.61 3.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.01126 
## F-statistic: 57.92 on 1 and 4998 DF,  p-value: 3.245e-14

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

mod_assoc_max <- lm(
  physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
    education + exercise + income_cat + bmi + gen_health,
  data = brfss_ms
)

b_exposure_max <- coef(mod_assoc_max)["sleep_hrs"]

interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)

cat("Sleep coefficient in maximum model:", round(b_exposure_max, 4), "\n")
## Sleep coefficient in maximum model: -0.193
cat("10% interval: (", round(interval_low, 4), ",", round(interval_high, 4), ")\n\n")
## 10% interval: ( -0.2123 , -0.1737 )
covariates_to_test <- c("menthlth_days", "age", "sex",
                        "education", "exercise", "income_cat", "bmi", "gen_health")

assoc_table <- map_dfr(covariates_to_test, \(cov) {
  
  remaining <- setdiff(covariates_to_test, cov)
  
  form <- as.formula(
    paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + "))
  )
  
  mod_reduced <- lm(form, data = brfss_ms)
  
  b_reduced <- coef(mod_reduced)["sleep_hrs"]
  
  pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100
  
  tibble(
    `Removed covariate` = cov,
    `Sleep β (max)` = round(b_exposure_max, 4),
    `Sleep β (without)` = round(b_reduced, 4),
    `% Change` = round(pct_change, 1),
    `Within 10%?` = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
    Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
  )
})

assoc_table |>
  kable(caption = "Associative Model: Confounder Assessment for Sleep") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(6, bold = TRUE)
Associative Model: Confounder Assessment for Sleep
Removed covariate Sleep β (max) Sleep β (without) % Change Within 10%? Confounder
menthlth_days -0.193 -0.2894 -50.0 No (keep) Yes
age -0.193 -0.1646 14.7 No (keep) Yes
sex -0.193 -0.1937 -0.4 Yes (drop) No
education -0.193 -0.1923 0.3 Yes (drop) No
exercise -0.193 -0.1957 -1.4 Yes (drop) No
income_cat -0.193 -0.1936 -0.3 Yes (drop) No
bmi -0.193 -0.1950 -1.0 Yes (drop) No
gen_health -0.193 -0.3593 -86.2 No (keep) Yes

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

The sleep coefficient is -0.2026 ad the CI is [-0.3349, -0.0702].

mod_assoc_max <- lm(
  physhlth_days ~ sleep_hrs + menthlth_days + age + gen_health,
  data = brfss_ms
)
tidy(mod_assoc_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Associative Model: Effect of Sleep on Physical Health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Final Associative Model: Effect of Sleep on Physical Health
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 0.8151 0.5615 1.4516 0.1467 -0.2857 1.9158
sleep_hrs -0.2026 0.0675 -3.0003 0.0027 -0.3349 -0.0702
menthlth_days 0.1512 0.0120 12.5637 0.0000 0.1276 0.1748
age 0.0205 0.0054 3.7595 0.0002 0.0098 0.0312
gen_healthVery good 0.5113 0.2451 2.0860 0.0370 0.0308 0.9919
gen_healthGood 1.9151 0.2579 7.4255 0.0000 1.4095 2.4207
gen_healthFair 7.7686 0.3488 22.2693 0.0000 7.0847 8.4524
gen_healthPoor 21.4868 0.5266 40.8018 0.0000 20.4544 22.5192

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

I didnt use step wise selection because combines forward and backward procedures by adding variables and then re-evaluating whether previously included variables should be removed While this can improve the model fit, it is not appropriate for this analysis because it selects variables based on statistical criteria rather than their role as confounders. This means that important confounders could be excluded if they are not statistically significant and this could lead to bias.


Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

  • A predictive model (from Task 2 or 3, the best model by AIC/BIC)
  • An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

The predictive and associative model do have different variables because they are trying to achieve different goals. The predictive model focuses on maximizing accuracy and includes as many variables as possible to improve the fit, while the associative model includes only confounders needed. The sleep coefficient differs between the models because the predictive model adjusts for variables that are not true confounders.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

  • The adjusted effect of sleep on physical health days
  • Which variables needed to be accounted for (confounders)
  • The direction and approximate magnitude of the association
  • A caveat about cross-sectional data

Do not use statistical jargon.

After adjusting for confounding factors, increased sleep was associated with fewer physically unhealthy days. Each additional hour of sleep was linked to about 0.2026 fewer unhealthy days per month. Mental health and age needed to be accounted for because they influenced both sleep and physical health, while other factors did not. This shows that more sleep is related to better physical health. However, because this data is cross-sectional, we cannot determine if more sleep improves health or if healthier people sleep better.