Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab Due: End of class, March 24, 2026


Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable Description Type
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
exercise Any physical activity (Yes/No) Factor
gen_health General health status (5 categories) Factor
income_cat Household income (1–8 ordinal) Numeric
bmi Body mass index Continuous

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

mod_1a <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi, data = brfss_ms)

glance(mod_1a) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Maximum Model
r.squared adj.r.squared sigma AIC BIC df.residual
0.386 0.384 6.321 32645.79 32750.06 4985

\(R^2\) = 0.386 Adjusted \(R^2\) = 0.384 AIC = 32645.79 BIC = 32750.06

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

mod_1b <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

glance(mod_1b) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Minimum Model with Mental Unhealthy Days and Age") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Minimum Model with Mental Unhealthy Days and Age
r.squared adj.r.squared sigma AIC BIC df.residual
0.115 0.115 7.58 34449.78 34475.85 4997

\(R^2\) = 0.115 Adjusted \(R^2\) = 0.115 AIC = 34449.78 BIC = 34475.85

The maximum model explains more variation in physically unhealthy days than the minimum model. This shows that adding the additional predictors improves explanatory power.

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

R-squared is not useful for comparing the maximum and minimum models since it always increases, or stays the same when predictors are added. The larger model will always look better when looking at R-squared. Adjusted R-squared corrects this by penalizing unnecessary predictors, only increasing when the variable improves the model. AIC balances the model fit with model complexity. BIC does this as well but more strongly.


Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

subset_2a <- regsubsets(physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi, 
                        data = brfss_ms,
                        nvmax = 15,
                        method = "exhaustive")
summary_2a <- summary(subset_2a)

subset_2a_metrics <- tibble(
  p = 1:length(summary_2a$adjr2),
  `Adj. R²` = summary_2a$adjr2,
  BIC = summary_2a$bic,
  Cp = summary_2a$cp
)

ggplot(subset_2a_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(summary_2a$adjr2), linetype = "dashed", color = "tomato") +
  labs(title =  "Adjusted R² by Model Size", x = "Number of Variables") +
  theme_minimal(base_size = 12)

Adjusted \(R^2\) plateaus around 10 variables. the Adjusted \(R^2\) does not get any better when adding more than 10 variables.

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

ggplot(subset_2a_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(summary_2a$bic), linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

The model size that minimizes BIC is 8 variables.

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

best_bic <- which.min(summary_2a$bic)
vars_2c <- names(which(summary_2a$which[best_bic, -1]))
cat("\nVariables in BIC-best model:\n")
## 
## Variables in BIC-best model:
cat(paste(" ", vars_2c), sep = "\n")
##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat
vars_clean <- unique(sapply(vars_2c, function(v) {
  base <- names(brfss_ms)[startsWith(v, names(brfss_ms))]
  if (length(base) == 0) v else base
}))
vars_clean
## [1] "menthlth_days" "sleep_hrs"     "age"           "exercise"     
## [5] "gen_health"    "income_cat"
bic_formula <- as.formula(
  paste("physhlth_days ~", paste(vars_clean, collapse = " + "))
)

mod_2c <- lm(bic_formula, data = brfss_ms)
summary(mod_2c)
## 
## Call:
## lm(formula = bic_formula, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

the BIC-best model and the Adjusted \(R^2\)-best model are not the same. BIC applies a stronger penalty for adding variables. It tends to favor more concise model. I would prefer the BIC-best model as it is more effective.


Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

backward_3a <- step(mod_1a, direction = "backward", trace =1)
## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

Education, BMI, and sex were removed. Variables that remained were menthlth_days, sleep_hrs, age, exercise, gen_health and income_cat.

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

mod_3b <- lm(physhlth_days ~ 1, data  = brfss_ms)

forward_mod_3b <- step(mod_3b, scope = list(lower = mod_3b, upper = mod_1b), direction = "forward", trace =1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1   29742.9 294693 20387
## + age            1    4172.9 320263 20803
## <none>                       324435 20865
## 
## Step:  AIC=20386.47
## physhlth_days ~ menthlth_days
## 
##        Df Sum of Sq    RSS   AIC
## + age   1    7567.7 287125 20258
## <none>              294693 20387
## 
## Step:  AIC=20258.4
## physhlth_days ~ menthlth_days + age

Yes. Forward selection arrives at the same final model as backward elimination. Both procedures select menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

stepwise_3c <- step(mod_3b, scope = list(lower = mod_3b, upper = mod_1a), direction = "both",trace =1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1      6395 202123 18509
## + exercise       1      1652 206865 18625
## + income_cat     1      1307 207211 18634
## + sleep_hrs      1       756 207762 18647
## + bmi            1        91 208427 18663
## <none>                       208518 18663
## + sex            1        38 208479 18664
## + age            1        32 208486 18664
## + education      3       145 208373 18666
## - gen_health     4    115918 324435 20865
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      1651 200472 18470
## + income_cat     1       818 201305 18491
## + age            1       465 201658 18500
## + sleep_hrs      1       258 201865 18505
## + bmi            1        91 202032 18509
## <none>                       202123 18509
## + sex            1         3 202120 18511
## + education      3       112 202011 18512
## - menthlth_days  1      6395 208518 18663
## - gen_health     4     92570 294693 20387
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income_cat     1       509 199963 18460
## + age            1       334 200139 18464
## + sleep_hrs      1       253 200219 18466
## <none>                       200472 18470
## + bmi            1        21 200451 18472
## + sex            1        11 200462 18472
## + education      3        27 200445 18476
## - exercise       1      1651 202123 18509
## - menthlth_days  1      6393 206865 18625
## - gen_health     4     78857 279330 20121
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## + age            1       322 199641 18453
## + sleep_hrs      1       250 199713 18455
## <none>                       199963 18460
## + bmi            1        28 199935 18461
## + sex            1        27 199936 18461
## + education      3        27 199937 18465
## - income_cat     1       509 200472 18470
## - exercise       1      1342 201305 18491
## - menthlth_days  1      5988 205952 18605
## - gen_health     4     72713 272676 20002
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##                 Df Sum of Sq    RSS   AIC
## + sleep_hrs      1       337 199305 18447
## <none>                       199641 18453
## + sex            1        45 199596 18454
## + bmi            1        42 199599 18454
## + education      3        23 199619 18459
## - age            1       322 199963 18460
## - income_cat     1       497 200139 18464
## - exercise       1      1231 200873 18482
## - menthlth_days  1      6304 205945 18607
## - gen_health     4     68936 268577 19929
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## + sex            1        42 199262 18448
## + bmi            1        34 199270 18448
## + education      3        25 199280 18452
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
comparison_3c <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_1a)) - 1,
    round(glance(mod_1a)$adj.r.squared, 4),
    round(AIC(mod_1a), 1),
    round(BIC(mod_1a), 1),
  "Backward (AIC)",
    length(coef(backward_3a)) - 1,
    round(glance(backward_3a)$adj.r.squared, 4),
    round(AIC(backward_3a), 1),
    round(BIC(backward_3a), 1),
  "Forward (AIC)",
    length(coef(forward_mod_3b)) - 1,
    round(glance(forward_mod_3b)$adj.r.squared, 4),
    round(AIC(forward_mod_3b), 1),
    round(BIC(forward_mod_3b), 1),
  "Stepwise (AIC)",
    length(coef(stepwise_3c)) - 1,
    round(glance(stepwise_3c)$adj.r.squared, 4),
    round(AIC(stepwise_3c), 1),
    round(BIC(stepwise_3c), 1)
)

comparison_3c |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Comparison of Variable Selection Methods
Method Variables selected Adj. R² AIC BIC
Maximum model 14 0.3843 32645.8 32750.1
Backward (AIC) 9 0.3846 32638.4 32710.1
Forward (AIC) 2 0.1146 34449.8 34475.8
Stepwise (AIC) 9 0.3846 32638.4 32710.1

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

Automated variable selection ignore the research question, inflate Type I error, and p-values and CIs in the final model are biased. Ignore the research question is the biggest concern.

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

mod_4a <- lm(physhlth_days ~ sleep_hrs, data=brfss_ms)
summary(mod_4a)
## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.279 -3.486 -2.854 -1.590 30.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.91103    0.59591   13.28  < 2e-16 ***
## sleep_hrs   -0.63209    0.08306   -7.61 3.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.01126 
## F-statistic: 57.92 on 1 and 4998 DF,  p-value: 3.245e-14

The sleep coefficient is -0.63209.

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

mod_4b <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age + sex + education + exercise + income_cat + bmi, data = brfss_ms)

b_sleep_max <- coef(mod_4b) ["sleep_hrs"]
low_interval <- b_sleep_max - 0.10 * abs(b_sleep_max)
high_interval <- b_sleep_max + 0.10 * abs(b_sleep_max)

cat("Adjusted sleep coefficient:", round(b_sleep_max, 4), "\n")
## Adjusted sleep coefficient: -0.3593
cat("10% interval:", round(low_interval, 4), "to", round(high_interval, 4), "\n")
## 10% interval: -0.3952 to -0.3233
covariates_4b <- c("menthlth_days", "age", "sex",
                        "education","income_cat", "bmi", "exercise")

table_4b <- map_dfr(covariates_4b, \(cov) {
  remaining_4b <- setdiff(covariates_4b, cov)
  formula <- as.formula(paste("physhlth_days~ sleep_hrs +", paste(remaining_4b, collapse = "+")))
  reduced_mod <- lm(formula, data = brfss_ms)
  reduced_b <- coef(reduced_mod) ["sleep_hrs"]
  percent_change <- (reduced_b - b_sleep_max) / abs(b_sleep_max) * 100
  
  tibble(
    `Removed covariate` = cov,
    `Sleep β (max)` = round(b_sleep_max, 4),
    `Sleep β (without)` = round(reduced_b, 4),
    `% Change` = round(percent_change, 1),
    `Within 10%?` = ifelse(abs(percent_change) <= 10, "Yes (drop)", "No (keep)"),
    Confounder = ifelse(abs(percent_change) > 10, "Yes", "No")
  )
})

table_4b |>
  kable(caption = "Associative Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(6, bold = TRUE)
Associative Model
Removed covariate Sleep β (max) Sleep β (without) % Change Within 10%? Confounder
menthlth_days -0.3593 -0.5804 -61.6 No (keep) Yes
age -0.3593 -0.2733 23.9 No (keep) Yes
sex -0.3593 -0.3633 -1.1 Yes (drop) No
education -0.3593 -0.3611 -0.5 Yes (drop) No
income_cat -0.3593 -0.3723 -3.6 Yes (drop) No
bmi -0.3593 -0.3738 -4.0 Yes (drop) No
exercise -0.3593 -0.3779 -5.2 Yes (drop) No

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

mod_4c <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age, data = brfss_ms)

summary(mod_4c)
## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs + menthlth_days + age, 
##     data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.2384  -3.3681  -2.0898  -0.2769  29.9556 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.28702    0.64571   1.993   0.0463 *  
## sleep_hrs     -0.44730    0.08001  -5.590 2.39e-08 ***
## menthlth_days  0.31185    0.01360  22.925  < 2e-16 ***
## age            0.07554    0.00626  12.066  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.557 on 4996 degrees of freedom
## Multiple R-squared:  0.1205, Adjusted R-squared:   0.12 
## F-statistic: 228.2 on 3 and 4996 DF,  p-value: < 2.2e-16
confint(mod_4c)["sleep_hrs", ]
##      2.5 %     97.5 % 
## -0.6041587 -0.2904371

The sleep coefficient is -0.44730. The 95% Ci is -0.604 to -0.290.

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Automated selection is inappropriate for this analysis because it chooses variables based on statistical criteria rather than if they are truly confounders. This can lead to dropping important confounders from the analysis. Automated analysis also inflates Type I error and produces biased p-values.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

  • A predictive model (from Task 2 or 3, the best model by AIC/BIC)
  • An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

The two models do not include the same variables. The associative model only includes the exposure and true confounders to obtain an unbiased estimate of sleep. The predictive model selects variables improving overall prediction accuracy. The sleep coefficient may differ since the associative model isolates the relationship between sleep and physical health.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

  • The adjusted effect of sleep on physical health days
  • Which variables needed to be accounted for (confounders)
  • The direction and approximate magnitude of the association
  • A caveat about cross-sectional data

Do not use statistical jargon.

In this analysis, people who slept more hours per night tended to report fewer days of poor physical health. After accounting for differences in the confounders age and mentally unhealthy days. Each additional hour of sleep was linked to roughly half a day fewer of poor physical health per month. The association was negative, meaning more sleep was related to better physical health. Since the data was collected at one point in time, we cannot say whether more sleep improves health.


End of Lab Activity