Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric
`bmi`	Body mass index	Continuous

library(tidyverse)
library(haven)
library(janitor)
library(broom)
library(knitr)
library(kableExtra)
library(gtsummary)
library(car)
library(leaps)
library(MASS)

brfss_ms <- readRDS("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Labs/brfss_ms_2020.rds"
)

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: All Candidate Predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.6902	0.8556	3.1441	0.0017	1.0128	4.3676
menthlth_days	0.1472	0.0121	12.1488	0.0000	0.1235	0.1710
sleep_hrs	-0.1930	0.0673	-2.8679	0.0041	-0.3249	-0.0611
age	0.0180	0.0055	3.2969	0.0010	0.0073	0.0288
sexFemale	-0.1889	0.1820	-1.0376	0.2995	-0.5458	0.1680
educationHS graduate	0.2508	0.4297	0.5836	0.5595	-0.5917	1.0933
educationSome college	0.3463	0.4324	0.8009	0.4233	-0.5014	1.1940
educationCollege graduate	0.3336	0.4357	0.7657	0.4439	-0.5206	1.1878
exerciseYes	-1.2866	0.2374	-5.4199	0.0000	-1.7520	-0.8212
gen_healthVery good	0.4373	0.2453	1.7824	0.0747	-0.0437	0.9183
gen_healthGood	1.5913	0.2651	6.0022	0.0000	1.0716	2.1111
gen_healthFair	7.0176	0.3682	19.0586	0.0000	6.2957	7.7394
gen_healthPoor	20.4374	0.5469	37.3722	0.0000	19.3653	21.5095
income_cat	-0.1817	0.0503	-3.6092	0.0003	-0.2803	-0.0830
bmi	0.0130	0.0145	0.8997	0.3683	-0.0153	0.0414

glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.386	0.384	6.321	32645.79	32750.06	4985

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

tidy(mod_min, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Minimum Model: Menthlth_days & Age",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Minimum Model: Menthlth_days & Age
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	-1.6983	0.3641	-4.6647	-2.4121	-0.9846
menthlth_days	0.3237	0.0135	24.0149	0.2973	0.3501
age	0.0716	0.0062	11.4763	0.0594	0.0838

glance(mod_min) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Minimum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Minimum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.115	0.115	7.58	34449.78	34475.85	4997

anova(mod_min, mod_max) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Minimal Model vs. Maximum Model") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Minimal Model vs. Maximum Model
term	df.residual	rss	df	sumsq	statistic	p.value
physhlth_days ~ menthlth_days + age	4997	287124.8	NA	NA	NA	NA
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi	4985	199201.8	12	87923	183.3551	0

The maximum model has a greater R2 and adjusted R2, but lower sigma, AIC, and BIC.

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

R2 is not a good criterion because it increases with increasing predictors, but adjusted R2, AIC, and BIC do not always increase when variables are added, and so they penalize for adding useless variables that do not add predictive power.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

Adjusted R2 plateaus at 10 variables.

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

8 variables minimizes BIC.

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")

## 
## Variables in BIC-best model:

cat(paste(" ", best_vars), sep = "\n")

##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat

BIC_best <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + gen_health + income_cat, data = brfss_ms)
BIC_best

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     exercise + gen_health + income_cat, data = brfss_ms)
## 
## Coefficients:
##         (Intercept)        menthlth_days            sleep_hrs  
##              3.1864               0.1461              -0.1951  
##                 age          exerciseYes  gen_healthVery good  
##              0.0174              -1.2877               0.4617  
##      gen_healthGood       gen_healthFair       gen_healthPoor  
##              1.6368               7.0787              20.5084  
##          income_cat  
##             -0.1657

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

They are not the same – the BIC-best model prefers fewer variables. Thus, I would choose the BIC-best model to have the simplest model and to avoid adding unnecessary variables.

Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

mod_backward <- step(mod_max, direction = "backward", trace = 1)

## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732

Education, BMI, and sex are removed – leaving menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat.

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634

Yes, it arrives at the same model as backward elimination.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparison of Variable Selection Methods
Method	Variables selected	Adj. R²	AIC	BIC
Maximum model	14	0.3843	32645.8	32750.1
Backward (AIC)	9	0.3846	32638.4	32710.1
Forward (AIC)	9	0.3846	32638.4	32710.1

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

They involve repeated testing which increases the risk of Type I error (false positive), which would mean including variables that are not truly predictive.
The p values from the final model after selection will be biased away from the null because the model was optimized for statistical fit.
They may remove confounding variables that are well-established in the literature but not significantly associated with the outcome in the sample.

#3 is most relevant for epidemiological research because controlling for confounding is a very important step in this field, which often looks at many variables in an observational sample and must control for variables that might yield biased results.

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

crude <- lm(physhlth_days ~ sleep_hrs,
                    data = brfss_ms)
tidy(crude, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Crude model -- Sleep and Physical health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Crude model – Sleep and Physical health
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	7.9110	0.5959	13.2755	0	6.7428	9.0793
sleep_hrs	-0.6321	0.0831	-7.6104	0	-0.7949	-0.4693

coef(crude)["sleep_hrs"]

##  sleep_hrs 
## -0.6320888

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

mod_assoc_max <- lm(physhlth_days ~ sleep_hrs + exercise + menthlth_days + age +
                      sex + education + income_cat + bmi,
                    data = brfss_ms)

b_exposure_max <- coef(mod_assoc_max)["sleep_hrs"]
interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)

cat("Exposure coefficient in maximum model:", round(b_exposure_max, 4), "\n")

## Exposure coefficient in maximum model: -0.3593

cat("10% interval: (", round(interval_low, 4), ",", round(interval_high, 4), ")\n\n")

## 10% interval: ( -0.3952 , -0.3233 )

covariates_to_test <- c("exercise", "menthlth_days", "age", "sex",
                         "education", "income_cat", "bmi")

assoc_table <- map_dfr(covariates_to_test, \(cov) {
  # Build formula without this covariate
  remaining <- setdiff(covariates_to_test, cov)
  form <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + ")))
  mod_reduced <- lm(form, data = brfss_ms)
  b_reduced <- coef(mod_reduced)["sleep_hrs"]
  pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100

  tibble(
    `Removed covariate` = cov,
    `Sleep β (max)` = round(b_exposure_max, 4),
    `Sleep β (without)` = round(b_reduced, 4),
    `% Change` = round(pct_change, 1),
    `Within 10%?` = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
    Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
  )
})

assoc_table |>
  kable(caption = "Associative Model: Systematic Confounder Assessment for Sleep") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(6, bold = TRUE)

Associative Model: Systematic Confounder Assessment for Sleep
Removed covariate	Sleep β (max)	Sleep β (without)	% Change	Within 10%?	Confounder
exercise	-0.3593	-0.3779	-5.2	Yes (drop)	No
menthlth_days	-0.3593	-0.5804	-61.6	No (keep)	Yes
age	-0.3593	-0.2733	23.9	No (keep)	Yes
sex	-0.3593	-0.3633	-1.1	Yes (drop)	No
education	-0.3593	-0.3611	-0.5	Yes (drop)	No
income_cat	-0.3593	-0.3723	-3.6	Yes (drop)	No
bmi	-0.3593	-0.3738	-4.0	Yes (drop)	No

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

final <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age,
                    data = brfss_ms)

tidy(final, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Crude model -- Sleep and Physical health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Crude model – Sleep and Physical health
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.2870	0.6457	1.9932	0.0463	0.0211	2.5529
sleep_hrs	-0.4473	0.0800	-5.5903	0.0000	-0.6042	-0.2904
menthlth_days	0.3119	0.0136	22.9250	0.0000	0.2852	0.3385
age	0.0755	0.0063	12.0663	0.0000	0.0633	0.0878

coef(final)["sleep_hrs"]

##  sleep_hrs 
## -0.4472979

95% CI: -0.6042, -0.2904

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Stepwise selection is inappropriate because because this analysis was looking for and keeping potential confounding variables. Associative analysis selects covariates based on confounding, but stepwise selection (predictive analysis) selects covariates based on statistical fit. Thus, stepwise selection is not the optimal tool for keeping only the variables that confound the sleep-physhlth_days relationship.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

A predictive model (from Task 2 or 3, the best model by AIC/BIC)
An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

They do not include the same variables. The associative model keeps sleep_hrs, menthlth_days, and age. The predictive model keeps menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat. In the associative model, the sleep coefficient is -0.4473, while in the predictive model, the sleep coefficient is -0.1951. The sleep coefficient likely differs because the two models use different variables. The predictive model chooses variables based on statistical fit, while the associative model is actively assessing for confounders of the sleep-physhlth_days relationship.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

The adjusted effect of sleep on physical health days
Which variables needed to be accounted for (confounders)
The direction and approximate magnitude of the association
A caveat about cross-sectional data

Do not use statistical jargon.

For every 1 hour increase in sleep, the number of poor physical health days decreases by 0.4473, keeping number of poor mental health days and age constant. The number of poor mental health days and age are confounders of this relationship, meaning they are associated with both sleep hours and physical health, and so must be controlled for. The relationship observed here between sleep hours and poor physical health days holds these confounders constant, so the effect of confounding is controlled for. BRFSS is cross sectional data collected at a single point in time, so causality cannot be inferred. That is, we cannot know whether sleep causes a change in physical health days or vice versa.

End of Lab Activity

Model Selection

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Josh Macera

April 2, 2026