Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric
`bmi`	Body mass index	Continuous

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(leaps)
library(MASS)

brfss_ms <- readRDS(
  "C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_ms_2020.rds"
)

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

# The maximum model with all candidate predictors

mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: All Candidate Predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.6902	0.8556	3.1441	0.0017	1.0128	4.3676
menthlth_days	0.1472	0.0121	12.1488	0.0000	0.1235	0.1710
sleep_hrs	-0.1930	0.0673	-2.8679	0.0041	-0.3249	-0.0611
age	0.0180	0.0055	3.2969	0.0010	0.0073	0.0288
sexFemale	-0.1889	0.1820	-1.0376	0.2995	-0.5458	0.1680
educationHS graduate	0.2508	0.4297	0.5836	0.5595	-0.5917	1.0933
educationSome college	0.3463	0.4324	0.8009	0.4233	-0.5014	1.1940
educationCollege graduate	0.3336	0.4357	0.7657	0.4439	-0.5206	1.1878
exerciseYes	-1.2866	0.2374	-5.4199	0.0000	-1.7520	-0.8212
gen_healthVery good	0.4373	0.2453	1.7824	0.0747	-0.0437	0.9183
gen_healthGood	1.5913	0.2651	6.0022	0.0000	1.0716	2.1111
gen_healthFair	7.0176	0.3682	19.0586	0.0000	6.2957	7.7394
gen_healthPoor	20.4374	0.5469	37.3722	0.0000	19.3653	21.5095
income_cat	-0.1817	0.0503	-3.6092	0.0003	-0.2803	-0.0830
bmi	0.0130	0.0145	0.8997	0.3683	-0.0153	0.0414

glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.386	0.384	6.321	32645.79	32750.06	4985

\(R^2\) = 0.386 Adjusted \(R^2\) =0.384 AIC = 32645.79 BIC = 32750.06

Interpretation: The maximum model explains approximately 38.6% of the variance in physically unhealthy days (R² = 0.386, Adjusted R² = 0.384). The strongest predictors are general health status (with “Poor” health associated with about 20 more unhealthy days compared to “Excellent”) and mental health days (each additional mentally unhealthy day is associated with 0.15 more physically unhealthy days). Exercise is also strongly associated, with exercisers reporting about 1.3 fewer physically unhealthy days. Several variables, including sex (p = 0.30), education (p > 0.40 for all levels), and BMI (p = 0.37), are not statistically significant, suggesting they may be candidates for removal in a more parsimonious model. The AIC is 32,645.8 and BIC is 32,750.1; these serve as baselines for comparing simpler models.

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

# The minimum model with two candidate predictors

mod_min <- lm(physhlth_days ~ menthlth_days + age,
              data = brfss_ms)

tidy(mod_min, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Minimum Model: Two Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Minimum Model: Two Candidate Predictors
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	-1.6983	0.3641	-4.6647	-2.4121	-0.9846
menthlth_days	0.3237	0.0135	24.0149	0.2973	0.3501
age	0.0716	0.0062	11.4763	0.0594	0.0838

glance(mod_min) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Minimum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Minimum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.115	0.115	7.58	34449.78	34475.85	4997

\(R^2\) = 0.115 Adjusted \(R^2\) =0.115 AIC = 34449.78 BIC = 34475.85

Interpretation: The minimum model explains only about 11.5% of the variance in physically unhealthy days (R² = 0.115, Adjusted R² = 0.115). The AIC is 34449.78 and BIC is 34475.85.

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

Given a set of candidate models, we need a criterion to compare them. We cover five: \(R^2\), Adjusted \(R^2\), \(F_p\) (partial F-test), AIC, and BIC.

###\(R^2\) (Coefficient of Determination)

\[R^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST}\]

\(R^2\) measures the proportion of variance in \(Y\) explained by the model. However, \(R^2\) always increases (or stays the same) when you add a predictor, regardless of whether it is useful. This makes raw \(R^2\) useless for model comparison across models of different sizes.

# Demonstrate that R2 always increases
models <- list(
  "Sleep only"       = lm(physhlth_days ~ sleep_hrs, data = brfss_ms),
  "+ age"            = lm(physhlth_days ~ sleep_hrs + age, data = brfss_ms),
  "+ sex"            = lm(physhlth_days ~ sleep_hrs + age + sex, data = brfss_ms),
  "+ education"      = lm(physhlth_days ~ sleep_hrs + age + sex + education, data = brfss_ms),
  "+ exercise"       = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise, data = brfss_ms),
  "+ gen_health"     = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health, data = brfss_ms),
  "+ mental health"  = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days, data = brfss_ms),
  "+ income"         = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat, data = brfss_ms),
  "+ BMI (full)"     = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat + bmi, data = brfss_ms)
)

r2_table <- map_dfr(names(models), \(name) {
  g <- glance(models[[name]])
  tibble(
    Model = name,
    p = length(coef(models[[name]])) - 1,
    `R²` = round(g$r.squared, 4),
    `Adj. R²` = round(g$adj.r.squared, 4),
    AIC = round(g$AIC, 1),
    BIC = round(g$BIC, 1)
  )
})

r2_table |>
  kable(caption = "Model Comparison: R² Always Increases as Predictors Are Added") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Comparison: R² Always Increases as Predictors Are Added
Model	p	R²	Adj. R²	AIC	BIC
Sleep only	1	0.0115	0.0113	35001.0	35020.6
age	2	0.0280	0.0276	34918.7	34944.8
sex	3	0.0280	0.0274	34920.7	34953.3
education	6	0.0440	0.0428	34843.7	34895.9
exercise	7	0.0849	0.0836	34626.8	34685.5
gen_health	11	0.3650	0.3636	32807.7	32892.4
mental health	12	0.3843	0.3828	32655.4	32746.6
income	13	0.3859	0.3843	32644.6	32742.4
BMI (full)	14	0.3860	0.3843	32645.8	32750.1

Interpretation: Notice that R² increases monotonically from 0.015 (sleep only) to 0.386 (full model) as each predictor is added. However, Adjusted R² tells a different story: it plateaus at 0.386 after adding income (the 8th predictor), and adding BMI does not improve it further (still 0.386). The largest single jump in both R² and Adjusted R² occurs when general health is added (from 0.084 to 0.365), indicating it is by far the most powerful predictor. AIC and BIC both decrease sharply at that same step. AIC reaches its minimum at the full model (32,645.8), while BIC, which penalizes complexity more heavily, favors a slightly smaller model. This table illustrates a key lesson: R² will always reward you for adding variables, even useless ones, making it unreliable for model comparison.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach

best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

print (p1)

Best Subsets: Adjusted R²

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

print (p2)

BIC by Model Size

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

cat("Best model by Adj. R²:", which.max(best_summary$adjr2), "variables\n")

## Best model by Adj. R²: 10 variables

cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")

## Best model by BIC: 8 variables

# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")

## 
## Variables in BIC-best model:

cat(paste(" ", best_vars), sep = "\n")

##   menthlth_days
##   sleep_hrs
##   age
##   exerciseYes
##   gen_healthGood
##   gen_healthFair
##   gen_healthPoor
##   income_cat

#runnin BIC best model 

BIC_mod <- lm (physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + gen_health + income_cat, data = brfss_ms)

summary(BIC_mod)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     exercise + gen_health + income_cat, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

tidy(BIC_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "BIC-Best Model Coefficients",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

BIC-Best Model Coefficients
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

The BIC‑best model and the Adjusted R²‑best model are not the same size, even though they explain almost the same amount of variation in physical health days. The BIC - best model and Adjusted R² plot shows that model performance increases quickly with the first few predictors and then plateaus around 8 and 10 variables respectively.BIC penalizes extra variables more strongly and prefers simpler models when they do not meaningfully improve fit. Because the BIC‑best model achieves nearly the same explanatory power with fewer predictors, it is generally the better choice.

Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)

## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732

Interpretation: AIC-based backward elimination removed sex, education, and BMI from the maximum model, arriving at a 9-parameter model (counting dummy variables). These are the same three variables that were non-significant in the maximum model. The retained predictors (mental health days, sleep, age, exercise, general health, and income) all have p-values below 0.05. The resulting model has Adjusted R² = 0.385, essentially identical to the maximum model (0.384), confirming that the dropped variables contributed negligible explanatory power.

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634

Interpretation: Forward selection arrived at the same final model as backward elimination, including the same 9 predictor terms. The order of entry is informative: general health entered first (the strongest predictor), followed by mental health days, exercise, income, age, and sleep. This ordering reflects each variable’s marginal contribution given the variables already in the model. The convergence of forward and backward methods on the same model increases our confidence in this particular subset, though this convergence is not guaranteed in general.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1      6395 202123 18509
## + exercise       1      1652 206865 18625
## + income_cat     1      1307 207211 18634
## + sleep_hrs      1       756 207762 18647
## + bmi            1        91 208427 18663
## <none>                       208518 18663
## + sex            1        38 208479 18664
## + age            1        32 208486 18664
## + education      3       145 208373 18666
## - gen_health     4    115918 324435 20865
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      1651 200472 18470
## + income_cat     1       818 201305 18491
## + age            1       465 201658 18500
## + sleep_hrs      1       258 201865 18505
## + bmi            1        91 202032 18509
## <none>                       202123 18509
## + sex            1         3 202120 18511
## + education      3       112 202011 18512
## - menthlth_days  1      6395 208518 18663
## - gen_health     4     92570 294693 20387
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income_cat     1       509 199963 18460
## + age            1       334 200139 18464
## + sleep_hrs      1       253 200219 18466
## <none>                       200472 18470
## + bmi            1        21 200451 18472
## + sex            1        11 200462 18472
## + education      3        27 200445 18476
## - exercise       1      1651 202123 18509
## - menthlth_days  1      6393 206865 18625
## - gen_health     4     78857 279330 20121
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## + age            1       322 199641 18453
## + sleep_hrs      1       250 199713 18455
## <none>                       199963 18460
## + bmi            1        28 199935 18461
## + sex            1        27 199936 18461
## + education      3        27 199937 18465
## - income_cat     1       509 200472 18470
## - exercise       1      1342 201305 18491
## - menthlth_days  1      5988 205952 18605
## - gen_health     4     72713 272676 20002
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##                 Df Sum of Sq    RSS   AIC
## + sleep_hrs      1       337 199305 18447
## <none>                       199641 18453
## + sex            1        45 199596 18454
## + bmi            1        42 199599 18454
## + education      3        23 199619 18459
## - age            1       322 199963 18460
## - income_cat     1       497 200139 18464
## - exercise       1      1231 200873 18482
## - menthlth_days  1      6304 205945 18607
## - gen_health     4     68936 268577 19929
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## + sex            1        42 199262 18448
## + bmi            1        34 199270 18448
## + education      3        25 199280 18452
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stepwise Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634

Interpretation: The stepwise procedure, which allows both addition and removal at each step, also converges on the identical model. In this dataset, no variable that was added early became redundant after later variables entered, so no removals were needed. This three-way agreement (backward = forward = stepwise) is reassuring but should not be taken as proof that this is the “correct” model. All three methods optimize the same criterion (AIC) on the same data.

Comparing All Selection Methods

method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparison of Variable Selection Methods
Method	Variables selected	Adj. R²	AIC	BIC
Maximum model	14	0.3843	32645.8	32750.1
Backward (AIC)	9	0.3846	32638.4	32710.1
Forward (AIC)	9	0.3846	32638.4	32710.1
Stepwise (AIC)	9	0.3846	32638.4	32710.1

Interpretation: All three automated selection methods—backward, forward, and stepwise—arrived at the same 9‑variable model, with an adjusted R² of 0.3846, AIC of 32,638.4, and BIC of 32,710.1. This selected model performs slightly better than the maximum model, which had a higher AIC (32,645.8) and a noticeably higher BIC (32,750.1). The improvement in BIC is especially meaningful because BIC penalizes unnecessary variables more strongly, indicating that removing sex, education, and BMI increased model efficiency without reducing explanatory power. In practice, the maximum model and the selected model would produce very similar predictions, but the selected model is preferred for its efficiency.

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?

Three reasons why we shoul dnot blindly trust automated variable selection are: 1. It ignores the causal structure.Automated procedures choose variables based on statistical criteria, not on whether a variable is a confounder, or EM. This can lead to adjusting for the wrong variables and producing biased estimates. 2. Small, random sample fluctuations can change the selected model.Different samples from the same population can lead to different “best” models, especially when predictors are correlated. This makes automated selection unstable and overly sensitive to noise. 3. It prioritizes prediction, not interpretation. Methods like AIC and BIC aim to improve model fit, not to answer epidemiologic questions about exposure effects. They may drop important confounders simply because they don’t improve prediction.

Epidemiology is fundamentally about estimating causal effects, which requires thoughtful decisions about which variables to adjust for.So, first reason is important concern.

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

# Exposure: sleep hrs; Outcome: physhlth_days

# Maximum associative model
mod_crude <- lm(physhlth_days ~ sleep_hrs,
                    data = brfss_ms)

# Extract the key information
crude_coef <- tidy(mod_crude, conf.int = TRUE)
crude_coef |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Task 4a: Crude Model - Sleep Effect on Physical Health",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 4a: Crude Model - Sleep Effect on Physical Health
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	7.9110	0.5959	13.2755	0	6.7428	9.0793
sleep_hrs	-0.6321	0.0831	-7.6104	0	-0.7949	-0.4693

#extract the coefficient for sleep_hrs
b_crude <- coef(mod_crude)["sleep_hrs"]

#report the results
cat("Exposure coefficient in crude model:",  round(b_crude, 3), "days per hour of sleep ***\n")

## Exposure coefficient in crude model: -0.632 days per hour of sleep ***

Exposure coefficient in crude model: -0.632

The crude estimate tells us: for every additional hour of sleep, physically unhealthy days decrease by 0.6 days. This is “crude” because it doesn’t account for other variables that might confound the relationship. Why confounding matters here: Maybe healthier people both sleep more and have fewer unhealthy days—if so, the crude effect overstates sleep’s benefit.

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

# Exposure: sleep hrs; Outcome: physhlth_days

# Maximum associative model
mod_assoc_max <- lm(physhlth_days ~ sleep_hrs + menthlth_days + exercise + age +
                      sex + education + income_cat + bmi,
                    data = brfss_ms)

# Display full model results
tidy(mod_assoc_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Associative Model: All Covariates Included",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Associative Model: All Covariates Included
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	6.2589	0.9741	6.4250	0.0000	4.3491	8.1687
sleep_hrs	-0.3593	0.0775	-4.6374	0.0000	-0.5112	-0.2074
menthlth_days	0.2697	0.0135	20.0226	0.0000	0.2433	0.2961
exerciseYes	-3.0688	0.2689	-11.4119	0.0000	-3.5960	-2.5416
age	0.0589	0.0061	9.5856	0.0000	0.0469	0.0710
sexFemale	-0.6847	0.2096	-3.2665	0.0011	-1.0956	-0.2738
educationHS graduate	-0.2873	0.4950	-0.5804	0.5617	-1.2577	0.6831
educationSome college	-0.3355	0.4982	-0.6735	0.5007	-1.3121	0.6411
educationCollege graduate	-0.4557	0.5014	-0.9089	0.3635	-1.4386	0.5272
income_cat	-0.5385	0.0566	-9.5074	0.0000	-0.6495	-0.4274
bmi	0.0670	0.0164	4.0964	0.0000	0.0349	0.0991

# Get the adjusted sleep coefficient
b_exposure_max <- coef(mod_assoc_max)["sleep_hrs"]
cat("\n*** Adjusted sleep effect (all covariates): ", round(b_exposure_max, 4), " days per hour ***\n")

## 
## *** Adjusted sleep effect (all covariates):  -0.3593  days per hour ***

The adjusted sleep coefficient from model -0.36 days per hour (instead of -0.632 crude). When we control for everything else, each extra hour of sleep is associated with 0.36 fewer unhealthy days. it’s smaller than crude (-0.632)’s because some of the crude effect was confounding, not true sleep effect.

# The 10% rule says: if removing a covariate changes the sleep coefficient by >10%,
# that covariate is a confounder and should be kept


interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)

cat("Exposure coefficient in maximum model:", round(b_exposure_max, 4), "\n")

## Exposure coefficient in maximum model: -0.3593

cat("10% interval: (", round(interval_low, 4), ",", round(interval_high, 4), ")\n\n")

## 10% interval: ( -0.3952 , -0.3233 )

# Systematically remove one covariate at a time
covariates_to_test <- c("menthlth_days", "exercise", "age", "sex",
                         "education", "income_cat", "bmi")


# Create the assessment table
assoc_table <- map_dfr(covariates_to_test, \(cov) {
  
  # Build formula without this covariate
  remaining <- setdiff(covariates_to_test, cov)
  form <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + ")))
  mod_reduced <- lm(form, data = brfss_ms)
  
   # Get the sleep coefficient in this reduced model
  b_reduced <- coef(mod_reduced)["sleep_hrs"]
  
  # Calculate the % change in the sleep coefficient
  pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100

  tibble(
    `Removed covariate` = cov,
    `Sleep hours (max)` = round(b_exposure_max, 4),
    `Sleep hours (without)` = round(b_reduced, 4),
    `% Change` = round(pct_change, 1),
    `Within 10%?` = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
    Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
  )
})

assoc_table |>
  kable(caption = "Associative Model: Systematic Confounder Assessment for sleep Hours") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(6, bold = TRUE)

Associative Model: Systematic Confounder Assessment for sleep Hours
Removed covariate	Sleep hours (max)	Sleep hours (without)	% Change	Within 10%?	Confounder
menthlth_days	-0.3593	-0.5804	-61.6	No (keep)	Yes
exercise	-0.3593	-0.3779	-5.2	Yes (drop)	No
age	-0.3593	-0.2733	23.9	No (keep)	Yes
sex	-0.3593	-0.3633	-1.1	Yes (drop)	No
education	-0.3593	-0.3611	-0.5	Yes (drop)	No
income_cat	-0.3593	-0.3723	-3.6	Yes (drop)	No
bmi	-0.3593	-0.3738	-4.0	Yes (drop)	No

Interpretation: The confounder assessment shows that most variables have only a small influence on the sleep–physical health association, but two variables—mental health days and age—substantially change the sleep coefficient when removed. Removing mental health days shifts the sleep coefficient from –0.36 to –0.58 (a 62% change), and removing age shifts it to –0.27 (a 24% change). Both changes exceed the 10% threshold, indicating that mental health and age are true confounders of the sleep–health relationship and should remain in the model. In contrast, removing exercise, sex, education, income, or BMI changes the sleep coefficient by less than 10%, meaning these variables do not confound the association and can be dropped from the associative model. The final associative model would include mental health days and, age.

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

# Exposure: sleep hrs; Outcome: physhlth_days

## Confounders identified: age, income_cat (example—check your actual output)

confounders_identified <- c("menthlth_days", "age") 

# STEP 2: Build the formula with sleep + only the confounders
final_formula <- paste(
  "physhlth_days ~ sleep_hrs + ",
  paste(confounders_identified, collapse = " + ")
)

# Selected  associative model
mod_assoc_final <- lm(as.formula(final_formula),
                    data = brfss_ms)

# Display full model results
tidy(mod_assoc_final, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Associative Model: Identified Covariates Included",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Final Associative Model: Identified Covariates Included
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.2870	0.6457	1.9932	0.0463	0.0211	2.5529
sleep_hrs	-0.4473	0.0800	-5.5903	0.0000	-0.6042	-0.2904
menthlth_days	0.3119	0.0136	22.9250	0.0000	0.2852	0.3385
age	0.0755	0.0063	12.0663	0.0000	0.0633	0.0878

# Get the adjusted sleep coefficient
b_exposure_max <- coef(mod_assoc_final)["sleep_hrs"]
cat("\n*** Adjusted sleep effect (two covariates): ", round(b_exposure_max, 4), " days per hour ***\n")

## 
## *** Adjusted sleep effect (two covariates):  -0.4473  days per hour ***

Interpretation: In the final model—including only sleep hours and the two identified confounders (mental health days and age)—sleep remained a meaningful predictor of physically unhealthy days. Each additional hour of sleep was associated with 0.45 fewer physically unhealthy days per month (95% CI: –0.60 to –0.29). Even after accounting for differences in mental health and age, people who sleep more tend to report better physical health. The confidence interval is entirely below zero, showing that this association is both precise and strongly supported by the data.

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Automated stepwise selection is not appropriate for this associative analysis because it chooses variables based purely on statistical criteria rather than on whether they are true confounders. This can lead to adjusting for mediators, which would distort the exposure–outcome relationship rather than clarify it. Stepwise methods is also unstable with small changes in the data producing different “best” models, making them unreliable for causal interpretation. Because our goal was to understand associations, not to optimize prediction, we relied on a confounder‑based approach based in the causal structure instead of automated selection.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

A predictive model (from Task 2 or 3, the best model by AIC/BIC)
An associative model (from Task 4, focused on sleep)

Predictive vs. Associative Model Building
Feature	Predictive	Associative
Exposure variable	No fixed exposure	Always in the model
Covariate selection	Based on statistical fit	Based on confounding assessment
Automated methods	Useful (with caution)	Generally inappropriate
10% change rule	Not used	Primary tool
Interaction terms	Include if improves prediction	Include if effect modification is present
Primary criterion	Adj. R², AIC, BIC	Validity of exposure β
Parsimony	Fewer variables = less overfitting	Fewer variables = more efficient, if not confounders

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

# Compare sleep coefficients
# Get sleep coefficient from predictive model
mod_pred <- lm (physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + gen_health + income_cat, data = brfss_ms)
sleep_pred <- coef(mod_pred)["sleep_hrs"]

# Get sleep coefficient from associative model
sleep_assoc <- coef(mod_assoc_final)["sleep_hrs"]

cat("\n=== SLEEP COEFFICIENT COMPARISON ===\n")

## 
## === SLEEP COEFFICIENT COMPARISON ===

cat("Predictive model:    ", round(sleep_pred, 4), "\n")

## Predictive model:     -0.1951

cat("Associative model:   ", round(sleep_assoc, 4), "\n")

## Associative model:    -0.4473

cat("Difference:          ", round(abs(sleep_pred - sleep_assoc), 4), "\n")

## Difference:           0.2522

The predictive model and the associative model do not include the same variables. The predictive model includes sleep, mental health days, age, exercise, general health, and income, while the associative model includes only sleep and the two identified confounders (mental health days and age). Because the associative model removes mediators and non‑confounders, the sleep coefficient is much larger in magnitude (–0.447) than in the predictive model (–0.195). These differences occur because the predictive model is optimized for statistical fit, while the associative model is designed to isolate the relationship between sleep and physical health by adjusting only for true confounders.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

The adjusted effect of sleep on physical health days
Which variables needed to be accounted for (confounders)
The direction and approximate magnitude of the association
A caveat about cross-sectional data

Do not use statistical jargon.

Answer More sleep was linked with better physical health: people who slept longer reported fewer days each month when their physical health was poor. After accounting for differences in age and mental health—two factors that influenced both sleep and physical health, the association remained strong. On average, each extra hour of sleep was tied to roughly half a day fewer of poor physical health. It suggest that sleep may play an important role in overall well‑being, but because the data were collected at one point in time (cross-sectional), we cannot say whether more sleep leads to better health or whether people in better health simply sleep more. —

End of Lab Activity

Model Selection

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

March 24, 2026