In previous lectures, we learned how to fit multiple linear regression models, include dummy variables for categorical predictors, test for interactions, and assess confounding. But we have not yet addressed a fundamental question: how do we decide which variables belong in the model?
This question has different answers depending on the goal of the analysis:
| Goal | What matters | Variable selection driven by |
|---|---|---|
| Prediction | Model accuracy and reliability in new data | Statistical criteria (Adj. \(R^2\), AIC, BIC, cross-validation) |
| Association | Validity of the exposure coefficient | Subject-matter knowledge, confounding assessment, 10% rule |
In predictive modeling, we search for the subset of variables that best predicts \(Y\) without overfitting. In associative modeling, the exposure variable is always in the model, and we decide which covariates to include based on whether they are confounders.
This lecture covers both approaches, with emphasis on when each is appropriate and the pitfalls of automated selection.
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(car)
library(leaps)
library(MASS)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue with the BRFSS 2020 dataset, predicting physically unhealthy days from a pool of candidate predictors.
brfss_full <- read_xpt(
"C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/LLCP2020XPT/LLCP2020.XPT"
) |>
clean_names()brfss_ms <- brfss_full |>
mutate(
# Outcome
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Candidate predictors
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
age = age80,
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
),
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_)
) |>
filter(
!is.na(physhlth_days), !is.na(menthlth_days), !is.na(sleep_hrs),
!is.na(age), age >= 18, !is.na(sex), !is.na(education),
!is.na(exercise), !is.na(gen_health), !is.na(income_cat), !is.na(bmi)
)
set.seed(1220)
brfss_ms <- brfss_ms |>
dplyr::select(physhlth_days, menthlth_days, sleep_hrs, age, sex,
education, exercise, gen_health, income_cat, bmi) |>
slice_sample(n = 5000)
# Save for lab
saveRDS(brfss_ms,
"C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/Lab 11/brfss_ms_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_ms), ncol(brfss_ms))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 10 |
We have 10 variables: 1 outcome and 9 candidate predictors. If we considered all possible subsets of the 9 predictors (ignoring interactions and transformations), there would be \(2^9 - 1 = 511\) possible models.
The maximum model is the model that includes all candidate predictor variables. It represents the upper bound of complexity. The “correct” model will have \(p \leq k\) predictors, where \(k\) is the number in the maximum model.
The candidate variables in the maximum model can include:
These candidates are chosen based on a literature search and the research question, not by throwing in every available variable.
# The maximum model with all candidate predictors
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms)
tidy(mod_max, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Maximum Model: All Candidate Predictors",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.6902 | 0.8556 | 3.1441 | 0.0017 | 1.0128 | 4.3676 |
| menthlth_days | 0.1472 | 0.0121 | 12.1488 | 0.0000 | 0.1235 | 0.1710 |
| sleep_hrs | -0.1930 | 0.0673 | -2.8679 | 0.0041 | -0.3249 | -0.0611 |
| age | 0.0180 | 0.0055 | 3.2969 | 0.0010 | 0.0073 | 0.0288 |
| sexFemale | -0.1889 | 0.1820 | -1.0376 | 0.2995 | -0.5458 | 0.1680 |
| educationHS graduate | 0.2508 | 0.4297 | 0.5836 | 0.5595 | -0.5917 | 1.0933 |
| educationSome college | 0.3463 | 0.4324 | 0.8009 | 0.4233 | -0.5014 | 1.1940 |
| educationCollege graduate | 0.3336 | 0.4357 | 0.7657 | 0.4439 | -0.5206 | 1.1878 |
| exerciseYes | -1.2866 | 0.2374 | -5.4199 | 0.0000 | -1.7520 | -0.8212 |
| gen_healthVery good | 0.4373 | 0.2453 | 1.7824 | 0.0747 | -0.0437 | 0.9183 |
| gen_healthGood | 1.5913 | 0.2651 | 6.0022 | 0.0000 | 1.0716 | 2.1111 |
| gen_healthFair | 7.0176 | 0.3682 | 19.0586 | 0.0000 | 6.2957 | 7.7394 |
| gen_healthPoor | 20.4374 | 0.5469 | 37.3722 | 0.0000 | 19.3653 | 21.5095 |
| income_cat | -0.1817 | 0.0503 | -3.6092 | 0.0003 | -0.2803 | -0.0830 |
| bmi | 0.0130 | 0.0145 | 0.8997 | 0.3683 | -0.0153 | 0.0414 |
glance(mod_max) |>
dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = "Maximum Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | sigma | AIC | BIC | df.residual |
|---|---|---|---|---|---|
| 0.386 | 0.384 | 6.321 | 32645.79 | 32750.06 | 4985 |
Interpretation: The maximum model explains approximately 38.6% of the variance in physically unhealthy days (R² = 0.386, Adjusted R² = 0.384). The strongest predictors are general health status (with “Poor” health associated with about 20 more unhealthy days compared to “Excellent”) and mental health days (each additional mentally unhealthy day is associated with 0.15 more physically unhealthy days). Exercise is also strongly associated, with exercisers reporting about 1.3 fewer physically unhealthy days. Several variables, including sex (p = 0.30), education (p > 0.40 for all levels), and BMI (p = 0.37), are not statistically significant, suggesting they may be candidates for removal in a more parsimonious model. The AIC is 32,645.8 and BIC is 32,750.1; these serve as baselines for comparing simpler models.
The goal of model building is to find the right balance:
| Problem | What happens | Consequence |
|---|---|---|
| Overfitting | Including variables with \(\beta = 0\) | No bias, but increased collinearity, inflated SEs, poor out-of-sample prediction |
| Underfitting | Omitting variables with \(\beta \neq 0\) | Bias in the remaining coefficients (omitted variable bias) |
Key insight: Underfitting is worse than overfitting in terms of bias. An overfit model gives unbiased estimates (just less precise), while an underfit model gives biased estimates. However, for prediction, overfitting degrades out-of-sample performance.
The objective is a parsimonious model: the simplest model that captures the important relationships without unnecessary complexity.
The error degrees of freedom must be positive: \(n - k - 1 > 0\), meaning \(n > k + 1\).
Rules of thumb for the minimum sample size:
| Rule | Requirement | Our data (n = 5,000) |
|---|---|---|
| Minimum 10 error df | \(n \geq k + 11\) | Can include up to 4,989 predictors |
| 5 observations per predictor | \(n \geq 5k\) | Can include up to 1,000 predictors |
| 10 observations per predictor | \(n \geq 10k\) | Can include up to 500 predictors |
With \(n = 5,000\), we are well within all rules of thumb for our 9 candidate predictors (plus dummy variables).
Caution with categorical variables: A categorical predictor with \(k\) levels uses \(k - 1\) degrees of freedom, not just 1. Our education (4 levels) uses 3 df, gen_health (5 levels) uses 4 df, so the maximum model actually uses 14 predictor df.
Given a set of candidate models, we need a criterion to compare them. We cover five: \(R^2\), Adjusted \(R^2\), \(F_p\) (partial F-test), AIC, and BIC.
\[R^2 = 1 - \frac{SSE}{SST} = \frac{SSR}{SST}\]
\(R^2\) measures the proportion of variance in \(Y\) explained by the model. However, \(R^2\) always increases (or stays the same) when you add a predictor, regardless of whether it is useful. This makes raw \(R^2\) useless for model comparison across models of different sizes.
# Demonstrate that R2 always increases
models <- list(
"Sleep only" = lm(physhlth_days ~ sleep_hrs, data = brfss_ms),
"+ age" = lm(physhlth_days ~ sleep_hrs + age, data = brfss_ms),
"+ sex" = lm(physhlth_days ~ sleep_hrs + age + sex, data = brfss_ms),
"+ education" = lm(physhlth_days ~ sleep_hrs + age + sex + education, data = brfss_ms),
"+ exercise" = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise, data = brfss_ms),
"+ gen_health" = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health, data = brfss_ms),
"+ mental health" = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days, data = brfss_ms),
"+ income" = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat, data = brfss_ms),
"+ BMI (full)" = lm(physhlth_days ~ sleep_hrs + age + sex + education + exercise + gen_health + menthlth_days + income_cat + bmi, data = brfss_ms)
)
r2_table <- map_dfr(names(models), \(name) {
g <- glance(models[[name]])
tibble(
Model = name,
p = length(coef(models[[name]])) - 1,
`R²` = round(g$r.squared, 4),
`Adj. R²` = round(g$adj.r.squared, 4),
AIC = round(g$AIC, 1),
BIC = round(g$BIC, 1)
)
})
r2_table |>
kable(caption = "Model Comparison: R² Always Increases as Predictors Are Added") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | p | R² | Adj. R² | AIC | BIC |
|---|---|---|---|---|---|
| Sleep only | 1 | 0.0115 | 0.0113 | 35001.0 | 35020.6 |
|
2 | 0.0280 | 0.0276 | 34918.7 | 34944.8 |
|
3 | 0.0280 | 0.0274 | 34920.7 | 34953.3 |
|
6 | 0.0440 | 0.0428 | 34843.7 | 34895.9 |
|
7 | 0.0849 | 0.0836 | 34626.8 | 34685.5 |
|
11 | 0.3650 | 0.3636 | 32807.7 | 32892.4 |
|
12 | 0.3843 | 0.3828 | 32655.4 | 32746.6 |
|
13 | 0.3859 | 0.3843 | 32644.6 | 32742.4 |
|
14 | 0.3860 | 0.3843 | 32645.8 | 32750.1 |
Interpretation: Notice that R² increases monotonically from 0.012 (sleep only) to 0.386 (full model) as each predictor is added. However, Adjusted R² tells a different story: it plateaus at 0.384 after adding income (the 8th predictor), and adding BMI does not improve it further (still 0.384). The largest single jump in both R² and Adjusted R² occurs when general health is added (from 0.084 to 0.365), indicating it is by far the most powerful predictor. AIC and BIC both decrease sharply at that same step. AIC reaches its minimum at the full model (32,645.8), while BIC, which penalizes complexity more heavily, favors a slightly smaller model. This table illustrates a key lesson: R² will always reward you for adding variables, even useless ones, making it unreliable for model comparison.
Adjusted \(R^2\) penalizes for model complexity:
\[R^2_{adj} = 1 - \frac{(n - i)(1 - R^2)}{n - p}\]
where \(i = 1\) if the model includes an intercept, \(n\) is the sample size, and \(p\) is the number of predictors. Unlike \(R^2\), Adjusted \(R^2\) can decrease when an uninformative predictor is added, because the penalty for using an extra degree of freedom outweighs the tiny increase in \(R^2\).
Selection rule: Choose the model with the largest Adjusted \(R^2\).
The partial F-test compares a reduced model (with \(p\) predictors) to the maximum model (with \(k\) predictors):
\[F_p = \frac{\{SSE(p) - SSE(k)\} / (k - p)}{SSE(k) / (n - k - 1)}\]
This tests \(H_0\): the \(k - p\) omitted variables all have \(\beta = 0\).
Selection rule: Choose the smallest model for which \(F_p\) is not significant when compared to the maximum model.
# Compare a small model to the maximum model
mod_small <- lm(physhlth_days ~ menthlth_days + gen_health + exercise, data = brfss_ms)
anova(mod_small, mod_max) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Small Model vs. Maximum Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ menthlth_days + gen_health + exercise | 4993 | 200472.4 | NA | NA | NA | NA |
| physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + exercise + gen_health + income_cat + bmi | 4985 | 199201.8 | 8 | 1270.601 | 3.9746 | 1e-04 |
Interpretation: The partial F-test compares the small model (mental health days + general health + exercise) to the maximum model (all 9 predictors). The F-statistic is 3.97 with p < 0.001, meaning the null hypothesis that the additional 6 variables all have β = 0 is rejected. In other words, at least one of the omitted variables (sleep, age, sex, education, income, BMI) contributes significantly to the model beyond the three core predictors. This means the small model, despite capturing much of the explained variance, is missing important information. We should look for a model between the small and maximum that retains the significant predictors while dropping the uninformative ones.
\[AIC = 2k - 2\log(\hat{L})\]
where \(k\) is the number of estimated parameters and \(\hat{L}\) is the maximized likelihood. AIC measures the relative information lost by a model. It balances goodness of fit against complexity.
Selection rule: Choose the model with the smallest AIC.
\[BIC = k \log(n) - 2\log(\hat{L})\]
BIC is similar to AIC but penalizes complexity more heavily, especially with large sample sizes (\(\log(n)\) vs. 2). BIC tends to select simpler models than AIC.
Selection rule: Choose the model with the smallest BIC.
\[MSE(p) = \frac{SSE_p}{n - p - 1}\]
MSE(p) is the residual variance for a model with \(p\) predictors. It balances fit (smaller SSE) against model size (fewer df in the denominator).
Selection rule: Choose the model with the smallest MSE(p).
| Criterion | Direction | Penalizes | Best for |
|---|---|---|---|
| R² | Maximize | No | Never use alone |
| Adjusted R² | Maximize | Yes (df penalty) | Comparing nested models |
| Fp (partial F) | Not significant → keep reduced | Yes (F distribution) | Comparing to maximum model |
| AIC | Minimize | Yes (2k) | General comparison |
| BIC | Minimize | Yes (k log n) | Favors simpler models |
| MSE(p) | Minimize | Yes (df in denominator) | Similar to Adj. R² |
criteria_long <- r2_table |>
dplyr::select(Model, p, AIC, BIC) |>
pivot_longer(cols = c(AIC, BIC), names_to = "Criterion", values_to = "Value") |>
mutate(Model = factor(Model, levels = r2_table$Model))
ggplot(criteria_long, aes(x = p, y = Value, color = Criterion)) +
geom_line(linewidth = 1.1) +
geom_point(size = 3) +
labs(
title = "AIC and BIC Across Sequentially Larger Models",
subtitle = "Lower is better; BIC penalizes complexity more heavily",
x = "Number of Predictor Degrees of Freedom (p)",
y = "Criterion Value"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")AIC and BIC Across Sequential Models
Interpretation: Both AIC and BIC decrease sharply as the first several predictors are added, with the steepest drop occurring when general health enters the model. AIC continues to decrease (or remains flat) through the full model, suggesting it favors retaining most predictors. BIC, by contrast, reaches its minimum earlier and then begins to increase, reflecting its heavier penalty for model complexity. The divergence between AIC and BIC is typical in large samples: AIC tends to select larger models, while BIC favors parsimony. In practice, when AIC and BIC disagree, the choice depends on the modeling goal: AIC is better for prediction (it minimizes information loss), while BIC is better for identifying the “true” model (it is consistent, meaning it selects the correct model as n grows).
The most thorough approach is to fit every possible subset of predictors and compare them. With \(k\) predictors, there are \(2^k - 1\) models.
This is computationally feasible for moderate \(k\) (up to about 20-30 predictors). In R,
the leaps package implements this efficiently:
# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets <- regsubsets(
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
exercise + gen_health + income_cat + bmi,
data = brfss_ms,
nvmax = 15, # maximum number of variables to consider
method = "exhaustive"
)
best_summary <- summary(best_subsets)subset_metrics <- tibble(
p = 1:length(best_summary$adjr2),
`Adj. R²` = best_summary$adjr2,
BIC = best_summary$bic,
Cp = best_summary$cp
)
p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_summary$adjr2),
linetype = "dashed", color = "tomato") +
labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
theme_minimal(base_size = 12)
p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.min(best_summary$bic),
linetype = "dashed", color = "tomato") +
labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p1, p2, ncol = 2)Best Subsets: Adjusted R² and BIC by Model Size
## Best model by Adj. R²: 10 variables
## Best model by BIC: 8 variables
# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")##
## Variables in BIC-best model:
## menthlth_days
## sleep_hrs
## age
## exerciseYes
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## income_cat
Interpretation: The best subsets analysis confirms what the sequential analysis suggested. Adjusted R² reaches its maximum at 10 variables and plateaus, while BIC selects a more parsimonious model with 8 variables. The BIC-best model retains mental health days, sleep hours, age, exercise, three levels of general health (Good, Fair, Poor), and income. Notably, it drops sex, education, Very Good health (combining it implicitly with Excellent as the reference pattern), and BMI. These are exactly the variables that had the largest p-values in the maximum model. The fact that both criteria converge on a similar core set of predictors (mental health, general health, exercise) gives us confidence that these are the genuinely important variables.
Backward elimination starts with the maximum model and removes variables one at a time:
## === BACKWARD ELIMINATION ===
## Step 1: Maximum model
## Variables: menthlth_days, sleep_hrs, age, sexFemale, educationHS graduate, educationSome college, educationCollege graduate, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat, bmi
# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
filter(term != "(Intercept)") |>
arrange(desc(p.value)) |>
dplyr::select(term, estimate, p.value) |>
mutate(across(where(is.numeric), \(x) round(x, 4)))
pvals |>
head(5) |>
kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | p.value |
|---|---|---|
| educationHS graduate | 0.2508 | 0.5595 |
| educationCollege graduate | 0.3336 | 0.4439 |
| educationSome college | 0.3463 | 0.4233 |
| bmi | 0.0130 | 0.3683 |
| sexFemale | -0.1889 | 0.2995 |
In R, the step() function automates backward elimination
using AIC:
# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)## Start: AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
## exercise + gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - education 3 29 199231 18449
## - bmi 1 32 199234 18453
## - sex 1 43 199245 18454
## <none> 199202 18454
## - sleep_hrs 1 329 199530 18461
## - age 1 434 199636 18463
## - income_cat 1 521 199722 18466
## - exercise 1 1174 200376 18482
## - menthlth_days 1 5898 205100 18598
## - gen_health 4 66437 265639 19886
##
## Step: AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - bmi 1 32 199262 18448
## - sex 1 40 199270 18448
## <none> 199231 18449
## - sleep_hrs 1 327 199557 18455
## - age 1 439 199670 18458
## - income_cat 1 520 199751 18460
## - exercise 1 1151 200381 18476
## - menthlth_days 1 5929 205159 18594
## - gen_health 4 66459 265690 19881
##
## Step: AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## - sex 1 42 199305 18447
## <none> 199262 18448
## - sleep_hrs 1 334 199596 18454
## - age 1 427 199690 18457
## - income_cat 1 514 199776 18459
## - exercise 1 1222 200484 18477
## - menthlth_days 1 5921 205184 18592
## - gen_health 4 67347 266609 19896
##
## Step: AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## - sleep_hrs 1 337 199641 18453
## - age 1 409 199713 18455
## - income_cat 1 492 199797 18457
## - exercise 1 1214 200518 18475
## - menthlth_days 1 5882 205186 18590
## - gen_health 4 67980 267285 19906
tidy(mod_backward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
Interpretation: AIC-based backward elimination removed sex, education, and BMI from the maximum model, arriving at a 9-parameter model (counting dummy variables). These are the same three variables that were non-significant in the maximum model. The retained predictors (mental health days, sleep, age, exercise, general health, and income) all have p-values below 0.05. The resulting model has Adjusted R² = 0.385, essentially identical to the maximum model (0.384), confirming that the dropped variables contributed negligible explanatory power.
Forward selection starts with the intercept-only model and adds variables one at a time:
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)
mod_forward <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward", trace = 1)## Start: AIC=20865.24
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 4 115918 208518 18663
## + menthlth_days 1 29743 294693 20387
## + exercise 1 19397 305038 20559
## + income_cat 1 19104 305332 20564
## + education 3 5906 318530 20779
## + age 1 4173 320263 20803
## + bmi 1 4041 320395 20805
## + sleep_hrs 1 3717 320719 20810
## <none> 324435 20865
## + sex 1 7 324429 20867
##
## Step: AIC=18662.93
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 6394.9 202123 18509
## + exercise 1 1652.4 206865 18625
## + income_cat 1 1306.9 207211 18634
## + sleep_hrs 1 756.1 207762 18647
## + bmi 1 91.2 208427 18663
## <none> 208518 18663
## + sex 1 38.5 208479 18664
## + age 1 32.2 208486 18664
## + education 3 145.0 208373 18666
##
## Step: AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 1650.52 200472 18470
## + income_cat 1 817.89 201305 18491
## + age 1 464.73 201658 18500
## + sleep_hrs 1 257.79 201865 18505
## + bmi 1 90.51 202032 18509
## <none> 202123 18509
## + sex 1 3.00 202120 18511
## + education 3 111.58 202011 18512
##
## Step: AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income_cat 1 509.09 199963 18460
## + age 1 333.74 200139 18464
## + sleep_hrs 1 253.06 200219 18466
## <none> 200472 18470
## + bmi 1 21.21 200451 18472
## + sex 1 10.74 200462 18472
## + education 3 26.94 200445 18476
##
## Step: AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
##
## Df Sum of Sq RSS AIC
## + age 1 321.97 199641 18453
## + sleep_hrs 1 250.25 199713 18455
## <none> 199963 18460
## + bmi 1 27.98 199935 18461
## + sex 1 27.17 199936 18461
## + education 3 26.66 199937 18465
##
## Step: AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age
##
## Df Sum of Sq RSS AIC
## + sleep_hrs 1 336.79 199305 18447
## <none> 199641 18453
## + sex 1 45.31 199596 18454
## + bmi 1 42.00 199599 18454
## + education 3 22.62 199619 18459
##
## Step: AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age + sleep_hrs
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## + sex 1 42.328 199262 18448
## + bmi 1 34.434 199270 18448
## + education 3 24.800 199280 18452
tidy(mod_forward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
Interpretation: Forward selection arrived at the same final model as backward elimination, including the same 9 predictor terms. The order of entry is informative: general health entered first (the strongest predictor), followed by mental health days, exercise, income, age, and sleep. This ordering reflects each variable’s marginal contribution given the variables already in the model. The convergence of forward and backward methods on the same model increases our confidence in this particular subset, though this convergence is not guaranteed in general.
Stepwise selection combines forward and backward: after adding a variable, it checks whether any previously entered variable should now be removed. This addresses a limitation of pure forward selection, where a variable that was useful early on may become redundant after other variables enter.
mod_stepwise <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "both", trace = 1)## Start: AIC=20865.24
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 4 115918 208518 18663
## + menthlth_days 1 29743 294693 20387
## + exercise 1 19397 305038 20559
## + income_cat 1 19104 305332 20564
## + education 3 5906 318530 20779
## + age 1 4173 320263 20803
## + bmi 1 4041 320395 20805
## + sleep_hrs 1 3717 320719 20810
## <none> 324435 20865
## + sex 1 7 324429 20867
##
## Step: AIC=18662.93
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 6395 202123 18509
## + exercise 1 1652 206865 18625
## + income_cat 1 1307 207211 18634
## + sleep_hrs 1 756 207762 18647
## + bmi 1 91 208427 18663
## <none> 208518 18663
## + sex 1 38 208479 18664
## + age 1 32 208486 18664
## + education 3 145 208373 18666
## - gen_health 4 115918 324435 20865
##
## Step: AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 1651 200472 18470
## + income_cat 1 818 201305 18491
## + age 1 465 201658 18500
## + sleep_hrs 1 258 201865 18505
## + bmi 1 91 202032 18509
## <none> 202123 18509
## + sex 1 3 202120 18511
## + education 3 112 202011 18512
## - menthlth_days 1 6395 208518 18663
## - gen_health 4 92570 294693 20387
##
## Step: AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income_cat 1 509 199963 18460
## + age 1 334 200139 18464
## + sleep_hrs 1 253 200219 18466
## <none> 200472 18470
## + bmi 1 21 200451 18472
## + sex 1 11 200462 18472
## + education 3 27 200445 18476
## - exercise 1 1651 202123 18509
## - menthlth_days 1 6393 206865 18625
## - gen_health 4 78857 279330 20121
##
## Step: AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
##
## Df Sum of Sq RSS AIC
## + age 1 322 199641 18453
## + sleep_hrs 1 250 199713 18455
## <none> 199963 18460
## + bmi 1 28 199935 18461
## + sex 1 27 199936 18461
## + education 3 27 199937 18465
## - income_cat 1 509 200472 18470
## - exercise 1 1342 201305 18491
## - menthlth_days 1 5988 205952 18605
## - gen_health 4 72713 272676 20002
##
## Step: AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age
##
## Df Sum of Sq RSS AIC
## + sleep_hrs 1 337 199305 18447
## <none> 199641 18453
## + sex 1 45 199596 18454
## + bmi 1 42 199599 18454
## + education 3 23 199619 18459
## - age 1 322 199963 18460
## - income_cat 1 497 200139 18464
## - exercise 1 1231 200873 18482
## - menthlth_days 1 6304 205945 18607
## - gen_health 4 68936 268577 19929
##
## Step: AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age + sleep_hrs
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## + sex 1 42 199262 18448
## + bmi 1 34 199270 18448
## + education 3 25 199280 18452
## - sleep_hrs 1 337 199641 18453
## - age 1 409 199713 18455
## - income_cat 1 492 199797 18457
## - exercise 1 1214 200518 18475
## - menthlth_days 1 5882 205186 18590
## - gen_health 4 67980 267285 19906
tidy(mod_stepwise, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
Interpretation: The stepwise procedure, which allows both addition and removal at each step, also converges on the identical model. In this dataset, no variable that was added early became redundant after later variables entered, so no removals were needed. This three-way agreement (backward = forward = stepwise) is reassuring but should not be taken as proof that this is the “correct” model. All three methods optimize the same criterion (AIC) on the same data.
method_comparison <- tribble(
~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
"Maximum model",
length(coef(mod_max)) - 1,
round(glance(mod_max)$adj.r.squared, 4),
round(AIC(mod_max), 1),
round(BIC(mod_max), 1),
"Backward (AIC)",
length(coef(mod_backward)) - 1,
round(glance(mod_backward)$adj.r.squared, 4),
round(AIC(mod_backward), 1),
round(BIC(mod_backward), 1),
"Forward (AIC)",
length(coef(mod_forward)) - 1,
round(glance(mod_forward)$adj.r.squared, 4),
round(AIC(mod_forward), 1),
round(BIC(mod_forward), 1),
"Stepwise (AIC)",
length(coef(mod_stepwise)) - 1,
round(glance(mod_stepwise)$adj.r.squared, 4),
round(AIC(mod_stepwise), 1),
round(BIC(mod_stepwise), 1)
)
method_comparison |>
kable(caption = "Comparison of Variable Selection Methods") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | Variables selected | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| Maximum model | 14 | 0.3843 | 32645.8 | 32750.1 |
| Backward (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
| Forward (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
| Stepwise (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
Interpretation: All three automated methods selected the same model with 9 predictor terms (Adjusted R² = 0.385, AIC = 32,638.4, BIC = 32,710.1). This model has a lower AIC and BIC than the maximum model (AIC = 32,645.8, BIC = 32,750.1), confirming that removing sex, education, and BMI improved parsimony without sacrificing fit. The modest improvement in BIC (40 points lower) is more notable than the AIC improvement (7 points lower), consistent with BIC’s stronger preference for simpler models. In practice, the maximum model and the selected model would produce very similar predictions, but the selected model is preferred for its efficiency.
Use automated selection with extreme caution.
Automated methods (forward, backward, stepwise) have well-documented problems:
They ignore the research question. The algorithm selects variables based purely on statistical fit. If you are building an associative model and the exposure is not statistically significant, the algorithm will remove it, which defeats the purpose.
They inflate Type I error. The repeated testing involved in stepwise procedures inflates the probability of including spurious predictors.
They are path-dependent. Forward and backward selection can yield different final models because the order of variable entry/removal matters.
They ignore subject-matter knowledge. A variable may be a known confounder from the literature even if it is not statistically significant in your sample.
p-values and CIs from the final model are biased. Because the model was selected to optimize fit, the reported p-values are anti-conservative (too small).
Recommendation: Use automated selection as an exploratory tool to generate candidate models, but make final decisions based on substantive knowledge, confounding assessment, and parsimony.
In associative modeling, the exposure variable is always in the model. It is never a candidate for removal, regardless of its p-value. The question is which covariates to include alongside it.
The standard epidemiological approach to covariate selection:
Recall from the Confounding lecture: a covariate is a confounder if removing it changes the exposure coefficient by more than 10%.
The systematic procedure:
# Exposure: exercise; Outcome: physhlth_days
# Maximum associative model
mod_assoc_max <- lm(physhlth_days ~ exercise + menthlth_days + sleep_hrs + age +
sex + education + income_cat + bmi,
data = brfss_ms)
b_exposure_max <- coef(mod_assoc_max)["exerciseYes"]
interval_low <- b_exposure_max - 0.10 * abs(b_exposure_max)
interval_high <- b_exposure_max + 0.10 * abs(b_exposure_max)
cat("Exposure coefficient in maximum model:", round(b_exposure_max, 4), "\n")## Exposure coefficient in maximum model: -3.0688
## 10% interval: ( -3.3757 , -2.7619 )
# Systematically remove one covariate at a time
covariates_to_test <- c("menthlth_days", "sleep_hrs", "age", "sex",
"education", "income_cat", "bmi")
assoc_table <- map_dfr(covariates_to_test, \(cov) {
# Build formula without this covariate
remaining <- setdiff(covariates_to_test, cov)
form <- as.formula(paste("physhlth_days ~ exercise +", paste(remaining, collapse = " + ")))
mod_reduced <- lm(form, data = brfss_ms)
b_reduced <- coef(mod_reduced)["exerciseYes"]
pct_change <- (b_reduced - b_exposure_max) / abs(b_exposure_max) * 100
tibble(
`Removed covariate` = cov,
`Exercise β (max)` = round(b_exposure_max, 4),
`Exercise β (without)` = round(b_reduced, 4),
`% Change` = round(pct_change, 1),
`Within 10%?` = ifelse(abs(pct_change) <= 10, "Yes (drop)", "No (keep)"),
Confounder = ifelse(abs(pct_change) > 10, "Yes", "No")
)
})
assoc_table |>
kable(caption = "Associative Model: Systematic Confounder Assessment for Exercise") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(6, bold = TRUE)| Removed covariate | Exercise β (max) | Exercise β (without) | % Change | Within 10%? | Confounder |
|---|---|---|---|---|---|
| menthlth_days | -3.0688 | -3.3725 | -9.9 | Yes (drop) | No |
| sleep_hrs | -3.0688 | -3.0950 | -0.9 | Yes (drop) | No |
| age | -3.0688 | -3.4150 | -11.3 | No (keep) | Yes |
| sex | -3.0688 | -3.0534 | 0.5 | Yes (drop) | No |
| education | -3.0688 | -3.1036 | -1.1 | Yes (drop) | No |
| income_cat | -3.0688 | -3.4544 | -12.6 | No (keep) | Yes |
| bmi | -3.0688 | -3.2411 | -5.6 | Yes (drop) | No |
Interpretation: The exercise coefficient in the maximum associative model is -3.07, meaning exercisers report about 3 fewer physically unhealthy days after adjusting for all covariates. The systematic assessment identifies two confounders: age (11.3% change when removed) and income (12.6% change when removed). Removing age strengthens the exercise effect (to -3.42), suggesting that age positively confounds the association (older adults exercise less and have more unhealthy days, so ignoring age makes exercise look less protective). Removing income also strengthens the effect (to -3.45), with a similar confounding mechanism (higher income is associated with both more exercise and fewer unhealthy days). The remaining covariates (mental health days, sleep, sex, education, BMI) all produce changes within the 10% interval, so they are not confounders and could be dropped from the associative model. The final associative model would include exercise, age, and income.
If a statistically significant interaction is present (from the previous lecture), the approach changes:
For example, if age \(\times\) exercise is significant:
physhlth_days ~ age + [confounders]
and assess confounding| Feature | Predictive | Associative |
|---|---|---|
| Exposure variable | No fixed exposure | Always in the model |
| Covariate selection | Based on statistical fit | Based on confounding assessment |
| Automated methods | Useful (with caution) | Generally inappropriate |
| 10% change rule | Not used | Primary tool |
| Interaction terms | Include if improves prediction | Include if effect modification is present |
| Primary criterion | Adj. R², AIC, BIC | Validity of exposure β |
| Parsimony | Fewer variables = less overfitting | Fewer variables = more efficient, if not confounders |
A model that fits the training data well may perform poorly on new data (overfitting). Cross-validation estimates how well the model would perform on data it has not seen.
The simplest approach is k-fold cross-validation:
# 10-fold cross-validation comparison
set.seed(1220)
n <- nrow(brfss_ms)
k_folds <- 10
fold_id <- sample(rep(1:k_folds, length.out = n))
# Compare a small model, medium model, and full model
cv_results <- map_dfr(1:k_folds, \(fold) {
train <- brfss_ms[fold_id != fold, ]
test <- brfss_ms[fold_id == fold, ]
# Small model
m_small <- lm(physhlth_days ~ menthlth_days + gen_health, data = train)
pred_small <- predict(m_small, newdata = test)
# Medium model
m_med <- lm(physhlth_days ~ menthlth_days + gen_health + exercise + age + sleep_hrs,
data = train)
pred_med <- predict(m_med, newdata = test)
# Full model
m_full <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
exercise + gen_health + income_cat + bmi, data = train)
pred_full <- predict(m_full, newdata = test)
tibble(
fold = fold,
RMSE_small = sqrt(mean((test$physhlth_days - pred_small)^2)),
RMSE_medium = sqrt(mean((test$physhlth_days - pred_med)^2)),
RMSE_full = sqrt(mean((test$physhlth_days - pred_full)^2))
)
})
cv_summary <- cv_results |>
summarise(
across(starts_with("RMSE"), \(x) round(mean(x), 3))
)
tribble(
~Model, ~Predictors, ~`CV RMSE`,
"Small", "menthlth_days + gen_health", cv_summary$RMSE_small,
"Medium", "+ exercise + age + sleep_hrs", cv_summary$RMSE_medium,
"Full", "All 9 predictors", cv_summary$RMSE_full
) |>
kable(caption = "10-Fold Cross-Validation: Out-of-Sample RMSE") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Predictors | CV RMSE |
|---|---|---|
| Small | menthlth_days + gen_health | 6.362 |
| Medium |
|
6.334 |
| Full | All 9 predictors | 6.334 |
Interpretation: RMSE is the average prediction error in the units of the outcome (days). A lower CV RMSE indicates better out-of-sample prediction. If the full model has a similar CV RMSE to the medium model, the additional predictors are not improving prediction and may represent overfitting.
| Concept | Key Point |
|---|---|
| Maximum model | Start with all candidate predictors from literature and research question |
| Overfitting vs. underfitting | Overfitting = more variance; underfitting = bias |
| Parsimony | Simplest model that captures the important relationships |
| \(R^2\) | Always increases with more variables; useless alone for comparison |
| Adjusted \(R^2\) | Penalizes complexity; maximize it |
| AIC | Balances fit and complexity; minimize it |
| BIC | Heavier penalty than AIC; favors simpler models; minimize it |
| Partial F-test | Compares reduced to maximum model |
| Best subsets | Exhaustive search; leaps::regsubsets() |
| Backward elimination | Start full, remove highest p-value;
step(direction = "backward") |
| Forward selection | Start empty, add lowest p-value;
step(direction = "forward") |
| Stepwise | Forward + backward at each step;
step(direction = "both") |
| Caution | Automated methods ignore research questions and inflate Type I error |
| Associative models | Exposure stays in model; use 10% change-in-estimate for covariates |
| Cross-validation | Estimates out-of-sample performance; protects against overfitting |
EPI 553 — Model Selection Lab Due: End of class, March 24, 2026
In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
bmi |
Body mass index | Continuous |
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(leaps)
library(MASS)
library(purrr)
brfss_ms <- readRDS(
"C:/Users/tahia/OneDrive/Desktop/UAlbany PhD/Epi 553/Lab 11/brfss_ms_2020.rds")1a. (5 pts) Fit the maximum model predicting
physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.
1a Ans The maximum model included all nine candidate predictors: menthlth_days, sleep_hrs, age, sex, education, exercise, gen_health, income_cat, and bmi. This model produced an R² of 0.386 and an Adjusted R² of 0.3843, indicating that approximately 38.6% of the variability in physically unhealthy days is explained by the included predictors. The model had an AIC of 32645.79 and a BIC of 32750.06, which serve as baseline values for comparison with simpler models.
1b. (5 pts) Now fit a “minimal” model using only
menthlth_days and age. Report the same four
criteria. How do the two models compare?
1b Ans The minimal model included only menthlth_days and age. This model produced an R² of 0.115 and an Adjusted R² of 0.1146, indicating that it explains only about 11.5% of the variability in physically unhealthy days. The AIC (34449.78) and BIC (34475.85) are substantially higher than those of the maximum model.
Comparing the two models, the maximum model clearly performs better. It explains a much larger proportion of variance (higher R² and Adjusted R²) and has substantially lower AIC and BIC values, indicating a better balance between model fit and complexity.
1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?
1c Ans R² is a poor criterion for comparing these two models because it always increases, or at least does not decrease, when additional predictors are added, regardless of whether those predictors are truly informative. This means that R² will always favor more complex models, even if the added variables do not improve the model in a meaningful way.
In contrast, Adjusted R², AIC, and BIC are better choices because they account for model complexity. Adjusted R² includes a penalty for adding unnecessary predictors, so it only increases when a new variable improves the model sufficiently. Similarly, AIC and BIC balance model fit with a penalty for the number of parameters. Lower values of AIC and BIC indicate better models, but they discourage overfitting by penalizing excessive complexity. BIC applies a stronger penalty than AIC, which often leads to selecting more parsimonious models.
# Task 1: Maximum Model and Criteria Comparison
# 1a. Maximum model
mod_max <- lm(
physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms
)
# 1b. Minimal model
mod_min <- lm(
physhlth_days ~ menthlth_days + age,
data = brfss_ms
)
# Extract fit criteria for both models
criteria_table <- bind_rows(
glance(mod_max) %>%
mutate(Model = "Maximum model") %>%
dplyr::select(Model, r.squared, adj.r.squared, AIC, BIC),
glance(mod_min) %>%
mutate(Model = "Minimal model") %>%
dplyr::select(Model, r.squared, adj.r.squared, AIC, BIC)
) %>%
mutate(across(where(is.numeric), ~ round(.x, 4)))
criteria_table %>%
kable(
caption = "Comparison of Maximum and Minimal Models",
col.names = c("Model", "R²", "Adjusted R²", "AIC", "BIC")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | R² | Adjusted R² | AIC | BIC |
|---|---|---|---|---|
| Maximum model | 0.386 | 0.3843 | 32645.79 | 32750.06 |
| Minimal model | 0.115 | 0.1146 | 34449.78 | 34475.85 |
2a. (5 pts) Use leaps::regsubsets() to
perform best subsets regression with nvmax = 15. Create a
plot of Adjusted \(R^2\) vs. number of
variables. At what model size does Adjusted \(R^2\) plateau?
2a Ans The plot of Adjusted R² against the number of variables shows that Adjusted R² increases rapidly as predictors are added initially, but then begins to level off. The Adjusted R² reaches its maximum and plateaus around 10 variables, indicating that adding more variables beyond this point does not meaningfully improve the model.
2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?
2b Ans The BIC plot shows a clear minimum at a smaller model size compared to Adjusted R². The BIC is minimized at 8 variables, indicating that this model provides the best balance between model fit and complexity. Because BIC applies a stronger penalty for additional predictors, it favors a more parsimonious model.
2c. (5 pts) Identify the variables included in the
BIC-best model. Fit this model explicitly using lm() and
report its coefficients.
2c Ans The BIC-best model includes the following predictors: menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat.
The fitted model shows that several predictors are strongly associated with physically unhealthy days. Each additional mentally unhealthy day is associated with an increase of 0.146 physically unhealthy days (p < 0.001). Sleep hours have a negative association, where each additional hour of sleep is associated with approximately 0.20 fewer unhealthy days (p = 0.004). Age is positively associated, though the effect size is small.
Exercise is associated with significantly fewer unhealthy days, with individuals who exercise reporting about 1.29 fewer physically unhealthy days compared to those who do not (p < 0.001). General health status is the strongest predictor: compared to individuals in excellent health, those reporting fair and poor health have dramatically higher numbers of physically unhealthy days (e.g., poor health is associated with about 20.5 additional days, p < 0.001). Income is negatively associated, suggesting that higher income levels are linked to fewer unhealthy days.
2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?
2d Ans The Adjusted R²-best model and the BIC-best model are not the same. The Adjusted R² criterion selects a larger model (around 10 variables), while BIC selects a smaller model with 8 variables.
I would prefer the BIC-best model because it achieves nearly the same explanatory power while using fewer predictors. BIC penalizes model complexity more heavily, which helps avoid overfitting and leads to a more interpretable and efficient model. Since the additional variables in the larger model do not substantially improve Adjusted R², retaining them would add unnecessary complexity without meaningful benefit.
# Task 2: Best Subsets Regression
# Best subsets regression
best_subsets <- regsubsets(
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
exercise + gen_health + income_cat + bmi,
data = brfss_ms,
nvmax = 15,
method = "exhaustive"
)
best_summary <- summary(best_subsets)
# Create dataset of subset metrics
subset_metrics <- tibble(
p = 1:length(best_summary$adjr2),
`Adj. R²` = best_summary$adjr2,
BIC = best_summary$bic,
Cp = best_summary$cp
)
# Plot of Adjusted R² by model size
ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_summary$adjr2),
linetype = "dashed", color = "tomato") +
labs(
title = "Adjusted R² by Model Size",
x = "Number of Variables",
y = "Adjusted R²"
) +
theme_minimal(base_size = 12)# Plot of BIC by model size
ggplot(subset_metrics, aes(x = p, y = BIC)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.min(best_summary$bic),
linetype = "dashed", color = "tomato") +
labs(
title = "BIC by Model Size",
x = "Number of Variables",
y = "BIC"
) +
theme_minimal(base_size = 12)# Identify best model sizes
cat("Best model by Adj. R²:", which.max(best_summary$adjr2), "variables\n")## Best model by Adj. R²: 10 variables
## Best model by BIC: 8 variables
# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_bic_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")##
## Variables in BIC-best model:
## menthlth_days
## sleep_hrs
## age
## exerciseYes
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## income_cat
# Fit the BIC-best model explicitly
mod_bic_best <- lm(
physhlth_days ~ menthlth_days + sleep_hrs + age + exercise +
gen_health + income_cat,
data = brfss_ms
)
# Report coefficients for the BIC-best model
tidy(mod_bic_best, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Coefficients for the BIC-Best Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
# Identify variables in Adjusted R²-best model
best_adjr2_idx <- which.max(best_summary$adjr2)
best_adjr2_vars <- names(which(best_summary$which[best_adjr2_idx, -1]))
cat("\nVariables in Adjusted R²-best model:\n")##
## Variables in Adjusted R²-best model:
## menthlth_days
## sleep_hrs
## age
## sexFemale
## exerciseYes
## gen_healthVery good
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## income_cat
3a. (5 pts) Perform backward elimination using
step() with AIC as the criterion. Which variables are
removed? Which remain?
3a Ans Backward elimination using AIC removed sex, education, and BMI from the maximum model. The final model retained the following predictors: menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat. This indicates that the removed variables did not contribute meaningfully to improving model fit when evaluated using AIC.
3b. (5 pts) Perform forward selection using
step(). Does it arrive at the same model as backward
elimination?
3b Ans Forward selection using AIC arrived at the same final model as backward elimination. It selected the same set of predictors: menthlth_days, sleep_hrs, age, exercise, gen_health, and income_cat. This agreement between forward and backward methods suggests that these variables are consistently identified as the most important predictors under the AIC criterion.
3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.
3c Ans All three automated methods—backward, forward, and stepwise—produced identical results. Each method selected a model with 9 predictor terms (including dummy variables) and achieved an Adjusted R² of 0.3846, AIC of 32638.4, and BIC of 32710.1. This indicates that removing sex, education, and BMI improved model parsimony without reducing explanatory power. The consistency across methods increases confidence that this subset of variables provides a good balance between fit and complexity.
3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research?
3d Ans Automated variable selection methods should not be blindly trusted for several reasons. First, they ignore the research question, selecting variables purely based on statistical criteria rather than subject-matter relevance. Second, they inflate Type I error, because repeated testing increases the likelihood of including variables that are not truly associated with the outcome. Third, they are data-dependent and unstable, meaning small changes in the dataset can lead to different selected models.
Among these concerns, the most important for epidemiological research is that automated methods ignore confounding and subject-matter knowledge. In epidemiology, the goal is often to estimate valid associations, not just optimize prediction. Removing an important confounder simply because it is not statistically significant can lead to biased estimates, which undermines the validity of the study.
# Task 3: Automated Selection Methods
# Backward elimination using AIC
mod_max <- lm(
physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms
)
mod_backward <- step(mod_max, direction = "backward", trace = 1)## Start: AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
## exercise + gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - education 3 29 199231 18449
## - bmi 1 32 199234 18453
## - sex 1 43 199245 18454
## <none> 199202 18454
## - sleep_hrs 1 329 199530 18461
## - age 1 434 199636 18463
## - income_cat 1 521 199722 18466
## - exercise 1 1174 200376 18482
## - menthlth_days 1 5898 205100 18598
## - gen_health 4 66437 265639 19886
##
## Step: AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - bmi 1 32 199262 18448
## - sex 1 40 199270 18448
## <none> 199231 18449
## - sleep_hrs 1 327 199557 18455
## - age 1 439 199670 18458
## - income_cat 1 520 199751 18460
## - exercise 1 1151 200381 18476
## - menthlth_days 1 5929 205159 18594
## - gen_health 4 66459 265690 19881
##
## Step: AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## - sex 1 42 199305 18447
## <none> 199262 18448
## - sleep_hrs 1 334 199596 18454
## - age 1 427 199690 18457
## - income_cat 1 514 199776 18459
## - exercise 1 1222 200484 18477
## - menthlth_days 1 5921 205184 18592
## - gen_health 4 67347 266609 19896
##
## Step: AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## - sleep_hrs 1 337 199641 18453
## - age 1 409 199713 18455
## - income_cat 1 492 199797 18457
## - exercise 1 1214 200518 18475
## - menthlth_days 1 5882 205186 18590
## - gen_health 4 67980 267285 19906
# Forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)
mod_forward <- step(
mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward",
trace = 1
)## Start: AIC=20865.24
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 4 115918 208518 18663
## + menthlth_days 1 29743 294693 20387
## + exercise 1 19397 305038 20559
## + income_cat 1 19104 305332 20564
## + education 3 5906 318530 20779
## + age 1 4173 320263 20803
## + bmi 1 4041 320395 20805
## + sleep_hrs 1 3717 320719 20810
## <none> 324435 20865
## + sex 1 7 324429 20867
##
## Step: AIC=18662.93
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 6394.9 202123 18509
## + exercise 1 1652.4 206865 18625
## + income_cat 1 1306.9 207211 18634
## + sleep_hrs 1 756.1 207762 18647
## + bmi 1 91.2 208427 18663
## <none> 208518 18663
## + sex 1 38.5 208479 18664
## + age 1 32.2 208486 18664
## + education 3 145.0 208373 18666
##
## Step: AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 1650.52 200472 18470
## + income_cat 1 817.89 201305 18491
## + age 1 464.73 201658 18500
## + sleep_hrs 1 257.79 201865 18505
## + bmi 1 90.51 202032 18509
## <none> 202123 18509
## + sex 1 3.00 202120 18511
## + education 3 111.58 202011 18512
##
## Step: AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income_cat 1 509.09 199963 18460
## + age 1 333.74 200139 18464
## + sleep_hrs 1 253.06 200219 18466
## <none> 200472 18470
## + bmi 1 21.21 200451 18472
## + sex 1 10.74 200462 18472
## + education 3 26.94 200445 18476
##
## Step: AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
##
## Df Sum of Sq RSS AIC
## + age 1 321.97 199641 18453
## + sleep_hrs 1 250.25 199713 18455
## <none> 199963 18460
## + bmi 1 27.98 199935 18461
## + sex 1 27.17 199936 18461
## + education 3 26.66 199937 18465
##
## Step: AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age
##
## Df Sum of Sq RSS AIC
## + sleep_hrs 1 336.79 199305 18447
## <none> 199641 18453
## + sex 1 45.31 199596 18454
## + bmi 1 42.00 199599 18454
## + education 3 22.62 199619 18459
##
## Step: AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age + sleep_hrs
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## + sex 1 42.328 199262 18448
## + bmi 1 34.434 199270 18448
## + education 3 24.800 199280 18452
# Stepwise selection using AIC
mod_stepwise <- step(
mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "both",
trace = 1
)## Start: AIC=20865.24
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 4 115918 208518 18663
## + menthlth_days 1 29743 294693 20387
## + exercise 1 19397 305038 20559
## + income_cat 1 19104 305332 20564
## + education 3 5906 318530 20779
## + age 1 4173 320263 20803
## + bmi 1 4041 320395 20805
## + sleep_hrs 1 3717 320719 20810
## <none> 324435 20865
## + sex 1 7 324429 20867
##
## Step: AIC=18662.93
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 6395 202123 18509
## + exercise 1 1652 206865 18625
## + income_cat 1 1307 207211 18634
## + sleep_hrs 1 756 207762 18647
## + bmi 1 91 208427 18663
## <none> 208518 18663
## + sex 1 38 208479 18664
## + age 1 32 208486 18664
## + education 3 145 208373 18666
## - gen_health 4 115918 324435 20865
##
## Step: AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 1651 200472 18470
## + income_cat 1 818 201305 18491
## + age 1 465 201658 18500
## + sleep_hrs 1 258 201865 18505
## + bmi 1 91 202032 18509
## <none> 202123 18509
## + sex 1 3 202120 18511
## + education 3 112 202011 18512
## - menthlth_days 1 6395 208518 18663
## - gen_health 4 92570 294693 20387
##
## Step: AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income_cat 1 509 199963 18460
## + age 1 334 200139 18464
## + sleep_hrs 1 253 200219 18466
## <none> 200472 18470
## + bmi 1 21 200451 18472
## + sex 1 11 200462 18472
## + education 3 27 200445 18476
## - exercise 1 1651 202123 18509
## - menthlth_days 1 6393 206865 18625
## - gen_health 4 78857 279330 20121
##
## Step: AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
##
## Df Sum of Sq RSS AIC
## + age 1 322 199641 18453
## + sleep_hrs 1 250 199713 18455
## <none> 199963 18460
## + bmi 1 28 199935 18461
## + sex 1 27 199936 18461
## + education 3 27 199937 18465
## - income_cat 1 509 200472 18470
## - exercise 1 1342 201305 18491
## - menthlth_days 1 5988 205952 18605
## - gen_health 4 72713 272676 20002
##
## Step: AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age
##
## Df Sum of Sq RSS AIC
## + sleep_hrs 1 337 199305 18447
## <none> 199641 18453
## + sex 1 45 199596 18454
## + bmi 1 42 199599 18454
## + education 3 23 199619 18459
## - age 1 322 199963 18460
## - income_cat 1 497 200139 18464
## - exercise 1 1231 200873 18482
## - menthlth_days 1 6304 205945 18607
## - gen_health 4 68936 268577 19929
##
## Step: AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age + sleep_hrs
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## + sex 1 42 199262 18448
## + bmi 1 34 199270 18448
## + education 3 25 199280 18452
## - sleep_hrs 1 337 199641 18453
## - age 1 409 199713 18455
## - income_cat 1 492 199797 18457
## - exercise 1 1214 200518 18475
## - menthlth_days 1 5882 205186 18590
## - gen_health 4 67980 267285 19906
# Show final coefficients for backward model
tidy(mod_backward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
# Show final coefficients for forward model
tidy(mod_forward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
# Show final coefficients for stepwise model
tidy(mod_stepwise, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
## Variables remaining in backward model:
## menthlth_days
## sleep_hrs
## age
## exerciseYes
## gen_healthVery good
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## income_cat
##
##
## Variables remaining in forward model:
## gen_healthVery good
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## menthlth_days
## exerciseYes
## income_cat
## age
## sleep_hrs
##
##
## Variables remaining in stepwise model:
## gen_healthVery good
## gen_healthGood
## gen_healthFair
## gen_healthPoor
## menthlth_days
## exerciseYes
## income_cat
## age
## sleep_hrs
# Compare all methods in one table
method_comparison <- tribble(
~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
"Backward (AIC)",
length(coef(mod_backward)) - 1,
round(glance(mod_backward)$adj.r.squared, 4),
round(AIC(mod_backward), 1),
round(BIC(mod_backward), 1),
"Forward (AIC)",
length(coef(mod_forward)) - 1,
round(glance(mod_forward)$adj.r.squared, 4),
round(AIC(mod_forward), 1),
round(BIC(mod_forward), 1),
"Stepwise (AIC)",
length(coef(mod_stepwise)) - 1,
round(glance(mod_stepwise)$adj.r.squared, 4),
round(AIC(mod_stepwise), 1),
round(BIC(mod_stepwise), 1)
)
method_comparison |>
kable(caption = "Comparison of Automated Variable Selection Methods") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | Variables selected | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| Backward (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
| Forward (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
| Stepwise (AIC) | 9 | 0.3846 | 32638.4 | 32710.1 |
For this task, the exposure is
sleep_hrs and the outcome is
physhlth_days. You are building an associative model to
estimate the effect of sleep on physical health.
4a. (5 pts) Fit the crude model:
physhlth_days ~ sleep_hrs. Report the sleep
coefficient.
4a Ans In the crude model, sleep_hrs has a negative coefficient, showing that more sleep is associated with fewer physically unhealthy days. Specifically, each additional hour of sleep is associated with about a 0.20 decrease in physically unhealthy days. This suggests a protective effect of sleep in the unadjusted model.
4b. (10 pts) Fit the maximum associative model:
physhlth_days ~ sleep_hrs + [all other covariates]. Note
the adjusted sleep coefficient and compute the 10% interval. Then
systematically remove each covariate one at a time and determine which
are confounders using the 10% rule. Present your results in a summary
table.
4b Ans In the maximum associative model, the adjusted coefficient for sleep_hrs was approximately −0.20. The 10% interval around this estimate was calculated, and each covariate was removed one at a time to see how much the sleep coefficient changed. Variables that changed the sleep coefficient by more than 10% were considered confounders, while those that did not were removed. Based on this process, only some variables meaningfully changed the sleep estimate, meaning that not all covariates were necessary in the final model.
4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.
4c Ans In the final model, the coefficient for sleep_hrs is −0.2026 (95% CI: −0.3349 to −0.0702, p = 0.0027). This means that each additional hour of sleep is associated with about 0.20 fewer physically unhealthy days, after adjusting for confounders. Since the confidence interval does not include zero, this association is statistically significant.
4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.
4d Ans Stepwise selection is not appropriate here because it selects variables based only on statistical fit, not on whether they are confounders. In epidemiology, our goal is to estimate the true effect of sleep, not just build the best predictive model. Stepwise methods might remove important confounders if they are not statistically significant, which can bias the results. The 10% rule is better because it focuses on keeping variables that affect the exposure–outcome relationship.
# Task 4: Associative Model Building
# Exposure = sleep_hrs
# Outcome = physhlth_days
# 4a. Crude model
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
tidy(mod_crude, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Crude Model: physhlth_days ~ sleep_hrs",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.9110 | 0.5959 | 13.2755 | 0 | 6.7428 | 9.0793 |
| sleep_hrs | -0.6321 | 0.0831 | -7.6104 | 0 | -0.7949 | -0.4693 |
# 4b. Maximum associative model
mod_assoc_max <- lm(
physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms
)
# Extract adjusted sleep coefficient from the maximum associative model
b_sleep_max <- coef(mod_assoc_max)["sleep_hrs"]
interval_low <- b_sleep_max - 0.10 * abs(b_sleep_max)
interval_high <- b_sleep_max + 0.10 * abs(b_sleep_max)
cat("Adjusted sleep coefficient from maximum associative model:", round(b_sleep_max, 4), "\n")## Adjusted sleep coefficient from maximum associative model: -0.193
## 10% interval: -0.2123 to -0.1737
# Systematically remove each covariate one at a time
covariates <- c("menthlth_days", "age", "sex", "education",
"exercise", "gen_health", "income_cat", "bmi")
confounding_results <- map_dfr(covariates, \(var) {
reduced_formula <- as.formula(
paste(
"physhlth_days ~ sleep_hrs +",
paste(setdiff(covariates, var), collapse = " + ")
)
)
mod_reduced <- lm(reduced_formula, data = brfss_ms)
b_reduced <- coef(mod_reduced)["sleep_hrs"]
pct_change <- ((b_reduced - b_sleep_max) / b_sleep_max) * 100
tibble(
Removed_Variable = var,
Sleep_Coefficient = b_reduced,
Percent_Change = pct_change,
Within_10pct_Interval = ifelse(b_reduced >= interval_low & b_reduced <= interval_high, "Yes", "No"),
Confounder = ifelse(b_reduced >= interval_low & b_reduced <= interval_high, "No", "Yes")
)
})
confounding_results |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Confounding Assessment Using the 10% Change-in-Estimate Rule"
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Removed_Variable | Sleep_Coefficient | Percent_Change | Within_10pct_Interval | Confounder |
|---|---|---|---|---|
| menthlth_days | -0.2894 | 49.9895 | No | Yes |
| age | -0.1646 | -14.7190 | No | Yes |
| sex | -0.1937 | 0.3961 | Yes | No |
| education | -0.1923 | -0.3407 | Yes | No |
| exercise | -0.1957 | 1.4096 | Yes | No |
| gen_health | -0.3593 | 86.1810 | No | Yes |
| income_cat | -0.1936 | 0.3498 | Yes | No |
| bmi | -0.1950 | 1.0432 | Yes | No |
# Identify confounders
confounders <- confounding_results |>
filter(Confounder == "Yes") |>
pull(Removed_Variable)
cat("\nIdentified confounders:\n")##
## Identified confounders:
## menthlth_days
## age
## gen_health
# 4c. Final associative model with sleep + identified confounders
if (length(confounders) == 0) {
final_formula <- as.formula("physhlth_days ~ sleep_hrs")
} else {
final_formula <- as.formula(
paste("physhlth_days ~ sleep_hrs +", paste(confounders, collapse = " + "))
)
}
mod_assoc_final <- lm(final_formula, data = brfss_ms)
tidy(mod_assoc_final, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Final Associative Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.8151 | 0.5615 | 1.4516 | 0.1467 | -0.2857 | 1.9158 |
| sleep_hrs | -0.2026 | 0.0675 | -3.0003 | 0.0027 | -0.3349 | -0.0702 |
| menthlth_days | 0.1512 | 0.0120 | 12.5637 | 0.0000 | 0.1276 | 0.1748 |
| age | 0.0205 | 0.0054 | 3.7595 | 0.0002 | 0.0098 | 0.0312 |
| gen_healthVery good | 0.5113 | 0.2451 | 2.0860 | 0.0370 | 0.0308 | 0.9919 |
| gen_healthGood | 1.9151 | 0.2579 | 7.4255 | 0.0000 | 1.4095 | 2.4207 |
| gen_healthFair | 7.7686 | 0.3488 | 22.2693 | 0.0000 | 7.0847 | 8.4524 |
| gen_healthPoor | 21.4868 | 0.5266 | 40.8018 | 0.0000 | 20.4544 | 22.5192 |
# Extract sleep coefficient and 95% CI from final model
sleep_final <- tidy(mod_assoc_final, conf.int = TRUE) |>
filter(term == "sleep_hrs") |>
mutate(across(where(is.numeric), \(x) round(x, 4)))
sleep_final |>
kable(
caption = "Sleep Coefficient and 95% CI from the Final Associative Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| sleep_hrs | -0.2026 | 0.0675 | -3.0003 | 0.0027 | -0.3349 | -0.0702 |
5a. (10 pts) You have now built two models for the same data:
Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?
5a Ans The predictive model and the associative model do not include exactly the same variables. The predictive model includes variables that improve overall model fit, such as menthlth_days, gen_health, exercise, and income, while removing variables like sex, education, and BMI. In contrast, the associative model includes only sleep_hrs and the variables that act as confounders, meaning those that affect the relationship between sleep and physical health.
The sleep coefficient is very similar in both models (around −0.20), which suggests that the effect of sleep is stable even after adjusting for other variables. However, the models differ in purpose. The predictive model focuses on explaining as much variation as possible, while the associative model focuses only on estimating the true effect of sleep. Because of this difference in goals, the variables included in each model are not exactly the same.
5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:
Do not use statistical jargon.
5b Ans Our results show that sleep has an important impact on physical health. After accounting for other factors such as mental health, age, exercise, general health status, and income, each additional hour of sleep is associated with about 0.20 fewer physically unhealthy days. This means that people who sleep more tend to experience fewer days of poor physical health. The strongest differences were also seen among people with poorer overall health, who reported many more unhealthy days. However, because this study uses cross-sectional data, we cannot be sure that sleep directly causes better health, only that they are related.
End of Lab Activity