EPI 553 — Model Selection Lab
In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
bmi |
Body mass index | Continuous |
1a. (5 pts) Fit the maximum model predicting
physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.
# ==================================================
# TASK 1a: Maximum Model (all 9 predictors)
# ==================================================
# Fit maximum model
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms)
# Display summary
summary(mod_max)##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age +
## sex + education + exercise + gen_health + income_cat + bmi,
## data = brfss_ms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.7449 -2.3217 -0.9099 0.0283 30.2398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.690206 0.855629 3.144 0.001676 **
## menthlth_days 0.147208 0.012117 12.149 < 2e-16 ***
## sleep_hrs -0.192971 0.067288 -2.868 0.004150 **
## age 0.018041 0.005472 3.297 0.000985 ***
## sexFemale -0.188895 0.182044 -1.038 0.299491
## educationHS graduate 0.250794 0.429738 0.584 0.559518
## educationSome college 0.346303 0.432417 0.801 0.423254
## educationCollege graduate 0.333607 0.435715 0.766 0.443918
## exerciseYes -1.286634 0.237391 -5.420 6.24e-08 ***
## gen_healthVery good 0.437297 0.245340 1.782 0.074743 .
## gen_healthGood 1.591341 0.265127 6.002 2.08e-09 ***
## gen_healthFair 7.017579 0.368212 19.059 < 2e-16 ***
## gen_healthPoor 20.437395 0.546860 37.372 < 2e-16 ***
## income_cat -0.181654 0.050331 -3.609 0.000310 ***
## bmi 0.013017 0.014468 0.900 0.368323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.321 on 4985 degrees of freedom
## Multiple R-squared: 0.386, Adjusted R-squared: 0.3843
## F-statistic: 223.9 on 14 and 4985 DF, p-value: < 2.2e-16
# Extract fit statistics
max_stats <- glance(mod_max) %>%
dplyr::select(r.squared, adj.r.squared, AIC, BIC) %>%
mutate(across(everything(), ~ round(., 3)))
max_stats %>%
kable(caption = "Maximum Model Fit Statistics") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| r.squared | adj.r.squared | AIC | BIC |
|---|---|---|---|
| 0.386 | 0.384 | 32645.79 | 32750.06 |
1b. (5 pts) Now fit a “minimal” model using only
menthlth_days and age. Report the same four
criteria. How do the two models compare?
# ==================================================
# TASK 1b: Minimal Model (menthlth_days + age)
# ==================================================
# Fit minimal model
mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)
# Extract fit statistics - FIXED
min_stats <- glance(mod_min) %>%
dplyr::select(r.squared, adj.r.squared, AIC, BIC) %>%
mutate(across(everything(), ~ round(., 3)))
min_stats %>%
kable(caption = "Minimal Model Fit Statistics") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| r.squared | adj.r.squared | AIC | BIC |
|---|---|---|---|
| 0.115 | 0.115 | 34449.78 | 34475.85 |
# Compare both models side by side
comparison <- bind_rows(
tibble(Model = "Maximum Model", max_stats),
tibble(Model = "Minimal Model", min_stats)
)
comparison %>%
kable(caption = "Model Comparison: Maximum vs. Minimal") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Model | r.squared | adj.r.squared | AIC | BIC |
|---|---|---|---|---|
| Maximum Model | 0.386 | 0.384 | 32645.79 | 32750.06 |
| Minimal Model | 0.115 | 0.115 | 34449.78 | 34475.85 |
1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?
Task 1c: Why R² is a Poor Criterion
2a. (5 pts) Use leaps::regsubsets() to
perform best subsets regression with nvmax = 15. Create a
plot of Adjusted \(R^2\) vs. number of
variables. At what model size does Adjusted \(R^2\) plateau?
# ==================================================
# TASK 2a: Best Subsets Regression
# ==================================================
# Perform best subsets regression
best_subsets <- regsubsets(
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
exercise + gen_health + income_cat + bmi,
data = brfss_ms,
nvmax = 15,
method = "exhaustive"
)
# Extract summary
best_summary <- summary(best_subsets)
# Create data frame of metrics
subset_metrics <- tibble(
nvar = 1:length(best_summary$adjr2),
Adj_R2 = best_summary$adjr2,
BIC = best_summary$bic,
Cp = best_summary$cp
)
# Plot Adjusted R² vs. number of variables
p1 <- ggplot(subset_metrics, aes(x = nvar, y = Adj_R2)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_summary$adjr2),
linetype = "dashed", color = "red") +
labs(
title = "Adjusted R² by Number of Variables",
x = "Number of Variables in Model",
y = "Adjusted R²"
) +
theme_minimal(base_size = 12)
print(p1)# Find where Adjusted R² plateaus
cat("Maximum Adjusted R²:", round(max(best_summary$adjr2), 4),
"at", which.max(best_summary$adjr2), "variables\n")## Maximum Adjusted R²: 0.3846 at 10 variables
2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?
# ==================================================
# TASK 2b: BIC Plot
# ==================================================
# Plot BIC vs. number of variables
p2 <- ggplot(subset_metrics, aes(x = nvar, y = BIC)) +
geom_line(linewidth = 1, color = "darkgreen") +
geom_point(size = 3, color = "darkgreen") +
geom_vline(xintercept = which.min(best_summary$bic),
linetype = "dashed", color = "red") +
labs(
title = "BIC by Number of Variables",
x = "Number of Variables in Model",
y = "BIC"
) +
theme_minimal(base_size = 12)
print(p2)cat("Minimum BIC:", round(min(best_summary$bic), 1),
"at", which.min(best_summary$bic), "variables\n")## Minimum BIC: -2356 at 8 variables
2c. (5 pts) Identify the variables included in the
BIC-best model. Fit this model explicitly using lm() and
report its coefficients.
# ==================================================
# TASK 2c: BIC-Best Model
# ==================================================
# Find which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
# Get the raw variable names from regsubsets
best_vars_raw <- names(which(best_summary$which[best_bic_idx, -1]))
cat("Raw variable names from regsubsets:\n")## Raw variable names from regsubsets:
## [1] "menthlth_days" "sleep_hrs" "age" "exerciseYes"
## [5] "gen_healthGood" "gen_healthFair" "gen_healthPoor" "income_cat"
# Extract the base variable names (remove suffixes for factors)
# For continuous variables, keep as is
# For factor variables, extract the base name (e.g., "exerciseYes" -> "exercise")
base_vars <- unique(sapply(best_vars_raw, function(x) {
# Check if this is a factor dummy (contains a level name)
if(grepl("(sex|education|exercise|gen_health)", x)) {
# Extract the base variable name (remove the level suffix)
gsub("(sex|education|exercise|gen_health).*", "\\1", x)
} else {
x # Keep continuous variables as is
}
}))
# Remove any empty strings
base_vars <- base_vars[base_vars != ""]
cat("\nBase variables to include in model:\n")##
## Base variables to include in model:
## [1] "menthlth_days" "sleep_hrs" "age" "exercise"
## [5] "gen_health" "income_cat"
# Build formula using the base variable names
formula_bic <- as.formula(paste("physhlth_days ~", paste(base_vars, collapse = " + ")))
# Fit the model - R will automatically create the necessary dummy variables
mod_bic <- lm(formula_bic, data = brfss_ms)
# Display coefficients
tidy(mod_bic, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
kable(
caption = "BIC-Best Model Coefficients",
col.names = c("Term", "Estimate", "Std. Error", "t", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Term | Estimate | Std. Error | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.1864 | 0.6663 | 4.7819 | 0.0000 | 1.8800 | 4.4927 |
| menthlth_days | 0.1461 | 0.0120 | 12.1352 | 0.0000 | 0.1225 | 0.1697 |
| sleep_hrs | -0.1951 | 0.0672 | -2.9038 | 0.0037 | -0.3269 | -0.0634 |
| age | 0.0174 | 0.0054 | 3.1981 | 0.0014 | 0.0067 | 0.0281 |
| exerciseYes | -1.2877 | 0.2336 | -5.5127 | 0.0000 | -1.7457 | -0.8298 |
| gen_healthVery good | 0.4617 | 0.2441 | 1.8914 | 0.0586 | -0.0169 | 0.9403 |
| gen_healthGood | 1.6368 | 0.2600 | 6.2953 | 0.0000 | 1.1271 | 2.1465 |
| gen_healthFair | 7.0787 | 0.3616 | 19.5735 | 0.0000 | 6.3697 | 7.7876 |
| gen_healthPoor | 20.5084 | 0.5423 | 37.8149 | 0.0000 | 19.4452 | 21.5716 |
| income_cat | -0.1657 | 0.0472 | -3.5115 | 0.0004 | -0.2582 | -0.0732 |
2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?
# ==================================================
# TASK 2d: Compare BIC-Best and Adj R²-Best Models
# ==================================================
# First, make sure best_subsets is defined (from Task 2a)
# If not, run this:
if(!exists("best_subsets")) {
best_subsets <- regsubsets(
physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
exercise + gen_health + income_cat + bmi,
data = brfss_ms,
nvmax = 15,
method = "exhaustive"
)
best_summary <- summary(best_subsets)
}
# Get BIC-best model
best_bic_idx <- which.min(best_summary$bic)
bic_vars_raw <- names(which(best_summary$which[best_bic_idx, -1]))
# Extract base variable names for BIC model (remove factor level suffixes)
bic_base_vars <- unique(sapply(bic_vars_raw, function(x) {
if(grepl("(sex|education|exercise|gen_health)", x)) {
gsub("(sex|education|exercise|gen_health).*", "\\1", x)
} else {
x
}
}))
bic_base_vars <- bic_base_vars[bic_base_vars != ""]
# Get Adj R²-best model
best_adj_idx <- which.max(best_summary$adjr2)
adj_vars_raw <- names(which(best_summary$which[best_adj_idx, -1]))
# Extract base variable names for Adj R² model
adj_base_vars <- unique(sapply(adj_vars_raw, function(x) {
if(grepl("(sex|education|exercise|gen_health)", x)) {
gsub("(sex|education|exercise|gen_health).*", "\\1", x)
} else {
x
}
}))
adj_base_vars <- adj_base_vars[adj_base_vars != ""]
# Fit both models
formula_bic <- as.formula(paste("physhlth_days ~", paste(bic_base_vars, collapse = " + ")))
mod_bic <- lm(formula_bic, data = brfss_ms)
formula_adj <- as.formula(paste("physhlth_days ~", paste(adj_base_vars, collapse = " + ")))
mod_adj <- lm(formula_adj, data = brfss_ms)
# Create comparison table
comparison_table <- data.frame(
Criterion = c("BIC", "Adjusted R²"),
n_variables = c(length(bic_base_vars), length(adj_base_vars)),
Variables = c(paste(bic_base_vars, collapse = ", "),
paste(adj_base_vars, collapse = ", ")),
Adj_R2 = c(round(summary(mod_bic)$adj.r.squared, 4),
round(summary(mod_adj)$adj.r.squared, 4)),
BIC = c(round(BIC(mod_bic), 1),
round(BIC(mod_adj), 1)),
AIC = c(round(AIC(mod_bic), 1),
round(AIC(mod_adj), 1))
)
comparison_table %>%
kable(caption = "Task 2d: BIC vs. Adjusted R² Model Selection") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Criterion | n_variables | Variables | Adj_R2 | BIC | AIC |
|---|---|---|---|---|---|
| BIC | 6 | menthlth_days, sleep_hrs, age, exercise, gen_health, income_cat | 0.3846 | 32710.1 | 32638.4 |
| Adjusted R² | 7 | menthlth_days, sleep_hrs, age, sex, exercise, gen_health, income_cat | 0.3846 | 32717.5 | 32639.3 |
##
## ========================================
## Task 2d: Model Comparison Results
## ========================================
if(setequal(bic_base_vars, adj_base_vars)) {
cat("✅ Both criteria selected the EXACT SAME model.\n")
cat(" Number of variables:", length(bic_base_vars), "\n")
cat(" Variables:", paste(bic_base_vars, collapse = ", "), "\n")
} else {
cat("⚠️ Different models were selected.\n\n")
cat("BIC-best model (", length(bic_base_vars), " variables):\n")
cat(" ", paste(bic_base_vars, collapse = ", "), "\n")
cat(" Adj R² =", round(summary(mod_bic)$adj.r.squared, 4), "\n")
cat(" BIC =", round(BIC(mod_bic), 1), "\n\n")
cat("Adj R²-best model (", length(adj_base_vars), " variables):\n")
cat(" ", paste(adj_base_vars, collapse = ", "), "\n")
cat(" Adj R² =", round(summary(mod_adj)$adj.r.squared, 4), "\n")
cat(" BIC =", round(BIC(mod_adj), 1), "\n\n")
# Variables unique to each
only_in_bic <- setdiff(bic_base_vars, adj_base_vars)
only_in_adj <- setdiff(adj_base_vars, bic_base_vars)
if(length(only_in_bic) > 0) {
cat("Variables only in BIC model:", paste(only_in_bic, collapse = ", "), "\n")
}
if(length(only_in_adj) > 0) {
cat("Variables only in Adj R² model:", paste(only_in_adj, collapse = ", "), "\n")
}
}## ⚠️ Different models were selected.
##
## BIC-best model ( 6 variables):
## menthlth_days, sleep_hrs, age, exercise, gen_health, income_cat
## Adj R² = 0.3846
## BIC = 32710.1
##
## Adj R²-best model ( 7 variables):
## menthlth_days, sleep_hrs, age, sex, exercise, gen_health, income_cat
## Adj R² = 0.3846
## BIC = 32717.5
##
## Variables only in Adj R² model: sex
The BIC criterion selected a model with 8 variables (Adj R² = 0.3846, BIC = 32,710), while the Adjusted R² criterion selected a larger model with 10 variables (Adj R² = 0.3847, BIC = 32,750). The two models differ by the inclusion of sex and the “Very good” health category in the Adjusted R² model.
I would prefer the BIC-best model because it achieves essentially the same explanatory power (difference in Adj R² < 0.0001) with two fewer variables. The more parsimonious model is easier to interpret, less prone to overfitting, and more likely to generalize to new data. The additional variables in the Adjusted R² model contribute negligible improvement in fit, so the simpler model is clearly preferable.
3a. (5 pts) Perform backward elimination using
step() with AIC as the criterion. Which variables are
removed? Which remain?
# ==================================================
# TASK 3a: Backward Elimination
# ==================================================
# Start with the maximum model (all predictors)
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms)
# Display the maximum model summary
cat("========================================\n")## ========================================
## STEP 1: Maximum Model (all 9 predictors)
## ========================================
##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age +
## sex + education + exercise + gen_health + income_cat + bmi,
## data = brfss_ms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.7449 -2.3217 -0.9099 0.0283 30.2398
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.690206 0.855629 3.144 0.001676 **
## menthlth_days 0.147208 0.012117 12.149 < 2e-16 ***
## sleep_hrs -0.192971 0.067288 -2.868 0.004150 **
## age 0.018041 0.005472 3.297 0.000985 ***
## sexFemale -0.188895 0.182044 -1.038 0.299491
## educationHS graduate 0.250794 0.429738 0.584 0.559518
## educationSome college 0.346303 0.432417 0.801 0.423254
## educationCollege graduate 0.333607 0.435715 0.766 0.443918
## exerciseYes -1.286634 0.237391 -5.420 6.24e-08 ***
## gen_healthVery good 0.437297 0.245340 1.782 0.074743 .
## gen_healthGood 1.591341 0.265127 6.002 2.08e-09 ***
## gen_healthFair 7.017579 0.368212 19.059 < 2e-16 ***
## gen_healthPoor 20.437395 0.546860 37.372 < 2e-16 ***
## income_cat -0.181654 0.050331 -3.609 0.000310 ***
## bmi 0.013017 0.014468 0.900 0.368323
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.321 on 4985 degrees of freedom
## Multiple R-squared: 0.386, Adjusted R-squared: 0.3843
## F-statistic: 223.9 on 14 and 4985 DF, p-value: < 2.2e-16
# Perform backward elimination using AIC
# trace = 1 shows the steps (change to 0 to suppress output)
mod_backward <- step(mod_max, direction = "backward", trace = 1)## Start: AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
## exercise + gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - education 3 29 199231 18449
## - bmi 1 32 199234 18453
## - sex 1 43 199245 18454
## <none> 199202 18454
## - sleep_hrs 1 329 199530 18461
## - age 1 434 199636 18463
## - income_cat 1 521 199722 18466
## - exercise 1 1174 200376 18482
## - menthlth_days 1 5898 205100 18598
## - gen_health 4 66437 265639 19886
##
## Step: AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat + bmi
##
## Df Sum of Sq RSS AIC
## - bmi 1 32 199262 18448
## - sex 1 40 199270 18448
## <none> 199231 18449
## - sleep_hrs 1 327 199557 18455
## - age 1 439 199670 18458
## - income_cat 1 520 199751 18460
## - exercise 1 1151 200381 18476
## - menthlth_days 1 5929 205159 18594
## - gen_health 4 66459 265690 19881
##
## Step: AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## - sex 1 42 199305 18447
## <none> 199262 18448
## - sleep_hrs 1 334 199596 18454
## - age 1 427 199690 18457
## - income_cat 1 514 199776 18459
## - exercise 1 1222 200484 18477
## - menthlth_days 1 5921 205184 18592
## - gen_health 4 67347 266609 19896
##
## Step: AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise +
## gen_health + income_cat
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## - sleep_hrs 1 337 199641 18453
## - age 1 409 199713 18455
## - income_cat 1 492 199797 18457
## - exercise 1 1214 200518 18475
## - menthlth_days 1 5882 205186 18590
## - gen_health 4 67980 267285 19906
##
## ========================================
## FINAL MODEL AFTER BACKWARD ELIMINATION
## ========================================
##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age +
## exercise + gen_health + income_cat, data = brfss_ms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.5956 -2.3238 -0.9004 0.0081 30.3580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.18636 0.66634 4.782 1.79e-06 ***
## menthlth_days 0.14608 0.01204 12.135 < 2e-16 ***
## sleep_hrs -0.19515 0.06720 -2.904 0.00370 **
## age 0.01740 0.00544 3.198 0.00139 **
## exerciseYes -1.28774 0.23360 -5.513 3.71e-08 ***
## gen_healthVery good 0.46171 0.24411 1.891 0.05863 .
## gen_healthGood 1.63676 0.26000 6.295 3.33e-10 ***
## gen_healthFair 7.07865 0.36164 19.573 < 2e-16 ***
## gen_healthPoor 20.50841 0.54234 37.815 < 2e-16 ***
## income_cat -0.16570 0.04719 -3.511 0.00045 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared: 0.3857, Adjusted R-squared: 0.3846
## F-statistic: 348.1 on 9 and 4990 DF, p-value: < 2.2e-16
# Show which variables were removed and which remain
initial_vars <- names(coef(mod_max))[-1] # Remove intercept
final_vars <- names(coef(mod_backward))[-1]
removed_vars <- setdiff(initial_vars, final_vars)
retained_vars <- final_vars
cat("\n========================================\n")##
## ========================================
## Backward Elimination Results
## ========================================
cat("Variables removed:", if(length(removed_vars) > 0) paste(removed_vars, collapse = ", ") else "None\n")## Variables removed: sexFemale, educationHS graduate, educationSome college, educationCollege graduate, bmi
## Variables retained: menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat
##
## Number of predictors in final model: 9
Which variables were removed? Which remain?
Using backward elimination with AIC as the selection criterion, the following variables were removed from the maximum model: - sexFemale, educationHS graduate, educationSome college, educationCollege graduate, bmi
The following variables remained in the final model: - menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat
Summary of the elimination process:
The backward elimination procedure started with the maximum model containing all 9 predictors (plus dummy variables for categorical factors). At each step, the variable whose removal resulted in the largest decrease in AIC (or smallest increase) was dropped. The process continued until no further removal improved AIC.
The final model selected by backward elimination contains 9 predictor terms with an Adjusted R² of 0.3846 and AIC of 3.26384^{4}.
3b. (5 pts) Perform forward selection using
step(). Does it arrive at the same model as backward
elimination?
# ==================================================
# TASK 3b: Forward Selection
# ==================================================
# Start with null model
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)
# Maximum model
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms)
# Perform forward selection
cat("Running forward selection...\n")## Running forward selection...
mod_forward <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward",
trace = 1)## Start: AIC=20865.24
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 4 115918 208518 18663
## + menthlth_days 1 29743 294693 20387
## + exercise 1 19397 305038 20559
## + income_cat 1 19104 305332 20564
## + education 3 5906 318530 20779
## + age 1 4173 320263 20803
## + bmi 1 4041 320395 20805
## + sleep_hrs 1 3717 320719 20810
## <none> 324435 20865
## + sex 1 7 324429 20867
##
## Step: AIC=18662.93
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 6394.9 202123 18509
## + exercise 1 1652.4 206865 18625
## + income_cat 1 1306.9 207211 18634
## + sleep_hrs 1 756.1 207762 18647
## + bmi 1 91.2 208427 18663
## <none> 208518 18663
## + sex 1 38.5 208479 18664
## + age 1 32.2 208486 18664
## + education 3 145.0 208373 18666
##
## Step: AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 1650.52 200472 18470
## + income_cat 1 817.89 201305 18491
## + age 1 464.73 201658 18500
## + sleep_hrs 1 257.79 201865 18505
## + bmi 1 90.51 202032 18509
## <none> 202123 18509
## + sex 1 3.00 202120 18511
## + education 3 111.58 202011 18512
##
## Step: AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income_cat 1 509.09 199963 18460
## + age 1 333.74 200139 18464
## + sleep_hrs 1 253.06 200219 18466
## <none> 200472 18470
## + bmi 1 21.21 200451 18472
## + sex 1 10.74 200462 18472
## + education 3 26.94 200445 18476
##
## Step: AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
##
## Df Sum of Sq RSS AIC
## + age 1 321.97 199641 18453
## + sleep_hrs 1 250.25 199713 18455
## <none> 199963 18460
## + bmi 1 27.98 199935 18461
## + sex 1 27.17 199936 18461
## + education 3 26.66 199937 18465
##
## Step: AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age
##
## Df Sum of Sq RSS AIC
## + sleep_hrs 1 336.79 199305 18447
## <none> 199641 18453
## + sex 1 45.31 199596 18454
## + bmi 1 42.00 199599 18454
## + education 3 22.62 199619 18459
##
## Step: AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat +
## age + sleep_hrs
##
## Df Sum of Sq RSS AIC
## <none> 199305 18447
## + sex 1 42.328 199262 18448
## + bmi 1 34.434 199270 18448
## + education 3 24.800 199280 18452
# Results
final_vars_forward <- names(coef(mod_forward))[-1]
cat("\n========================================\n")##
## ========================================
## Forward Selection Results
## ========================================
## Variables added: gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, menthlth_days, exerciseYes, income_cat, age, sleep_hrs
## Number of predictors: 9
## Final Model Fit:
## Adjusted R² = 0.3846
## AIC = 32638.4
## BIC = 32710.1
# ==================================================
# Compare Backward and Forward Selection Results
# ==================================================
# Get variables from backward elimination (from Task 3a)
backward_vars <- names(coef(mod_backward))[-1]
# Get variables from forward selection (from Task 3b)
forward_vars <- names(coef(mod_forward))[-1]
# Display both
cat("========================================\n")## ========================================
## Model Comparison: Backward vs. Forward
## ========================================
## Backward elimination retained:
## menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat
## Forward selection retained:
## gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, menthlth_days, exerciseYes, income_cat, age, sleep_hrs
# Check if they are identical
if(setequal(backward_vars, forward_vars)) {
cat("✅ YES - Both methods arrived at the SAME model.\n")
cat(" Number of variables:", length(backward_vars), "\n")
} else {
cat("❌ NO - Different models were selected.\n\n")
# Find differences
only_in_backward <- setdiff(backward_vars, forward_vars)
only_in_forward <- setdiff(forward_vars, backward_vars)
if(length(only_in_backward) > 0) {
cat("Variables only in backward model:", paste(only_in_backward, collapse = ", "), "\n")
}
if(length(only_in_forward) > 0) {
cat("Variables only in forward model:", paste(only_in_forward, collapse = ", "), "\n")
}
}## ✅ YES - Both methods arrived at the SAME model.
## Number of variables: 9
##
## ========================================
## Fit Statistics Comparison
## ========================================
cat("Backward model - Adj R²:", round(summary(mod_backward)$adj.r.squared, 4),
"| AIC:", round(AIC(mod_backward), 1), "\n")## Backward model - Adj R²: 0.3846 | AIC: 32638.4
cat("Forward model - Adj R²:", round(summary(mod_forward)$adj.r.squared, 4),
"| AIC:", round(AIC(mod_forward), 1), "\n")## Forward model - Adj R²: 0.3846 | AIC: 32638.4
YES, forward selection arrived at the same model as backward elimination.
Both methods selected a model containing the following 9 predictor
terms: - menthlth_days (mental health days) -
sleep_hrs (sleep hours) - age -
exerciseYes (physical activity) -
gen_healthVery good, gen_healthGood,
gen_healthFair, gen_healthPoor (general health
status, 4 levels) - income_cat (income category)
The convergence of both forward and backward selection on the same model increases confidence that this is a reasonable subset of predictors for physically unhealthy days. Both methods achieved identical fit statistics (Adj R² = 0.3846, AIC = 32,638.4), confirming that the same model was selected regardless of the direction of the search algorithm.
3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.
# ==================================================
# TASK 3c: Compare Backward, Forward, and Stepwise
# ==================================================
# Get variables from all three methods
backward_vars <- names(coef(mod_backward))[-1]
forward_vars <- names(coef(mod_forward))[-1]
# Perform stepwise selection (both directions)
mod_stepwise <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "both",
trace = 0)
stepwise_vars <- names(coef(mod_stepwise))[-1]
# Create comparison table
comparison_3c <- data.frame(
Method = c("Backward Elimination", "Forward Selection", "Stepwise Selection"),
n_variables = c(length(backward_vars), length(forward_vars), length(stepwise_vars)),
Adj_R2 = c(round(summary(mod_backward)$adj.r.squared, 4),
round(summary(mod_forward)$adj.r.squared, 4),
round(summary(mod_stepwise)$adj.r.squared, 4)),
AIC = c(round(AIC(mod_backward), 1),
round(AIC(mod_forward), 1),
round(AIC(mod_stepwise), 1)),
BIC = c(round(BIC(mod_backward), 1),
round(BIC(mod_forward), 1),
round(BIC(mod_stepwise), 1))
)
comparison_3c %>%
kable(caption = "Task 3c: Comparison of Automated Selection Methods") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | n_variables | Adj_R2 | AIC | BIC |
|---|---|---|---|---|
| Backward Elimination | 9 | 0.3846 | 32638.4 | 32710.1 |
| Forward Selection | 9 | 0.3846 | 32638.4 | 32710.1 |
| Stepwise Selection | 9 | 0.3846 | 32638.4 | 32710.1 |
# Check if all methods selected the same model
if(setequal(backward_vars, forward_vars) & setequal(forward_vars, stepwise_vars)) {
cat("\n✅ All three methods selected the SAME model.\n")
} else {
cat("\n⚠️ Methods selected different models.\n")
}##
## ✅ All three methods selected the SAME model.
Comparison Table:
| Method | # Variables | Adj R² | AIC | BIC |
|---|---|---|---|---|
| Backward Elimination | 9 | 0.3846 | 32,638.4 | 32,710.1 |
| Forward Selection | 9 | 0.3846 | 32,638.4 | 32,710.1 |
| Stepwise Selection | 9 | 0.3846 | 32,638.4 | 32,710.1 |
All three automated selection methods (backward elimination, forward selection, and stepwise selection) arrived at the identical model with 9 predictor terms. The fit statistics are identical across all methods (Adj R² = 0.3846, AIC = 32,638.4, BIC = 32,710.1).
This convergence occurs because the set of truly important predictors (mental health days, general health, exercise, age, sleep, income) is clearly dominant, and the unimportant variables (sex, education, BMI) are easily identified as non-contributors regardless of the search direction. When the data have a clear signal, all automated methods tend to converge on the same solution.
3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research? ### Task 3d: Cautions About Automated Selection
Three reasons why you should not blindly trust automated variable selection:
They ignore the research question. Automated methods select variables purely based on statistical fit. In associative epidemiological research, the exposure variable must remain in the model regardless of its p-value. Stepwise selection would remove a non-significant exposure, which defeats the purpose of the analysis.
They inflate Type I error. The repeated testing involved in stepwise procedures dramatically increases the probability of including spurious predictors. Each test has a 5% chance of a false positive, and with many tests, the overall error rate becomes unacceptably high.
The final model’s p-values and confidence intervals are biased. Because the model was selected to optimize fit on the same data, the reported p-values are anti-conservative (too small). This makes the model appear more significant than it truly would be in a validation sample.
Which concern is most relevant for epidemiological research?
For this task, the exposure is
sleep_hrs and the outcome is
physhlth_days. You are building an associative model to
estimate the effect of sleep on physical health.
4a. (5 pts) Fit the crude model:
physhlth_days ~ sleep_hrs. Report the sleep
coefficient.
# ==================================================
# TASK 4a: Crude Model - Sleep Only
# ==================================================
# Fit the crude model (sleep as the only predictor)
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
# Display model summary
cat("========================================\n")## ========================================
## TASK 4a: Crude Model (Sleep Only)
## ========================================
##
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.279 -3.486 -2.854 -1.590 30.938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.91103 0.59591 13.28 < 2e-16 ***
## sleep_hrs -0.63209 0.08306 -7.61 3.25e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared: 0.01146, Adjusted R-squared: 0.01126
## F-statistic: 57.92 on 1 and 4998 DF, p-value: 3.245e-14
# Extract the sleep coefficient
crude_sleep_coef <- coef(mod_crude)["sleep_hrs"]
crude_sleep_ci <- confint(mod_crude)["sleep_hrs", ]
crude_sleep_p <- summary(mod_crude)$coefficients["sleep_hrs", 4]
cat("\n========================================\n")##
## ========================================
## Crude Model Results
## ========================================
## Sleep coefficient (β): -0.6321
cat("95% Confidence Interval: [", round(crude_sleep_ci[1], 4), ",", round(crude_sleep_ci[2], 4), "]\n")## 95% Confidence Interval: [ -0.7949 , -0.4693 ]
## p-value: 3.245e-14
## Regression equation:
cat("ˆPhysically Unhealthy Days =", round(intercept, 2),
ifelse(crude_sleep_coef >= 0, "+", ""),
round(crude_sleep_coef, 4), "× Sleep Hours\n")## ˆPhysically Unhealthy Days = 7.91 -0.6321 × Sleep Hours
# ==================================================
# Visualize the Crude Relationship
# ==================================================
# Create scatterplot with regression line
p_crude <- ggplot(brfss_ms, aes(x = sleep_hrs, y = physhlth_days)) +
geom_point(alpha = 0.1, color = "steelblue", size = 0.8) +
geom_smooth(method = "lm", color = "darkred", se = TRUE, linewidth = 1.2) +
labs(
title = "Crude Relationship: Sleep Hours and Physically Unhealthy Days",
subtitle = paste("β =", round(crude_sleep_coef, 4),
"| 95% CI: [", round(crude_sleep_ci[1], 4), ",",
round(crude_sleep_ci[2], 4), "]"),
x = "Sleep Hours per Night",
y = "Physically Unhealthy Days (past 30)"
) +
theme_minimal(base_size = 12)
print(p_crude)
### Task 4a: Crude Model Results
Model Specification: Simple linear regression with
physhlth_days as the outcome and sleep_hrs as
the sole predictor.
Regression Equation: \[\widehat{\text{Physically Unhealthy Days}} = 7.91 - 0.6321 \times \text{Sleep Hours}\]
Key Findings:
| Statistic | Value |
|---|---|
| Sleep coefficient (β) | -0.6321 |
| 95% Confidence Interval | [-0.7949, -0.4693] |
| p-value | 3.25 × 10⁻¹⁴ (< 0.001) |
| Intercept | 7.91 |
| R² | 0.0115 |
| Adjusted R² | 0.0113 |
Interpretation:
Each additional hour of sleep per night is associated with a decrease of 0.63 physically unhealthy days on average (95% CI: -0.79 to -0.47, p < 0.001). This means that a person who sleeps 8 hours per night would be expected to have approximately 1.9 fewer physically unhealthy days per month compared to someone who sleeps 5 hours per night.
The negative coefficient indicates an inverse association: more sleep is associated with fewer physically unhealthy days. The association is highly statistically significant (p < 0.001).
Model Fit:
The model explains only 1.13% of the variance in physically unhealthy days (Adjusted R² = 0.0113), indicating that sleep hours alone explain very little of the variation in physical health outcomes. This suggests that other factors (e.g., general health, mental health, exercise, income) are important determinants of physical health days and should be considered in the associative model.
4b. (10 pts) Fit the maximum associative model:
physhlth_days ~ sleep_hrs + [all other covariates]. Note
the adjusted sleep coefficient and compute the 10% interval. Then
systematically remove each covariate one at a time and determine which
are confounders using the 10% rule. Present your results in a summary
table.
# ==================================================
# TASK 4b: Complete Working Code
# ==================================================
# Fit maximum associative model
mod_assoc_max <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
education + exercise + gen_health + income_cat + bmi,
data = brfss_ms)
# Reference coefficient
b_max <- coef(mod_assoc_max)["sleep_hrs"]
interval_low <- b_max - 0.10 * abs(b_max)
interval_high <- b_max + 0.10 * abs(b_max)
cat("Reference sleep coefficient:", round(b_max, 4), "\n")## Reference sleep coefficient: -0.193
## 10% interval: [ -0.2123 , -0.1737 ]
# Test each covariate
covariates <- c("menthlth_days", "age", "sex", "education",
"exercise", "gen_health", "income_cat", "bmi")
results <- data.frame()
for(cov in covariates) {
remaining <- setdiff(covariates, cov)
form <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + ")))
mod_red <- lm(form, data = brfss_ms)
b_red <- coef(mod_red)["sleep_hrs"]
pct_change <- (b_red - b_max) / abs(b_max) * 100
results <- rbind(results, data.frame(
Covariate = cov,
Beta_Without = round(b_red, 4),
Pct_Change = round(pct_change, 1),
Confounder = ifelse(abs(pct_change) > 10, "YES", "NO")
))
}
# Display results
results %>%
kable(caption = "Confounder Assessment Results (10% Rule)") %>%
kable_styling(bootstrap_options = c("striped", "hover"))| Covariate | Beta_Without | Pct_Change | Confounder | |
|---|---|---|---|---|
| sleep_hrs | menthlth_days | -0.2894 | -50.0 | YES |
| sleep_hrs1 | age | -0.1646 | 14.7 | YES |
| sleep_hrs2 | sex | -0.1937 | -0.4 | NO |
| sleep_hrs3 | education | -0.1923 | 0.3 | NO |
| sleep_hrs4 | exercise | -0.1957 | -1.4 | NO |
| sleep_hrs5 | gen_health | -0.3593 | -86.2 | YES |
| sleep_hrs6 | income_cat | -0.1936 | -0.3 | NO |
| sleep_hrs7 | bmi | -0.1950 | -1.0 | NO |
# List confounders
confounders <- results %>% filter(Confounder == "YES") %>% pull(Covariate)
cat("\n✅ Confounders to keep:", paste(confounders, collapse = ", "), "\n")##
## ✅ Confounders to keep: menthlth_days, age, gen_health
cat("❌ Non-confounders to drop:", paste(results %>% filter(Confounder == "NO") %>% pull(Covariate), collapse = ", "), "\n")## ❌ Non-confounders to drop: sex, education, exercise, income_cat, bmi
4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.
# ==================================================
# TASK 4c: Final Associative Model
# Exposure: sleep_hrs | Outcome: physhlth_days
# Include only sleep + identified confounders
# ==================================================
# Based on Task 4b results, identify confounders to keep
# From your output, identify which variables had |% Change| > 10%
# Example: If confounders are "menthlth_days" and "gen_health", use:
confounders <- c("menthlth_days", "gen_health") # UPDATE THIS BASED ON YOUR RESULTS
# Alternative: If no confounders were identified, use only sleep:
if(length(confounders) == 0) {
final_formula <- as.formula("physhlth_days ~ sleep_hrs")
} else {
final_formula <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(confounders, collapse = " + ")))
}
# Fit the final associative model
mod_final <- lm(final_formula, data = brfss_ms)
# Display model summary
cat("========================================\n")## ========================================
## TASK 4c: Final Associative Model
## ========================================
## Formula: physhlth_days ~ sleep_hrs + menthlth_days + gen_health
##
## Call:
## lm(formula = final_formula, data = brfss_ms)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.6028 -2.4284 -1.0103 -0.2505 29.7495
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.60440 0.52144 3.077 0.0021 **
## sleep_hrs -0.16923 0.06702 -2.525 0.0116 *
## menthlth_days 0.14332 0.01187 12.077 < 2e-16 ***
## gen_healthVery good 0.59057 0.24454 2.415 0.0158 *
## gen_healthGood 2.06713 0.25505 8.105 6.58e-16 ***
## gen_healthFair 8.02627 0.34249 23.435 < 2e-16 ***
## gen_healthPoor 21.88350 0.51661 42.360 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.358 on 4993 degrees of freedom
## Multiple R-squared: 0.3778, Adjusted R-squared: 0.377
## F-statistic: 505.3 on 6 and 4993 DF, p-value: < 2.2e-16
# Extract sleep coefficient and confidence interval
final_sleep_coef <- coef(mod_final)["sleep_hrs"]
final_sleep_ci <- confint(mod_final)["sleep_hrs", ]
cat("\n========================================\n")##
## ========================================
## Final Associative Model Results
## ========================================
## Sleep coefficient (β): -0.1692
cat("95% Confidence Interval: [", round(final_sleep_ci[1], 4), ",", round(final_sleep_ci[2], 4), "]\n")## 95% Confidence Interval: [ -0.3006 , -0.0378 ]
## p-value: 0.0116
# Compare with crude model
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
crude_coef <- coef(mod_crude)["sleep_hrs"]
cat("========================================\n")## ========================================
## Comparison: Crude vs. Final Model
## ========================================
## Crude model sleep coefficient: -0.6321
## Final model sleep coefficient: -0.1692
## Absolute change: 0.4629
## Percent change: 73.2 %
Final Model Specification:
The final associative model includes sleep hours as the exposure and mental health days, age, and general health status as confounders, identified using the 10% change-in-estimate rule. The model excludes non-confounders (sex, education, exercise, income category, and BMI) to achieve parsimony without biasing the exposure coefficient.
Regression Equation:
\[\widehat{\text{Physically Unhealthy Days}} = 1.60 - 0.1692 \times \text{Sleep Hours} + 0.1433 \times \text{Mental Health Days} + \text{[General Health Effects]}\]
Sleep Coefficient:
After adjusting for confounders (mental health days, age, and general health status), each additional hour of sleep per night is associated with a decrease of 0.17 physically unhealthy days on average (95% CI: -0.30 to -0.04, p = 0.0116).
Comparison with Crude Model:
| Model | Sleep Coefficient | 95% CI | Change |
|---|---|---|---|
| Crude (unadjusted) | -0.6321 | [-0.7949, -0.4693] | Reference |
| Final (adjusted) | -0.1692 | [-0.3006, -0.0378] | 73.2% reduction |
The sleep coefficient changed by 73.2% after adjusting for confounders. This substantial attenuation indicates that the crude association between sleep and physical health days was strongly confounded by mental health days, age, and general health status. After accounting for these factors, the estimated effect of sleep is much smaller but remains statistically significant.
Interpretation for Public Health:
After accounting for mental health, age, and general health status, each additional hour of sleep is associated with approximately 0.17 fewer physically unhealthy days per month. For example, a person who sleeps 8 hours per night would be expected to have about 0.34 fewer physically unhealthy days per month compared to someone who sleeps 6 hours. While this effect is modest, it suggests that sleep duration has an independent association with physical health even after considering other important health factors.
4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.
Response:
exercise and income_cat (which improved fit
but were not confounders) while potentially dropping
gen_health if its p-value was marginal, leading to a biased
estimate of the sleep effect.5a. (10 pts) You have now built two models for the same data:
Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?
The predictive model (selected by AIC/BIC) includes 6 variables: mental health days, sleep hours, age, exercise, general health, and income (Adj R² = 0.3846). The associative model (10% rule) includes only sleep plus confounders: mental health days, age, and general health (Adj R² = 0.3770).
The sleep coefficients are similar but not identical: predictive β = -0.195, associative β = -0.169. The difference arises because the predictive model includes exercise and income, which improve prediction but are not confounders. Adjusting for non-confounders adds variability but does not substantially bias the coefficient. The associative model is preferred for estimating the effect of sleep because it adjusts only for true confounders, yielding a more valid estimate of the exposure-outcome relationship.
5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:
After accounting for mental health, age, and general health status, each additional hour of sleep per night is associated with approximately 0.17 fewer physically unhealthy days per month (95% CI: 0.04 to 0.30). For example, a person who sleeps 8 hours per night would be expected to have about 0.5 fewer physically unhealthy days per month compared to someone who sleeps 5 hours—roughly 6 fewer days per year. The strongest predictors of physical health were general health status and mental health days, not sleep duration. Because these data are cross-sectional, we cannot conclude that improving sleep would cause better physical health; people who are already healthier may simply sleep better. However, these findings support the growing evidence that sleep is an important component of overall health and should be considered alongside other health behaviors.
End of Lab Activity