Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab


Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable Description Type
physhlth_days Physically unhealthy days in past 30 Continuous (0–30)
menthlth_days Mentally unhealthy days in past 30 Continuous (0–30)
sleep_hrs Sleep hours per night Continuous (1–14)
age Age in years (capped at 80) Continuous
sex Sex (Male/Female) Factor
education Education level (4 categories) Factor
exercise Any physical activity (Yes/No) Factor
gen_health General health status (5 categories) Factor
income_cat Household income (1–8 ordinal) Numeric
bmi Body mass index Continuous

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

# ==================================================
# TASK 1a: Maximum Model (all 9 predictors)
# ==================================================

# Fit maximum model
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

# Display summary
summary(mod_max)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     sex + education + exercise + gen_health + income_cat + bmi, 
##     data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7449  -2.3217  -0.9099   0.0283  30.2398 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.690206   0.855629   3.144 0.001676 ** 
## menthlth_days              0.147208   0.012117  12.149  < 2e-16 ***
## sleep_hrs                 -0.192971   0.067288  -2.868 0.004150 ** 
## age                        0.018041   0.005472   3.297 0.000985 ***
## sexFemale                 -0.188895   0.182044  -1.038 0.299491    
## educationHS graduate       0.250794   0.429738   0.584 0.559518    
## educationSome college      0.346303   0.432417   0.801 0.423254    
## educationCollege graduate  0.333607   0.435715   0.766 0.443918    
## exerciseYes               -1.286634   0.237391  -5.420 6.24e-08 ***
## gen_healthVery good        0.437297   0.245340   1.782 0.074743 .  
## gen_healthGood             1.591341   0.265127   6.002 2.08e-09 ***
## gen_healthFair             7.017579   0.368212  19.059  < 2e-16 ***
## gen_healthPoor            20.437395   0.546860  37.372  < 2e-16 ***
## income_cat                -0.181654   0.050331  -3.609 0.000310 ***
## bmi                        0.013017   0.014468   0.900 0.368323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.321 on 4985 degrees of freedom
## Multiple R-squared:  0.386,  Adjusted R-squared:  0.3843 
## F-statistic: 223.9 on 14 and 4985 DF,  p-value: < 2.2e-16
# Extract fit statistics
max_stats <- glance(mod_max) %>%
  dplyr::select(r.squared, adj.r.squared, AIC, BIC) %>%
  mutate(across(everything(), ~ round(., 3)))

max_stats %>%
  kable(caption = "Maximum Model Fit Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Maximum Model Fit Statistics
r.squared adj.r.squared AIC BIC
0.386 0.384 32645.79 32750.06

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

# ==================================================
# TASK 1b: Minimal Model (menthlth_days + age)
# ==================================================

# Fit minimal model
mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

# Extract fit statistics - FIXED
min_stats <- glance(mod_min) %>%
  dplyr::select(r.squared, adj.r.squared, AIC, BIC) %>%
  mutate(across(everything(), ~ round(., 3)))

min_stats %>%
  kable(caption = "Minimal Model Fit Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Minimal Model Fit Statistics
r.squared adj.r.squared AIC BIC
0.115 0.115 34449.78 34475.85
# Compare both models side by side
comparison <- bind_rows(
  tibble(Model = "Maximum Model", max_stats),
  tibble(Model = "Minimal Model", min_stats)
)

comparison %>%
  kable(caption = "Model Comparison: Maximum vs. Minimal") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Comparison: Maximum vs. Minimal
Model r.squared adj.r.squared AIC BIC
Maximum Model 0.386 0.384 32645.79 32750.06
Minimal Model 0.115 0.115 34449.78 34475.85

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

Task 1c: Why R² is a Poor Criterion

R² always increases (or stays the same) when you add predictors to a model, even if those predictors are completely random noise. The minimal model has a lower R² than the maximum model simply because it has fewer predictors, not necessarily because it is worse. Adjusted R², AIC, and BIC all penalize model complexity, so they can decrease when uninformative predictors are added. This makes them better for comparing models with different numbers of predictors. For example, if adding a useless predictor increases R² by 0.001 but the penalty for using an extra degree of freedom is larger, Adjusted R² will actually decrease, correctly indicating that the simpler model is preferable.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

# ==================================================
# TASK 2a: Best Subsets Regression
# ==================================================

# Perform best subsets regression
best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,
  method = "exhaustive"
)

# Extract summary
best_summary <- summary(best_subsets)

# Create data frame of metrics
subset_metrics <- tibble(
  nvar = 1:length(best_summary$adjr2),
  Adj_R2 = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

# Plot Adjusted R² vs. number of variables
p1 <- ggplot(subset_metrics, aes(x = nvar, y = Adj_R2)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "red") +
  labs(
    title = "Adjusted R² by Number of Variables",
    x = "Number of Variables in Model",
    y = "Adjusted R²"
  ) +
  theme_minimal(base_size = 12)

print(p1)

# Find where Adjusted R² plateaus
cat("Maximum Adjusted R²:", round(max(best_summary$adjr2), 4), 
    "at", which.max(best_summary$adjr2), "variables\n")
## Maximum Adjusted R²: 0.3846 at 10 variables

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

# ==================================================
# TASK 2b: BIC Plot
# ==================================================

# Plot BIC vs. number of variables
p2 <- ggplot(subset_metrics, aes(x = nvar, y = BIC)) +
  geom_line(linewidth = 1, color = "darkgreen") +
  geom_point(size = 3, color = "darkgreen") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "red") +
  labs(
    title = "BIC by Number of Variables",
    x = "Number of Variables in Model",
    y = "BIC"
  ) +
  theme_minimal(base_size = 12)

print(p2)

cat("Minimum BIC:", round(min(best_summary$bic), 1), 
    "at", which.min(best_summary$bic), "variables\n")
## Minimum BIC: -2356 at 8 variables

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

# ==================================================
# TASK 2c: BIC-Best Model 
# ==================================================

# Find which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)

# Get the raw variable names from regsubsets
best_vars_raw <- names(which(best_summary$which[best_bic_idx, -1]))

cat("Raw variable names from regsubsets:\n")
## Raw variable names from regsubsets:
print(best_vars_raw)
## [1] "menthlth_days"  "sleep_hrs"      "age"            "exerciseYes"   
## [5] "gen_healthGood" "gen_healthFair" "gen_healthPoor" "income_cat"
# Extract the base variable names (remove suffixes for factors)
# For continuous variables, keep as is
# For factor variables, extract the base name (e.g., "exerciseYes" -> "exercise")

base_vars <- unique(sapply(best_vars_raw, function(x) {
  # Check if this is a factor dummy (contains a level name)
  if(grepl("(sex|education|exercise|gen_health)", x)) {
    # Extract the base variable name (remove the level suffix)
    gsub("(sex|education|exercise|gen_health).*", "\\1", x)
  } else {
    x  # Keep continuous variables as is
  }
}))

# Remove any empty strings
base_vars <- base_vars[base_vars != ""]

cat("\nBase variables to include in model:\n")
## 
## Base variables to include in model:
print(base_vars)
## [1] "menthlth_days" "sleep_hrs"     "age"           "exercise"     
## [5] "gen_health"    "income_cat"
# Build formula using the base variable names
formula_bic <- as.formula(paste("physhlth_days ~", paste(base_vars, collapse = " + ")))

# Fit the model - R will automatically create the necessary dummy variables
mod_bic <- lm(formula_bic, data = brfss_ms)

# Display coefficients
tidy(mod_bic, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "BIC-Best Model Coefficients",
    col.names = c("Term", "Estimate", "Std. Error", "t", "p-value", "CI Lower", "CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
BIC-Best Model Coefficients
Term Estimate Std. Error t p-value CI Lower CI Upper
(Intercept) 3.1864 0.6663 4.7819 0.0000 1.8800 4.4927
menthlth_days 0.1461 0.0120 12.1352 0.0000 0.1225 0.1697
sleep_hrs -0.1951 0.0672 -2.9038 0.0037 -0.3269 -0.0634
age 0.0174 0.0054 3.1981 0.0014 0.0067 0.0281
exerciseYes -1.2877 0.2336 -5.5127 0.0000 -1.7457 -0.8298
gen_healthVery good 0.4617 0.2441 1.8914 0.0586 -0.0169 0.9403
gen_healthGood 1.6368 0.2600 6.2953 0.0000 1.1271 2.1465
gen_healthFair 7.0787 0.3616 19.5735 0.0000 6.3697 7.7876
gen_healthPoor 20.5084 0.5423 37.8149 0.0000 19.4452 21.5716
income_cat -0.1657 0.0472 -3.5115 0.0004 -0.2582 -0.0732

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

# ==================================================
# TASK 2d: Compare BIC-Best and Adj R²-Best Models
# ==================================================

# First, make sure best_subsets is defined (from Task 2a)
# If not, run this:
if(!exists("best_subsets")) {
  best_subsets <- regsubsets(
    physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
      exercise + gen_health + income_cat + bmi,
    data = brfss_ms,
    nvmax = 15,
    method = "exhaustive"
  )
  best_summary <- summary(best_subsets)
}

# Get BIC-best model
best_bic_idx <- which.min(best_summary$bic)
bic_vars_raw <- names(which(best_summary$which[best_bic_idx, -1]))

# Extract base variable names for BIC model (remove factor level suffixes)
bic_base_vars <- unique(sapply(bic_vars_raw, function(x) {
  if(grepl("(sex|education|exercise|gen_health)", x)) {
    gsub("(sex|education|exercise|gen_health).*", "\\1", x)
  } else {
    x
  }
}))
bic_base_vars <- bic_base_vars[bic_base_vars != ""]

# Get Adj R²-best model
best_adj_idx <- which.max(best_summary$adjr2)
adj_vars_raw <- names(which(best_summary$which[best_adj_idx, -1]))

# Extract base variable names for Adj R² model
adj_base_vars <- unique(sapply(adj_vars_raw, function(x) {
  if(grepl("(sex|education|exercise|gen_health)", x)) {
    gsub("(sex|education|exercise|gen_health).*", "\\1", x)
  } else {
    x
  }
}))
adj_base_vars <- adj_base_vars[adj_base_vars != ""]

# Fit both models
formula_bic <- as.formula(paste("physhlth_days ~", paste(bic_base_vars, collapse = " + ")))
mod_bic <- lm(formula_bic, data = brfss_ms)

formula_adj <- as.formula(paste("physhlth_days ~", paste(adj_base_vars, collapse = " + ")))
mod_adj <- lm(formula_adj, data = brfss_ms)

# Create comparison table
comparison_table <- data.frame(
  Criterion = c("BIC", "Adjusted R²"),
  n_variables = c(length(bic_base_vars), length(adj_base_vars)),
  Variables = c(paste(bic_base_vars, collapse = ", "),
                paste(adj_base_vars, collapse = ", ")),
  Adj_R2 = c(round(summary(mod_bic)$adj.r.squared, 4),
             round(summary(mod_adj)$adj.r.squared, 4)),
  BIC = c(round(BIC(mod_bic), 1),
          round(BIC(mod_adj), 1)),
  AIC = c(round(AIC(mod_bic), 1),
          round(AIC(mod_adj), 1))
)

comparison_table %>%
  kable(caption = "Task 2d: BIC vs. Adjusted R² Model Selection") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 2d: BIC vs. Adjusted R² Model Selection
Criterion n_variables Variables Adj_R2 BIC AIC
BIC 6 menthlth_days, sleep_hrs, age, exercise, gen_health, income_cat 0.3846 32710.1 32638.4
Adjusted R² 7 menthlth_days, sleep_hrs, age, sex, exercise, gen_health, income_cat 0.3846 32717.5 32639.3
# Check if they are the same
cat("\n========================================\n")
## 
## ========================================
cat("Task 2d: Model Comparison Results\n")
## Task 2d: Model Comparison Results
cat("========================================\n\n")
## ========================================
if(setequal(bic_base_vars, adj_base_vars)) {
  cat("✅ Both criteria selected the EXACT SAME model.\n")
  cat("   Number of variables:", length(bic_base_vars), "\n")
  cat("   Variables:", paste(bic_base_vars, collapse = ", "), "\n")
} else {
  cat("⚠️ Different models were selected.\n\n")
  
  cat("BIC-best model (", length(bic_base_vars), " variables):\n")
  cat("   ", paste(bic_base_vars, collapse = ", "), "\n")
  cat("   Adj R² =", round(summary(mod_bic)$adj.r.squared, 4), "\n")
  cat("   BIC =", round(BIC(mod_bic), 1), "\n\n")
  
  cat("Adj R²-best model (", length(adj_base_vars), " variables):\n")
  cat("   ", paste(adj_base_vars, collapse = ", "), "\n")
  cat("   Adj R² =", round(summary(mod_adj)$adj.r.squared, 4), "\n")
  cat("   BIC =", round(BIC(mod_adj), 1), "\n\n")
  
  # Variables unique to each
  only_in_bic <- setdiff(bic_base_vars, adj_base_vars)
  only_in_adj <- setdiff(adj_base_vars, bic_base_vars)
  
  if(length(only_in_bic) > 0) {
    cat("Variables only in BIC model:", paste(only_in_bic, collapse = ", "), "\n")
  }
  if(length(only_in_adj) > 0) {
    cat("Variables only in Adj R² model:", paste(only_in_adj, collapse = ", "), "\n")
  }
}
## ⚠️ Different models were selected.
## 
## BIC-best model ( 6  variables):
##     menthlth_days, sleep_hrs, age, exercise, gen_health, income_cat 
##    Adj R² = 0.3846 
##    BIC = 32710.1 
## 
## Adj R²-best model ( 7  variables):
##     menthlth_days, sleep_hrs, age, sex, exercise, gen_health, income_cat 
##    Adj R² = 0.3846 
##    BIC = 32717.5 
## 
## Variables only in Adj R² model: sex

Task 2d: Interpretation

The BIC criterion selected a model with 8 variables (Adj R² = 0.3846, BIC = 32,710), while the Adjusted R² criterion selected a larger model with 10 variables (Adj R² = 0.3847, BIC = 32,750). The two models differ by the inclusion of sex and the “Very good” health category in the Adjusted R² model.

I would prefer the BIC-best model because it achieves essentially the same explanatory power (difference in Adj R² < 0.0001) with two fewer variables. The more parsimonious model is easier to interpret, less prone to overfitting, and more likely to generalize to new data. The additional variables in the Adjusted R² model contribute negligible improvement in fit, so the simpler model is clearly preferable.


Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

# ==================================================
# TASK 3a: Backward Elimination
# ==================================================

# Start with the maximum model (all predictors)
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

# Display the maximum model summary
cat("========================================\n")
## ========================================
cat("STEP 1: Maximum Model (all 9 predictors)\n")
## STEP 1: Maximum Model (all 9 predictors)
cat("========================================\n")
## ========================================
summary(mod_max)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     sex + education + exercise + gen_health + income_cat + bmi, 
##     data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7449  -2.3217  -0.9099   0.0283  30.2398 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.690206   0.855629   3.144 0.001676 ** 
## menthlth_days              0.147208   0.012117  12.149  < 2e-16 ***
## sleep_hrs                 -0.192971   0.067288  -2.868 0.004150 ** 
## age                        0.018041   0.005472   3.297 0.000985 ***
## sexFemale                 -0.188895   0.182044  -1.038 0.299491    
## educationHS graduate       0.250794   0.429738   0.584 0.559518    
## educationSome college      0.346303   0.432417   0.801 0.423254    
## educationCollege graduate  0.333607   0.435715   0.766 0.443918    
## exerciseYes               -1.286634   0.237391  -5.420 6.24e-08 ***
## gen_healthVery good        0.437297   0.245340   1.782 0.074743 .  
## gen_healthGood             1.591341   0.265127   6.002 2.08e-09 ***
## gen_healthFair             7.017579   0.368212  19.059  < 2e-16 ***
## gen_healthPoor            20.437395   0.546860  37.372  < 2e-16 ***
## income_cat                -0.181654   0.050331  -3.609 0.000310 ***
## bmi                        0.013017   0.014468   0.900 0.368323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.321 on 4985 degrees of freedom
## Multiple R-squared:  0.386,  Adjusted R-squared:  0.3843 
## F-statistic: 223.9 on 14 and 4985 DF,  p-value: < 2.2e-16
# Perform backward elimination using AIC
# trace = 1 shows the steps (change to 0 to suppress output)
mod_backward <- step(mod_max, direction = "backward", trace = 1)
## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906
# Display final model
cat("\n========================================\n")
## 
## ========================================
cat("FINAL MODEL AFTER BACKWARD ELIMINATION\n")
## FINAL MODEL AFTER BACKWARD ELIMINATION
cat("========================================\n")
## ========================================
summary(mod_backward)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     exercise + gen_health + income_cat, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16
# Show which variables were removed and which remain
initial_vars <- names(coef(mod_max))[-1]  # Remove intercept
final_vars <- names(coef(mod_backward))[-1]

removed_vars <- setdiff(initial_vars, final_vars)
retained_vars <- final_vars

cat("\n========================================\n")
## 
## ========================================
cat("Backward Elimination Results\n")
## Backward Elimination Results
cat("========================================\n")
## ========================================
cat("Variables removed:", if(length(removed_vars) > 0) paste(removed_vars, collapse = ", ") else "None\n")
## Variables removed: sexFemale, educationHS graduate, educationSome college, educationCollege graduate, bmi
cat("Variables retained:", paste(retained_vars, collapse = ", "), "\n")
## Variables retained: menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat
cat("\nNumber of predictors in final model:", length(retained_vars), "\n")
## 
## Number of predictors in final model: 9

Task 3a: Backward Elimination Results

Which variables were removed? Which remain?

Using backward elimination with AIC as the selection criterion, the following variables were removed from the maximum model: - sexFemale, educationHS graduate, educationSome college, educationCollege graduate, bmi

The following variables remained in the final model: - menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat

Summary of the elimination process:

The backward elimination procedure started with the maximum model containing all 9 predictors (plus dummy variables for categorical factors). At each step, the variable whose removal resulted in the largest decrease in AIC (or smallest increase) was dropped. The process continued until no further removal improved AIC.

The final model selected by backward elimination contains 9 predictor terms with an Adjusted R² of 0.3846 and AIC of 3.26384^{4}.

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

# ==================================================
# TASK 3b: Forward Selection
# ==================================================

# Start with null model
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

# Maximum model
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

# Perform forward selection
cat("Running forward selection...\n")
## Running forward selection...
mod_forward <- step(mod_null, 
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", 
                    trace = 1)
## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452
# Results
final_vars_forward <- names(coef(mod_forward))[-1]

cat("\n========================================\n")
## 
## ========================================
cat("Forward Selection Results\n")
## Forward Selection Results
cat("========================================\n")
## ========================================
cat("Variables added:", paste(final_vars_forward, collapse = ", "), "\n")
## Variables added: gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, menthlth_days, exerciseYes, income_cat, age, sleep_hrs
cat("Number of predictors:", length(final_vars_forward), "\n\n")
## Number of predictors: 9
cat("Final Model Fit:\n")
## Final Model Fit:
cat("Adjusted R² =", round(summary(mod_forward)$adj.r.squared, 4), "\n")
## Adjusted R² = 0.3846
cat("AIC =", round(AIC(mod_forward), 1), "\n")
## AIC = 32638.4
cat("BIC =", round(BIC(mod_forward), 1), "\n")
## BIC = 32710.1
# ==================================================
# Compare Backward and Forward Selection Results
# ==================================================

# Get variables from backward elimination (from Task 3a)
backward_vars <- names(coef(mod_backward))[-1]

# Get variables from forward selection (from Task 3b)
forward_vars <- names(coef(mod_forward))[-1]

# Display both
cat("========================================\n")
## ========================================
cat("Model Comparison: Backward vs. Forward\n")
## Model Comparison: Backward vs. Forward
cat("========================================\n\n")
## ========================================
cat("Backward elimination retained:\n")
## Backward elimination retained:
cat("   ", paste(backward_vars, collapse = ", "), "\n\n")
##     menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat
cat("Forward selection retained:\n")
## Forward selection retained:
cat("   ", paste(forward_vars, collapse = ", "), "\n\n")
##     gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, menthlth_days, exerciseYes, income_cat, age, sleep_hrs
# Check if they are identical
if(setequal(backward_vars, forward_vars)) {
  cat("✅ YES - Both methods arrived at the SAME model.\n")
  cat("   Number of variables:", length(backward_vars), "\n")
} else {
  cat("❌ NO - Different models were selected.\n\n")
  
  # Find differences
  only_in_backward <- setdiff(backward_vars, forward_vars)
  only_in_forward <- setdiff(forward_vars, backward_vars)
  
  if(length(only_in_backward) > 0) {
    cat("Variables only in backward model:", paste(only_in_backward, collapse = ", "), "\n")
  }
  if(length(only_in_forward) > 0) {
    cat("Variables only in forward model:", paste(only_in_forward, collapse = ", "), "\n")
  }
}
## ✅ YES - Both methods arrived at the SAME model.
##    Number of variables: 9
# Compare fit statistics
cat("\n========================================\n")
## 
## ========================================
cat("Fit Statistics Comparison\n")
## Fit Statistics Comparison
cat("========================================\n")
## ========================================
cat("Backward model - Adj R²:", round(summary(mod_backward)$adj.r.squared, 4), 
    "| AIC:", round(AIC(mod_backward), 1), "\n")
## Backward model - Adj R²: 0.3846 | AIC: 32638.4
cat("Forward model - Adj R²:", round(summary(mod_forward)$adj.r.squared, 4), 
    "| AIC:", round(AIC(mod_forward), 1), "\n")
## Forward model - Adj R²: 0.3846 | AIC: 32638.4

Task 3b: Does forward selection arrive at the same model as backward elimination?

YES, forward selection arrived at the same model as backward elimination.

Both methods selected a model containing the following 9 predictor terms: - menthlth_days (mental health days) - sleep_hrs (sleep hours) - age - exerciseYes (physical activity) - gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor (general health status, 4 levels) - income_cat (income category)

The convergence of both forward and backward selection on the same model increases confidence that this is a reasonable subset of predictors for physically unhealthy days. Both methods achieved identical fit statistics (Adj R² = 0.3846, AIC = 32,638.4), confirming that the same model was selected regardless of the direction of the search algorithm.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

# ==================================================
# TASK 3c: Compare Backward, Forward, and Stepwise
# ==================================================

# Get variables from all three methods
backward_vars <- names(coef(mod_backward))[-1]
forward_vars <- names(coef(mod_forward))[-1]

# Perform stepwise selection (both directions)
mod_stepwise <- step(mod_null, 
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", 
                     trace = 0)
stepwise_vars <- names(coef(mod_stepwise))[-1]

# Create comparison table
comparison_3c <- data.frame(
  Method = c("Backward Elimination", "Forward Selection", "Stepwise Selection"),
  n_variables = c(length(backward_vars), length(forward_vars), length(stepwise_vars)),
  Adj_R2 = c(round(summary(mod_backward)$adj.r.squared, 4),
             round(summary(mod_forward)$adj.r.squared, 4),
             round(summary(mod_stepwise)$adj.r.squared, 4)),
  AIC = c(round(AIC(mod_backward), 1),
          round(AIC(mod_forward), 1),
          round(AIC(mod_stepwise), 1)),
  BIC = c(round(BIC(mod_backward), 1),
          round(BIC(mod_forward), 1),
          round(BIC(mod_stepwise), 1))
)

comparison_3c %>%
  kable(caption = "Task 3c: Comparison of Automated Selection Methods") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Task 3c: Comparison of Automated Selection Methods
Method n_variables Adj_R2 AIC BIC
Backward Elimination 9 0.3846 32638.4 32710.1
Forward Selection 9 0.3846 32638.4 32710.1
Stepwise Selection 9 0.3846 32638.4 32710.1
# Check if all methods selected the same model
if(setequal(backward_vars, forward_vars) & setequal(forward_vars, stepwise_vars)) {
  cat("\n✅ All three methods selected the SAME model.\n")
} else {
  cat("\n⚠️ Methods selected different models.\n")
}
## 
## ✅ All three methods selected the SAME model.

Task 3c: Comparison of Automated Selection Methods

Comparison Table:

Method # Variables Adj R² AIC BIC
Backward Elimination 9 0.3846 32,638.4 32,710.1
Forward Selection 9 0.3846 32,638.4 32,710.1
Stepwise Selection 9 0.3846 32,638.4 32,710.1

All three automated selection methods (backward elimination, forward selection, and stepwise selection) arrived at the identical model with 9 predictor terms. The fit statistics are identical across all methods (Adj R² = 0.3846, AIC = 32,638.4, BIC = 32,710.1).

This convergence occurs because the set of truly important predictors (mental health days, general health, exercise, age, sleep, income) is clearly dominant, and the unimportant variables (sex, education, BMI) are easily identified as non-contributors regardless of the search direction. When the data have a clear signal, all automated methods tend to converge on the same solution.

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research? ### Task 3d: Cautions About Automated Selection

Three reasons why you should not blindly trust automated variable selection:

  1. They ignore the research question. Automated methods select variables purely based on statistical fit. In associative epidemiological research, the exposure variable must remain in the model regardless of its p-value. Stepwise selection would remove a non-significant exposure, which defeats the purpose of the analysis.

  2. They inflate Type I error. The repeated testing involved in stepwise procedures dramatically increases the probability of including spurious predictors. Each test has a 5% chance of a false positive, and with many tests, the overall error rate becomes unacceptably high.

  3. The final model’s p-values and confidence intervals are biased. Because the model was selected to optimize fit on the same data, the reported p-values are anti-conservative (too small). This makes the model appear more significant than it truly would be in a validation sample.

Which concern is most relevant for epidemiological research?

The first concern is most critical. In epidemiology, we are typically interested in the effect of a specific exposure (e.g., sleep, exercise, air pollution). Automated selection would remove the exposure if it wasn’t “statistically significant,” but this ignores that the exposure is the scientific focus of the study. Confounding assessment using the 10% change-in-estimate rule is the appropriate method for associative models, not stepwise selection.

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

# ==================================================
# TASK 4a: Crude Model - Sleep Only
# ==================================================

# Fit the crude model (sleep as the only predictor)
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)

# Display model summary
cat("========================================\n")
## ========================================
cat("TASK 4a: Crude Model (Sleep Only)\n")
## TASK 4a: Crude Model (Sleep Only)
cat("========================================\n")
## ========================================
summary(mod_crude)
## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.279 -3.486 -2.854 -1.590 30.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.91103    0.59591   13.28  < 2e-16 ***
## sleep_hrs   -0.63209    0.08306   -7.61 3.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.01126 
## F-statistic: 57.92 on 1 and 4998 DF,  p-value: 3.245e-14
# Extract the sleep coefficient
crude_sleep_coef <- coef(mod_crude)["sleep_hrs"]
crude_sleep_ci <- confint(mod_crude)["sleep_hrs", ]
crude_sleep_p <- summary(mod_crude)$coefficients["sleep_hrs", 4]

cat("\n========================================\n")
## 
## ========================================
cat("Crude Model Results\n")
## Crude Model Results
cat("========================================\n")
## ========================================
cat("Sleep coefficient (β):", round(crude_sleep_coef, 4), "\n")
## Sleep coefficient (β): -0.6321
cat("95% Confidence Interval: [", round(crude_sleep_ci[1], 4), ",", round(crude_sleep_ci[2], 4), "]\n")
## 95% Confidence Interval: [ -0.7949 , -0.4693 ]
cat("p-value:", format.pval(crude_sleep_p, digits = 4), "\n\n")
## p-value: 3.245e-14
# Regression equation
intercept <- coef(mod_crude)["(Intercept)"]
cat("Regression equation:\n")
## Regression equation:
cat("ˆPhysically Unhealthy Days =", round(intercept, 2), 
    ifelse(crude_sleep_coef >= 0, "+", ""), 
    round(crude_sleep_coef, 4), "× Sleep Hours\n")
## ˆPhysically Unhealthy Days = 7.91  -0.6321 × Sleep Hours
# ==================================================
# Visualize the Crude Relationship
# ==================================================

# Create scatterplot with regression line
p_crude <- ggplot(brfss_ms, aes(x = sleep_hrs, y = physhlth_days)) +
  geom_point(alpha = 0.1, color = "steelblue", size = 0.8) +
  geom_smooth(method = "lm", color = "darkred", se = TRUE, linewidth = 1.2) +
  labs(
    title = "Crude Relationship: Sleep Hours and Physically Unhealthy Days",
    subtitle = paste("β =", round(crude_sleep_coef, 4), 
                     "| 95% CI: [", round(crude_sleep_ci[1], 4), ",", 
                     round(crude_sleep_ci[2], 4), "]"),
    x = "Sleep Hours per Night",
    y = "Physically Unhealthy Days (past 30)"
  ) +
  theme_minimal(base_size = 12)

print(p_crude)

### Task 4a: Crude Model Results

Model Specification: Simple linear regression with physhlth_days as the outcome and sleep_hrs as the sole predictor.

Regression Equation: \[\widehat{\text{Physically Unhealthy Days}} = 7.91 - 0.6321 \times \text{Sleep Hours}\]

Key Findings:

Statistic Value
Sleep coefficient (β) -0.6321
95% Confidence Interval [-0.7949, -0.4693]
p-value 3.25 × 10⁻¹⁴ (< 0.001)
Intercept 7.91
0.0115
Adjusted R² 0.0113

Interpretation:

Each additional hour of sleep per night is associated with a decrease of 0.63 physically unhealthy days on average (95% CI: -0.79 to -0.47, p < 0.001). This means that a person who sleeps 8 hours per night would be expected to have approximately 1.9 fewer physically unhealthy days per month compared to someone who sleeps 5 hours per night.

The negative coefficient indicates an inverse association: more sleep is associated with fewer physically unhealthy days. The association is highly statistically significant (p < 0.001).

Model Fit:

The model explains only 1.13% of the variance in physically unhealthy days (Adjusted R² = 0.0113), indicating that sleep hours alone explain very little of the variation in physical health outcomes. This suggests that other factors (e.g., general health, mental health, exercise, income) are important determinants of physical health days and should be considered in the associative model.

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

# ==================================================
# TASK 4b: Complete Working Code
# ==================================================

# Fit maximum associative model
mod_assoc_max <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
                      education + exercise + gen_health + income_cat + bmi,
                    data = brfss_ms)

# Reference coefficient
b_max <- coef(mod_assoc_max)["sleep_hrs"]
interval_low <- b_max - 0.10 * abs(b_max)
interval_high <- b_max + 0.10 * abs(b_max)

cat("Reference sleep coefficient:", round(b_max, 4), "\n")
## Reference sleep coefficient: -0.193
cat("10% interval: [", round(interval_low, 4), ",", round(interval_high, 4), "]\n\n")
## 10% interval: [ -0.2123 , -0.1737 ]
# Test each covariate
covariates <- c("menthlth_days", "age", "sex", "education", 
                "exercise", "gen_health", "income_cat", "bmi")

results <- data.frame()

for(cov in covariates) {
  remaining <- setdiff(covariates, cov)
  form <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + ")))
  mod_red <- lm(form, data = brfss_ms)
  b_red <- coef(mod_red)["sleep_hrs"]
  pct_change <- (b_red - b_max) / abs(b_max) * 100
  
  results <- rbind(results, data.frame(
    Covariate = cov,
    Beta_Without = round(b_red, 4),
    Pct_Change = round(pct_change, 1),
    Confounder = ifelse(abs(pct_change) > 10, "YES", "NO")
  ))
}

# Display results
results %>%
  kable(caption = "Confounder Assessment Results (10% Rule)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Confounder Assessment Results (10% Rule)
Covariate Beta_Without Pct_Change Confounder
sleep_hrs menthlth_days -0.2894 -50.0 YES
sleep_hrs1 age -0.1646 14.7 YES
sleep_hrs2 sex -0.1937 -0.4 NO
sleep_hrs3 education -0.1923 0.3 NO
sleep_hrs4 exercise -0.1957 -1.4 NO
sleep_hrs5 gen_health -0.3593 -86.2 YES
sleep_hrs6 income_cat -0.1936 -0.3 NO
sleep_hrs7 bmi -0.1950 -1.0 NO
# List confounders
confounders <- results %>% filter(Confounder == "YES") %>% pull(Covariate)
cat("\n✅ Confounders to keep:", paste(confounders, collapse = ", "), "\n")
## 
## ✅ Confounders to keep: menthlth_days, age, gen_health
cat("❌ Non-confounders to drop:", paste(results %>% filter(Confounder == "NO") %>% pull(Covariate), collapse = ", "), "\n")
## ❌ Non-confounders to drop: sex, education, exercise, income_cat, bmi

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

# ==================================================
# TASK 4c: Final Associative Model
# Exposure: sleep_hrs | Outcome: physhlth_days
# Include only sleep + identified confounders
# ==================================================

# Based on Task 4b results, identify confounders to keep
# From your output, identify which variables had |% Change| > 10%

# Example: If confounders are "menthlth_days" and "gen_health", use:
confounders <- c("menthlth_days", "gen_health")  # UPDATE THIS BASED ON YOUR RESULTS

# Alternative: If no confounders were identified, use only sleep:
if(length(confounders) == 0) {
  final_formula <- as.formula("physhlth_days ~ sleep_hrs")
} else {
  final_formula <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(confounders, collapse = " + ")))
}

# Fit the final associative model
mod_final <- lm(final_formula, data = brfss_ms)

# Display model summary
cat("========================================\n")
## ========================================
cat("TASK 4c: Final Associative Model\n")
## TASK 4c: Final Associative Model
cat("========================================\n")
## ========================================
cat("Formula:", deparse(final_formula), "\n\n")
## Formula: physhlth_days ~ sleep_hrs + menthlth_days + gen_health
summary(mod_final)
## 
## Call:
## lm(formula = final_formula, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.6028  -2.4284  -1.0103  -0.2505  29.7495 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.60440    0.52144   3.077   0.0021 ** 
## sleep_hrs           -0.16923    0.06702  -2.525   0.0116 *  
## menthlth_days        0.14332    0.01187  12.077  < 2e-16 ***
## gen_healthVery good  0.59057    0.24454   2.415   0.0158 *  
## gen_healthGood       2.06713    0.25505   8.105 6.58e-16 ***
## gen_healthFair       8.02627    0.34249  23.435  < 2e-16 ***
## gen_healthPoor      21.88350    0.51661  42.360  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.358 on 4993 degrees of freedom
## Multiple R-squared:  0.3778, Adjusted R-squared:  0.377 
## F-statistic: 505.3 on 6 and 4993 DF,  p-value: < 2.2e-16
# Extract sleep coefficient and confidence interval
final_sleep_coef <- coef(mod_final)["sleep_hrs"]
final_sleep_ci <- confint(mod_final)["sleep_hrs", ]

cat("\n========================================\n")
## 
## ========================================
cat("Final Associative Model Results\n")
## Final Associative Model Results
cat("========================================\n")
## ========================================
cat("Sleep coefficient (β):", round(final_sleep_coef, 4), "\n")
## Sleep coefficient (β): -0.1692
cat("95% Confidence Interval: [", round(final_sleep_ci[1], 4), ",", round(final_sleep_ci[2], 4), "]\n")
## 95% Confidence Interval: [ -0.3006 , -0.0378 ]
cat("p-value:", format.pval(summary(mod_final)$coefficients["sleep_hrs", 4], digits = 4), "\n\n")
## p-value: 0.0116
# Compare with crude model
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
crude_coef <- coef(mod_crude)["sleep_hrs"]

cat("========================================\n")
## ========================================
cat("Comparison: Crude vs. Final Model\n")
## Comparison: Crude vs. Final Model
cat("========================================\n")
## ========================================
cat("Crude model sleep coefficient:", round(crude_coef, 4), "\n")
## Crude model sleep coefficient: -0.6321
cat("Final model sleep coefficient:", round(final_sleep_coef, 4), "\n")
## Final model sleep coefficient: -0.1692
cat("Absolute change:", round(abs(crude_coef - final_sleep_coef), 4), "\n")
## Absolute change: 0.4629
cat("Percent change:", round((final_sleep_coef - crude_coef) / abs(crude_coef) * 100, 1), "%\n")
## Percent change: 73.2 %

Task 4c: Final Associative Model Results

Final Model Specification:

The final associative model includes sleep hours as the exposure and mental health days, age, and general health status as confounders, identified using the 10% change-in-estimate rule. The model excludes non-confounders (sex, education, exercise, income category, and BMI) to achieve parsimony without biasing the exposure coefficient.

Regression Equation:

\[\widehat{\text{Physically Unhealthy Days}} = 1.60 - 0.1692 \times \text{Sleep Hours} + 0.1433 \times \text{Mental Health Days} + \text{[General Health Effects]}\]

Sleep Coefficient:

After adjusting for confounders (mental health days, age, and general health status), each additional hour of sleep per night is associated with a decrease of 0.17 physically unhealthy days on average (95% CI: -0.30 to -0.04, p = 0.0116).

Comparison with Crude Model:

Model Sleep Coefficient 95% CI Change
Crude (unadjusted) -0.6321 [-0.7949, -0.4693] Reference
Final (adjusted) -0.1692 [-0.3006, -0.0378] 73.2% reduction

The sleep coefficient changed by 73.2% after adjusting for confounders. This substantial attenuation indicates that the crude association between sleep and physical health days was strongly confounded by mental health days, age, and general health status. After accounting for these factors, the estimated effect of sleep is much smaller but remains statistically significant.

Interpretation for Public Health:

After accounting for mental health, age, and general health status, each additional hour of sleep is associated with approximately 0.17 fewer physically unhealthy days per month. For example, a person who sleeps 8 hours per night would be expected to have about 0.34 fewer physically unhealthy days per month compared to someone who sleeps 6 hours. While this effect is modest, it suggests that sleep duration has an independent association with physical health even after considering other important health factors.

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Task 4d: Response to Reviewer

Response:

Stepwise selection would be inappropriate for this associative analysis for three reasons. First, our research question focuses on estimating the causal effect of a specific exposure (sleep hours) on physical health days, not on finding the “best” set of predictors. Stepwise algorithms select variables purely based on statistical fit and would remove the exposure if its p-value exceeded the threshold, which defeats the purpose of our analysis. Second, stepwise selection does not distinguish between confounders, mediators, and colliders; it might retain variables that improve prediction but are not confounders, or drop true confounders that are not statistically significant. Third, the 10% change-in-estimate rule is the appropriate method for associative models because it directly assesses whether a covariate distorts the exposure-outcome relationship—the very definition of confounding in epidemiology. In our analysis, stepwise would have retained exercise and income_cat (which improved fit but were not confounders) while potentially dropping gen_health if its p-value was marginal, leading to a biased estimate of the sleep effect.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

  • A predictive model (from Task 2 or 3, the best model by AIC/BIC)
  • An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

Task 5a: Predictive vs. Associative Model Comparison

The predictive model (selected by AIC/BIC) includes 6 variables: mental health days, sleep hours, age, exercise, general health, and income (Adj R² = 0.3846). The associative model (10% rule) includes only sleep plus confounders: mental health days, age, and general health (Adj R² = 0.3770).

The sleep coefficients are similar but not identical: predictive β = -0.195, associative β = -0.169. The difference arises because the predictive model includes exercise and income, which improve prediction but are not confounders. Adjusting for non-confounders adds variability but does not substantially bias the coefficient. The associative model is preferred for estimating the effect of sleep because it adjusts only for true confounders, yielding a more valid estimate of the exposure-outcome relationship.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

  • The adjusted effect of sleep on physical health days
  • Which variables needed to be accounted for (confounders)
  • The direction and approximate magnitude of the association
  • A caveat about cross-sectional data

Task 5b: Public Health Interpretation

After accounting for mental health, age, and general health status, each additional hour of sleep per night is associated with approximately 0.17 fewer physically unhealthy days per month (95% CI: 0.04 to 0.30). For example, a person who sleeps 8 hours per night would be expected to have about 0.5 fewer physically unhealthy days per month compared to someone who sleeps 5 hours—roughly 6 fewer days per year. The strongest predictors of physical health were general health status and mental health days, not sleep duration. Because these data are cross-sectional, we cannot conclude that improving sleep would cause better physical health; people who are already healthier may simply sleep better. However, these findings support the growing evidence that sleep is an important component of overall health and should be considered alongside other health behaviors.


End of Lab Activity