Part 2: In-Class Lab Activity

EPI 553 — Model Selection Lab

Instructions

In this lab, you will practice both predictive and associative model selection using the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric
`bmi`	Body mass index	Continuous

Task 1: Maximum Model and Criteria Comparison (15 points)

1a. (5 pts) Fit the maximum model predicting physhlth_days from all 9 candidate predictors. Report \(R^2\), Adjusted \(R^2\), AIC, and BIC.

# ==================================================
# TASK 1a: Maximum Model (all 9 predictors)
# ==================================================

# Fit maximum model
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

# Display summary
summary(mod_max)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     sex + education + exercise + gen_health + income_cat + bmi, 
##     data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7449  -2.3217  -0.9099   0.0283  30.2398 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.690206   0.855629   3.144 0.001676 ** 
## menthlth_days              0.147208   0.012117  12.149  < 2e-16 ***
## sleep_hrs                 -0.192971   0.067288  -2.868 0.004150 ** 
## age                        0.018041   0.005472   3.297 0.000985 ***
## sexFemale                 -0.188895   0.182044  -1.038 0.299491    
## educationHS graduate       0.250794   0.429738   0.584 0.559518    
## educationSome college      0.346303   0.432417   0.801 0.423254    
## educationCollege graduate  0.333607   0.435715   0.766 0.443918    
## exerciseYes               -1.286634   0.237391  -5.420 6.24e-08 ***
## gen_healthVery good        0.437297   0.245340   1.782 0.074743 .  
## gen_healthGood             1.591341   0.265127   6.002 2.08e-09 ***
## gen_healthFair             7.017579   0.368212  19.059  < 2e-16 ***
## gen_healthPoor            20.437395   0.546860  37.372  < 2e-16 ***
## income_cat                -0.181654   0.050331  -3.609 0.000310 ***
## bmi                        0.013017   0.014468   0.900 0.368323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.321 on 4985 degrees of freedom
## Multiple R-squared:  0.386,  Adjusted R-squared:  0.3843 
## F-statistic: 223.9 on 14 and 4985 DF,  p-value: < 2.2e-16

# Extract fit statistics
max_stats <- glance(mod_max) %>%
  dplyr::select(r.squared, adj.r.squared, AIC, BIC) %>%
  mutate(across(everything(), ~ round(., 3)))

max_stats %>%
  kable(caption = "Maximum Model Fit Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Maximum Model Fit Statistics
r.squared	adj.r.squared	AIC	BIC
0.386	0.384	32645.79	32750.06

1b. (5 pts) Now fit a “minimal” model using only menthlth_days and age. Report the same four criteria. How do the two models compare?

# ==================================================
# TASK 1b: Minimal Model (menthlth_days + age)
# ==================================================

# Fit minimal model
mod_min <- lm(physhlth_days ~ menthlth_days + age, data = brfss_ms)

# Extract fit statistics - FIXED
min_stats <- glance(mod_min) %>%
  dplyr::select(r.squared, adj.r.squared, AIC, BIC) %>%
  mutate(across(everything(), ~ round(., 3)))

min_stats %>%
  kable(caption = "Minimal Model Fit Statistics") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Minimal Model Fit Statistics
r.squared	adj.r.squared	AIC	BIC
0.115	0.115	34449.78	34475.85

# Compare both models side by side
comparison <- bind_rows(
  tibble(Model = "Maximum Model", max_stats),
  tibble(Model = "Minimal Model", min_stats)
)

comparison %>%
  kable(caption = "Model Comparison: Maximum vs. Minimal") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Model Comparison: Maximum vs. Minimal
Model	r.squared	adj.r.squared	AIC	BIC
Maximum Model	0.386	0.384	32645.79	32750.06
Minimal Model	0.115	0.115	34449.78	34475.85

1c. (5 pts) Explain why \(R^2\) is a poor criterion for comparing these two models. What makes Adjusted \(R^2\), AIC, and BIC better choices?

Task 1c: Why R² is a Poor Criterion

R² always increases (or stays the same) when you add predictors to a model, even if those predictors are completely random noise. The minimal model has a lower R² than the maximum model simply because it has fewer predictors, not necessarily because it is worse. Adjusted R², AIC, and BIC all penalize model complexity, so they can decrease when uninformative predictors are added. This makes them better for comparing models with different numbers of predictors. For example, if adding a useless predictor increases R² by 0.001 but the penalty for using an extra degree of freedom is larger, Adjusted R² will actually decrease, correctly indicating that the simpler model is preferable.

Task 2: Best Subsets Regression (20 points)

2a. (5 pts) Use leaps::regsubsets() to perform best subsets regression with nvmax = 15. Create a plot of Adjusted \(R^2\) vs. number of variables. At what model size does Adjusted \(R^2\) plateau?

# ==================================================
# TASK 2a: Best Subsets Regression
# ==================================================

# Perform best subsets regression
best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
    exercise + gen_health + income_cat + bmi,
  data = brfss_ms,
  nvmax = 15,
  method = "exhaustive"
)

# Extract summary
best_summary <- summary(best_subsets)

# Create data frame of metrics
subset_metrics <- tibble(
  nvar = 1:length(best_summary$adjr2),
  Adj_R2 = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

# Plot Adjusted R² vs. number of variables
p1 <- ggplot(subset_metrics, aes(x = nvar, y = Adj_R2)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "red") +
  labs(
    title = "Adjusted R² by Number of Variables",
    x = "Number of Variables in Model",
    y = "Adjusted R²"
  ) +
  theme_minimal(base_size = 12)

print(p1)

# Find where Adjusted R² plateaus
cat("Maximum Adjusted R²:", round(max(best_summary$adjr2), 4), 
    "at", which.max(best_summary$adjr2), "variables\n")

## Maximum Adjusted R²: 0.3846 at 10 variables

2b. (5 pts) Create a plot of BIC vs. number of variables. Which model size minimizes BIC?

# ==================================================
# TASK 2b: BIC Plot
# ==================================================

# Plot BIC vs. number of variables
p2 <- ggplot(subset_metrics, aes(x = nvar, y = BIC)) +
  geom_line(linewidth = 1, color = "darkgreen") +
  geom_point(size = 3, color = "darkgreen") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "red") +
  labs(
    title = "BIC by Number of Variables",
    x = "Number of Variables in Model",
    y = "BIC"
  ) +
  theme_minimal(base_size = 12)

print(p2)

cat("Minimum BIC:", round(min(best_summary$bic), 1), 
    "at", which.min(best_summary$bic), "variables\n")

## Minimum BIC: -2356 at 8 variables

2c. (5 pts) Identify the variables included in the BIC-best model. Fit this model explicitly using lm() and report its coefficients.

# ==================================================
# TASK 2c: BIC-Best Model 
# ==================================================

# Find which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)

# Get the raw variable names from regsubsets
best_vars_raw <- names(which(best_summary$which[best_bic_idx, -1]))

cat("Raw variable names from regsubsets:\n")

## Raw variable names from regsubsets:

print(best_vars_raw)

## [1] "menthlth_days"  "sleep_hrs"      "age"            "exerciseYes"   
## [5] "gen_healthGood" "gen_healthFair" "gen_healthPoor" "income_cat"

# Extract the base variable names (remove suffixes for factors)
# For continuous variables, keep as is
# For factor variables, extract the base name (e.g., "exerciseYes" -> "exercise")

base_vars <- unique(sapply(best_vars_raw, function(x) {
  # Check if this is a factor dummy (contains a level name)
  if(grepl("(sex|education|exercise|gen_health)", x)) {
    # Extract the base variable name (remove the level suffix)
    gsub("(sex|education|exercise|gen_health).*", "\\1", x)
  } else {
    x  # Keep continuous variables as is
  }
}))

# Remove any empty strings
base_vars <- base_vars[base_vars != ""]

cat("\nBase variables to include in model:\n")

## 
## Base variables to include in model:

print(base_vars)

## [1] "menthlth_days" "sleep_hrs"     "age"           "exercise"     
## [5] "gen_health"    "income_cat"

# Build formula using the base variable names
formula_bic <- as.formula(paste("physhlth_days ~", paste(base_vars, collapse = " + ")))

# Fit the model - R will automatically create the necessary dummy variables
mod_bic <- lm(formula_bic, data = brfss_ms)

# Display coefficients
tidy(mod_bic, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "BIC-Best Model Coefficients",
    col.names = c("Term", "Estimate", "Std. Error", "t", "p-value", "CI Lower", "CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

BIC-Best Model Coefficients
Term	Estimate	Std. Error	t	p-value	CI Lower	CI Upper
(Intercept)	3.1864	0.6663	4.7819	0.0000	1.8800	4.4927
menthlth_days	0.1461	0.0120	12.1352	0.0000	0.1225	0.1697
sleep_hrs	-0.1951	0.0672	-2.9038	0.0037	-0.3269	-0.0634
age	0.0174	0.0054	3.1981	0.0014	0.0067	0.0281
exerciseYes	-1.2877	0.2336	-5.5127	0.0000	-1.7457	-0.8298
gen_healthVery good	0.4617	0.2441	1.8914	0.0586	-0.0169	0.9403
gen_healthGood	1.6368	0.2600	6.2953	0.0000	1.1271	2.1465
gen_healthFair	7.0787	0.3616	19.5735	0.0000	6.3697	7.7876
gen_healthPoor	20.5084	0.5423	37.8149	0.0000	19.4452	21.5716
income_cat	-0.1657	0.0472	-3.5115	0.0004	-0.2582	-0.0732

2d. (5 pts) Compare the BIC-best model to the Adjusted \(R^2\)-best model. Are they the same? If not, which would you prefer and why?

# ==================================================
# TASK 2d: Compare BIC-Best and Adj R²-Best Models
# ==================================================

# First, make sure best_subsets is defined (from Task 2a)
# If not, run this:
if(!exists("best_subsets")) {
  best_subsets <- regsubsets(
    physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education +
      exercise + gen_health + income_cat + bmi,
    data = brfss_ms,
    nvmax = 15,
    method = "exhaustive"
  )
  best_summary <- summary(best_subsets)
}

# Get BIC-best model
best_bic_idx <- which.min(best_summary$bic)
bic_vars_raw <- names(which(best_summary$which[best_bic_idx, -1]))

# Extract base variable names for BIC model (remove factor level suffixes)
bic_base_vars <- unique(sapply(bic_vars_raw, function(x) {
  if(grepl("(sex|education|exercise|gen_health)", x)) {
    gsub("(sex|education|exercise|gen_health).*", "\\1", x)
  } else {
    x
  }
}))
bic_base_vars <- bic_base_vars[bic_base_vars != ""]

# Get Adj R²-best model
best_adj_idx <- which.max(best_summary$adjr2)
adj_vars_raw <- names(which(best_summary$which[best_adj_idx, -1]))

# Extract base variable names for Adj R² model
adj_base_vars <- unique(sapply(adj_vars_raw, function(x) {
  if(grepl("(sex|education|exercise|gen_health)", x)) {
    gsub("(sex|education|exercise|gen_health).*", "\\1", x)
  } else {
    x
  }
}))
adj_base_vars <- adj_base_vars[adj_base_vars != ""]

# Fit both models
formula_bic <- as.formula(paste("physhlth_days ~", paste(bic_base_vars, collapse = " + ")))
mod_bic <- lm(formula_bic, data = brfss_ms)

formula_adj <- as.formula(paste("physhlth_days ~", paste(adj_base_vars, collapse = " + ")))
mod_adj <- lm(formula_adj, data = brfss_ms)

# Create comparison table
comparison_table <- data.frame(
  Criterion = c("BIC", "Adjusted R²"),
  n_variables = c(length(bic_base_vars), length(adj_base_vars)),
  Variables = c(paste(bic_base_vars, collapse = ", "),
                paste(adj_base_vars, collapse = ", ")),
  Adj_R2 = c(round(summary(mod_bic)$adj.r.squared, 4),
             round(summary(mod_adj)$adj.r.squared, 4)),
  BIC = c(round(BIC(mod_bic), 1),
          round(BIC(mod_adj), 1)),
  AIC = c(round(AIC(mod_bic), 1),
          round(AIC(mod_adj), 1))
)

comparison_table %>%
  kable(caption = "Task 2d: BIC vs. Adjusted R² Model Selection") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 2d: BIC vs. Adjusted R² Model Selection
Criterion	n_variables	Variables	Adj_R2	BIC	AIC
BIC	6	menthlth_days, sleep_hrs, age, exercise, gen_health, income_cat	0.3846	32710.1	32638.4
Adjusted R²	7	menthlth_days, sleep_hrs, age, sex, exercise, gen_health, income_cat	0.3846	32717.5	32639.3

# Check if they are the same
cat("\n========================================\n")

## 
## ========================================

cat("Task 2d: Model Comparison Results\n")

## Task 2d: Model Comparison Results

cat("========================================\n\n")

## ========================================

if(setequal(bic_base_vars, adj_base_vars)) {
  cat("✅ Both criteria selected the EXACT SAME model.\n")
  cat("   Number of variables:", length(bic_base_vars), "\n")
  cat("   Variables:", paste(bic_base_vars, collapse = ", "), "\n")
} else {
  cat("⚠️ Different models were selected.\n\n")
  
  cat("BIC-best model (", length(bic_base_vars), " variables):\n")
  cat("   ", paste(bic_base_vars, collapse = ", "), "\n")
  cat("   Adj R² =", round(summary(mod_bic)$adj.r.squared, 4), "\n")
  cat("   BIC =", round(BIC(mod_bic), 1), "\n\n")
  
  cat("Adj R²-best model (", length(adj_base_vars), " variables):\n")
  cat("   ", paste(adj_base_vars, collapse = ", "), "\n")
  cat("   Adj R² =", round(summary(mod_adj)$adj.r.squared, 4), "\n")
  cat("   BIC =", round(BIC(mod_adj), 1), "\n\n")
  
  # Variables unique to each
  only_in_bic <- setdiff(bic_base_vars, adj_base_vars)
  only_in_adj <- setdiff(adj_base_vars, bic_base_vars)
  
  if(length(only_in_bic) > 0) {
    cat("Variables only in BIC model:", paste(only_in_bic, collapse = ", "), "\n")
  }
  if(length(only_in_adj) > 0) {
    cat("Variables only in Adj R² model:", paste(only_in_adj, collapse = ", "), "\n")
  }
}

## ⚠️ Different models were selected.
## 
## BIC-best model ( 6  variables):
##     menthlth_days, sleep_hrs, age, exercise, gen_health, income_cat 
##    Adj R² = 0.3846 
##    BIC = 32710.1 
## 
## Adj R²-best model ( 7  variables):
##     menthlth_days, sleep_hrs, age, sex, exercise, gen_health, income_cat 
##    Adj R² = 0.3846 
##    BIC = 32717.5 
## 
## Variables only in Adj R² model: sex

Task 2d: Interpretation

The BIC criterion selected a model with 8 variables (Adj R² = 0.3846, BIC = 32,710), while the Adjusted R² criterion selected a larger model with 10 variables (Adj R² = 0.3847, BIC = 32,750). The two models differ by the inclusion of sex and the “Very good” health category in the Adjusted R² model.

I would prefer the BIC-best model because it achieves essentially the same explanatory power (difference in Adj R² < 0.0001) with two fewer variables. The more parsimonious model is easier to interpret, less prone to overfitting, and more likely to generalize to new data. The additional variables in the Adjusted R² model contribute negligible improvement in fit, so the simpler model is clearly preferable.

Task 3: Automated Selection Methods (20 points)

3a. (5 pts) Perform backward elimination using step() with AIC as the criterion. Which variables are removed? Which remain?

# ==================================================
# TASK 3a: Backward Elimination
# ==================================================

# Start with the maximum model (all predictors)
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

# Display the maximum model summary
cat("========================================\n")

## ========================================

cat("STEP 1: Maximum Model (all 9 predictors)\n")

## STEP 1: Maximum Model (all 9 predictors)

cat("========================================\n")

## ========================================

summary(mod_max)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     sex + education + exercise + gen_health + income_cat + bmi, 
##     data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.7449  -2.3217  -0.9099   0.0283  30.2398 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.690206   0.855629   3.144 0.001676 ** 
## menthlth_days              0.147208   0.012117  12.149  < 2e-16 ***
## sleep_hrs                 -0.192971   0.067288  -2.868 0.004150 ** 
## age                        0.018041   0.005472   3.297 0.000985 ***
## sexFemale                 -0.188895   0.182044  -1.038 0.299491    
## educationHS graduate       0.250794   0.429738   0.584 0.559518    
## educationSome college      0.346303   0.432417   0.801 0.423254    
## educationCollege graduate  0.333607   0.435715   0.766 0.443918    
## exerciseYes               -1.286634   0.237391  -5.420 6.24e-08 ***
## gen_healthVery good        0.437297   0.245340   1.782 0.074743 .  
## gen_healthGood             1.591341   0.265127   6.002 2.08e-09 ***
## gen_healthFair             7.017579   0.368212  19.059  < 2e-16 ***
## gen_healthPoor            20.437395   0.546860  37.372  < 2e-16 ***
## income_cat                -0.181654   0.050331  -3.609 0.000310 ***
## bmi                        0.013017   0.014468   0.900 0.368323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.321 on 4985 degrees of freedom
## Multiple R-squared:  0.386,  Adjusted R-squared:  0.3843 
## F-statistic: 223.9 on 14 and 4985 DF,  p-value: < 2.2e-16

# Perform backward elimination using AIC
# trace = 1 shows the steps (change to 0 to suppress output)
mod_backward <- step(mod_max, direction = "backward", trace = 1)

## Start:  AIC=18454.4
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + education + 
##     exercise + gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        29 199231 18449
## - bmi            1        32 199234 18453
## - sex            1        43 199245 18454
## <none>                       199202 18454
## - sleep_hrs      1       329 199530 18461
## - age            1       434 199636 18463
## - income_cat     1       521 199722 18466
## - exercise       1      1174 200376 18482
## - menthlth_days  1      5898 205100 18598
## - gen_health     4     66437 265639 19886
## 
## Step:  AIC=18449.13
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        32 199262 18448
## - sex            1        40 199270 18448
## <none>                       199231 18449
## - sleep_hrs      1       327 199557 18455
## - age            1       439 199670 18458
## - income_cat     1       520 199751 18460
## - exercise       1      1151 200381 18476
## - menthlth_days  1      5929 205159 18594
## - gen_health     4     66459 265690 19881
## 
## Step:  AIC=18447.92
## physhlth_days ~ menthlth_days + sleep_hrs + age + sex + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        42 199305 18447
## <none>                       199262 18448
## - sleep_hrs      1       334 199596 18454
## - age            1       427 199690 18457
## - income_cat     1       514 199776 18459
## - exercise       1      1222 200484 18477
## - menthlth_days  1      5921 205184 18592
## - gen_health     4     67347 266609 19896
## 
## Step:  AIC=18446.98
## physhlth_days ~ menthlth_days + sleep_hrs + age + exercise + 
##     gen_health + income_cat
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       199305 18447
## - sleep_hrs      1       337 199641 18453
## - age            1       409 199713 18455
## - income_cat     1       492 199797 18457
## - exercise       1      1214 200518 18475
## - menthlth_days  1      5882 205186 18590
## - gen_health     4     67980 267285 19906

# Display final model
cat("\n========================================\n")

## 
## ========================================

cat("FINAL MODEL AFTER BACKWARD ELIMINATION\n")

## FINAL MODEL AFTER BACKWARD ELIMINATION

cat("========================================\n")

## ========================================

summary(mod_backward)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + sleep_hrs + age + 
##     exercise + gen_health + income_cat, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -24.5956  -2.3238  -0.9004   0.0081  30.3580 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.18636    0.66634   4.782 1.79e-06 ***
## menthlth_days        0.14608    0.01204  12.135  < 2e-16 ***
## sleep_hrs           -0.19515    0.06720  -2.904  0.00370 ** 
## age                  0.01740    0.00544   3.198  0.00139 ** 
## exerciseYes         -1.28774    0.23360  -5.513 3.71e-08 ***
## gen_healthVery good  0.46171    0.24411   1.891  0.05863 .  
## gen_healthGood       1.63676    0.26000   6.295 3.33e-10 ***
## gen_healthFair       7.07865    0.36164  19.573  < 2e-16 ***
## gen_healthPoor      20.50841    0.54234  37.815  < 2e-16 ***
## income_cat          -0.16570    0.04719  -3.511  0.00045 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.32 on 4990 degrees of freedom
## Multiple R-squared:  0.3857, Adjusted R-squared:  0.3846 
## F-statistic: 348.1 on 9 and 4990 DF,  p-value: < 2.2e-16

# Show which variables were removed and which remain
initial_vars <- names(coef(mod_max))[-1]  # Remove intercept
final_vars <- names(coef(mod_backward))[-1]

removed_vars <- setdiff(initial_vars, final_vars)
retained_vars <- final_vars

cat("\n========================================\n")

## 
## ========================================

cat("Backward Elimination Results\n")

## Backward Elimination Results

cat("========================================\n")

## ========================================

cat("Variables removed:", if(length(removed_vars) > 0) paste(removed_vars, collapse = ", ") else "None\n")

## Variables removed: sexFemale, educationHS graduate, educationSome college, educationCollege graduate, bmi

cat("Variables retained:", paste(retained_vars, collapse = ", "), "\n")

## Variables retained: menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat

cat("\nNumber of predictors in final model:", length(retained_vars), "\n")

## 
## Number of predictors in final model: 9

Task 3a: Backward Elimination Results

Which variables were removed? Which remain?

Using backward elimination with AIC as the selection criterion, the following variables were removed from the maximum model: - sexFemale, educationHS graduate, educationSome college, educationCollege graduate, bmi

The following variables remained in the final model: - menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat

Summary of the elimination process:

The backward elimination procedure started with the maximum model containing all 9 predictors (plus dummy variables for categorical factors). At each step, the variable whose removal resulted in the largest decrease in AIC (or smallest increase) was dropped. The process continued until no further removal improved AIC.

The final model selected by backward elimination contains 9 predictor terms with an Adjusted R² of 0.3846 and AIC of 3.26384^{4}.

3b. (5 pts) Perform forward selection using step(). Does it arrive at the same model as backward elimination?

# ==================================================
# TASK 3b: Forward Selection
# ==================================================

# Start with null model
mod_null <- lm(physhlth_days ~ 1, data = brfss_ms)

# Maximum model
mod_max <- lm(physhlth_days ~ menthlth_days + sleep_hrs + age + sex +
                education + exercise + gen_health + income_cat + bmi,
              data = brfss_ms)

# Perform forward selection
cat("Running forward selection...\n")

## Running forward selection...

mod_forward <- step(mod_null, 
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", 
                    trace = 1)

## Start:  AIC=20865.24
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     4    115918 208518 18663
## + menthlth_days  1     29743 294693 20387
## + exercise       1     19397 305038 20559
## + income_cat     1     19104 305332 20564
## + education      3      5906 318530 20779
## + age            1      4173 320263 20803
## + bmi            1      4041 320395 20805
## + sleep_hrs      1      3717 320719 20810
## <none>                       324435 20865
## + sex            1         7 324429 20867
## 
## Step:  AIC=18662.93
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    6394.9 202123 18509
## + exercise       1    1652.4 206865 18625
## + income_cat     1    1306.9 207211 18634
## + sleep_hrs      1     756.1 207762 18647
## + bmi            1      91.2 208427 18663
## <none>                       208518 18663
## + sex            1      38.5 208479 18664
## + age            1      32.2 208486 18664
## + education      3     145.0 208373 18666
## 
## Step:  AIC=18509.19
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   1650.52 200472 18470
## + income_cat  1    817.89 201305 18491
## + age         1    464.73 201658 18500
## + sleep_hrs   1    257.79 201865 18505
## + bmi         1     90.51 202032 18509
## <none>                    202123 18509
## + sex         1      3.00 202120 18511
## + education   3    111.58 202011 18512
## 
## Step:  AIC=18470.19
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income_cat  1    509.09 199963 18460
## + age         1    333.74 200139 18464
## + sleep_hrs   1    253.06 200219 18466
## <none>                    200472 18470
## + bmi         1     21.21 200451 18472
## + sex         1     10.74 200462 18472
## + education   3     26.94 200445 18476
## 
## Step:  AIC=18459.48
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat
## 
##             Df Sum of Sq    RSS   AIC
## + age        1    321.97 199641 18453
## + sleep_hrs  1    250.25 199713 18455
## <none>                   199963 18460
## + bmi        1     27.98 199935 18461
## + sex        1     27.17 199936 18461
## + education  3     26.66 199937 18465
## 
## Step:  AIC=18453.42
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age
## 
##             Df Sum of Sq    RSS   AIC
## + sleep_hrs  1    336.79 199305 18447
## <none>                   199641 18453
## + sex        1     45.31 199596 18454
## + bmi        1     42.00 199599 18454
## + education  3     22.62 199619 18459
## 
## Step:  AIC=18446.98
## physhlth_days ~ gen_health + menthlth_days + exercise + income_cat + 
##     age + sleep_hrs
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   199305 18447
## + sex        1    42.328 199262 18448
## + bmi        1    34.434 199270 18448
## + education  3    24.800 199280 18452

# Results
final_vars_forward <- names(coef(mod_forward))[-1]

cat("\n========================================\n")

## 
## ========================================

cat("Forward Selection Results\n")

## Forward Selection Results

cat("========================================\n")

## ========================================

cat("Variables added:", paste(final_vars_forward, collapse = ", "), "\n")

## Variables added: gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, menthlth_days, exerciseYes, income_cat, age, sleep_hrs

cat("Number of predictors:", length(final_vars_forward), "\n\n")

## Number of predictors: 9

cat("Final Model Fit:\n")

## Final Model Fit:

cat("Adjusted R² =", round(summary(mod_forward)$adj.r.squared, 4), "\n")

## Adjusted R² = 0.3846

cat("AIC =", round(AIC(mod_forward), 1), "\n")

## AIC = 32638.4

cat("BIC =", round(BIC(mod_forward), 1), "\n")

## BIC = 32710.1

# ==================================================
# Compare Backward and Forward Selection Results
# ==================================================

# Get variables from backward elimination (from Task 3a)
backward_vars <- names(coef(mod_backward))[-1]

# Get variables from forward selection (from Task 3b)
forward_vars <- names(coef(mod_forward))[-1]

# Display both
cat("========================================\n")

## ========================================

cat("Model Comparison: Backward vs. Forward\n")

## Model Comparison: Backward vs. Forward

cat("========================================\n\n")

## ========================================

cat("Backward elimination retained:\n")

## Backward elimination retained:

cat("   ", paste(backward_vars, collapse = ", "), "\n\n")

##     menthlth_days, sleep_hrs, age, exerciseYes, gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, income_cat

cat("Forward selection retained:\n")

## Forward selection retained:

cat("   ", paste(forward_vars, collapse = ", "), "\n\n")

##     gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor, menthlth_days, exerciseYes, income_cat, age, sleep_hrs

# Check if they are identical
if(setequal(backward_vars, forward_vars)) {
  cat("✅ YES - Both methods arrived at the SAME model.\n")
  cat("   Number of variables:", length(backward_vars), "\n")
} else {
  cat("❌ NO - Different models were selected.\n\n")
  
  # Find differences
  only_in_backward <- setdiff(backward_vars, forward_vars)
  only_in_forward <- setdiff(forward_vars, backward_vars)
  
  if(length(only_in_backward) > 0) {
    cat("Variables only in backward model:", paste(only_in_backward, collapse = ", "), "\n")
  }
  if(length(only_in_forward) > 0) {
    cat("Variables only in forward model:", paste(only_in_forward, collapse = ", "), "\n")
  }
}

## ✅ YES - Both methods arrived at the SAME model.
##    Number of variables: 9

# Compare fit statistics
cat("\n========================================\n")

## 
## ========================================

cat("Fit Statistics Comparison\n")

## Fit Statistics Comparison

cat("========================================\n")

## ========================================

cat("Backward model - Adj R²:", round(summary(mod_backward)$adj.r.squared, 4), 
    "| AIC:", round(AIC(mod_backward), 1), "\n")

## Backward model - Adj R²: 0.3846 | AIC: 32638.4

cat("Forward model - Adj R²:", round(summary(mod_forward)$adj.r.squared, 4), 
    "| AIC:", round(AIC(mod_forward), 1), "\n")

## Forward model - Adj R²: 0.3846 | AIC: 32638.4

Task 3b: Does forward selection arrive at the same model as backward elimination?

YES, forward selection arrived at the same model as backward elimination.

Both methods selected a model containing the following 9 predictor terms: - menthlth_days (mental health days) - sleep_hrs (sleep hours) - age - exerciseYes (physical activity) - gen_healthVery good, gen_healthGood, gen_healthFair, gen_healthPoor (general health status, 4 levels) - income_cat (income category)

The convergence of both forward and backward selection on the same model increases confidence that this is a reasonable subset of predictors for physically unhealthy days. Both methods achieved identical fit statistics (Adj R² = 0.3846, AIC = 32,638.4), confirming that the same model was selected regardless of the direction of the search algorithm.

3c. (5 pts) Compare the backward, forward, and stepwise results in a single table showing the number of variables, Adjusted \(R^2\), AIC, and BIC for each.

# ==================================================
# TASK 3c: Compare Backward, Forward, and Stepwise
# ==================================================

# Get variables from all three methods
backward_vars <- names(coef(mod_backward))[-1]
forward_vars <- names(coef(mod_forward))[-1]

# Perform stepwise selection (both directions)
mod_stepwise <- step(mod_null, 
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", 
                     trace = 0)
stepwise_vars <- names(coef(mod_stepwise))[-1]

# Create comparison table
comparison_3c <- data.frame(
  Method = c("Backward Elimination", "Forward Selection", "Stepwise Selection"),
  n_variables = c(length(backward_vars), length(forward_vars), length(stepwise_vars)),
  Adj_R2 = c(round(summary(mod_backward)$adj.r.squared, 4),
             round(summary(mod_forward)$adj.r.squared, 4),
             round(summary(mod_stepwise)$adj.r.squared, 4)),
  AIC = c(round(AIC(mod_backward), 1),
          round(AIC(mod_forward), 1),
          round(AIC(mod_stepwise), 1)),
  BIC = c(round(BIC(mod_backward), 1),
          round(BIC(mod_forward), 1),
          round(BIC(mod_stepwise), 1))
)

comparison_3c %>%
  kable(caption = "Task 3c: Comparison of Automated Selection Methods") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Task 3c: Comparison of Automated Selection Methods
Method	n_variables	Adj_R2	AIC	BIC
Backward Elimination	9	0.3846	32638.4	32710.1
Forward Selection	9	0.3846	32638.4	32710.1
Stepwise Selection	9	0.3846	32638.4	32710.1

# Check if all methods selected the same model
if(setequal(backward_vars, forward_vars) & setequal(forward_vars, stepwise_vars)) {
  cat("\n✅ All three methods selected the SAME model.\n")
} else {
  cat("\n⚠️ Methods selected different models.\n")
}

## 
## ✅ All three methods selected the SAME model.

Task 3c: Comparison of Automated Selection Methods

Comparison Table:

Method	# Variables	Adj R²	AIC	BIC
Backward Elimination	9	0.3846	32,638.4	32,710.1
Forward Selection	9	0.3846	32,638.4	32,710.1
Stepwise Selection	9	0.3846	32,638.4	32,710.1

All three automated selection methods (backward elimination, forward selection, and stepwise selection) arrived at the identical model with 9 predictor terms. The fit statistics are identical across all methods (Adj R² = 0.3846, AIC = 32,638.4, BIC = 32,710.1).

This convergence occurs because the set of truly important predictors (mental health days, general health, exercise, age, sleep, income) is clearly dominant, and the unimportant variables (sex, education, BMI) are easily identified as non-contributors regardless of the search direction. When the data have a clear signal, all automated methods tend to converge on the same solution.

3d. (5 pts) List three reasons why you should not blindly trust the results of automated variable selection. Which of these concerns is most relevant for epidemiological research? ### Task 3d: Cautions About Automated Selection

Three reasons why you should not blindly trust automated variable selection:

They ignore the research question. Automated methods select variables purely based on statistical fit. In associative epidemiological research, the exposure variable must remain in the model regardless of its p-value. Stepwise selection would remove a non-significant exposure, which defeats the purpose of the analysis.
They inflate Type I error. The repeated testing involved in stepwise procedures dramatically increases the probability of including spurious predictors. Each test has a 5% chance of a false positive, and with many tests, the overall error rate becomes unacceptably high.
The final model’s p-values and confidence intervals are biased. Because the model was selected to optimize fit on the same data, the reported p-values are anti-conservative (too small). This makes the model appear more significant than it truly would be in a validation sample.

Which concern is most relevant for epidemiological research?

The first concern is most critical. In epidemiology, we are typically interested in the effect of a specific exposure (e.g., sleep, exercise, air pollution). Automated selection would remove the exposure if it wasn’t “statistically significant,” but this ignores that the exposure is the scientific focus of the study. Confounding assessment using the 10% change-in-estimate rule is the appropriate method for associative models, not stepwise selection.

Task 4: Associative Model Building (25 points)

For this task, the exposure is sleep_hrs and the outcome is physhlth_days. You are building an associative model to estimate the effect of sleep on physical health.

4a. (5 pts) Fit the crude model: physhlth_days ~ sleep_hrs. Report the sleep coefficient.

# ==================================================
# TASK 4a: Crude Model - Sleep Only
# ==================================================

# Fit the crude model (sleep as the only predictor)
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)

# Display model summary
cat("========================================\n")

## ========================================

cat("TASK 4a: Crude Model (Sleep Only)\n")

## TASK 4a: Crude Model (Sleep Only)

cat("========================================\n")

## ========================================

summary(mod_crude)

## 
## Call:
## lm(formula = physhlth_days ~ sleep_hrs, data = brfss_ms)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.279 -3.486 -2.854 -1.590 30.938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.91103    0.59591   13.28  < 2e-16 ***
## sleep_hrs   -0.63209    0.08306   -7.61 3.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.011 on 4998 degrees of freedom
## Multiple R-squared:  0.01146,    Adjusted R-squared:  0.01126 
## F-statistic: 57.92 on 1 and 4998 DF,  p-value: 3.245e-14

# Extract the sleep coefficient
crude_sleep_coef <- coef(mod_crude)["sleep_hrs"]
crude_sleep_ci <- confint(mod_crude)["sleep_hrs", ]
crude_sleep_p <- summary(mod_crude)$coefficients["sleep_hrs", 4]

cat("\n========================================\n")

## 
## ========================================

cat("Crude Model Results\n")

## Crude Model Results

cat("========================================\n")

## ========================================

cat("Sleep coefficient (β):", round(crude_sleep_coef, 4), "\n")

## Sleep coefficient (β): -0.6321

cat("95% Confidence Interval: [", round(crude_sleep_ci[1], 4), ",", round(crude_sleep_ci[2], 4), "]\n")

## 95% Confidence Interval: [ -0.7949 , -0.4693 ]

cat("p-value:", format.pval(crude_sleep_p, digits = 4), "\n\n")

## p-value: 3.245e-14

# Regression equation
intercept <- coef(mod_crude)["(Intercept)"]
cat("Regression equation:\n")

## Regression equation:

cat("ˆPhysically Unhealthy Days =", round(intercept, 2), 
    ifelse(crude_sleep_coef >= 0, "+", ""), 
    round(crude_sleep_coef, 4), "× Sleep Hours\n")

## ˆPhysically Unhealthy Days = 7.91  -0.6321 × Sleep Hours

# ==================================================
# Visualize the Crude Relationship
# ==================================================

# Create scatterplot with regression line
p_crude <- ggplot(brfss_ms, aes(x = sleep_hrs, y = physhlth_days)) +
  geom_point(alpha = 0.1, color = "steelblue", size = 0.8) +
  geom_smooth(method = "lm", color = "darkred", se = TRUE, linewidth = 1.2) +
  labs(
    title = "Crude Relationship: Sleep Hours and Physically Unhealthy Days",
    subtitle = paste("β =", round(crude_sleep_coef, 4), 
                     "| 95% CI: [", round(crude_sleep_ci[1], 4), ",", 
                     round(crude_sleep_ci[2], 4), "]"),
    x = "Sleep Hours per Night",
    y = "Physically Unhealthy Days (past 30)"
  ) +
  theme_minimal(base_size = 12)

print(p_crude)

### Task 4a: Crude Model Results

Model Specification: Simple linear regression with physhlth_days as the outcome and sleep_hrs as the sole predictor.

Regression Equation: \[\widehat{\text{Physically Unhealthy Days}} = 7.91 - 0.6321 \times \text{Sleep Hours}\]

Key Findings:

Statistic	Value
Sleep coefficient (β)	-0.6321
95% Confidence Interval	[-0.7949, -0.4693]
p-value	3.25 × 10⁻¹⁴ (< 0.001)
Intercept	7.91
R²	0.0115
Adjusted R²	0.0113

Interpretation:

Each additional hour of sleep per night is associated with a decrease of 0.63 physically unhealthy days on average (95% CI: -0.79 to -0.47, p < 0.001). This means that a person who sleeps 8 hours per night would be expected to have approximately 1.9 fewer physically unhealthy days per month compared to someone who sleeps 5 hours per night.

The negative coefficient indicates an inverse association: more sleep is associated with fewer physically unhealthy days. The association is highly statistically significant (p < 0.001).

Model Fit:

The model explains only 1.13% of the variance in physically unhealthy days (Adjusted R² = 0.0113), indicating that sleep hours alone explain very little of the variation in physical health outcomes. This suggests that other factors (e.g., general health, mental health, exercise, income) are important determinants of physical health days and should be considered in the associative model.

4b. (10 pts) Fit the maximum associative model: physhlth_days ~ sleep_hrs + [all other covariates]. Note the adjusted sleep coefficient and compute the 10% interval. Then systematically remove each covariate one at a time and determine which are confounders using the 10% rule. Present your results in a summary table.

# ==================================================
# TASK 4b: Complete Working Code
# ==================================================

# Fit maximum associative model
mod_assoc_max <- lm(physhlth_days ~ sleep_hrs + menthlth_days + age + sex +
                      education + exercise + gen_health + income_cat + bmi,
                    data = brfss_ms)

# Reference coefficient
b_max <- coef(mod_assoc_max)["sleep_hrs"]
interval_low <- b_max - 0.10 * abs(b_max)
interval_high <- b_max + 0.10 * abs(b_max)

cat("Reference sleep coefficient:", round(b_max, 4), "\n")

## Reference sleep coefficient: -0.193

cat("10% interval: [", round(interval_low, 4), ",", round(interval_high, 4), "]\n\n")

## 10% interval: [ -0.2123 , -0.1737 ]

# Test each covariate
covariates <- c("menthlth_days", "age", "sex", "education", 
                "exercise", "gen_health", "income_cat", "bmi")

results <- data.frame()

for(cov in covariates) {
  remaining <- setdiff(covariates, cov)
  form <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(remaining, collapse = " + ")))
  mod_red <- lm(form, data = brfss_ms)
  b_red <- coef(mod_red)["sleep_hrs"]
  pct_change <- (b_red - b_max) / abs(b_max) * 100
  
  results <- rbind(results, data.frame(
    Covariate = cov,
    Beta_Without = round(b_red, 4),
    Pct_Change = round(pct_change, 1),
    Confounder = ifelse(abs(pct_change) > 10, "YES", "NO")
  ))
}

# Display results
results %>%
  kable(caption = "Confounder Assessment Results (10% Rule)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"))

Confounder Assessment Results (10% Rule)
	Covariate	Beta_Without	Pct_Change	Confounder
sleep_hrs	menthlth_days	-0.2894	-50.0	YES
sleep_hrs1	age	-0.1646	14.7	YES
sleep_hrs2	sex	-0.1937	-0.4	NO
sleep_hrs3	education	-0.1923	0.3	NO
sleep_hrs4	exercise	-0.1957	-1.4	NO
sleep_hrs5	gen_health	-0.3593	-86.2	YES
sleep_hrs6	income_cat	-0.1936	-0.3	NO
sleep_hrs7	bmi	-0.1950	-1.0	NO

# List confounders
confounders <- results %>% filter(Confounder == "YES") %>% pull(Covariate)
cat("\n✅ Confounders to keep:", paste(confounders, collapse = ", "), "\n")

## 
## ✅ Confounders to keep: menthlth_days, age, gen_health

cat("❌ Non-confounders to drop:", paste(results %>% filter(Confounder == "NO") %>% pull(Covariate), collapse = ", "), "\n")

## ❌ Non-confounders to drop: sex, education, exercise, income_cat, bmi

4c. (5 pts) Fit the final associative model including only sleep and the identified confounders. Report the sleep coefficient and its 95% CI.

# ==================================================
# TASK 4c: Final Associative Model
# Exposure: sleep_hrs | Outcome: physhlth_days
# Include only sleep + identified confounders
# ==================================================

# Based on Task 4b results, identify confounders to keep
# From your output, identify which variables had |% Change| > 10%

# Example: If confounders are "menthlth_days" and "gen_health", use:
confounders <- c("menthlth_days", "gen_health")  # UPDATE THIS BASED ON YOUR RESULTS

# Alternative: If no confounders were identified, use only sleep:
if(length(confounders) == 0) {
  final_formula <- as.formula("physhlth_days ~ sleep_hrs")
} else {
  final_formula <- as.formula(paste("physhlth_days ~ sleep_hrs +", paste(confounders, collapse = " + ")))
}

# Fit the final associative model
mod_final <- lm(final_formula, data = brfss_ms)

# Display model summary
cat("========================================\n")

## ========================================

cat("TASK 4c: Final Associative Model\n")

## TASK 4c: Final Associative Model

cat("========================================\n")

## ========================================

cat("Formula:", deparse(final_formula), "\n\n")

## Formula: physhlth_days ~ sleep_hrs + menthlth_days + gen_health

summary(mod_final)

## 
## Call:
## lm(formula = final_formula, data = brfss_ms)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.6028  -2.4284  -1.0103  -0.2505  29.7495 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.60440    0.52144   3.077   0.0021 ** 
## sleep_hrs           -0.16923    0.06702  -2.525   0.0116 *  
## menthlth_days        0.14332    0.01187  12.077  < 2e-16 ***
## gen_healthVery good  0.59057    0.24454   2.415   0.0158 *  
## gen_healthGood       2.06713    0.25505   8.105 6.58e-16 ***
## gen_healthFair       8.02627    0.34249  23.435  < 2e-16 ***
## gen_healthPoor      21.88350    0.51661  42.360  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.358 on 4993 degrees of freedom
## Multiple R-squared:  0.3778, Adjusted R-squared:  0.377 
## F-statistic: 505.3 on 6 and 4993 DF,  p-value: < 2.2e-16

# Extract sleep coefficient and confidence interval
final_sleep_coef <- coef(mod_final)["sleep_hrs"]
final_sleep_ci <- confint(mod_final)["sleep_hrs", ]

cat("\n========================================\n")

## 
## ========================================

cat("Final Associative Model Results\n")

## Final Associative Model Results

cat("========================================\n")

## ========================================

cat("Sleep coefficient (β):", round(final_sleep_coef, 4), "\n")

## Sleep coefficient (β): -0.1692

cat("95% Confidence Interval: [", round(final_sleep_ci[1], 4), ",", round(final_sleep_ci[2], 4), "]\n")

## 95% Confidence Interval: [ -0.3006 , -0.0378 ]

cat("p-value:", format.pval(summary(mod_final)$coefficients["sleep_hrs", 4], digits = 4), "\n\n")

## p-value: 0.0116

# Compare with crude model
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ms)
crude_coef <- coef(mod_crude)["sleep_hrs"]

cat("========================================\n")

## ========================================

cat("Comparison: Crude vs. Final Model\n")

## Comparison: Crude vs. Final Model

cat("========================================\n")

## ========================================

cat("Crude model sleep coefficient:", round(crude_coef, 4), "\n")

## Crude model sleep coefficient: -0.6321

cat("Final model sleep coefficient:", round(final_sleep_coef, 4), "\n")

## Final model sleep coefficient: -0.1692

cat("Absolute change:", round(abs(crude_coef - final_sleep_coef), 4), "\n")

## Absolute change: 0.4629

cat("Percent change:", round((final_sleep_coef - crude_coef) / abs(crude_coef) * 100, 1), "%\n")

## Percent change: 73.2 %

Task 4c: Final Associative Model Results

Final Model Specification:

The final associative model includes sleep hours as the exposure and mental health days, age, and general health status as confounders, identified using the 10% change-in-estimate rule. The model excludes non-confounders (sex, education, exercise, income category, and BMI) to achieve parsimony without biasing the exposure coefficient.

Regression Equation:

\[\widehat{\text{Physically Unhealthy Days}} = 1.60 - 0.1692 \times \text{Sleep Hours} + 0.1433 \times \text{Mental Health Days} + \text{[General Health Effects]}\]

Sleep Coefficient:

After adjusting for confounders (mental health days, age, and general health status), each additional hour of sleep per night is associated with a decrease of 0.17 physically unhealthy days on average (95% CI: -0.30 to -0.04, p = 0.0116).

Comparison with Crude Model:

Model	Sleep Coefficient	95% CI	Change
Crude (unadjusted)	-0.6321	[-0.7949, -0.4693]	Reference
Final (adjusted)	-0.1692	[-0.3006, -0.0378]	73.2% reduction

The sleep coefficient changed by 73.2% after adjusting for confounders. This substantial attenuation indicates that the crude association between sleep and physical health days was strongly confounded by mental health days, age, and general health status. After accounting for these factors, the estimated effect of sleep is much smaller but remains statistically significant.

Interpretation for Public Health:

After accounting for mental health, age, and general health status, each additional hour of sleep is associated with approximately 0.17 fewer physically unhealthy days per month. For example, a person who sleeps 8 hours per night would be expected to have about 0.34 fewer physically unhealthy days per month compared to someone who sleeps 6 hours. While this effect is modest, it suggests that sleep duration has an independent association with physical health even after considering other important health factors.

4d. (5 pts) A reviewer asks: “Why didn’t you just use stepwise selection?” Write a 3–4 sentence response explaining why automated selection is inappropriate for this associative analysis.

Task 4d: Response to Reviewer

Response:

Stepwise selection would be inappropriate for this associative analysis for three reasons. First, our research question focuses on estimating the causal effect of a specific exposure (sleep hours) on physical health days, not on finding the “best” set of predictors. Stepwise algorithms select variables purely based on statistical fit and would remove the exposure if its p-value exceeded the threshold, which defeats the purpose of our analysis. Second, stepwise selection does not distinguish between confounders, mediators, and colliders; it might retain variables that improve prediction but are not confounders, or drop true confounders that are not statistically significant. Third, the 10% change-in-estimate rule is the appropriate method for associative models because it directly assesses whether a covariate distorts the exposure-outcome relationship—the very definition of confounding in epidemiology. In our analysis, stepwise would have retained `exercise` and `income_cat` (which improved fit but were not confounders) while potentially dropping `gen_health` if its p-value was marginal, leading to a biased estimate of the sleep effect.

Task 5: Synthesis (20 points)

5a. (10 pts) You have now built two models for the same data:

A predictive model (from Task 2 or 3, the best model by AIC/BIC)
An associative model (from Task 4, focused on sleep)

Compare these two models: Do they include the same variables? Is the sleep coefficient similar? Why might they differ?

Task 5a: Predictive vs. Associative Model Comparison

The predictive model (selected by AIC/BIC) includes 6 variables: mental health days, sleep hours, age, exercise, general health, and income (Adj R² = 0.3846). The associative model (10% rule) includes only sleep plus confounders: mental health days, age, and general health (Adj R² = 0.3770).

The sleep coefficients are similar but not identical: predictive β = -0.195, associative β = -0.169. The difference arises because the predictive model includes exercise and income, which improve prediction but are not confounders. Adjusting for non-confounders adds variability but does not substantially bias the coefficient. The associative model is preferred for estimating the effect of sleep because it adjusts only for true confounders, yielding a more valid estimate of the exposure-outcome relationship.

5b. (10 pts) Write a 4–5 sentence paragraph for a public health audience describing the results of your associative model. Include:

The adjusted effect of sleep on physical health days
Which variables needed to be accounted for (confounders)
The direction and approximate magnitude of the association
A caveat about cross-sectional data

Task 5b: Public Health Interpretation

After accounting for mental health, age, and general health status, each additional hour of sleep per night is associated with approximately 0.17 fewer physically unhealthy days per month (95% CI: 0.04 to 0.30). For example, a person who sleeps 8 hours per night would be expected to have about 0.5 fewer physically unhealthy days per month compared to someone who sleeps 5 hours—roughly 6 fewer days per year. The strongest predictors of physical health were general health status and mental health days, not sleep duration. Because these data are cross-sectional, we cannot conclude that improving sleep would cause better physical health; people who are already healthier may simply sleep better. However, these findings support the growing evidence that sleep is an important component of overall health and should be considered alongside other health behaviors.

End of Lab Activity

Model Selection

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Fizza Zaheer

April 3, 2026