EPI 553 — Homework 4: Model Selection and Introduction to Logistic Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw03_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(car)
library(gtsummary)
library(leaps)
library(pROC)
library(ResourceSelection)

# Import the dataset — update the path if needed
brfss_raw <- read_xpt("LLCP2023.XPT")

# Report dimensions of raw dataset
dim(brfss_raw)

## [1] 433323    350

nrow(brfss_raw)

## [1] 433323

ncol(brfss_raw)

## [1] 350

cat("The raw BRFSS dataset contains", nrow(brfss_raw), "rows and", ncol(brfss_raw), "columns.\n")

## The raw BRFSS dataset contains 433323 rows and 350 columns.

# Select outcome + chosen predictors
brfss_work <- brfss_raw %>%
  select(
    PHYSHLTH,
    MENTHLTH,
    `_BMI5`,
    SEXVAR,
    EXERANY2,
    ADDEPEV3,
    `_AGEG5YR`,
    `_INCOMG1`,
    EDUCA,
    `_SMOKER3`
  )

# Recode variables
brfss_clean <- brfss_work %>%
  mutate(
    physhlth_days = case_when(
      PHYSHLTH == 88 ~ 0,
      PHYSHLTH %in% c(77, 99) ~ NA_real_,
      PHYSHLTH >= 1 & PHYSHLTH <= 30 ~ as.numeric(PHYSHLTH),
      TRUE ~ NA_real_
    ),

    menthlth_days = case_when(
      MENTHLTH == 88 ~ 0,
      MENTHLTH %in% c(77, 99) ~ NA_real_,
      MENTHLTH >= 1 & MENTHLTH <= 30 ~ as.numeric(MENTHLTH),
      TRUE ~ NA_real_
    ),

    bmi = case_when(
      `_BMI5` == 9999 ~ NA_real_,
      TRUE ~ as.numeric(`_BMI5`) / 100
    ),

    sex = case_when(
      SEXVAR == 1 ~ "Male",
      SEXVAR == 2 ~ "Female",
      TRUE ~ NA_character_
    ),

    exercise = case_when(
      EXERANY2 == 1 ~ "Yes",
      EXERANY2 == 2 ~ "No",
      EXERANY2 %in% c(7, 9) ~ NA_character_,
      TRUE ~ NA_character_
    ),

    depression = case_when(
      ADDEPEV3 == 1 ~ "Yes",
      ADDEPEV3 == 2 ~ "No",
      ADDEPEV3 %in% c(7, 9) ~ NA_character_,
      TRUE ~ NA_character_
    ),

    age_group = case_when(
      `_AGEG5YR` %in% 1:3 ~ "18-34",
      `_AGEG5YR` %in% 4:6 ~ "35-49",
      `_AGEG5YR` %in% 7:9 ~ "50-64",
      `_AGEG5YR` %in% 10:13 ~ "65+",
      `_AGEG5YR` == 14 ~ NA_character_,
      TRUE ~ NA_character_
    ),

    income = case_when(
      `_INCOMG1` %in% 1:2 ~ "Less than $25K",
      `_INCOMG1` %in% 3:4 ~ "$25K-$49K",
      `_INCOMG1` %in% 5:6 ~ "$50K-$99K",
      `_INCOMG1` == 7 ~ "$100K+",
      `_INCOMG1` == 9 ~ NA_character_,
      TRUE ~ NA_character_
    ),

    education = case_when(
      EDUCA %in% 1:3 ~ "Less than HS",
      EDUCA == 4 ~ "HS/GED",
      EDUCA == 5 ~ "Some college",
      EDUCA == 6 ~ "College graduate",
      EDUCA == 9 ~ NA_character_,
      TRUE ~ NA_character_
    ),

    smoking = case_when(
      `_SMOKER3` %in% 1:2 ~ "Current",
      `_SMOKER3` == 3 ~ "Former",
      `_SMOKER3` == 4 ~ "Never",
      `_SMOKER3` == 9 ~ NA_character_,
      TRUE ~ NA_character_
    )
  ) %>%
  mutate(
    sex = relevel(factor(sex), ref = "Male"),
    exercise = relevel(factor(exercise), ref = "No"),
    depression = relevel(factor(depression), ref = "No"),
    age_group = relevel(factor(age_group), ref = "18-34"),
    income = relevel(factor(income), ref = "Less than $25K"),
    education = relevel(factor(education), ref = "Less than HS"),
    smoking = relevel(factor(smoking), ref = "Never")
  )

# Keep only cleaned variables for the analytic dataset
brfss_clean_final <- brfss_clean %>%
  select(
    physhlth_days,
    menthlth_days,
    bmi,
    sex,
    exercise,
    depression,
    age_group,
    income,
    education,
    smoking
  )

# Missingness before removing missing data
missing_summary <- tibble(
  variable = c("physhlth_days", "menthlth_days", "bmi"),
  missing_n = c(
    sum(is.na(brfss_clean_final$physhlth_days)),
    sum(is.na(brfss_clean_final$menthlth_days)),
    sum(is.na(brfss_clean_final$bmi))
  )
) %>%
  mutate(
    missing_pct = 100 * missing_n / nrow(brfss_clean_final)
  )

missing_summary

## # A tibble: 3 × 3
##   variable      missing_n missing_pct
##   <chr>             <int>       <dbl>
## 1 physhlth_days     10785        2.49
## 2 menthlth_days      8108        1.87
## 3 bmi               40535        9.35

# Create analytic dataset
set.seed(1220)

brfss_analytic <- brfss_clean_final %>%
  drop_na() %>%
  slice_sample(n = 8000)

# Report final analytic sample size
nrow(brfss_analytic)

## [1] 8000

# Descriptive summary table
tbl_summary(
  brfss_analytic,
  type = list(
    sex ~ "categorical",
    exercise ~ "categorical",
    depression ~ "categorical",
    age_group ~ "categorical",
    income ~ "categorical",
    education ~ "categorical",
    smoking ~ "categorical"
  )
)

Characteristic	N = 8,000¹
physhlth_days	0 (0, 4)
menthlth_days	0 (0, 5)
bmi	27 (24, 32)
sex
Male	4,007 (50%)
Female	3,993 (50%)
exercise
No	1,763 (22%)
Yes	6,237 (78%)
depression
No	6,313 (79%)
Yes	1,687 (21%)
age_group
18-34	1,287 (16%)
35-49	1,683 (21%)
50-64	2,098 (26%)
65+	2,932 (37%)
income
Less than $25K	1,092 (14%)
$100K+	653 (8.2%)
$25K-$49K	1,888 (24%)
$50K-$99K	4,367 (55%)
education
Less than HS	387 (4.8%)
College graduate	3,746 (47%)
HS/GED	1,720 (22%)
Some college	2,147 (27%)
smoking
Never	4,815 (60%)
Current	929 (12%)
Former	2,256 (28%)
¹ Median (Q1, Q3); n (%)

Step 3 Interpretation: Sample Size

Before removing missing data, physhlth_days had 10,785 missing values (2.49%), menthlth_days had 8,108 (1.87%), and BMI had 40,535 (9.35%). After removing observations with missing values and randomly sampling with set.seed(1220), the final analytic dataset included 8,000 respondents.

Part 1: Model Selection for Linear Regression (35 points)

#----------------------------------------------------------#
# PART 1: Model Selection for Linear Regression
#----------------------------------------------------------#
# Assumes brfss_analytic was created in Part 0

# Outcome: physhlth_days
# Predictors:
# menthlth_days, bmi, sex, exercise, depression,
# age_group, income, education, smoking

#----------------------------------------------------------#
# Step 1: Fit the Maximum Model
#----------------------------------------------------------#

max_lm <- lm(
  physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
    age_group + income + education + smoking,
  data = brfss_analytic
)

summary(max_lm)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise + 
##     depression + age_group + income + education + smoking, data = brfss_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9005  -3.9214  -1.8200   0.7445  31.2775 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.27832    0.66322   4.943 7.85e-07 ***
## menthlth_days              0.29364    0.01195  24.563  < 2e-16 ***
## bmi                        0.09918    0.01396   7.107 1.29e-12 ***
## sexFemale                 -0.06683    0.18126  -0.369 0.712382    
## exerciseYes               -3.57069    0.22529 -15.849  < 2e-16 ***
## depressionYes              0.83235    0.24556   3.390 0.000703 ***
## age_group35-49             0.92707    0.29954   3.095 0.001975 ** 
## age_group50-64             2.22979    0.28720   7.764 9.26e-15 ***
## age_group65+               3.02349    0.27616  10.948  < 2e-16 ***
## income$100K+              -3.91649    0.42567  -9.201  < 2e-16 ***
## income$25K-$49K           -2.44157    0.30554  -7.991 1.53e-15 ***
## income$50K-$99K           -3.50150    0.29178 -12.001  < 2e-16 ***
## educationCollege graduate  0.42865    0.45029   0.952 0.341157    
## educationHS/GED            0.03516    0.44933   0.078 0.937628    
## educationSome college      0.45289    0.44870   1.009 0.312837    
## smokingCurrent             1.08265    0.29955   3.614 0.000303 ***
## smokingFormer              0.57889    0.20811   2.782 0.005421 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.887 on 7983 degrees of freedom
## Multiple R-squared:  0.203,  Adjusted R-squared:  0.2014 
## F-statistic: 127.1 on 16 and 7983 DF,  p-value: < 2.2e-16

# Model fit statistics for maximum model
max_model_stats <- tibble(
  Model = "Maximum Model",
  R_squared = summary(max_lm)$r.squared,
  Adjusted_R_squared = summary(max_lm)$adj.r.squared,
  AIC = AIC(max_lm),
  BIC = BIC(max_lm)
)

max_model_stats %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kbl(caption = "Maximum Linear Regression Model Fit Statistics") %>%
  kable_styling(full_width = FALSE)

Maximum Linear Regression Model Fit Statistics
Model	R_squared	Adjusted_R_squared	AIC	BIC
Maximum Model	0.203	0.201	55764.86	55890.63

#----------------------------------------------------------#
# Step 2: Best Subsets Regression
#----------------------------------------------------------#

# Create model matrix so factors are expanded into dummy variables
x <- model.matrix(
  physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
    age_group + income + education + smoking,
  data = brfss_analytic
)

# Remove intercept column
x <- x[, -1]

# Build dataset for regsubsets
regsub_data <- data.frame(
  physhlth_days = brfss_analytic$physhlth_days,
  x
)

# Set nvmax equal to total number of candidate predictors/parameters
nvmax_val <- ncol(regsub_data) - 1

best_subsets <- regsubsets(
  physhlth_days ~ .,
  data = regsub_data,
  nvmax = nvmax_val,
  method = "exhaustive"
)

best_subsets_summary <- summary(best_subsets)

best_subsets_results <- tibble(
  model_size = 1:nvmax_val,
  adj_r2 = best_subsets_summary$adjr2,
  bic = best_subsets_summary$bic
)

best_subsets_results %>%
  mutate(
    adj_r2 = round(adj_r2, 4),
    bic = round(bic, 2)
  ) %>%
  kbl(caption = "Best Subsets Results by Model Size") %>%
  kable_styling(full_width = FALSE)

Best Subsets Results by Model Size
model_size	adj_r2	bic
1	0.1070	-888.43
2	0.1576	-1347.18
3	0.1673	-1432.08
4	0.1742	-1490.36
5	0.1805	-1543.54
6	0.1865	-1594.84
7	0.1918	-1638.67
8	0.1974	-1686.63
9	0.1987	-1691.62
10	0.2000	-1696.32
11	0.2008	-1695.95
12	0.2014	-1694.30
13	0.2015	-1687.85
14	0.2016	-1679.88
15	0.2015	-1671.02
16	0.2014	-1662.04

# Plot Adjusted R-squared across model sizes
ggplot(best_subsets_results, aes(x = model_size, y = adj_r2)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Best Subsets: Adjusted R-squared by Model Size",
    x = "Model Size",
    y = "Adjusted R-squared"
  ) +
  theme_minimal()

# Plot BIC across model sizes
ggplot(best_subsets_results, aes(x = model_size, y = bic)) +
  geom_line() +
  geom_point() +
  labs(
    title = "Best Subsets: BIC by Model Size",
    x = "Model Size",
    y = "BIC"
  ) +
  theme_minimal()

# Identify best model size by Adjusted R-squared and BIC
best_adj_r2_row <- best_subsets_results %>%
  filter(adj_r2 == max(adj_r2))

best_bic_row <- best_subsets_results %>%
  filter(bic == min(bic))

cat("Model size with maximum Adjusted R-squared:", best_adj_r2_row$model_size, "\n")

## Model size with maximum Adjusted R-squared: 14

cat("Model size with minimum BIC:", best_bic_row$model_size, "\n")

## Model size with minimum BIC: 10

cat("\nVariables in model with maximum Adjusted R-squared:\n")

## 
## Variables in model with maximum Adjusted R-squared:

print(coef(best_subsets, best_adj_r2_row$model_size))

##               (Intercept)             menthlth_days                       bmi 
##                 3.2695545                 0.2933798                 0.0993539 
##               exerciseYes             depressionYes            age_group35.49 
##                -3.5664466                 0.8227978                 0.9171982 
##            age_group50.64              age_group65.              income.100K. 
##                 2.2210320                 3.0151712                -3.8990537 
##           income.25K..49K           income.50K..99K educationCollege.graduate 
##                -2.4381856                -3.4929870                 0.3948283 
##     educationSome.college            smokingCurrent             smokingFormer 
##                 0.4220925                 1.0912107                 0.5848116

cat("\nVariables in model with minimum BIC:\n")

## 
## Variables in model with minimum BIC:

print(coef(best_subsets, best_bic_row$model_size))

##     (Intercept)   menthlth_days             bmi     exerciseYes   depressionYes 
##       3.7739087       0.2974875       0.0962865      -3.5865640       0.9002398 
##  age_group35.49  age_group50.64    age_group65.    income.100K. income.25K..49K 
##       1.0994401       2.3926369       3.1804914      -3.9434145      -2.4191235 
## income.50K..99K 
##      -3.4765315

#----------------------------------------------------------#
# Step 3: Stepwise Selection Methods
#----------------------------------------------------------#

# Backward elimination
backward_model <- step(
  max_lm,
  direction = "backward",
  trace = 0
)

# Forward selection
null_model <- lm(physhlth_days ~ 1, data = brfss_analytic)

forward_model <- step(
  null_model,
  scope = formula(max_lm),
  direction = "forward",
  trace = 0
)

# Stepwise selection (both directions)
stepwise_model <- step(
  null_model,
  scope = formula(max_lm),
  direction = "both",
  trace = 0
)

# Display summaries
summary(backward_model)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + exercise + 
##     depression + age_group + income + smoking, data = brfss_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1018  -3.8895  -1.8323   0.7495  30.9167 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.42720    0.55800   6.142 8.54e-10 ***
## menthlth_days    0.29357    0.01193  24.612  < 2e-16 ***
## bmi              0.09891    0.01390   7.114 1.22e-12 ***
## exerciseYes     -3.52573    0.22348 -15.777  < 2e-16 ***
## depressionYes    0.85465    0.24349   3.510 0.000451 ***
## age_group35-49   0.95193    0.29762   3.198 0.001387 ** 
## age_group50-64   2.24872    0.28575   7.870 4.02e-15 ***
## age_group65+     3.05550    0.27408  11.148  < 2e-16 ***
## income$100K+    -3.73413    0.40666  -9.182  < 2e-16 ***
## income$25K-$49K -2.38110    0.30272  -7.866 4.15e-15 ***
## income$50K-$99K -3.35910    0.27753 -12.103  < 2e-16 ***
## smokingCurrent   1.02655    0.29420   3.489 0.000487 ***
## smokingFormer    0.55675    0.20567   2.707 0.006803 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared:  0.2026, Adjusted R-squared:  0.2014 
## F-statistic: 169.1 on 12 and 7987 DF,  p-value: < 2.2e-16

summary(forward_model)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + exercise + income + 
##     age_group + bmi + smoking + depression, data = brfss_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1018  -3.8895  -1.8323   0.7495  30.9167 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.42720    0.55800   6.142 8.54e-10 ***
## menthlth_days    0.29357    0.01193  24.612  < 2e-16 ***
## exerciseYes     -3.52573    0.22348 -15.777  < 2e-16 ***
## income$100K+    -3.73413    0.40666  -9.182  < 2e-16 ***
## income$25K-$49K -2.38110    0.30272  -7.866 4.15e-15 ***
## income$50K-$99K -3.35910    0.27753 -12.103  < 2e-16 ***
## age_group35-49   0.95193    0.29762   3.198 0.001387 ** 
## age_group50-64   2.24872    0.28575   7.870 4.02e-15 ***
## age_group65+     3.05550    0.27408  11.148  < 2e-16 ***
## bmi              0.09891    0.01390   7.114 1.22e-12 ***
## smokingCurrent   1.02655    0.29420   3.489 0.000487 ***
## smokingFormer    0.55675    0.20567   2.707 0.006803 ** 
## depressionYes    0.85465    0.24349   3.510 0.000451 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared:  0.2026, Adjusted R-squared:  0.2014 
## F-statistic: 169.1 on 12 and 7987 DF,  p-value: < 2.2e-16

summary(stepwise_model)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + exercise + income + 
##     age_group + bmi + smoking + depression, data = brfss_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1018  -3.8895  -1.8323   0.7495  30.9167 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.42720    0.55800   6.142 8.54e-10 ***
## menthlth_days    0.29357    0.01193  24.612  < 2e-16 ***
## exerciseYes     -3.52573    0.22348 -15.777  < 2e-16 ***
## income$100K+    -3.73413    0.40666  -9.182  < 2e-16 ***
## income$25K-$49K -2.38110    0.30272  -7.866 4.15e-15 ***
## income$50K-$99K -3.35910    0.27753 -12.103  < 2e-16 ***
## age_group35-49   0.95193    0.29762   3.198 0.001387 ** 
## age_group50-64   2.24872    0.28575   7.870 4.02e-15 ***
## age_group65+     3.05550    0.27408  11.148  < 2e-16 ***
## bmi              0.09891    0.01390   7.114 1.22e-12 ***
## smokingCurrent   1.02655    0.29420   3.489 0.000487 ***
## smokingFormer    0.55675    0.20567   2.707 0.006803 ** 
## depressionYes    0.85465    0.24349   3.510 0.000451 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared:  0.2026, Adjusted R-squared:  0.2014 
## F-statistic: 169.1 on 12 and 7987 DF,  p-value: < 2.2e-16

# Extract terms from each final model
backward_terms <- attr(terms(backward_model), "term.labels")
forward_terms <- attr(terms(forward_model), "term.labels")
stepwise_terms <- attr(terms(stepwise_model), "term.labels")

cat("\nBackward model variables:\n")

## 
## Backward model variables:

print(backward_terms)

## [1] "menthlth_days" "bmi"           "exercise"      "depression"   
## [5] "age_group"     "income"        "smoking"

cat("\nForward model variables:\n")

## 
## Forward model variables:

print(forward_terms)

## [1] "menthlth_days" "exercise"      "income"        "age_group"    
## [5] "bmi"           "smoking"       "depression"

cat("\nStepwise model variables:\n")

## 
## Stepwise model variables:

print(stepwise_terms)

## [1] "menthlth_days" "exercise"      "income"        "age_group"    
## [5] "bmi"           "smoking"       "depression"

# Compare variables retained/excluded
all_predictors <- c(
  "menthlth_days", "bmi", "sex", "exercise", "depression",
  "age_group", "income", "education", "smoking"
)

retained_all_three <- Reduce(intersect, list(backward_terms, forward_terms, stepwise_terms))
excluded_all_three <- setdiff(
  all_predictors,
  union(union(backward_terms, forward_terms), stepwise_terms)
)

cat("\nRetained by all three methods:\n")

## 
## Retained by all three methods:

print(retained_all_three)

## [1] "menthlth_days" "bmi"           "exercise"      "depression"   
## [5] "age_group"     "income"        "smoking"

cat("\nExcluded by all three methods:\n")

## 
## Excluded by all three methods:

print(excluded_all_three)

## [1] "sex"       "education"

#----------------------------------------------------------#
# Step 4: Final Model Selection and Interpretation
#----------------------------------------------------------#

# Choose your final model here
# You can change this to backward_model or forward_model if preferred
final_lm <- stepwise_model

summary(final_lm)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + exercise + income + 
##     age_group + bmi + smoking + depression, data = brfss_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.1018  -3.8895  -1.8323   0.7495  30.9167 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.42720    0.55800   6.142 8.54e-10 ***
## menthlth_days    0.29357    0.01193  24.612  < 2e-16 ***
## exerciseYes     -3.52573    0.22348 -15.777  < 2e-16 ***
## income$100K+    -3.73413    0.40666  -9.182  < 2e-16 ***
## income$25K-$49K -2.38110    0.30272  -7.866 4.15e-15 ***
## income$50K-$99K -3.35910    0.27753 -12.103  < 2e-16 ***
## age_group35-49   0.95193    0.29762   3.198 0.001387 ** 
## age_group50-64   2.24872    0.28575   7.870 4.02e-15 ***
## age_group65+     3.05550    0.27408  11.148  < 2e-16 ***
## bmi              0.09891    0.01390   7.114 1.22e-12 ***
## smokingCurrent   1.02655    0.29420   3.489 0.000487 ***
## smokingFormer    0.55675    0.20567   2.707 0.006803 ** 
## depressionYes    0.85465    0.24349   3.510 0.000451 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared:  0.2026, Adjusted R-squared:  0.2014 
## F-statistic: 169.1 on 12 and 7987 DF,  p-value: < 2.2e-16

final_model_stats <- tibble(
  Model = "Final Linear Model",
  R_squared = summary(final_lm)$r.squared,
  Adjusted_R_squared = summary(final_lm)$adj.r.squared,
  AIC = AIC(final_lm),
  BIC = BIC(final_lm)
)

final_model_stats %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kbl(caption = "Final Linear Model Fit Statistics") %>%
  kable_styling(full_width = FALSE)

Final Linear Model Fit Statistics
Model	R_squared	Adjusted_R_squared	AIC	BIC
Final Linear Model	0.203	0.201	55760.55	55858.37

# Coefficient table with 95% confidence intervals
final_lm_tidy <- tidy(final_lm, conf.int = TRUE)

final_lm_tidy %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kbl(caption = "Final Linear Regression Model Coefficients with 95% Confidence Intervals") %>%
  kable_styling(full_width = FALSE)

Final Linear Regression Model Coefficients with 95% Confidence Intervals
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	3.427	0.558	6.142	0.000	2.333	4.521
menthlth_days	0.294	0.012	24.612	0.000	0.270	0.317
exerciseYes	-3.526	0.223	-15.777	0.000	-3.964	-3.088
income$100K+	-3.734	0.407	-9.182	0.000	-4.531	-2.937
income$25K-$49K	-2.381	0.303	-7.866	0.000	-2.975	-1.788
income$50K-$99K	-3.359	0.278	-12.103	0.000	-3.903	-2.815
age_group35-49	0.952	0.298	3.198	0.001	0.369	1.535
age_group50-64	2.249	0.286	7.870	0.000	1.689	2.809
age_group65+	3.055	0.274	11.148	0.000	2.518	3.593
bmi	0.099	0.014	7.114	0.000	0.072	0.126
smokingCurrent	1.027	0.294	3.489	0.000	0.450	1.603
smokingFormer	0.557	0.206	2.707	0.007	0.154	0.960
depressionYes	0.855	0.243	3.510	0.000	0.377	1.332

#----------------------------------------------------------#
# Step 5: LINE Assumption Check
#----------------------------------------------------------#

par(mfrow = c(2, 2))
plot(final_lm)

par(mfrow = c(1, 1))

Step 1 Interpretation: Maximum model

The maximum linear regression model included mental health days, BMI, sex, exercise, depression, age group, income, education, and smoking status as predictors of physically unhealthy days. Model fit was assessed using R-squared, adjusted R-squared, AIC, and BIC.

The model explained approximately 20.3% of the variability in physically unhealthy days (R² = 0.203; Adjusted R² = 0.201), indicating a modest but meaningful fit.

The model had an AIC of 55,764.86 and a BIC of 55,890.63, which serve as baseline metrics for comparison with simpler models.

Several predictors were statistically significant, including mental health days, BMI, exercise, depression, age group, income, and smoking status, while sex and education were not statistically significant predictors of physically unhealthy days.

Although the model explains a modest proportion of the variance, this is expected given that physical health outcomes are influenced by many unmeasured behavioral, environmental, and clinical factors.

Step 2 Interpretation: Best subsets

Best subsets regression was used to compare candidate models across different model sizes. Adjusted R-squared was used to identify the model with the strongest explanatory power, while BIC was used to identify the most parsimonious model.

The model with 14 predictors maximized adjusted R-squared (0.2016), while the model with 10 predictors minimized BIC (-1696.32).

These criteria did not agree, which is expected. Adjusted R-squared tends to favor more complex models, whereas BIC applies a stronger penalty for model complexity and therefore favors more parsimonious models.

Based on BIC, a model with approximately 10 predictors provides the best balance between model fit and simplicity, and is therefore preferred.

Step 3 Interpretation: Stepwise comparison

Backward elimination, forward selection, and stepwise selection were performed using the step() function. The final models from each approach were compared to identify predictors that were consistently retained or excluded.

The variables mental health days, BMI, exercise, depression, age group, income, and smoking status were retained by all three methods, indicating they are strong and consistent predictors of physically unhealthy days.

In contrast, sex and education were excluded by all three methods, suggesting that they do not contribute meaningfully to the model after accounting for the other predictors.

Step 4 Interpretation: Final model

The final model was selected based on parsimony, consistency across selection methods, and overall model fit. The stepwise model was chosen because it retained only predictors that were consistently identified as important while maintaining strong model performance.

The final model had an R² of 0.203 and an adjusted R² of 0.201, with an AIC of 55,760.55 and a BIC of 55,858.37, indicating similar explanatory power to the maximum model despite including fewer predictors.

This suggests that the excluded variables did not meaningfully improve model fit and that the final model achieves a more efficient balance between complexity and explanatory power.

Step 5 Interpretation: LINE Assumption Check

Diagnostic plots were examined to assess the LINE assumptions, including linearity, independence, normality, and equal variance of residuals.

The residuals vs. fitted plot showed some evidence of non-constant variance, with a slight funnel shape suggesting mild heteroscedasticity.

The Q-Q plot indicated that residuals were approximately normally distributed, although there were noticeable deviations in the upper tail, suggesting departures from normality.

The scale-location plot further supported the presence of heteroscedasticity, as the spread of residuals increased slightly with fitted values.

The residuals vs. leverage plot did not reveal any highly influential observations with extreme Cook’s distance, although a few points exhibited moderately higher leverage.

Overall, the LINE assumptions appear to be partially satisfied, with evidence of heteroscedasticity and mild departures from normality. While these violations are not severe, the results should be interpreted with some caution.

Part 2: Logistic Regression (40 points)

#----------------------------------------------------------#
# PART 2: Logistic Regression
#----------------------------------------------------------#

# Assumes brfss_analytic already exists from Part 0
# and final linear model predictors were:
# menthlth_days + exercise + income + age_group + bmi + smoking + depression

#----------------------------------------------------------#
# Step 1: Create Binary Outcome
#----------------------------------------------------------#

brfss_analytic <- brfss_analytic %>%
  mutate(
    frequent_distress = case_when(
      physhlth_days >= 14 ~ "Yes",
      physhlth_days < 14 ~ "No",
      TRUE ~ NA_character_
    ),
    frequent_distress = factor(frequent_distress, levels = c("No", "Yes"))
  )

# Prevalence table
distress_prev <- brfss_analytic %>%
  count(frequent_distress) %>%
  mutate(percent = 100 * n / sum(n))

distress_prev %>%
  mutate(percent = round(percent, 2)) %>%
  kbl(caption = "Prevalence of Frequent Physical Distress") %>%
  kable_styling(full_width = FALSE)

Prevalence of Frequent Physical Distress
frequent_distress	n	percent
No	6895	86.19
Yes	1105	13.81

#----------------------------------------------------------#
# Step 2: Simple (Unadjusted) Logistic Regression
#----------------------------------------------------------#
# Choose one strong predictor from the final linear model.
# Here: exercise

mod_simple <- glm(
  frequent_distress ~ exercise,
  data = brfss_analytic,
  family = binomial
)

summary(mod_simple)

## 
## Call:
## glm(formula = frequent_distress ~ exercise, family = binomial, 
##     data = brfss_analytic)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.87145    0.05223  -16.69   <2e-16 ***
## exerciseYes -1.39670    0.06793  -20.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6424.7  on 7999  degrees of freedom
## Residual deviance: 6020.9  on 7998  degrees of freedom
## AIC: 6024.9
## 
## Number of Fisher Scoring iterations: 5

# Odds ratio and 95% CI
simple_or <- exp(coef(mod_simple))
simple_ci <- exp(confint(mod_simple))

simple_results <- tibble(
  term = names(simple_or),
  odds_ratio = as.numeric(simple_or),
  conf.low = simple_ci[, 1],
  conf.high = simple_ci[, 2]
)

simple_results %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kbl(caption = "Simple Logistic Regression: Odds Ratios and 95% CI") %>%
  kable_styling(full_width = FALSE)

Simple Logistic Regression: Odds Ratios and 95% CI
term	odds_ratio	conf.low	conf.high
(Intercept)	0.418	0.377	0.463
exerciseYes	0.247	0.217	0.283

#----------------------------------------------------------#
# Step 3: Multiple Logistic Regression
#----------------------------------------------------------#

mod_logistic <- glm(
  frequent_distress ~ menthlth_days + exercise + income +
    age_group + bmi + smoking + depression,
  data = brfss_analytic,
  family = binomial
)

summary(mod_logistic)

## 
## Call:
## glm(formula = frequent_distress ~ menthlth_days + exercise + 
##     income + age_group + bmi + smoking + depression, family = binomial, 
##     data = brfss_analytic)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.879241   0.218703 -13.165  < 2e-16 ***
## menthlth_days    0.067729   0.003883  17.442  < 2e-16 ***
## exerciseYes     -0.960316   0.076063 -12.625  < 2e-16 ***
## income$100K+    -1.397224   0.208803  -6.692 2.21e-11 ***
## income$25K-$49K -0.568895   0.099366  -5.725 1.03e-08 ***
## income$50K-$99K -0.995229   0.094489 -10.533  < 2e-16 ***
## age_group35-49   0.413981   0.147478   2.807  0.00500 ** 
## age_group50-64   0.924551   0.137207   6.738 1.60e-11 ***
## age_group65+     1.281364   0.133111   9.626  < 2e-16 ***
## bmi              0.031461   0.005047   6.234 4.56e-10 ***
## smokingCurrent   0.360242   0.105900   3.402  0.00067 ***
## smokingFormer    0.218424   0.081922   2.666  0.00767 ** 
## depressionYes    0.258291   0.089178   2.896  0.00378 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6424.7  on 7999  degrees of freedom
## Residual deviance: 5186.1  on 7987  degrees of freedom
## AIC: 5212.1
## 
## Number of Fisher Scoring iterations: 5

# Adjusted odds ratios with 95% CI
logistic_results <- tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE)

logistic_results %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kbl(caption = "Multiple Logistic Regression: Adjusted Odds Ratios with 95% CI") %>%
  kable_styling(full_width = FALSE)

Multiple Logistic Regression: Adjusted Odds Ratios with 95% CI
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.056	0.219	-13.165	0.000	0.036	0.086
menthlth_days	1.070	0.004	17.442	0.000	1.062	1.078
exerciseYes	0.383	0.076	-12.625	0.000	0.330	0.444
income$100K+	0.247	0.209	-6.692	0.000	0.161	0.367
income$25K-$49K	0.566	0.099	-5.725	0.000	0.466	0.688
income$50K-$99K	0.370	0.094	-10.533	0.000	0.307	0.445
age_group35-49	1.513	0.147	2.807	0.005	1.136	2.026
age_group50-64	2.521	0.137	6.738	0.000	1.934	3.313
age_group65+	3.602	0.133	9.626	0.000	2.787	4.699
bmi	1.032	0.005	6.234	0.000	1.022	1.042
smokingCurrent	1.434	0.106	3.402	0.001	1.163	1.762
smokingFormer	1.244	0.082	2.666	0.008	1.059	1.460
depressionYes	1.295	0.089	2.896	0.004	1.086	1.540

#----------------------------------------------------------#
# Step 4: Likelihood Ratio Test
#----------------------------------------------------------#
# Test income as a group of related predictors

mod_reduced <- glm(
  frequent_distress ~ menthlth_days + exercise +
    age_group + bmi + smoking + depression,
  data = brfss_analytic,
  family = binomial
)

lrt_results <- anova(mod_reduced, mod_logistic, test = "Chisq")
lrt_results

## Analysis of Deviance Table
## 
## Model 1: frequent_distress ~ menthlth_days + exercise + age_group + bmi + 
##     smoking + depression
## Model 2: frequent_distress ~ menthlth_days + exercise + income + age_group + 
##     bmi + smoking + depression
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      7990     5308.7                          
## 2      7987     5186.1  3    122.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Optional formatted display
lrt_table <- broom::tidy(lrt_results)

lrt_table %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kbl(caption = "Likelihood Ratio Test Comparing Reduced and Full Logistic Models") %>%
  kable_styling(full_width = FALSE)

Likelihood Ratio Test Comparing Reduced and Full Logistic Models
term	df.residual	residual.deviance	df	deviance	p.value
frequent_distress ~ menthlth_days + exercise + age_group + bmi + smoking + depression	7990	5308.706	NA	NA	NA
frequent_distress ~ menthlth_days + exercise + income + age_group + bmi + smoking + depression	7987	5186.106	3	122.6	0

#----------------------------------------------------------#
# Step 5: ROC Curve and AUC
#----------------------------------------------------------#

roc_obj <- roc(brfss_analytic$frequent_distress, fitted(mod_logistic))

plot(
  roc_obj,
  main = "ROC Curve: Frequent Physical Distress Model",
  print.auc = TRUE
)

auc_value <- auc(roc_obj)
auc_ci <- ci.auc(roc_obj)

cat("AUC:", round(as.numeric(auc_value), 3), "\n")

## AUC: 0.793

cat("95% CI for AUC:", round(auc_ci[1], 3), "-", round(auc_ci[3], 3), "\n")

## 95% CI for AUC: 0.778 - 0.808

#----------------------------------------------------------#
# Step 6: Hosmer-Lemeshow Goodness-of-Fit Test
#----------------------------------------------------------#

hl_test <- hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10)
hl_test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mod_logistic$y, fitted(mod_logistic)
## X-squared = 13.516, df = 8, p-value = 0.09529

# Optional formatted display
hl_table <- tibble(
  statistic = as.numeric(hl_test$statistic),
  df = as.numeric(hl_test$parameter),
  p_value = as.numeric(hl_test$p.value)
)

hl_table %>%
  mutate(across(where(is.numeric), round, 3)) %>%
  kbl(caption = "Hosmer-Lemeshow Goodness-of-Fit Test") %>%
  kable_styling(full_width = FALSE)

Hosmer-Lemeshow Goodness-of-Fit Test
statistic	df	p_value
13.516	8	0.095

Step 1 Interpretation: Binary Outcome

A binary outcome variable, frequent_distress, was created and coded as “Yes” for individuals reporting 14 or more physically unhealthy days in the past 30 days and “No” otherwise.

In the analytic sample, 13.81% (n = 1,105) of respondents reported frequent physical distress, while 86.19% (n = 6,895) did not.

This indicates that the outcome is moderately imbalanced, with a smaller proportion of individuals experiencing frequent distress.

Step 2 Interpretation: Simple Logistic Regression

I selected exercise as the predictor for the simple logistic regression because it showed a strong association in the linear regression model and is an important modifiable behavioral factor. The odds ratio for exercise was 0.247 (95% CI: 0.217, 0.283).

Compared to individuals who did not exercise, those who reported exercising had 0.25 times the odds of frequent physical distress. This corresponds to approximately a 75% reduction in the odds of frequent distress. The association is statistically significant (p < 0.001), as the confidence interval does not include 1.0.

Step 3 Interpretation: Multiple Logistic Regression

A multiple logistic regression model was fit using the same predictors as the final linear model, as required.

Key Adjusted Odds Ratios: Mental health (continuous)

Holding all other variables constant, each additional day of poor mental health was associated with 1.07 times higher odds of frequent physical distress (95% CI: 1.06, 1.08).

Exercise (binary)

Holding all other variables constant, individuals who exercised had 0.38 times the odds of frequent physical distress compared to those who did not exercise (95% CI: 0.33, 0.44), indicating a substantial protective effect.

Income (categorical)

Compared to individuals with income less than $25,000:

Those earning $100K+ had 0.25 times the odds (95% CI: 0.16, 0.37) Those earning $50K–$99K had 0.37 times the odds Those earning $25K–$49K had 0.57 times the odds

This shows a strong gradient where higher income is associated with lower odds of distress.

Age group

Compared to adults aged 18–34:

Age 35–49: 1.51 times higher odds Age 50–64: 2.52 times higher odds Age 65+: 3.60 times higher odds

This suggests increasing physical distress with age.

Step 4 Interpretation: Likelihood Ratio Test

Hypotheses: H₀: Income does not improve model fit

Hₐ: Income significantly improves model fit

Results:

The likelihood ratio test comparing the reduced and full models yielded a chi-square statistic of 122.6 with 3 degrees of freedom and a p-value < 0.001.

Conclusion:

At α = 0.05, we reject the null hypothesis, concluding that income significantly improves the model and should be retained.

Step 5 Interpretation: ROC Curve and AUC

The ROC curve was used to evaluate the model’s ability to distinguish between individuals with and without frequent physical distress.

The AUC was 0.793 (95% CI: 0.778–0.808).

This indicates acceptable to good discrimination, meaning the model can reasonably distinguish between individuals who do and do not experience frequent distress.

Step 6 Interpretation: Hosmer-Lemeshow Test

Hypotheses: H₀: The model fits the data well Hₐ: The model does not fit the data well Results:

The Hosmer-Lemeshow test produced a chi-square statistic of 13.52 with 8 degrees of freedom and a p-value of 0.095.

Interpretation:

Because the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating no evidence of poor model fit.

Complement with AUC:

The Hosmer-Lemeshow test assesses model calibration, while the ROC/AUC assesses discrimination. Together, these results indicate that the model both fits the data adequately and has good predictive performance.

Part 3: Reflection (15 points)

The results of this analysis suggest that physical health burden among U.S. adults is influenced by a combination of behavioral, socioeconomic, and health-related factors. Across both the linear and logistic regression models, mental health, exercise, income, age, BMI, smoking, and depression emerged as consistent predictors. Mental health was one of the strongest predictors in both models, with more mentally unhealthy days associated with both an increase in physically unhealthy days and higher odds of frequent physical distress. Exercise showed a strong protective effect, with individuals who exercised reporting fewer unhealthy days and significantly lower odds of distress. Income also demonstrated a clear gradient, with higher income levels associated with better physical health outcomes. Age was another important factor, with older adults experiencing greater health burden.

Some predictors, such as sex and education, were not significant in the linear model and were excluded from the final models, suggesting their effects may be mediated through other variables. A key limitation of this analysis is the use of cross-sectional survey data, which prevents causal inference. Additionally, self-reported measures may introduce reporting bias. Important potential confounders not included in the model include access to healthcare, chronic disease status, and geographic factors, all of which could influence both exposures and outcomes.

The linear and logistic regression models provide complementary insights into the same research question. The linear model quantifies how predictors are associated with the number of physically unhealthy days, offering detailed information about changes in magnitude. In contrast, the logistic model focuses on the probability of experiencing frequent physical distress, which is often more relevant for public health decision-making and risk classification. The evaluation metrics also differ: linear regression uses R-squared to assess explained variance, while logistic regression relies on AUC to assess discrimination. In this analysis, the logistic model demonstrated acceptable discrimination (AUC ≈ 0.79), indicating reasonable predictive performance.

Overall, linear regression is useful when the outcome is continuous and detailed variation is of interest, while logistic regression is preferred when the goal is to identify individuals at higher risk of a clinically meaningful threshold outcome. These findings suggest that interventions promoting physical activity and addressing socioeconomic disparities may reduce the burden of physical distress at the population level.

End of Homework 4