Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw02_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are NOT permitted on this assignment. See the assignment description for full details.


Part 0: Data Preparation (10 points)

# Load required packages
library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(car)
library(janitor)
library(gtsummary)
library(leaps)
library(pROC)
library(ResourceSelection)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
# Import the dataset

brfss <- read_xpt("C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Assignment 3/LLCP2023.xpt") %>%
  clean_names()
#select variables
brfss_work <- brfss |>
  select(physhlth, menthlth, bmi5, sexvar, exerany2, ageg5yr, incomg1, educa, smoker3)
brfss_work <- brfss_work %>%
  mutate(
      physhlth_days = case_when(physhlth == 88 ~ 0,
      physhlth == 77 & physhlth == 99 ~ NA,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE                        ~ NA_real_),
    
      menthlth_days = case_when(menthlth == 88 ~ 0,
      menthlth == 77 & menthlth == 99 ~ NA,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE                        ~ NA_real_),
  
      bmi = case_when(bmi5 == 9999 ~ NA, bmi5 < 9999 ~ bmi5/100),
      
      sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
      
      exercise = case_when(exerany2 == 7 & exerany2 == 9 ~ NA),
      exercise = factor(exerany2, levels = c(1, 2), labels = c("Yes", "No")),
      
      age_group = case_when(ageg5yr == 14 ~ NA),
      age_group = case_when(ageg5yr <= 3 ~ 0,
                         ageg5yr >= 4 & ageg5yr <= 6 ~ 1,
                         ageg5yr >= 7 & ageg5yr <= 9 ~ 2,
                         ageg5yr >= 10 & ageg5yr <= 13 ~ 3),
      
      age_group = factor(age_group, levels = c(0,1,2,3), labels = c("18-34", "35-49", "50-64", "65+")),
      
      income = case_when(incomg1 == 9 ~ NA),
      income = case_when(incomg1 <= 2 ~ 0,
                         incomg1 == 3 & incomg1 == 4 ~ 1,
                         incomg1 == 5 & incomg1 == 6 ~ 2,
                         incomg1 == 7 ~ 3),
      
      income = factor(income, levels = c(0,1,2,3), labels = c("<$25K", "$25K-$49K", "$50K-$99K", "$100K+")),
  
      education = case_when(educa == 9 ~ NA),
      education = case_when(educa <= 3 ~ 0,
                         educa == 4 ~ 1,
                         educa == 5 ~ 2,
                         educa == 6 ~ 3),

      education = factor(education, levels = c(0,1,2,3), labels = c("Less than HS", "HS/GED", "Some college", "College graduate")),
      
      smoking = case_when(smoker3 == 9 ~ NA),
      smoking = case_when(smoker3 <= 2 ~ 0,
                         smoker3 == 3 ~ 1,
                         smoker3 == 4 ~ 2),
                         
      smoking = factor(smoking, levels = c(0,1,2), labels = c("Current", "Former", "Never")),
                        )
brfss_work$sex <- relevel(brfss_work$sex, ref = "Male")
brfss_work$exercise <- relevel(brfss_work$exercise, ref = "No")
brfss_work$age_group <- relevel(brfss_work$age_group, ref = "18-34")
brfss_work$income <- relevel(brfss_work$income, ref = "<$25K")
brfss_work$education <- relevel(brfss_work$education, ref = "Less than HS")
brfss_work$smoking <- relevel(brfss_work$smoking, ref = "Never")
cat("Total N:", nrow(brfss_work), "\n")
## Total N: 433323
cat("Physhlth_days Missing data:", sum(is.na(brfss_work$physhlth_days)), "\n")
## Physhlth_days Missing data: 10785
a <- nrow(brfss_work)
b <- sum(is.na(brfss_work$physhlth_days))
c <- (b / a)*100

cat("Physhlth_days missing data %:", c , "\n")
## Physhlth_days missing data %: 2.488906
cat("BMI Missing data:", sum(is.na(brfss_work$bmi)), "\n")
## BMI Missing data: 40535
a <- nrow(brfss_work)
b <- sum(is.na(brfss_work$bmi))
c <- (b / a)*100

cat("BMI missing data %:", c , "\n")
## BMI missing data %: 9.354454
cat("Education Missing data:", sum(is.na(brfss_work$education)), "\n")
## Education Missing data: 2325
a <- nrow(brfss_work)
b <- sum(is.na(brfss_work$education))
c <- (b / a)*100

cat("Education missing data %:", c , "\n")
## Education missing data %: 0.5365513
set.seed(1220) 
brfss_analytic <- brfss_work |> 
drop_na() |> 
slice_sample(n = 8000) 

cat("Final analytic N:", nrow(brfss_analytic), "\n")
## Final analytic N: 8000
brfss_analytic %>%
  select(physhlth_days, menthlth_days, bmi, sex, exercise, age_group, income, education, smoking) %>% 
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      bmi     ~ "Body mass index",
      sex           ~ "Sex",
      exercise      ~ "Any exercise in past 30 days",
      age_group      ~ "Age group",
      income           ~ "Annual household income",
      education       ~ "Highest level of education completed",
      smoking         ~ "Smoking status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8000)**")
Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8000)
Characteristic N N = 8,0001
Physically unhealthy days (past 30) 8,000 6.2 (10.3)
Mentally unhealthy days (past 30) 8,000 5.6 (9.5)
Body mass index 8,000 28.6 (7.1)
Sex 8,000
    Male
3,819 (48%)
    Female
4,181 (52%)
Any exercise in past 30 days 8,000 5,717 (71%)
Age group 8,000
    18-34
1,159 (14%)
    35-49
1,809 (23%)
    50-64
2,426 (30%)
    65+
2,606 (33%)
Annual household income 8,000
    <$25K
5,005 (63%)
    $25K-$49K
0 (0%)
    $50K-$99K
0 (0%)
    $100K+
2,995 (37%)
Highest level of education completed 8,000
    Less than HS
855 (11%)
    HS/GED
2,135 (27%)
    Some college
1,806 (23%)
    College graduate
3,204 (40%)
Smoking status 8,000
    Never
4,723 (59%)
    Current
1,252 (16%)
    Former
2,025 (25%)
1 Mean (SD); n (%)
model <- lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + income + education + smoking, data = brfss_analytic)

summary(model)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise + 
##     age_group + income + education + smoking, data = brfss_analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.250  -5.101  -1.803   2.206  32.710 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.97255    0.60487   3.261  0.00111 ** 
## menthlth_days              0.38858    0.01109  35.050  < 2e-16 ***
## bmi                        0.08582    0.01414   6.069 1.35e-09 ***
## sexFemale                 -0.31915    0.20074  -1.590  0.11190    
## exerciseYes               -3.90961    0.23639 -16.539  < 2e-16 ***
## age_group35-49             2.16004    0.33855   6.380 1.87e-10 ***
## age_group50-64             4.16956    0.32001  13.029  < 2e-16 ***
## age_group65+               3.82716    0.31976  11.969  < 2e-16 ***
## income$100K+              -3.12969    0.28390 -11.024  < 2e-16 ***
## educationHS/GED            0.12285    0.35417   0.347  0.72869    
## educationSome college      0.13235    0.36746   0.360  0.71873    
## educationCollege graduate  0.52149    0.38795   1.344  0.17892    
## smokingCurrent             1.36575    0.29937   4.562 5.14e-06 ***
## smokingFormer              1.07425    0.23991   4.478 7.65e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.714 on 7986 degrees of freedom
## Multiple R-squared:  0.2885, Adjusted R-squared:  0.2874 
## F-statistic: 249.1 on 13 and 7986 DF,  p-value: < 2.2e-16
AIC(model)
## [1] 57358.53
BIC(model)
## [1] 57463.34

R2 = 0.2885 and adjusted R2 = 0.2874. This indicates that about 29% of the variance in physically unhealthy days is explained by the predictors in the model. AIC and BIC measure the goodness of fit of a model and penalize unnecessary complexity, but are relative comparison tools and so they are best used to compare models. Thus, the AIC and BIC of just this one model are not particularly informative.

best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + income + education + smoking,
  data = brfss_analytic,
  nvmax = 8,      # maximum number of variables to consider
  method = "exhaustive"
)
## Reordering variables and trying again:
best_summary <- summary(best_subsets)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)
ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

The model with the maximum R2 and minimum BIC is the one with 8 predictors, or the maximum model. The two criteria R2 and BIC agree with one another.

mod_backward <- step(model, direction = "backward", trace = 1)
## Start:  AIC=34653.52
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + 
##     income + education + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3       203 606659 34650
## <none>                       606455 34654
## - sex            1       192 606647 34654
## - smoking        2      2414 608869 34681
## - bmi            1      2797 609252 34688
## - income         1      9229 615684 34772
## - age_group      3     15901 622356 34855
## - exercise       1     20772 627227 34921
## - menthlth_days  1     93295 699750 35796
## 
## Step:  AIC=34650.2
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + 
##     income + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       606659 34650
## - sex            1       183 606842 34651
## - smoking        2      2290 608949 34676
## - bmi            1      2731 609390 34684
## - income         1     10972 617631 34792
## - age_group      3     16060 622719 34853
## - exercise       1     20625 627283 34916
## - menthlth_days  1     93381 700040 35794
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = model),
                    direction = "forward", trace = 1)
## Start:  AIC=37350.91
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    156378 696021 35732
## + exercise       1     84880 767519 36514
## + income         1     81858 770541 36545
## + education      3     39057 813342 36982
## + smoking        2     38806 813593 36982
## + age_group      3     19275 833124 37174
## + bmi            1     18741 833658 37175
## + sex            1      3841 848558 37317
## <none>                       852399 37351
## 
## Step:  AIC=35731.51
## physhlth_days ~ menthlth_days
## 
##             Df Sum of Sq    RSS   AIC
## + exercise   1     48192 647829 35159
## + income     1     36767 659254 35299
## + age_group  3     30195 665826 35383
## + education  3     15515 680506 35557
## + smoking    2     13425 682596 35580
## + bmi        1      9942 686079 35618
## + sex        1       311 695710 35730
## <none>                   696021 35732
## 
## Step:  AIC=35159.48
## physhlth_days ~ menthlth_days + exercise
## 
##             Df Sum of Sq    RSS   AIC
## + age_group  3   21204.8 626624 34899
## + income     1   16467.0 631362 34956
## + smoking    2    8306.8 639522 35060
## + bmi        1    4643.5 643185 35104
## + education  3    4652.0 643177 35108
## <none>                   647829 35159
## + sex        1       2.8 647826 35161
## 
## Step:  AIC=34899.24
## physhlth_days ~ menthlth_days + exercise + age_group
## 
##             Df Sum of Sq    RSS   AIC
## + income     1   14767.0 611857 34710
## + smoking    2    4731.1 621893 34843
## + bmi        1    3880.4 622744 34852
## + education  3    4051.9 622572 34853
## <none>                   626624 34899
## + sex        1      10.4 626614 34901
## 
## Step:  AIC=34710.46
## physhlth_days ~ menthlth_days + exercise + age_group + income
## 
##             Df Sum of Sq    RSS   AIC
## + bmi        1   2547.42 609310 34679
## + smoking    2   2255.34 609602 34685
## + sex        1    390.43 611467 34707
## <none>                   611857 34710
## + education  3     38.42 611819 34716
## 
## Step:  AIC=34679.08
## physhlth_days ~ menthlth_days + exercise + age_group + income + 
##     bmi
## 
##             Df Sum of Sq    RSS   AIC
## + smoking    2   2467.74 606842 34651
## + sex        1    360.90 608949 34676
## <none>                   609310 34679
## + education  3     66.32 609243 34684
## 
## Step:  AIC=34650.61
## physhlth_days ~ menthlth_days + exercise + age_group + income + 
##     bmi + smoking
## 
##             Df Sum of Sq    RSS   AIC
## + sex        1    183.33 606659 34650
## <none>                   606842 34651
## + education  3    194.56 606647 34654
## 
## Step:  AIC=34650.2
## physhlth_days ~ menthlth_days + exercise + age_group + income + 
##     bmi + smoking + sex
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   606659 34650
## + education  3    203.19 606455 34654
stepwise <- step(mod_null,
                    scope = list(lower = mod_null, upper = model),
                    direction = "both", trace = 1)
## Start:  AIC=37350.91
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1    156378 696021 35732
## + exercise       1     84880 767519 36514
## + income         1     81858 770541 36545
## + education      3     39057 813342 36982
## + smoking        2     38806 813593 36982
## + age_group      3     19275 833124 37174
## + bmi            1     18741 833658 37175
## + sex            1      3841 848558 37317
## <none>                       852399 37351
## 
## Step:  AIC=35731.51
## physhlth_days ~ menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1     48192 647829 35159
## + income         1     36767 659254 35299
## + age_group      3     30195 665826 35383
## + education      3     15515 680506 35557
## + smoking        2     13425 682596 35580
## + bmi            1      9942 686079 35618
## + sex            1       311 695710 35730
## <none>                       696021 35732
## - menthlth_days  1    156378 852399 37351
## 
## Step:  AIC=35159.48
## physhlth_days ~ menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + age_group      3     21205 626624 34899
## + income         1     16467 631362 34956
## + smoking        2      8307 639522 35060
## + bmi            1      4644 643185 35104
## + education      3      4652 643177 35108
## <none>                       647829 35159
## + sex            1         3 647826 35161
## - exercise       1     48192 696021 35732
## - menthlth_days  1    119690 767519 36514
## 
## Step:  AIC=34899.24
## physhlth_days ~ menthlth_days + exercise + age_group
## 
##                 Df Sum of Sq    RSS   AIC
## + income         1     14767 611857 34710
## + smoking        2      4731 621893 34843
## + bmi            1      3880 622744 34852
## + education      3      4052 622572 34853
## <none>                       626624 34899
## + sex            1        10 626614 34901
## - age_group      3     21205 647829 35159
## - exercise       1     39202 665826 35383
## - menthlth_days  1    128509 755133 36390
## 
## Step:  AIC=34710.46
## physhlth_days ~ menthlth_days + exercise + age_group + income
## 
##                 Df Sum of Sq    RSS   AIC
## + bmi            1      2547 609310 34679
## + smoking        2      2255 609602 34685
## + sex            1       390 611467 34707
## <none>                       611857 34710
## + education      3        38 611819 34716
## - income         1     14767 626624 34899
## - age_group      3     19505 631362 34956
## - exercise       1     23098 634955 35005
## - menthlth_days  1    102057 713914 35943
## 
## Step:  AIC=34679.08
## physhlth_days ~ menthlth_days + exercise + age_group + income + 
##     bmi
## 
##                 Df Sum of Sq    RSS   AIC
## + smoking        2      2468 606842 34651
## + sex            1       361 608949 34676
## <none>                       609310 34679
## + education      3        66 609243 34684
## - bmi            1      2547 611857 34710
## - income         1     13434 622744 34852
## - age_group      3     18544 627854 34913
## - exercise       1     21131 630440 34950
## - menthlth_days  1    100663 709973 35900
## 
## Step:  AIC=34650.61
## physhlth_days ~ menthlth_days + exercise + age_group + income + 
##     bmi + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## + sex            1       183 606659 34650
## <none>                       606842 34651
## + education      3       195 606647 34654
## - smoking        2      2468 609310 34679
## - bmi            1      2760 609602 34685
## - income         1     10794 617636 34790
## - age_group      3     15941 622783 34852
## - exercise       1     20535 627377 34915
## - menthlth_days  1     93360 700202 35793
## 
## Step:  AIC=34650.2
## physhlth_days ~ menthlth_days + exercise + age_group + income + 
##     bmi + smoking + sex
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       606659 34650
## - sex            1       183 606842 34651
## + education      3       203 606455 34654
## - smoking        2      2290 608949 34676
## - bmi            1      2731 609390 34684
## - income         1     10972 617631 34792
## - age_group      3     16060 622719 34853
## - exercise       1     20625 627283 34916
## - menthlth_days  1     93381 700040 35794
tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Backward Elimination Result
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.1535 0.5454 3.9484 0.0001 1.0843 3.2226
menthlth_days 0.3884 0.0111 35.0674 0.0000 0.3667 0.4101
bmi 0.0847 0.0141 5.9975 0.0000 0.0570 0.1124
sexFemale -0.3111 0.2002 -1.5538 0.1203 -0.7036 0.0814
exerciseYes -3.8766 0.2352 -16.4804 0.0000 -4.3377 -3.4155
age_group35-49 2.1877 0.3377 6.4781 0.0000 1.5257 2.8496
age_group50-64 4.1903 0.3194 13.1183 0.0000 3.5641 4.8164
age_group65+ 3.8505 0.3193 12.0580 0.0000 3.2245 4.4764
income$100K+ -2.8855 0.2400 -12.0204 0.0000 -3.3560 -2.4149
smokingCurrent 1.3124 0.2975 4.4121 0.0000 0.7293 1.8955
smokingFormer 1.0458 0.2390 4.3755 0.0000 0.5773 1.5143
tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Forward Selection Result
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.1535 0.5454 3.9484 0.0001 1.0843 3.2226
menthlth_days 0.3884 0.0111 35.0674 0.0000 0.3667 0.4101
exerciseYes -3.8766 0.2352 -16.4804 0.0000 -4.3377 -3.4155
age_group35-49 2.1877 0.3377 6.4781 0.0000 1.5257 2.8496
age_group50-64 4.1903 0.3194 13.1183 0.0000 3.5641 4.8164
age_group65+ 3.8505 0.3193 12.0580 0.0000 3.2245 4.4764
income$100K+ -2.8855 0.2400 -12.0204 0.0000 -3.3560 -2.4149
bmi 0.0847 0.0141 5.9975 0.0000 0.0570 0.1124
smokingCurrent 1.3124 0.2975 4.4121 0.0000 0.7293 1.8955
smokingFormer 1.0458 0.2390 4.3755 0.0000 0.5773 1.5143
sexFemale -0.3111 0.2002 -1.5538 0.1203 -0.7036 0.0814
tidy(stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stepwise Selection Result
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.1535 0.5454 3.9484 0.0001 1.0843 3.2226
menthlth_days 0.3884 0.0111 35.0674 0.0000 0.3667 0.4101
exerciseYes -3.8766 0.2352 -16.4804 0.0000 -4.3377 -3.4155
age_group35-49 2.1877 0.3377 6.4781 0.0000 1.5257 2.8496
age_group50-64 4.1903 0.3194 13.1183 0.0000 3.5641 4.8164
age_group65+ 3.8505 0.3193 12.0580 0.0000 3.2245 4.4764
income$100K+ -2.8855 0.2400 -12.0204 0.0000 -3.3560 -2.4149
bmi 0.0847 0.0141 5.9975 0.0000 0.0570 0.1124
smokingCurrent 1.3124 0.2975 4.4121 0.0000 0.7293 1.8955
smokingFormer 1.0458 0.2390 4.3755 0.0000 0.5773 1.5143
sexFemale -0.3111 0.2002 -1.5538 0.1203 -0.7036 0.0814

All three methods agree on the same final model. Seven variables were retained in the automated selection methods, namely mentally unhealthy days, BMI, sex, exercise, age group, income, and smoking status. Education was removed by all three selection methods.

tidy(model, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Model",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Final Model
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 1.9725 0.6049 3.2611 0.0011 0.7868 3.1583
menthlth_days 0.3886 0.0111 35.0505 0.0000 0.3668 0.4103
bmi 0.0858 0.0141 6.0688 0.0000 0.0581 0.1135
sexFemale -0.3192 0.2007 -1.5899 0.1119 -0.7127 0.0743
exerciseYes -3.9096 0.2364 -16.5386 0.0000 -4.3730 -3.4462
age_group35-49 2.1600 0.3386 6.3802 0.0000 1.4964 2.8237
age_group50-64 4.1696 0.3200 13.0294 0.0000 3.5423 4.7969
age_group65+ 3.8272 0.3198 11.9689 0.0000 3.2003 4.4540
income$100K+ -3.1297 0.2839 -11.0238 0.0000 -3.6862 -2.5732
educationHS/GED 0.1229 0.3542 0.3469 0.7287 -0.5714 0.8171
educationSome college 0.1324 0.3675 0.3602 0.7187 -0.5880 0.8527
educationCollege graduate 0.5215 0.3880 1.3442 0.1789 -0.2390 1.2820
smokingCurrent 1.3657 0.2994 4.5620 0.0000 0.7789 1.9526
smokingFormer 1.0742 0.2399 4.4777 0.0000 0.6040 1.5445

I chose the maximum model for my final model. The R2 and BIC methods favored the inclusion of all 8 predictors, and the stepwise selection methods only removed one variable (education). Given that education may serve as a confounder or exhibit an interaction effect with another predictor, I decided to keep it in the model.

Exercise: Holding all other variables constant, those who exercise have about 4 less physically unhealthy days per month than those who do not exercise.

BMI: Holding all other variables constant, each 1 unit increase in BMI is associated with an increase of 0.08 physically unhealthy days per month.

Smoking current: Holding all other variables constant, current smokers have about 1.4 more physically unhealthy days per month compared to never smokers.

par(mfrow = c(2, 2)) 
plot(model) 

par(mfrow = c(1, 1))

Residuals vs. Fitted: The points here show a clear downward trend, with the red line deviating from the zero line. This indicates some nonlinearity between the predictors and the outcome.

Normal Q-Q: The Q-Q plot mostly sticks to the dotted line at first, but then deviates. This suggests a slight departure from normality.

Scale-Location: The red line is not flat or horizontal, and the points are not randomly spread, indicating some degree of heteroscedasticity.

Residuals vs. Leverage: No single point falls outside of Cooks Distance 0.5 or 1.0, suggesting the model is not swayed by extreme observations.

The LINE assumptions are not reasonably satisfied in this case. Although the Cook’s Distance and Q-Q plots are acceptable, the Residuals vs. Fitted and Scale Location plots suggest significant departure from the LINE assumptions. However, the large sample size likely ensures that these violations will not drastically affect the results.

brfss_analytic <- brfss_analytic %>%
  mutate(
      frequent_distress = case_when(
      physhlth_days >= 14 ~ 1,
      physhlth_days < 14 ~ 0,
      ),
    

  frequent_distress = factor(frequent_distress, levels = c(0,1), labels = c("No", "Yes")))

brfss_analytic |>
  count(frequent_distress) |>
  mutate(pct = round(100 * n / sum(n), 1)) |>
  kable(col.names = c("Frequent Distress", "N", "%"),
        caption = "Outcome Distribution") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Outcome Distribution
Frequent Distress N %
No 6368 79.6
Yes 1632 20.4

The outcome frequent distress is not very balanced – most individuals (80%) report no frequent distress.

mod_simple <- glm(frequent_distress ~ exercise, 
data = brfss_analytic, family = binomial)

tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: Frequent Distress ~ Exercise (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Simple Logistic Regression: Frequent Distress ~ Exercise (Odds Ratio Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.647 0.043 -10.154 0 0.595 0.704
exerciseYes 0.228 0.058 -25.367 0 0.203 0.255

I chose exercise for the logistic regression because I expect it to have a strong association with physical distress. In the final linear model, exercise was strongly inversely associated with number of poor physical health days (3.9096 less physically unhealthy days in Exercisers vs. Non-Exercisers).

Compared to non-exercisers, exercisers have 0.228 times the odds of frequent distress, 95% CI [0.203, 0.255]

The association is statistically significant because p < 0.05.

mod_logistic <- glm(frequent_distress ~ menthlth_days + bmi + sex + exercise + age_group + income + education + smoking,
data = brfss_analytic, family = binomial)

tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Multiple Logistic Regression: Frequent Distress (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Multiple Logistic Regression: Frequent Distress (Odds Ratio Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.044 0.200 -15.596 0.000 0.030 0.065
menthlth_days 1.080 0.003 24.542 0.000 1.073 1.086
bmi 1.025 0.004 5.940 0.000 1.017 1.034
sexFemale 0.977 0.067 -0.342 0.733 0.856 1.115
exerciseYes 0.426 0.068 -12.545 0.000 0.373 0.487
age_group35-49 2.113 0.135 5.538 0.000 1.625 2.761
age_group50-64 3.903 0.125 10.920 0.000 3.067 5.002
age_group65+ 3.663 0.125 10.388 0.000 2.877 4.697
income$100K+ 0.270 0.112 -11.739 0.000 0.216 0.335
educationHS/GED 1.081 0.102 0.765 0.444 0.886 1.322
educationSome college 1.059 0.108 0.532 0.595 0.858 1.308
educationCollege graduate 1.139 0.119 1.091 0.275 0.902 1.439
smokingCurrent 1.406 0.087 3.911 0.000 1.185 1.668
smokingFormer 1.437 0.078 4.663 0.000 1.233 1.673

Holding all other variables constant, each one-unit increase in BMI is associated with 1.025 times the risk of frequent physical distress.

Holding all other variables constant, those age 65+ have 3.663 times the risk of frequent physical distress compared with those 18-34.

Holding all other variables constant, those who currently smoke have 1.406 times the risk of frequent physical distress compared with those who have never smoked.

red_mod <- glm(frequent_distress ~ menthlth_days + bmi + sex + exercise + age_group + income + education, data = brfss_analytic, family = binomial)

# Likelihood ratio test
lr_test <- anova(red_mod, mod_logistic, test = "Chisq")

lr_test |>
  kable(digits = 3, caption = "Likelihood Ratio Test: Full Model vs. Reduced Model") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Likelihood Ratio Test: Full Model vs. Reduced Model
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
7988 6090.921 NA NA NA
7986 6063.532 2 27.389 0

Null hypothesis: There is no difference in predictive value between the reduced and full models.

Alternative hypothesis: There is a difference in predictive value between the reduced and full models.

p < 0.05. Thus, smoking-related predictors significantly improve this model.

roc_obj <- roc(brfss_analytic$frequent_distress, fitted(mod_logistic)) 
plot(roc_obj, main = "ROC Curve: Frequent Physical Distress Model", 
print.auc = TRUE) 

auc(roc_obj) 
## Area under the curve: 0.8312
ci.auc(roc_obj) 
## 95% CI: 0.8202-0.8422 (DeLong)

The area under the curve was 0.8312. This indicates excellent discrimination - that is, the model is excellent at distinguishing individuals with and without frequent physical distress from the predictors.

hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10) 
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mod_logistic$y, fitted(mod_logistic)
## X-squared = 11.315, df = 8, p-value = 0.1845

Null hypothesis: the model fits the data well. Alternative hypothesis: the model does not fit the data well.

p > 0.05, thus we do not reject the null hypothesis and conclude this model fits the data well.

Both the Hosmer-Lemeshow and ROC/AUC findings agree that this particular model fits the data well. In other words, this model, with its particular set of predictors, is good at distinguishing individuals that do and do not exhibit the outcome of frequent distress.

REFLECTION

In the linear and logistic models, number of poor mental health days, BMI, exercise, age, income, and smoking were significantly associated with the outcome of frequent distress. Sex and education were not significantly associated with frequent distress. The predictors with the strongest magnitude of association were exercise, age, and income, suggesting that these factors are most strongly associated with physical health in US adults. In terms of the predictors’ statistical significance and magnitude of associations with the outcome, the models generally agreed with one another. Some limitations of cross-sectional data such as BRFSS is that the data are collected at a single point in time, so it is not possible to gather the temporal exposure-outcome information needed to make causal inferences. Also, since this is observational data, there may be unmeasured confounders distorting the observed associations. One potential confounder might be the diagnosis of a chronic disease, which would be positively associated with age, negatively associated with exercise (those with chronic diseases may be less able to exercise), and associated with the outcome of frequent physical distress. Another confounder might be the use of alcohol or other drugs, which is associated with socioeconomic characteristics such as income and also would be associated with physical distress.

The linear regression model models the linear relationship between a continuous outcome and one or more predictors, allowing you to predict the specific value of an outcome from the values of the predictors. The logistic regression model models the relationship between a binary outcome and one or more predictors, allowing you to predict the likelihood of a particular outcome given the values of the predictors. The linear regression model is able to predict specific values of the outcome, while the logistic regression model predicts the likelihood of one of two values of the outcome. Linear regression is preferred when we have a variable with more than two outcomes, while logistic regression is preferred when we have a binary variable. R2 tells you the proportion of variance in the outcome variable explained by the predictor variables. Thus, a model has a high R2 if it explains a high proportion of outcome variance and is thus a good predictor of the outcome. AUC tells you how well the model is able to discriminate one value of the binary outcome from the other.

I did not use AI to assist with this assignment.