Overview

This assignment examines factors associated with physical health burden among U.S. adults using 2023 BRFSS data. I first model the continuous outcome, physhlth_days, using linear regression and formal model selection procedures. I then reframe the same research question using a binary outcome, frequent_distress, defined as 14 or more physically unhealthy days in the past 30 days, and analyze it with logistic regression.

Required Packages

library(tidyverse)
library(haven)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(gtsummary)
library(leaps)
library(pROC)
library(ResourceSelection)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

Part 0: Data Preparation

Predictor Pool Justification

For the maximum model, I selected ten predictors, mentally unhealthy days, BMI, sex, exercise, depression history, age group, income, education, smoking status, and general health status. These variables represent individual, behavioral, socioeconomic, and health-related characteristics that are plausibly associated with physical health burden. This predictor pool includes both continuous predictors, several binary predictors, and multiple categorical predictors.

Step 1: Import Data and Select Variables

brfss_raw <- read_xpt("LLCP2023.XPT ")

tibble(
  Metric = c("Rows in raw dataset", "Columns in raw dataset"),
  Value = c(nrow(brfss_raw), ncol(brfss_raw))
) |>
  kable(caption = "Raw BRFSS 2023 Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Raw BRFSS 2023 Dataset Dimensions
Metric Value
Rows in raw dataset 433323
Columns in raw dataset 350
brfss_work <- brfss_raw |>
  select(
    PHYSHLTH,
    MENTHLTH,
    `_BMI5`,
    SEXVAR,
    EXERANY2,
    ADDEPEV3,
    `_AGEG5YR`,
    `_INCOMG1`,
    EDUCA,
    `_SMOKER3`,
    GENHLTH
  )

glimpse(brfss_work)
## Rows: 433,323
## Columns: 11
## $ PHYSHLTH   <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, 30,…
## $ MENTHLTH   <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88, 8…
## $ `_BMI5`    <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 2296,…
## $ SEXVAR     <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1,…
## $ EXERANY2   <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2, 1,…
## $ ADDEPEV3   <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ `_AGEG5YR` <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12, 1…
## $ `_INCOMG1` <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3, 4,…
## $ EDUCA      <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5, 4,…
## $ `_SMOKER3` <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1, 3,…
## $ GENHLTH    <dbl> 2, 2, 4, 2, 4, 3, 4, 4, 3, 3, 3, 2, 2, 3, 3, 4, 5, 3, 3, 3,…

Step 2: Recode Variables

brfss_clean <- brfss_work |>
  transmute(
    # Outcome: physically unhealthy days in past 30 days
    physhlth_days = case_when(
      PHYSHLTH == 88 ~ 0,
      PHYSHLTH %in% c(77, 99) ~ NA_real_,
      between(PHYSHLTH, 1, 30) ~ as.numeric(PHYSHLTH),
      TRUE ~ NA_real_
    ),
    # Continuous predictor: mentally unhealthy days in past 30 days
    menthlth_days = case_when(
      MENTHLTH == 88 ~ 0,
      MENTHLTH %in% c(77, 99) ~ NA_real_,
      between(MENTHLTH, 1, 30) ~ as.numeric(MENTHLTH),
      TRUE ~ NA_real_
    ),
    # Continuous predictor: BMI
    bmi = case_when(
      `_BMI5` == 9999 ~ NA_real_,
      TRUE ~ as.numeric(`_BMI5`) / 100
    ),
    # Binary predictors
    sex = factor(case_when(
      SEXVAR == 1 ~ "Male",
      SEXVAR == 2 ~ "Female",
      TRUE ~ NA_character_
    ), levels = c("Male", "Female")),
    exercise = factor(case_when(
      EXERANY2 == 1 ~ "Yes",
      EXERANY2 == 2 ~ "No",
      EXERANY2 %in% c(7, 9) ~ NA_character_,
      TRUE ~ NA_character_
    ), levels = c("No", "Yes")),
    depression = factor(case_when(
      ADDEPEV3 == 1 ~ "Yes",
      ADDEPEV3 == 2 ~ "No",
      ADDEPEV3 %in% c(7, 9) ~ NA_character_,
      TRUE ~ NA_character_
    ), levels = c("No", "Yes")),
    # Categorical predictors
    age_group = factor(case_when(
      `_AGEG5YR` %in% 1:3 ~ "18-34",
      `_AGEG5YR` %in% 4:6 ~ "35-49",
      `_AGEG5YR` %in% 7:9 ~ "50-64",
      `_AGEG5YR` %in% 10:13 ~ "65+",
      `_AGEG5YR` == 14 ~ NA_character_,
      TRUE ~ NA_character_
    ), levels = c("18-34", "35-49", "50-64", "65+")),
    income = factor(case_when(
      `_INCOMG1` %in% 1:2 ~ "Less than $25K",
      `_INCOMG1` %in% 3:4 ~ "$25K-$49K",
      `_INCOMG1` %in% 5:6 ~ "$50K-$99K",
      `_INCOMG1` == 7 ~ "$100K+",
      `_INCOMG1` == 9 ~ NA_character_,
      TRUE ~ NA_character_
    ), levels = c("Less than $25K", "$25K-$49K", "$50K-$99K", "$100K+")),
    education = factor(case_when(
      EDUCA %in% 1:3 ~ "Less than HS",
      EDUCA == 4 ~ "HS/GED",
      EDUCA == 5 ~ "Some college",
      EDUCA == 6 ~ "College graduate",
      EDUCA == 9 ~ NA_character_,
      TRUE ~ NA_character_
    ), levels = c("Less than HS", "HS/GED", "Some college", "College graduate")),
    smoking = factor(case_when(
      `_SMOKER3` %in% 1:2 ~ "Current",
      `_SMOKER3` == 3 ~ "Former",
      `_SMOKER3` == 4 ~ "Never",
      `_SMOKER3` == 9 ~ NA_character_,
      TRUE ~ NA_character_
    ), levels = c("Never", "Former", "Current")),
    gen_health = factor(case_when(
      GENHLTH %in% 1:2 ~ "Excellent/Very Good",
      GENHLTH == 3 ~ "Good",
      GENHLTH %in% 4:5 ~ "Fair/Poor",
      GENHLTH %in% c(7, 9) ~ NA_character_,
      TRUE ~ NA_character_
    ), levels = c("Excellent/Very Good", "Good", "Fair/Poor"))
  )

str(brfss_clean)
## tibble [433,323 × 11] (S3: tbl_df/tbl/data.frame)
##  $ physhlth_days: num [1:433323] 0 0 6 2 0 2 8 1 5 0 ...
##  $ menthlth_days: num [1:433323] 0 0 2 0 0 3 NA 0 0 0 ...
##  $ bmi          : num [1:433323] 30.5 28.6 22.3 27.4 25.9 ...
##  $ sex          : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 1 2 2 1 ...
##  $ exercise     : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 1 1 1 2 ...
##  $ depression   : Factor w/ 2 levels "No","Yes": 1 2 1 2 2 1 1 1 1 1 ...
##  $ age_group    : Factor w/ 4 levels "18-34","35-49",..: 4 4 4 4 4 3 4 4 4 4 ...
##  $ income       : Factor w/ 4 levels "Less than $25K",..: NA NA 1 NA 3 3 2 NA 2 3 ...
##  $ education    : Factor w/ 4 levels "Less than HS",..: 3 3 2 3 3 3 2 3 3 2 ...
##  $ smoking      : Factor w/ 3 levels "Never","Former",..: 1 1 2 1 1 1 2 1 1 2 ...
##  $ gen_health   : Factor w/ 3 levels "Excellent/Very Good",..: 1 1 3 1 3 2 3 3 2 2 ...

Step 3: Missingness, Analytic Sample, and Descriptive Summary

missing_summary <- brfss_clean |>
  summarise(
    across(
      c(physhlth_days, menthlth_days, bmi, exercise, depression, gen_health),
      ~ sum(is.na(.x)),
      .names = "{.col}"
    )
  ) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_N") |>
  mutate(Missing_Percent = 100 * Missing_N / nrow(brfss_clean))

missing_summary |>
  kable(digits = 2, caption = "Missingness Before Removing Missing Data") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Missingness Before Removing Missing Data
Variable Missing_N Missing_Percent
physhlth_days 10785 2.49
menthlth_days 8108 1.87
bmi 40535 9.35
exercise 1251 0.29
depression 2587 0.60
gen_health 1262 0.29
# Reproduce random sample
set.seed(1220)
brfss_analytic <- brfss_clean |>
  drop_na() |>
  slice_sample(n = 8000)

tibble(
  Metric = c("Final analytic sample size", "Number of variables"),
  Value = c(nrow(brfss_analytic), ncol(brfss_analytic))
) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Final analytic sample size 8000
Number of variables 11
brfss_analytic |>
  tbl_summary(
    label = list(
      physhlth_days ~ "Physically unhealthy days, past 30",
      menthlth_days ~ "Mentally unhealthy days, past 30",
      bmi ~ "Body mass index",
      sex ~ "Sex assigned at birth",
      exercise ~ "Any exercise in past 30 days",
      depression ~ "Ever told depressive disorder",
      age_group ~ "Age group",
      income ~ "Annual household income",
      education ~ "Education level",
      smoking ~ "Smoking status",
      gen_health ~ "General health status"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**")
Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)
Characteristic N N = 8,0001
Physically unhealthy days, past 30 8,000 4.3 (8.5)
Mentally unhealthy days, past 30 8,000 4.4 (8.3)
Body mass index 8,000 28.6 (6.7)
Sex assigned at birth 8,000
    Male
3,905 (49%)
    Female
4,095 (51%)
Any exercise in past 30 days 8,000 6,224 (78%)
Ever told depressive disorder 8,000 1,694 (21%)
Age group 8,000
    18-34
1,328 (17%)
    35-49
1,650 (21%)
    50-64
2,145 (27%)
    65+
2,877 (36%)
Annual household income 8,000
    Less than $25K
1,101 (14%)
    $25K-$49K
1,927 (24%)
    $50K-$99K
4,341 (54%)
    $100K+
631 (7.9%)
Education level 8,000
    Less than HS
384 (4.8%)
    HS/GED
1,941 (24%)
    Some college
2,067 (26%)
    College graduate
3,608 (45%)
Smoking status 8,000
    Never
4,884 (61%)
    Former
2,261 (28%)
    Current
855 (11%)
General health status 8,000
    Excellent/Very Good
3,990 (50%)
    Good
2,623 (33%)
    Fair/Poor
1,387 (17%)
1 Mean (SD); n (%)

Part 1: Model Selection for Linear Regression

Step 1: Fit the Maximum Model

mod_max <- lm(
  physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
    age_group + income + education + smoking + gen_health,
  data = brfss_analytic
)

summary(mod_max)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise + 
##     depression + age_group + income + education + smoking + gen_health, 
##     data = brfss_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.2174  -2.7092  -1.1314   0.6714  30.0121 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.3194460  0.5780595   2.283 0.022483 *  
## menthlth_days              0.1990805  0.0108958  18.271  < 2e-16 ***
## bmi                       -0.0009936  0.0122581  -0.081 0.935399    
## sexFemale                  0.2381885  0.1582722   1.505 0.132382    
## exerciseYes               -2.1205593  0.1995982 -10.624  < 2e-16 ***
## depressionYes              0.3476674  0.2126271   1.635 0.102067    
## age_group35-49             0.3535773  0.2601484   1.359 0.174142    
## age_group50-64             1.5652359  0.2483827   6.302 3.10e-10 ***
## age_group65+               1.6307384  0.2425275   6.724 1.89e-11 ***
## income$25K-$49K           -0.4765953  0.2670005  -1.785 0.074300 .  
## income$50K-$99K           -1.1805694  0.2594204  -4.551 5.42e-06 ***
## income$100K+              -0.9897185  0.3825676  -2.587 0.009698 ** 
## educationHS/GED            0.8888178  0.3931248   2.261 0.023792 *  
## educationSome college      1.1012326  0.3978171   2.768 0.005650 ** 
## educationCollege graduate  1.3386538  0.4027682   3.324 0.000893 ***
## smokingFormer              0.2340748  0.1818188   1.287 0.197990    
## smokingCurrent             0.3990517  0.2697480   1.479 0.139086    
## gen_healthGood             1.1875598  0.1809538   6.563 5.61e-11 ***
## gen_healthFair/Poor       10.0607051  0.2495420  40.317  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.891 on 7981 degrees of freedom
## Multiple R-squared:  0.3465, Adjusted R-squared:  0.345 
## F-statistic: 235.1 on 18 and 7981 DF,  p-value: < 2.2e-16
max_model_metrics <- tibble(
  Metric = c("R-squared", "Adjusted R-squared", "AIC", "BIC"),
  Value = c(
    summary(mod_max)$r.squared,
    summary(mod_max)$adj.r.squared,
    AIC(mod_max),
    BIC(mod_max)
  )
)

max_model_metrics |>
  mutate(Value = round(Value, 3)) |>
  kable(caption = "Maximum Linear Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Maximum Linear Model Fit Statistics
Metric Value
R-squared 0.347
Adjusted R-squared 0.345
AIC 53606.970
BIC 53746.714

The maximum model has an R-squared of 0.347 and an adjusted R-squared of 0.345. This means the full set of predictors explains approximately 34.7% of the variation in physically unhealthy days in this analytic sample, indicating moderate explanatory power. The AIC was 53,606.97 and the BIC was 53,746.71. These model comparison criteria are useful for evaluating competing models fit to the same outcome, with lower values indicating a better balance between goodness of fit and parsimony. Overall, these results suggest the maximum model captures a meaningful portion of the variation in physical health burden and provides a strong starting point for sequential model selection procedures.

Step 2: Best Subsets Regression

x_mat <- model.matrix(mod_max)[, -1]
x_df <- as.data.frame(x_mat)
names(x_df) <- make.names(names(x_df), unique = TRUE)

bestsub_data <- x_df |>
  mutate(physhlth_days = brfss_analytic$physhlth_days)

nvmax_val <- ncol(x_df)

bestsub_fit <- regsubsets(
  physhlth_days ~ .,
  data = bestsub_data,
  nvmax = nvmax_val,
  method = "exhaustive"
)

bestsub_sum <- summary(bestsub_fit)

bestsub_table <- tibble(
  Model_Size = 1:nvmax_val,
  Adjusted_R_Squared = bestsub_sum$adjr2,
  BIC = bestsub_sum$bic
)

bestsub_table |>
  mutate(
    Adjusted_R_Squared = round(Adjusted_R_Squared, 3),
    BIC = round(BIC, 1)
  ) |>
  kable(caption = "Best Subsets Results by Model Size") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Best Subsets Results by Model Size
Model_Size Adjusted_R_Squared BIC
1 0.281 -2618.0
2 0.317 -3027.7
3 0.331 -3183.4
4 0.336 -3231.6
5 0.338 -3256.3
6 0.342 -3296.9
7 0.344 -3303.2
8 0.344 -3299.4
9 0.344 -3293.5
10 0.344 -3287.0
11 0.344 -3280.9
12 0.345 -3275.5
13 0.345 -3270.0
14 0.345 -3263.7
15 0.345 -3256.6
16 0.345 -3249.0
17 0.345 -3241.7
18 0.345 -3232.7
par(mfrow = c(1, 2))
plot(bestsub_table$Model_Size, bestsub_table$Adjusted_R_Squared,
     type = "b", xlab = "Model size", ylab = "Adjusted R-squared",
     main = "Adjusted R-squared by Model Size")
plot(bestsub_table$Model_Size, bestsub_table$BIC,
     type = "b", xlab = "Model size", ylab = "BIC",
     main = "BIC by Model Size")

par(mfrow = c(1, 1))
best_adjr2_size <- bestsub_table$Model_Size[which.max(bestsub_table$Adjusted_R_Squared)]
best_bic_size <- bestsub_table$Model_Size[which.min(bestsub_table$BIC)]

tibble(
  Criterion = c("Maximum adjusted R-squared", "Minimum BIC"),
  Best_Model_Size = c(best_adjr2_size, best_bic_size)
) |>
  kable(caption = "Best Model Size by Selection Criterion") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Best Model Size by Selection Criterion
Criterion Best_Model_Size
Maximum adjusted R-squared 17
Minimum BIC 7

Best subsets regression treats each dummy variable from a categorical predictor as a separate candidate predictor, so some levels of a categorical variable may be selected while others are left out, which is a known limitation of this method. In my results, the model that maximized adjusted R-squared included 17 predictors, while the model that minimized BIC included 7 predictors. The two criteria did not agree on the same best model size. Adjusted R-squared kept improving slightly as more predictors were added and favored a larger model, while BIC favored a simpler model because it gives a stronger penalty for adding too many predictors. Based on these results, BIC chose the simpler model, while adjusted R-squared focused more on improving model fit. Overall, the plots show model fit improved quickly with the first several predictors and then leveled off, suggesting little added benefit after about seven predictors.

Step 3: Stepwise Selection Methods

mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)

mod_backward <- step(mod_max, direction = "backward", trace = 0)
mod_forward <- step(
  mod_null,
  scope = list(lower = mod_null, upper = mod_max),
  direction = "forward",
  trace = 0
)
mod_both <- step(
  mod_null,
  scope = list(lower = mod_null, upper = mod_max),
  direction = "both",
  trace = 0
)

backward_terms <- attr(terms(mod_backward), "term.labels")
forward_terms <- attr(terms(mod_forward), "term.labels")
both_terms <- attr(terms(mod_both), "term.labels")

stepwise_results <- tibble(
  Method = c("Backward elimination", "Forward selection", "Stepwise selection"),
  Final_Model_Terms = c(
    paste(backward_terms, collapse = ", "),
    paste(forward_terms, collapse = ", "),
    paste(both_terms, collapse = ", ")
  )
)

stepwise_results |>
  kable(caption = "Final Predictors Retained by Each Stepwise Method") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Final Predictors Retained by Each Stepwise Method
Method Final_Model_Terms
Backward elimination menthlth_days, exercise, depression, age_group, income, education, gen_health
Forward selection gen_health, menthlth_days, exercise, age_group, income, education, depression
Stepwise selection gen_health, menthlth_days, exercise, age_group, income, education, depression

Backward elimination, forward selection, and stepwise selection produced very similar results and largely agreed on the same final model. All three methods retained mental health days, exercise, depression, age group, income, education, and general health, suggesting these variables were consistently important predictors of physically unhealthy days. The forward and stepwise models were identical, while backward elimination selected the same predictors but in a different order. Variables excluded by all three methods were BMI, sex, and smoking, suggesting they were less useful for predicting the outcome in this sample after accounting for other variables. Overall, the strong agreement across methods increases confidence that the retained predictors are important contributors to the final model.

Step 4: Final Model Selection and Interpretation

mod_final <- mod_both

final_linear_table <- tidy(mod_final, conf.int = TRUE)

final_linear_table |>
  mutate(across(where(is.numeric), ~ round(.x, 3))) |>
  kable(caption = "Final Linear Regression Model Coefficients") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Final Linear Regression Model Coefficients
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 1.507 0.460 3.279 0.001 0.606 2.408
gen_healthGood 1.195 0.178 6.706 0.000 0.846 1.544
gen_healthFair/Poor 10.066 0.246 40.954 0.000 9.584 10.548
menthlth_days 0.201 0.011 18.474 0.000 0.179 0.222
exerciseYes -2.131 0.198 -10.774 0.000 -2.519 -1.743
age_group35-49 0.421 0.257 1.636 0.102 -0.083 0.925
age_group50-64 1.636 0.245 6.667 0.000 1.155 2.117
age_group65+ 1.700 0.238 7.140 0.000 1.233 2.167
income$25K-$49K -0.488 0.267 -1.829 0.067 -1.010 0.035
income$50K-$99K -1.223 0.258 -4.749 0.000 -1.728 -0.718
income$100K+ -1.065 0.380 -2.804 0.005 -1.810 -0.321
educationHS/GED 0.880 0.393 2.241 0.025 0.110 1.649
educationSome college 1.094 0.397 2.759 0.006 0.317 1.872
educationCollege graduate 1.292 0.400 3.233 0.001 0.509 2.076
depressionYes 0.402 0.210 1.912 0.056 -0.010 0.814

My final model was guided primarily by the stepwise AIC procedure and checked against the best subsets results. I chose this model because it balances model fit with simplicity while keeping predictors that contribute meaningfully to predicting physically unhealthy days.

Use your coefficient table to complete sentences like these:

Holding all other variables constant, each one-day increase in mentally unhealthy days was associated with an average 0.20-day increase in physically unhealthy days. Adults who exercised in the past 30 days had, on average, 2.13 fewer physically unhealthy days than adults who did not exercise, controlling for all other variables. In addition, adults reporting fair or poor general health had, on average, 10.07 more physically unhealthy days than adults reporting excellent or very good health, holding all other variables constant, making general health one of the strongest predictors in the model. Older adults ages 50–64 and 65+ also reported significantly more unhealthy days than adults ages 18–34, while higher income levels were generally associated with fewer unhealthy days. Overall, the final model suggests mental health burden, exercise, self-rated health, age, and socioeconomic factors are important factors associated with physical health burden among U.S. adults.

Step 5: LINE Assumption Check

par(mfrow = c(2, 2))
plot(mod_final)

par(mfrow = c(1, 1))

Residuals vs. Fitted: The residuals are centered around zero, but the plot shows some curvature and a patterned spread rather than completely random scatter, suggesting some evidence of non-linearity and possible non-constant variance. This may reflect the bounded and skewed nature of the physically unhealthy days outcome.

Normal Q-Q: The residuals do not closely follow the reference line, especially in the tails, indicating departures from normality. This is not unexpected for BRFSS health-days outcomes, which are bounded between 0 and 30 and often have skewed distributions.

Scale-Location: The red line is not completely horizontal and the spread of residuals appears to increase somewhat across fitted values, suggesting some unequal variance of residuals.

Residuals vs. Leverage: There are some observations with moderate leverage, but no strong evidence of highly influential observations with large Cook’s distance. This suggests no single observation appears to influence the model.

Overall, the LINE assumptions are mostly met, although there is some evidence that the residuals are not perfectly normal and the variance is not completely constant. These issues are common with large survey data and outcomes like unhealthy days that are limited between 0 and 30. Even with these small violations, the model still appears reasonable to interpret, but the results should be considered with these limitations in mind.

Part 2: Logistic Regression

Step 1: Create the Binary Outcome

brfss_analytic <- brfss_analytic |>
  mutate(
    frequent_distress = factor(
      if_else(physhlth_days >= 14, "Yes", "No"),
      levels = c("No", "Yes")
    )
  )

freq_distress_tab <- brfss_analytic |>
  count(frequent_distress) |>
  mutate(Percent = 100 * n / sum(n))

freq_distress_tab |>
  kable(digits = 2, caption = "Prevalence of Frequent Physical Distress") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Prevalence of Frequent Physical Distress
frequent_distress n Percent
No 6927 86.59
Yes 1073 13.41

Frequent physical distress appears somewhat imbalanced because most participants were in the “No” group, 86.59%, while only 13.41% were in the “Yes” group. Since the outcome has 13.41% in the Yes group, the model still has enough information for estimating frequent distress, although the outcome is less evenly distributed between groups. This imbalance should be noted, but the number of participants reporting frequent distress is still large enough to support logistic regression analysis.

Step 2: Simple Logistic Regression

mod_simple <- glm(
  frequent_distress ~ gen_health,
  data = brfss_analytic,
  family = binomial
)

summary(mod_simple)
## 
## Call:
## glm(formula = frequent_distress ~ gen_health, family = binomial, 
##     data = brfss_analytic)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.34421    0.08725 -38.329   <2e-16 ***
## gen_healthGood       1.07595    0.10999   9.782   <2e-16 ***
## gen_healthFair/Poor  3.33700    0.10245  32.571   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6306.5  on 7999  degrees of freedom
## Residual deviance: 4741.8  on 7997  degrees of freedom
## AIC: 4747.8
## 
## Number of Fisher Scoring iterations: 6
simple_or_table <- tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), ~ round(.x, 3)))

simple_or_table |>
  kable(caption = "Unadjusted Odds Ratios for Simple Logistic Regression") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Unadjusted Odds Ratios for Simple Logistic Regression
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.035 0.087 -38.329 0 0.030 0.042
gen_healthGood 2.933 0.110 9.782 0 2.368 3.646
gen_healthFair/Poor 28.135 0.102 32.571 0 23.089 34.509

I chose general health because adults reporting worse general health would be expected to have higher odds of frequent physical distress. Compared with adults reporting excellent or very good general health, adults reporting fair or poor general health had 28.14 times the odds of frequent physical distress, 95% CI [23.09, 34.51]. This association was statistically significant at alpha = 0.05 because the p-value was <0.001 and the 95% confidence interval does not include 1.0. This suggests that poorer self-rated general health is strongly associated with greater odds of experiencing frequent physical distress.

Step 3: Multiple Logistic Regression

# Use the same predictors as the final linear model.
final_logistic_formula <- update(formula(mod_final), frequent_distress ~ .)

mod_logistic <- glm(
  final_logistic_formula,
  data = brfss_analytic,
  family = binomial
)

summary(mod_logistic)
## 
## Call:
## glm(formula = final_logistic_formula, family = binomial, data = brfss_analytic)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.38630    0.22369 -15.138  < 2e-16 ***
## gen_healthGood             0.78944    0.11382   6.936 4.04e-12 ***
## gen_healthFair/Poor        2.64555    0.11183  23.657  < 2e-16 ***
## menthlth_days              0.05319    0.00430  12.368  < 2e-16 ***
## exerciseYes               -0.72914    0.08400  -8.680  < 2e-16 ***
## age_group35-49             0.20493    0.15495   1.323    0.186    
## age_group50-64             0.80095    0.14125   5.670 1.43e-08 ***
## age_group65+               0.78875    0.13972   5.645 1.65e-08 ***
## income$25K-$49K           -0.06743    0.11131  -0.606    0.545    
## income$50K-$99K           -0.46666    0.11349  -4.112 3.93e-05 ***
## income$100K+              -0.51331    0.22749  -2.256    0.024 *  
## educationHS/GED            0.20569    0.16343   1.259    0.208    
## educationSome college      0.14432    0.16703   0.864    0.388    
## educationCollege graduate  0.25825    0.17313   1.492    0.136    
## depressionYes              0.07361    0.09680   0.761    0.447    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6306.5  on 7999  degrees of freedom
## Residual deviance: 4380.9  on 7985  degrees of freedom
## AIC: 4410.9
## 
## Number of Fisher Scoring iterations: 6
logistic_or_table <- tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), ~ round(.x, 3)))

logistic_or_table |>
  kable(caption = "Adjusted Odds Ratios for Multiple Logistic Regression") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Adjusted Odds Ratios for Multiple Logistic Regression
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.034 0.224 -15.138 0.000 0.022 0.052
gen_healthGood 2.202 0.114 6.936 0.000 1.765 2.758
gen_healthFair/Poor 14.091 0.112 23.657 0.000 11.348 17.595
menthlth_days 1.055 0.004 12.368 0.000 1.046 1.064
exerciseYes 0.482 0.084 -8.680 0.000 0.409 0.569
age_group35-49 1.227 0.155 1.323 0.186 0.907 1.667
age_group50-64 2.228 0.141 5.670 0.000 1.695 2.949
age_group65+ 2.201 0.140 5.645 0.000 1.680 2.906
income$25K-$49K 0.935 0.111 -0.606 0.545 0.752 1.163
income$50K-$99K 0.627 0.113 -4.112 0.000 0.502 0.784
income$100K+ 0.599 0.227 -2.256 0.024 0.378 0.924
educationHS/GED 1.228 0.163 1.259 0.208 0.894 1.697
educationSome college 1.155 0.167 0.864 0.388 0.835 1.607
educationCollege graduate 1.295 0.173 1.492 0.136 0.925 1.823
depressionYes 1.076 0.097 0.761 0.447 0.889 1.300

Holding all other variables constant, each one-unit increase in mentally unhealthy days was associated with multiplying the odds of frequent physical distress by 1.06, meaning the odds increased by about 5.5% for each additional mentally unhealthy day. Holding all other variables constant, adults reporting fair or poor general health had 14.09 times the odds of frequent physical distress compared with adults reporting excellent or very good health. Adults who exercised had lower odds of frequent physical distress, with 0.48 times the odds compared with those who did not exercise, suggesting about a 52% reduction in odds. Because odds ratios above 1 indicate higher odds and odds ratios below 1 indicate lower odds, the findings suggest worse mental and general health are associated with greater odds of frequent physical distress, while exercise is associated with lower odds of frequent physical distress.

Step 4: Likelihood Ratio Test

mod_reduced <- update(mod_logistic, . ~ . - income)

lrt_results <- anova(mod_reduced, mod_logistic, test = "Chisq")
lrt_results
## Analysis of Deviance Table
## 
## Model 1: frequent_distress ~ gen_health + menthlth_days + exercise + age_group + 
##     education + depression
## Model 2: frequent_distress ~ gen_health + menthlth_days + exercise + age_group + 
##     income + education + depression
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      7988     4404.9                          
## 2      7985     4380.9  3   24.026 2.467e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Null hypothesis: The income terms do not improve the fit of the multiple logistic regression model after accounting for the other predictors

Alternative hypothesis: At least one income term improves the fit of the multiple logistic regression model after accounting for the other predictors.

The likelihood ratio test statistic was 24.03 with 3 degrees of freedom and p = 2.47 × 10^-5. At alpha = 0.05, we reject the null hypothesis, meaning income does significantly improve model fit as a group. The results suggest that including income levels in the multiple logistic regression model provides important information for predicting frequent physical distress beyond the other predictors already in the model.

Step 5: ROC Curve and AUC

roc_obj <- roc(brfss_analytic$frequent_distress, fitted(mod_logistic))
plot(
  roc_obj,
  main = "ROC Curve: Frequent Physical Distress Model",
  print.auc = TRUE
)

auc(roc_obj)
## Area under the curve: 0.8557
ci.auc(roc_obj)
## 95% CI: 0.8424-0.869 (DeLong)

The AUC is 0.856 with a 95% confidence interval of 0.8424 to 0.869. This means that the model has excellent discrimination between adults with and without frequent physical distress. This falls in the excellent discrimination range (0.8–0.9), suggesting the model does a strong job distinguishing individuals who report frequent physical distress from those who do not.

Step 6: Hosmer-Lemeshow Goodness-of-Fit Test

hl_test <- hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10)
hl_test
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mod_logistic$y, fitted(mod_logistic)
## X-squared = 6.658, df = 8, p-value = 0.5739

Null hypothesis: The model fits the data well.

Alternative hypothesis: The model does not fit the data well.

The Hosmer-Lemeshow test statistic was 6.658 with 8 degrees of freedom and p = 0.5739. At alpha = 0.05, there is not evidence of poor model fit, so we fail to reject the null hypothesis that the model fits the data well. This suggests the logistic regression model shows good calibration between predicted and observed outcomes. The Hosmer-Lemeshow test evaluates calibration, while the ROC and AUC evaluates discrimination, so the two results provide different but complementary information about model performance. In this case, the non-significant Hosmer-Lemeshow test and the strong AUC both support that the model performs well.

Part 3: Reflection

Write 250-400 words in continuous prose. Do not use bullet points.

This analysis suggests that several individual and health-related factors are associated with physical health burden among U.S. adults. In both the linear and logistic regression models, general health status, mentally unhealthy days, and exercise were among the strongest predictors. Adults reporting fair or poor general health had substantially more physically unhealthy days and much higher odds of frequent physical distress, while exercise was associated with fewer unhealthy days and lower odds of distress. Mental health burden was also consistently associated with worse physical health outcomes in both models. Some predictors, such as certain education levels, appeared significant in the linear model but were not significant in the adjusted logistic model, showing that predictors may differ depending on how the outcome is defined. Because these data are cross-sectional, the results show associations rather than cause-and-effect relationships, since predictors and outcomes were measured at the same time. It is not possible to determine whether these factors caused greater physical health burden or were consequences of it. In addition, there may be unmeasured confounders not included in the model, such as chronic disease status, disability status, access to health care, pain conditions, or occupational factors.

The linear and logistic regression models provided related but different information about the research question. The linear model helped explain factors associated with the average number of physically unhealthy days, while the logistic model focused on the odds of meeting a threshold for frequent physical distress. Linear regression is useful when the outcome is continuous and the goal is estimating average change, while logistic regression is useful when the outcome is binary and the goal is classification or estimating odds. Model evaluation also differed across approaches. For linear regression, model fit was assessed using R-squared, adjusted R-squared, AIC, and BIC, while logistic regression relied more on odds ratios, likelihood ratio tests, ROC/AUC, and the Hosmer-Lemeshow test. Together the methods gave a fuller understanding of model performance.

I used AI assistance to help confirm that everything ran smoothly and to check some interpretation wording. I used it to troubleshoot issues such as file paths, confirm coding steps, and help write some interpretations correctly using the numerical results from my own output. I verified all code and interpretations by checking the assignment instructions, running the R Markdown file from top to bottom, reviewing model output directly, and making sure odds ratios were interpreted as odds rather than probabilities or risks. AI was used as a support tool for clarification, while all analyses and final interpretations were based on my own code and results.

Pre-Submission Checklist

  • File knits to HTML without errors
  • All code chunks run in order from top to bottom
  • Tables use kable() or tbl_summary() formatting
  • Figures have descriptive titles and axis labels
  • Categorical variables display meaningful labels, not numeric codes
  • Sample size is reported at each stage in Part 0
  • set.seed(1220) appears immediately before slice_sample()
  • At least 8 predictors are included in the maximum model
  • All three stepwise methods are run and compared
  • LINE diagnostic plots are produced and interpreted
  • Binary outcome uses the 14-day threshold with the correct reference level
  • Odds ratios, not log-odds, are reported and interpreted
  • Likelihood ratio test includes hypotheses, statistic, df, p-value, and conclusion
  • ROC curve is plotted and AUC is reported
  • Hosmer-Lemeshow test is reported and interpreted
  • Reflection is 250-400 words and addresses A, B, and C
  • Filename follows EPI553_HW04_LastName_FirstName.Rmd
  • RPubs link is pasted into Brightspace comments