This assignment examines factors associated with physical health
burden among U.S. adults using 2023 BRFSS data. I first model the
continuous outcome, physhlth_days, using linear regression
and formal model selection procedures. I then reframe the same research
question using a binary outcome, frequent_distress, defined
as 14 or more physically unhealthy days in the past 30 days, and analyze
it with logistic regression.
For the maximum model, I selected ten predictors, mentally unhealthy days, BMI, sex, exercise, depression history, age group, income, education, smoking status, and general health status. These variables represent individual, behavioral, socioeconomic, and health-related characteristics that are plausibly associated with physical health burden. This predictor pool includes both continuous predictors, several binary predictors, and multiple categorical predictors.
brfss_raw <- read_xpt("LLCP2023.XPT ")
tibble(
Metric = c("Rows in raw dataset", "Columns in raw dataset"),
Value = c(nrow(brfss_raw), ncol(brfss_raw))
) |>
kable(caption = "Raw BRFSS 2023 Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Rows in raw dataset | 433323 |
| Columns in raw dataset | 350 |
brfss_work <- brfss_raw |>
select(
PHYSHLTH,
MENTHLTH,
`_BMI5`,
SEXVAR,
EXERANY2,
ADDEPEV3,
`_AGEG5YR`,
`_INCOMG1`,
EDUCA,
`_SMOKER3`,
GENHLTH
)
glimpse(brfss_work)## Rows: 433,323
## Columns: 11
## $ PHYSHLTH <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, 30,…
## $ MENTHLTH <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88, 8…
## $ `_BMI5` <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 2296,…
## $ SEXVAR <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1,…
## $ EXERANY2 <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2, 1,…
## $ ADDEPEV3 <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ `_AGEG5YR` <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12, 1…
## $ `_INCOMG1` <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3, 4,…
## $ EDUCA <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5, 4,…
## $ `_SMOKER3` <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1, 3,…
## $ GENHLTH <dbl> 2, 2, 4, 2, 4, 3, 4, 4, 3, 3, 3, 2, 2, 3, 3, 4, 5, 3, 3, 3,…
brfss_clean <- brfss_work |>
transmute(
# Outcome: physically unhealthy days in past 30 days
physhlth_days = case_when(
PHYSHLTH == 88 ~ 0,
PHYSHLTH %in% c(77, 99) ~ NA_real_,
between(PHYSHLTH, 1, 30) ~ as.numeric(PHYSHLTH),
TRUE ~ NA_real_
),
# Continuous predictor: mentally unhealthy days in past 30 days
menthlth_days = case_when(
MENTHLTH == 88 ~ 0,
MENTHLTH %in% c(77, 99) ~ NA_real_,
between(MENTHLTH, 1, 30) ~ as.numeric(MENTHLTH),
TRUE ~ NA_real_
),
# Continuous predictor: BMI
bmi = case_when(
`_BMI5` == 9999 ~ NA_real_,
TRUE ~ as.numeric(`_BMI5`) / 100
),
# Binary predictors
sex = factor(case_when(
SEXVAR == 1 ~ "Male",
SEXVAR == 2 ~ "Female",
TRUE ~ NA_character_
), levels = c("Male", "Female")),
exercise = factor(case_when(
EXERANY2 == 1 ~ "Yes",
EXERANY2 == 2 ~ "No",
EXERANY2 %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
), levels = c("No", "Yes")),
depression = factor(case_when(
ADDEPEV3 == 1 ~ "Yes",
ADDEPEV3 == 2 ~ "No",
ADDEPEV3 %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# Categorical predictors
age_group = factor(case_when(
`_AGEG5YR` %in% 1:3 ~ "18-34",
`_AGEG5YR` %in% 4:6 ~ "35-49",
`_AGEG5YR` %in% 7:9 ~ "50-64",
`_AGEG5YR` %in% 10:13 ~ "65+",
`_AGEG5YR` == 14 ~ NA_character_,
TRUE ~ NA_character_
), levels = c("18-34", "35-49", "50-64", "65+")),
income = factor(case_when(
`_INCOMG1` %in% 1:2 ~ "Less than $25K",
`_INCOMG1` %in% 3:4 ~ "$25K-$49K",
`_INCOMG1` %in% 5:6 ~ "$50K-$99K",
`_INCOMG1` == 7 ~ "$100K+",
`_INCOMG1` == 9 ~ NA_character_,
TRUE ~ NA_character_
), levels = c("Less than $25K", "$25K-$49K", "$50K-$99K", "$100K+")),
education = factor(case_when(
EDUCA %in% 1:3 ~ "Less than HS",
EDUCA == 4 ~ "HS/GED",
EDUCA == 5 ~ "Some college",
EDUCA == 6 ~ "College graduate",
EDUCA == 9 ~ NA_character_,
TRUE ~ NA_character_
), levels = c("Less than HS", "HS/GED", "Some college", "College graduate")),
smoking = factor(case_when(
`_SMOKER3` %in% 1:2 ~ "Current",
`_SMOKER3` == 3 ~ "Former",
`_SMOKER3` == 4 ~ "Never",
`_SMOKER3` == 9 ~ NA_character_,
TRUE ~ NA_character_
), levels = c("Never", "Former", "Current")),
gen_health = factor(case_when(
GENHLTH %in% 1:2 ~ "Excellent/Very Good",
GENHLTH == 3 ~ "Good",
GENHLTH %in% 4:5 ~ "Fair/Poor",
GENHLTH %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
), levels = c("Excellent/Very Good", "Good", "Fair/Poor"))
)
str(brfss_clean)## tibble [433,323 × 11] (S3: tbl_df/tbl/data.frame)
## $ physhlth_days: num [1:433323] 0 0 6 2 0 2 8 1 5 0 ...
## $ menthlth_days: num [1:433323] 0 0 2 0 0 3 NA 0 0 0 ...
## $ bmi : num [1:433323] 30.5 28.6 22.3 27.4 25.9 ...
## $ sex : Factor w/ 2 levels "Male","Female": 2 2 2 2 2 2 1 2 2 1 ...
## $ exercise : Factor w/ 2 levels "No","Yes": 1 2 2 2 2 2 1 1 1 2 ...
## $ depression : Factor w/ 2 levels "No","Yes": 1 2 1 2 2 1 1 1 1 1 ...
## $ age_group : Factor w/ 4 levels "18-34","35-49",..: 4 4 4 4 4 3 4 4 4 4 ...
## $ income : Factor w/ 4 levels "Less than $25K",..: NA NA 1 NA 3 3 2 NA 2 3 ...
## $ education : Factor w/ 4 levels "Less than HS",..: 3 3 2 3 3 3 2 3 3 2 ...
## $ smoking : Factor w/ 3 levels "Never","Former",..: 1 1 2 1 1 1 2 1 1 2 ...
## $ gen_health : Factor w/ 3 levels "Excellent/Very Good",..: 1 1 3 1 3 2 3 3 2 2 ...
missing_summary <- brfss_clean |>
summarise(
across(
c(physhlth_days, menthlth_days, bmi, exercise, depression, gen_health),
~ sum(is.na(.x)),
.names = "{.col}"
)
) |>
pivot_longer(everything(), names_to = "Variable", values_to = "Missing_N") |>
mutate(Missing_Percent = 100 * Missing_N / nrow(brfss_clean))
missing_summary |>
kable(digits = 2, caption = "Missingness Before Removing Missing Data") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Variable | Missing_N | Missing_Percent |
|---|---|---|
| physhlth_days | 10785 | 2.49 |
| menthlth_days | 8108 | 1.87 |
| bmi | 40535 | 9.35 |
| exercise | 1251 | 0.29 |
| depression | 2587 | 0.60 |
| gen_health | 1262 | 0.29 |
# Reproduce random sample
set.seed(1220)
brfss_analytic <- brfss_clean |>
drop_na() |>
slice_sample(n = 8000)
tibble(
Metric = c("Final analytic sample size", "Number of variables"),
Value = c(nrow(brfss_analytic), ncol(brfss_analytic))
) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Final analytic sample size | 8000 |
| Number of variables | 11 |
brfss_analytic |>
tbl_summary(
label = list(
physhlth_days ~ "Physically unhealthy days, past 30",
menthlth_days ~ "Mentally unhealthy days, past 30",
bmi ~ "Body mass index",
sex ~ "Sex assigned at birth",
exercise ~ "Any exercise in past 30 days",
depression ~ "Ever told depressive disorder",
age_group ~ "Age group",
income ~ "Annual household income",
education ~ "Education level",
smoking ~ "Smoking status",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**")| Characteristic | N | N = 8,0001 |
|---|---|---|
| Physically unhealthy days, past 30 | 8,000 | 4.3 (8.5) |
| Mentally unhealthy days, past 30 | 8,000 | 4.4 (8.3) |
| Body mass index | 8,000 | 28.6 (6.7) |
| Sex assigned at birth | 8,000 | |
| Male | 3,905 (49%) | |
| Female | 4,095 (51%) | |
| Any exercise in past 30 days | 8,000 | 6,224 (78%) |
| Ever told depressive disorder | 8,000 | 1,694 (21%) |
| Age group | 8,000 | |
| 18-34 | 1,328 (17%) | |
| 35-49 | 1,650 (21%) | |
| 50-64 | 2,145 (27%) | |
| 65+ | 2,877 (36%) | |
| Annual household income | 8,000 | |
| Less than $25K | 1,101 (14%) | |
| $25K-$49K | 1,927 (24%) | |
| $50K-$99K | 4,341 (54%) | |
| $100K+ | 631 (7.9%) | |
| Education level | 8,000 | |
| Less than HS | 384 (4.8%) | |
| HS/GED | 1,941 (24%) | |
| Some college | 2,067 (26%) | |
| College graduate | 3,608 (45%) | |
| Smoking status | 8,000 | |
| Never | 4,884 (61%) | |
| Former | 2,261 (28%) | |
| Current | 855 (11%) | |
| General health status | 8,000 | |
| Excellent/Very Good | 3,990 (50%) | |
| Good | 2,623 (33%) | |
| Fair/Poor | 1,387 (17%) | |
| 1 Mean (SD); n (%) | ||
mod_max <- lm(
physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
age_group + income + education + smoking + gen_health,
data = brfss_analytic
)
summary(mod_max)##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise +
## depression + age_group + income + education + smoking + gen_health,
## data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.2174 -2.7092 -1.1314 0.6714 30.0121
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3194460 0.5780595 2.283 0.022483 *
## menthlth_days 0.1990805 0.0108958 18.271 < 2e-16 ***
## bmi -0.0009936 0.0122581 -0.081 0.935399
## sexFemale 0.2381885 0.1582722 1.505 0.132382
## exerciseYes -2.1205593 0.1995982 -10.624 < 2e-16 ***
## depressionYes 0.3476674 0.2126271 1.635 0.102067
## age_group35-49 0.3535773 0.2601484 1.359 0.174142
## age_group50-64 1.5652359 0.2483827 6.302 3.10e-10 ***
## age_group65+ 1.6307384 0.2425275 6.724 1.89e-11 ***
## income$25K-$49K -0.4765953 0.2670005 -1.785 0.074300 .
## income$50K-$99K -1.1805694 0.2594204 -4.551 5.42e-06 ***
## income$100K+ -0.9897185 0.3825676 -2.587 0.009698 **
## educationHS/GED 0.8888178 0.3931248 2.261 0.023792 *
## educationSome college 1.1012326 0.3978171 2.768 0.005650 **
## educationCollege graduate 1.3386538 0.4027682 3.324 0.000893 ***
## smokingFormer 0.2340748 0.1818188 1.287 0.197990
## smokingCurrent 0.3990517 0.2697480 1.479 0.139086
## gen_healthGood 1.1875598 0.1809538 6.563 5.61e-11 ***
## gen_healthFair/Poor 10.0607051 0.2495420 40.317 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.891 on 7981 degrees of freedom
## Multiple R-squared: 0.3465, Adjusted R-squared: 0.345
## F-statistic: 235.1 on 18 and 7981 DF, p-value: < 2.2e-16
max_model_metrics <- tibble(
Metric = c("R-squared", "Adjusted R-squared", "AIC", "BIC"),
Value = c(
summary(mod_max)$r.squared,
summary(mod_max)$adj.r.squared,
AIC(mod_max),
BIC(mod_max)
)
)
max_model_metrics |>
mutate(Value = round(Value, 3)) |>
kable(caption = "Maximum Linear Model Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| R-squared | 0.347 |
| Adjusted R-squared | 0.345 |
| AIC | 53606.970 |
| BIC | 53746.714 |
The maximum model has an R-squared of 0.347 and an adjusted R-squared of 0.345. This means the full set of predictors explains approximately 34.7% of the variation in physically unhealthy days in this analytic sample, indicating moderate explanatory power. The AIC was 53,606.97 and the BIC was 53,746.71. These model comparison criteria are useful for evaluating competing models fit to the same outcome, with lower values indicating a better balance between goodness of fit and parsimony. Overall, these results suggest the maximum model captures a meaningful portion of the variation in physical health burden and provides a strong starting point for sequential model selection procedures.
x_mat <- model.matrix(mod_max)[, -1]
x_df <- as.data.frame(x_mat)
names(x_df) <- make.names(names(x_df), unique = TRUE)
bestsub_data <- x_df |>
mutate(physhlth_days = brfss_analytic$physhlth_days)
nvmax_val <- ncol(x_df)
bestsub_fit <- regsubsets(
physhlth_days ~ .,
data = bestsub_data,
nvmax = nvmax_val,
method = "exhaustive"
)
bestsub_sum <- summary(bestsub_fit)
bestsub_table <- tibble(
Model_Size = 1:nvmax_val,
Adjusted_R_Squared = bestsub_sum$adjr2,
BIC = bestsub_sum$bic
)
bestsub_table |>
mutate(
Adjusted_R_Squared = round(Adjusted_R_Squared, 3),
BIC = round(BIC, 1)
) |>
kable(caption = "Best Subsets Results by Model Size") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Model_Size | Adjusted_R_Squared | BIC |
|---|---|---|
| 1 | 0.281 | -2618.0 |
| 2 | 0.317 | -3027.7 |
| 3 | 0.331 | -3183.4 |
| 4 | 0.336 | -3231.6 |
| 5 | 0.338 | -3256.3 |
| 6 | 0.342 | -3296.9 |
| 7 | 0.344 | -3303.2 |
| 8 | 0.344 | -3299.4 |
| 9 | 0.344 | -3293.5 |
| 10 | 0.344 | -3287.0 |
| 11 | 0.344 | -3280.9 |
| 12 | 0.345 | -3275.5 |
| 13 | 0.345 | -3270.0 |
| 14 | 0.345 | -3263.7 |
| 15 | 0.345 | -3256.6 |
| 16 | 0.345 | -3249.0 |
| 17 | 0.345 | -3241.7 |
| 18 | 0.345 | -3232.7 |
par(mfrow = c(1, 2))
plot(bestsub_table$Model_Size, bestsub_table$Adjusted_R_Squared,
type = "b", xlab = "Model size", ylab = "Adjusted R-squared",
main = "Adjusted R-squared by Model Size")
plot(bestsub_table$Model_Size, bestsub_table$BIC,
type = "b", xlab = "Model size", ylab = "BIC",
main = "BIC by Model Size")best_adjr2_size <- bestsub_table$Model_Size[which.max(bestsub_table$Adjusted_R_Squared)]
best_bic_size <- bestsub_table$Model_Size[which.min(bestsub_table$BIC)]
tibble(
Criterion = c("Maximum adjusted R-squared", "Minimum BIC"),
Best_Model_Size = c(best_adjr2_size, best_bic_size)
) |>
kable(caption = "Best Model Size by Selection Criterion") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Criterion | Best_Model_Size |
|---|---|
| Maximum adjusted R-squared | 17 |
| Minimum BIC | 7 |
Best subsets regression treats each dummy variable from a categorical predictor as a separate candidate predictor, so some levels of a categorical variable may be selected while others are left out, which is a known limitation of this method. In my results, the model that maximized adjusted R-squared included 17 predictors, while the model that minimized BIC included 7 predictors. The two criteria did not agree on the same best model size. Adjusted R-squared kept improving slightly as more predictors were added and favored a larger model, while BIC favored a simpler model because it gives a stronger penalty for adding too many predictors. Based on these results, BIC chose the simpler model, while adjusted R-squared focused more on improving model fit. Overall, the plots show model fit improved quickly with the first several predictors and then leveled off, suggesting little added benefit after about seven predictors.
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)
mod_backward <- step(mod_max, direction = "backward", trace = 0)
mod_forward <- step(
mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward",
trace = 0
)
mod_both <- step(
mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "both",
trace = 0
)
backward_terms <- attr(terms(mod_backward), "term.labels")
forward_terms <- attr(terms(mod_forward), "term.labels")
both_terms <- attr(terms(mod_both), "term.labels")
stepwise_results <- tibble(
Method = c("Backward elimination", "Forward selection", "Stepwise selection"),
Final_Model_Terms = c(
paste(backward_terms, collapse = ", "),
paste(forward_terms, collapse = ", "),
paste(both_terms, collapse = ", ")
)
)
stepwise_results |>
kable(caption = "Final Predictors Retained by Each Stepwise Method") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Method | Final_Model_Terms |
|---|---|
| Backward elimination | menthlth_days, exercise, depression, age_group, income, education, gen_health |
| Forward selection | gen_health, menthlth_days, exercise, age_group, income, education, depression |
| Stepwise selection | gen_health, menthlth_days, exercise, age_group, income, education, depression |
Backward elimination, forward selection, and stepwise selection produced very similar results and largely agreed on the same final model. All three methods retained mental health days, exercise, depression, age group, income, education, and general health, suggesting these variables were consistently important predictors of physically unhealthy days. The forward and stepwise models were identical, while backward elimination selected the same predictors but in a different order. Variables excluded by all three methods were BMI, sex, and smoking, suggesting they were less useful for predicting the outcome in this sample after accounting for other variables. Overall, the strong agreement across methods increases confidence that the retained predictors are important contributors to the final model.
mod_final <- mod_both
final_linear_table <- tidy(mod_final, conf.int = TRUE)
final_linear_table |>
mutate(across(where(is.numeric), ~ round(.x, 3))) |>
kable(caption = "Final Linear Regression Model Coefficients") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 1.507 | 0.460 | 3.279 | 0.001 | 0.606 | 2.408 |
| gen_healthGood | 1.195 | 0.178 | 6.706 | 0.000 | 0.846 | 1.544 |
| gen_healthFair/Poor | 10.066 | 0.246 | 40.954 | 0.000 | 9.584 | 10.548 |
| menthlth_days | 0.201 | 0.011 | 18.474 | 0.000 | 0.179 | 0.222 |
| exerciseYes | -2.131 | 0.198 | -10.774 | 0.000 | -2.519 | -1.743 |
| age_group35-49 | 0.421 | 0.257 | 1.636 | 0.102 | -0.083 | 0.925 |
| age_group50-64 | 1.636 | 0.245 | 6.667 | 0.000 | 1.155 | 2.117 |
| age_group65+ | 1.700 | 0.238 | 7.140 | 0.000 | 1.233 | 2.167 |
| income$25K-$49K | -0.488 | 0.267 | -1.829 | 0.067 | -1.010 | 0.035 |
| income$50K-$99K | -1.223 | 0.258 | -4.749 | 0.000 | -1.728 | -0.718 |
| income$100K+ | -1.065 | 0.380 | -2.804 | 0.005 | -1.810 | -0.321 |
| educationHS/GED | 0.880 | 0.393 | 2.241 | 0.025 | 0.110 | 1.649 |
| educationSome college | 1.094 | 0.397 | 2.759 | 0.006 | 0.317 | 1.872 |
| educationCollege graduate | 1.292 | 0.400 | 3.233 | 0.001 | 0.509 | 2.076 |
| depressionYes | 0.402 | 0.210 | 1.912 | 0.056 | -0.010 | 0.814 |
My final model was guided primarily by the stepwise AIC procedure and checked against the best subsets results. I chose this model because it balances model fit with simplicity while keeping predictors that contribute meaningfully to predicting physically unhealthy days.
Use your coefficient table to complete sentences like these:
Holding all other variables constant, each one-day increase in mentally unhealthy days was associated with an average 0.20-day increase in physically unhealthy days. Adults who exercised in the past 30 days had, on average, 2.13 fewer physically unhealthy days than adults who did not exercise, controlling for all other variables. In addition, adults reporting fair or poor general health had, on average, 10.07 more physically unhealthy days than adults reporting excellent or very good health, holding all other variables constant, making general health one of the strongest predictors in the model. Older adults ages 50–64 and 65+ also reported significantly more unhealthy days than adults ages 18–34, while higher income levels were generally associated with fewer unhealthy days. Overall, the final model suggests mental health burden, exercise, self-rated health, age, and socioeconomic factors are important factors associated with physical health burden among U.S. adults.
Residuals vs. Fitted: The residuals are centered around zero, but the plot shows some curvature and a patterned spread rather than completely random scatter, suggesting some evidence of non-linearity and possible non-constant variance. This may reflect the bounded and skewed nature of the physically unhealthy days outcome.
Normal Q-Q: The residuals do not closely follow the reference line, especially in the tails, indicating departures from normality. This is not unexpected for BRFSS health-days outcomes, which are bounded between 0 and 30 and often have skewed distributions.
Scale-Location: The red line is not completely horizontal and the spread of residuals appears to increase somewhat across fitted values, suggesting some unequal variance of residuals.
Residuals vs. Leverage: There are some observations with moderate leverage, but no strong evidence of highly influential observations with large Cook’s distance. This suggests no single observation appears to influence the model.
Overall, the LINE assumptions are mostly met, although there is some evidence that the residuals are not perfectly normal and the variance is not completely constant. These issues are common with large survey data and outcomes like unhealthy days that are limited between 0 and 30. Even with these small violations, the model still appears reasonable to interpret, but the results should be considered with these limitations in mind.
brfss_analytic <- brfss_analytic |>
mutate(
frequent_distress = factor(
if_else(physhlth_days >= 14, "Yes", "No"),
levels = c("No", "Yes")
)
)
freq_distress_tab <- brfss_analytic |>
count(frequent_distress) |>
mutate(Percent = 100 * n / sum(n))
freq_distress_tab |>
kable(digits = 2, caption = "Prevalence of Frequent Physical Distress") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| frequent_distress | n | Percent |
|---|---|---|
| No | 6927 | 86.59 |
| Yes | 1073 | 13.41 |
Frequent physical distress appears somewhat imbalanced because most participants were in the “No” group, 86.59%, while only 13.41% were in the “Yes” group. Since the outcome has 13.41% in the Yes group, the model still has enough information for estimating frequent distress, although the outcome is less evenly distributed between groups. This imbalance should be noted, but the number of participants reporting frequent distress is still large enough to support logistic regression analysis.
mod_simple <- glm(
frequent_distress ~ gen_health,
data = brfss_analytic,
family = binomial
)
summary(mod_simple)##
## Call:
## glm(formula = frequent_distress ~ gen_health, family = binomial,
## data = brfss_analytic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.34421 0.08725 -38.329 <2e-16 ***
## gen_healthGood 1.07595 0.10999 9.782 <2e-16 ***
## gen_healthFair/Poor 3.33700 0.10245 32.571 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6306.5 on 7999 degrees of freedom
## Residual deviance: 4741.8 on 7997 degrees of freedom
## AIC: 4747.8
##
## Number of Fisher Scoring iterations: 6
simple_or_table <- tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
mutate(across(where(is.numeric), ~ round(.x, 3)))
simple_or_table |>
kable(caption = "Unadjusted Odds Ratios for Simple Logistic Regression") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.035 | 0.087 | -38.329 | 0 | 0.030 | 0.042 |
| gen_healthGood | 2.933 | 0.110 | 9.782 | 0 | 2.368 | 3.646 |
| gen_healthFair/Poor | 28.135 | 0.102 | 32.571 | 0 | 23.089 | 34.509 |
I chose general health because adults reporting worse general health would be expected to have higher odds of frequent physical distress. Compared with adults reporting excellent or very good general health, adults reporting fair or poor general health had 28.14 times the odds of frequent physical distress, 95% CI [23.09, 34.51]. This association was statistically significant at alpha = 0.05 because the p-value was <0.001 and the 95% confidence interval does not include 1.0. This suggests that poorer self-rated general health is strongly associated with greater odds of experiencing frequent physical distress.
# Use the same predictors as the final linear model.
final_logistic_formula <- update(formula(mod_final), frequent_distress ~ .)
mod_logistic <- glm(
final_logistic_formula,
data = brfss_analytic,
family = binomial
)
summary(mod_logistic)##
## Call:
## glm(formula = final_logistic_formula, family = binomial, data = brfss_analytic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.38630 0.22369 -15.138 < 2e-16 ***
## gen_healthGood 0.78944 0.11382 6.936 4.04e-12 ***
## gen_healthFair/Poor 2.64555 0.11183 23.657 < 2e-16 ***
## menthlth_days 0.05319 0.00430 12.368 < 2e-16 ***
## exerciseYes -0.72914 0.08400 -8.680 < 2e-16 ***
## age_group35-49 0.20493 0.15495 1.323 0.186
## age_group50-64 0.80095 0.14125 5.670 1.43e-08 ***
## age_group65+ 0.78875 0.13972 5.645 1.65e-08 ***
## income$25K-$49K -0.06743 0.11131 -0.606 0.545
## income$50K-$99K -0.46666 0.11349 -4.112 3.93e-05 ***
## income$100K+ -0.51331 0.22749 -2.256 0.024 *
## educationHS/GED 0.20569 0.16343 1.259 0.208
## educationSome college 0.14432 0.16703 0.864 0.388
## educationCollege graduate 0.25825 0.17313 1.492 0.136
## depressionYes 0.07361 0.09680 0.761 0.447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6306.5 on 7999 degrees of freedom
## Residual deviance: 4380.9 on 7985 degrees of freedom
## AIC: 4410.9
##
## Number of Fisher Scoring iterations: 6
logistic_or_table <- tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
mutate(across(where(is.numeric), ~ round(.x, 3)))
logistic_or_table |>
kable(caption = "Adjusted Odds Ratios for Multiple Logistic Regression") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.034 | 0.224 | -15.138 | 0.000 | 0.022 | 0.052 |
| gen_healthGood | 2.202 | 0.114 | 6.936 | 0.000 | 1.765 | 2.758 |
| gen_healthFair/Poor | 14.091 | 0.112 | 23.657 | 0.000 | 11.348 | 17.595 |
| menthlth_days | 1.055 | 0.004 | 12.368 | 0.000 | 1.046 | 1.064 |
| exerciseYes | 0.482 | 0.084 | -8.680 | 0.000 | 0.409 | 0.569 |
| age_group35-49 | 1.227 | 0.155 | 1.323 | 0.186 | 0.907 | 1.667 |
| age_group50-64 | 2.228 | 0.141 | 5.670 | 0.000 | 1.695 | 2.949 |
| age_group65+ | 2.201 | 0.140 | 5.645 | 0.000 | 1.680 | 2.906 |
| income$25K-$49K | 0.935 | 0.111 | -0.606 | 0.545 | 0.752 | 1.163 |
| income$50K-$99K | 0.627 | 0.113 | -4.112 | 0.000 | 0.502 | 0.784 |
| income$100K+ | 0.599 | 0.227 | -2.256 | 0.024 | 0.378 | 0.924 |
| educationHS/GED | 1.228 | 0.163 | 1.259 | 0.208 | 0.894 | 1.697 |
| educationSome college | 1.155 | 0.167 | 0.864 | 0.388 | 0.835 | 1.607 |
| educationCollege graduate | 1.295 | 0.173 | 1.492 | 0.136 | 0.925 | 1.823 |
| depressionYes | 1.076 | 0.097 | 0.761 | 0.447 | 0.889 | 1.300 |
Holding all other variables constant, each one-unit increase in mentally unhealthy days was associated with multiplying the odds of frequent physical distress by 1.06, meaning the odds increased by about 5.5% for each additional mentally unhealthy day. Holding all other variables constant, adults reporting fair or poor general health had 14.09 times the odds of frequent physical distress compared with adults reporting excellent or very good health. Adults who exercised had lower odds of frequent physical distress, with 0.48 times the odds compared with those who did not exercise, suggesting about a 52% reduction in odds. Because odds ratios above 1 indicate higher odds and odds ratios below 1 indicate lower odds, the findings suggest worse mental and general health are associated with greater odds of frequent physical distress, while exercise is associated with lower odds of frequent physical distress.
mod_reduced <- update(mod_logistic, . ~ . - income)
lrt_results <- anova(mod_reduced, mod_logistic, test = "Chisq")
lrt_results## Analysis of Deviance Table
##
## Model 1: frequent_distress ~ gen_health + menthlth_days + exercise + age_group +
## education + depression
## Model 2: frequent_distress ~ gen_health + menthlth_days + exercise + age_group +
## income + education + depression
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 7988 4404.9
## 2 7985 4380.9 3 24.026 2.467e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null hypothesis: The income terms do not improve the fit of the multiple logistic regression model after accounting for the other predictors
Alternative hypothesis: At least one income term improves the fit of the multiple logistic regression model after accounting for the other predictors.
The likelihood ratio test statistic was 24.03 with 3 degrees of freedom and p = 2.47 × 10^-5. At alpha = 0.05, we reject the null hypothesis, meaning income does significantly improve model fit as a group. The results suggest that including income levels in the multiple logistic regression model provides important information for predicting frequent physical distress beyond the other predictors already in the model.
roc_obj <- roc(brfss_analytic$frequent_distress, fitted(mod_logistic))
plot(
roc_obj,
main = "ROC Curve: Frequent Physical Distress Model",
print.auc = TRUE
)## Area under the curve: 0.8557
## 95% CI: 0.8424-0.869 (DeLong)
The AUC is 0.856 with a 95% confidence interval of 0.8424 to 0.869. This means that the model has excellent discrimination between adults with and without frequent physical distress. This falls in the excellent discrimination range (0.8–0.9), suggesting the model does a strong job distinguishing individuals who report frequent physical distress from those who do not.
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: mod_logistic$y, fitted(mod_logistic)
## X-squared = 6.658, df = 8, p-value = 0.5739
Null hypothesis: The model fits the data well.
Alternative hypothesis: The model does not fit the data well.
The Hosmer-Lemeshow test statistic was 6.658 with 8 degrees of freedom and p = 0.5739. At alpha = 0.05, there is not evidence of poor model fit, so we fail to reject the null hypothesis that the model fits the data well. This suggests the logistic regression model shows good calibration between predicted and observed outcomes. The Hosmer-Lemeshow test evaluates calibration, while the ROC and AUC evaluates discrimination, so the two results provide different but complementary information about model performance. In this case, the non-significant Hosmer-Lemeshow test and the strong AUC both support that the model performs well.
Write 250-400 words in continuous prose. Do not use bullet points.
This analysis suggests that several individual and health-related factors are associated with physical health burden among U.S. adults. In both the linear and logistic regression models, general health status, mentally unhealthy days, and exercise were among the strongest predictors. Adults reporting fair or poor general health had substantially more physically unhealthy days and much higher odds of frequent physical distress, while exercise was associated with fewer unhealthy days and lower odds of distress. Mental health burden was also consistently associated with worse physical health outcomes in both models. Some predictors, such as certain education levels, appeared significant in the linear model but were not significant in the adjusted logistic model, showing that predictors may differ depending on how the outcome is defined. Because these data are cross-sectional, the results show associations rather than cause-and-effect relationships, since predictors and outcomes were measured at the same time. It is not possible to determine whether these factors caused greater physical health burden or were consequences of it. In addition, there may be unmeasured confounders not included in the model, such as chronic disease status, disability status, access to health care, pain conditions, or occupational factors.
The linear and logistic regression models provided related but different information about the research question. The linear model helped explain factors associated with the average number of physically unhealthy days, while the logistic model focused on the odds of meeting a threshold for frequent physical distress. Linear regression is useful when the outcome is continuous and the goal is estimating average change, while logistic regression is useful when the outcome is binary and the goal is classification or estimating odds. Model evaluation also differed across approaches. For linear regression, model fit was assessed using R-squared, adjusted R-squared, AIC, and BIC, while logistic regression relied more on odds ratios, likelihood ratio tests, ROC/AUC, and the Hosmer-Lemeshow test. Together the methods gave a fuller understanding of model performance.
I used AI assistance to help confirm that everything ran smoothly and to check some interpretation wording. I used it to troubleshoot issues such as file paths, confirm coding steps, and help write some interpretations correctly using the numerical results from my own output. I verified all code and interpretations by checking the assignment instructions, running the R Markdown file from top to bottom, reviewing model output directly, and making sure odds ratios were interpreted as odds rather than probabilities or risks. AI was used as a support tool for clarification, while all analyses and final interpretations were based on my own code and results.
kable() or tbl_summary()
formattingset.seed(1220) appears immediately before
slice_sample()EPI553_HW04_LastName_FirstName.Rmd