What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?
Source: https://www.cdc.gov/brfss/annual_data/annual_2023.html
If there is already a file named “working_data.rds” in the data folder in the directory, then read the rds cache in. If there is not, read from the XPT. Write the rds cache only if it already exists, to save processing time.
filename <- "LLCP2023.XPT"
filename <- paste0("data/", filename)
df <- if (file.exists("data/working_data.rds")) {
print("Loading working data from cache: data/working_data.rds")
readRDS("data/working_data.rds")
} else {
print(paste0("Loading full XPT from ", filename))
read_xpt(filename)
}## [1] "Loading working data from cache: data/working_data.rds"
if (file.exists("data/working_data.rds")) {
print("nah it's cool don't rewrite")
} else {
saveRDS(df, "data/working_data.rds")
print("Writing rds")
}## [1] "nah it's cool don't rewrite"
There are currently 433323 observations in the dataset.
Outcome: physhlth_days
Predictors: - 2 Continuous: menthlth_days,
bmi - 1 Binary: depression - 2 Categorical:
income, age_group - Any other 3:
exercise (binary), sex (binary),
smoking (categorical)
I included the binary variable depression because I
believe that having a covariate for a depression diagnosis has the
potential to inform interpretation of the exposure, in terms of the
relationship between mental and physical health with more specifics. I
included income and age_group because physical
health is known to be associated negatively with both age and low
socioeconomic status, and these covariates capture these. I included
exercise and smoking to give insight into
behavioral factors with a known relationship to physical health, and
sex for further insight.
| Raw Variable Name | Description | Your Recoded Name |
|---|---|---|
| PHYSHLTH | Physically unhealthy days in past 30 | physhlth_days |
| MENTHLTH | Mentally unhealthy days in past 30 (outcome) | menthlth_days |
| _BMI5 | Body mass index × 100 (divide by 100) | bmi |
| ADDEPEV3 | Ever told depressive disorder (1=Yes, 2 = No) | depression |
| _INCOMG1 | Annual household income (7 levels) | income |
| _AGEG5YR | Age group in 5-year categories (13 levels) | age_group |
| EXERANY2 | Any exercise in past 30 days (1 = Yes, 2 = No) | exercise |
| SEXVAR | Sex (1 = Male, 2 = Female) | sex |
| _SMOKER3 | Smoking status (4 levels) | smoking |
df <- df |>
dplyr::select(MENTHLTH, PHYSHLTH, `_BMI5`, ADDEPEV3,`_INCOMG1`, `_AGEG5YR`, EXERANY2, SEXVAR, `_SMOKER3`) |>
mutate(
##Recode PHYSHLTH to physhlth_days
physhlth_days = case_when(
PHYSHLTH == 88 ~ 0,
PHYSHLTH == 99 ~ NA,
PHYSHLTH == 77 ~ NA,
TRUE ~ PHYSHLTH),
##Recode MENTHLTH to menthlth_days
menthlth_days = case_when(
MENTHLTH == 88 ~ 0,
MENTHLTH == 99 ~ NA,
MENTHLTH == 77 ~ NA,
TRUE ~ MENTHLTH),
##Recode __BMI5 to bmi - divide by 100
bmi = case_when(`_BMI5` == 9999 ~ NA,
TRUE ~ `_BMI5` /100),
## Recode ADDEPEV3 as factor
depression = factor(
ifelse(ADDEPEV3 %in% c(7, 9), NA, ADDEPEV3),
levels = c(1, 2),
labels = c("Yes", "No")
),
## Recode age group - Collapse
age_group = factor(
case_when(
`_AGEG5YR`== 14 ~ NA,
`_AGEG5YR` %in% c(1,2,3) ~ "18-34",
`_AGEG5YR` %in% c(4,5,6) ~ "35-49",
`_AGEG5YR` %in% c(7,8,9) ~ "50-64",
`_AGEG5YR` %in% c(10,11,12,13) ~ "65+",
),
levels = c(
"18-34",
"35-49",
"50-64",
"65+"
)
),
## Recode Income - Collapse
income = factor(
case_when(
`_INCOMG1`== 9 ~ NA,
`_INCOMG1` %in% c(1,2) ~ "Less than $25K",
`_INCOMG1` %in% c(3,4) ~ "$25K - $49K",
`_INCOMG1` %in% c(5,6) ~ "$50K - $99K",
`_INCOMG1` == 7 ~ "$100K+",
),
levels = c(
"Less than $25K",
"$25K - $49K",
"$50K - $99K",
"$100K+"
)
),
## Recode exerany2 as factor
exercise = factor(
ifelse(EXERANY2 %in% c(7, 9), NA, EXERANY2),
levels = c(1, 2),
labels = c("Yes", "No")
),
## Recode sex as factor
sex = factor(
ifelse(SEXVAR %in% c(7, 9), NA, SEXVAR),
levels = c(1, 2),
labels = c("Male", "Female")
),
## Recode Income - Collapse
smoking = factor(
case_when(
`_SMOKER3`== 9 ~ NA,
`_SMOKER3` %in% c(1,2) ~ "Current smoker",
`_SMOKER3` %in% c(3) ~ "Former smoker",
`_SMOKER3` %in% c(4) ~ "Never smoked"
),
levels = c(
"Never smoked",
"Former smoker",
"Current smoker"
)
)) |>
## Adjust reference categories - could've been done earlier if I'd read further in the instructionslol
mutate(
## Binary - no as reference
exercise = fct_relevel(exercise, "No"),
depression = fct_relevel(depression, "No"),
## Smoking - Never smoked
smoking = fct_relevel(smoking, "Never smoked"),
)
#QA - Print top 3 rows, output for new var
(df[1:3, (ncol(df)-9):ncol(df)]) %>%
kable(caption = "3 row sample of recoded variables"
)| _SMOKER3 | physhlth_days | menthlth_days | bmi | depression | age_group | income | exercise | sex | smoking |
|---|---|---|---|---|---|---|---|---|---|
| 4 | 0 | 0 | 30.47 | No | 65+ | NA | No | Female | Never smoked |
| 4 | 0 | 0 | 28.56 | Yes | 65+ | NA | Yes | Female | Never smoked |
| 3 | 6 | 2 | 22.31 | No | 65+ | Less than $25K | Yes | Female | Former smoker |
After recoding, create your analytic dataset by removing observations with missing values on all nine analysis variables, then drawing a random sample of 8,000 observations. Report the number and percentage of missing observations for menthlth_days and at least two other key variables, before removing any missing values
Using set.seed(1220) ensures that your results are reproducible and comparable across submissions (this is critical)
This helper function was written for my final project. Adjusting it so it looks at everything in an input list, rather than just by prefix.
missingness <- function(w_data, colsearch){
w_data %>%
summarize(
across(all_of(colsearch),
~sum(is.na(.x)), .names = "{.col}")) %>%
pivot_longer(cols = everything(),
names_to = "variable",
values_to = "nmissing") %>%
mutate(
run = unique(w_data$v_year),
n = nrow(w_data),
pct_missing = nmissing / n * 100
)
}Call helper function on the list
#list of columns to search in
search <- c("physhlth_days", "menthlth_days", "bmi", "depression", "income", "age_group", "sex", "exercise", "smoking")
#
a_df <- df
missing <- missingness(a_df, search)
missing %>% kable()| variable | nmissing | n | pct_missing |
|---|---|---|---|
| physhlth_days | 10785 | 433323 | 2.4889055 |
| menthlth_days | 8108 | 433323 | 1.8711215 |
| bmi | 40535 | 433323 | 9.3544538 |
| depression | 2587 | 433323 | 0.5970142 |
| income | 86623 | 433323 | 19.9903998 |
| age_group | 7779 | 433323 | 1.7951967 |
| sex | 0 | 433323 | 0.0000000 |
| exercise | 1251 | 433323 | 0.2886992 |
| smoking | 23062 | 433323 | 5.3221269 |
Call helper function on the list
## Drop if missing on any values in this list.
a_df <- a_df |>
drop_na(search)
n_end <- nrow(a_df)
set.seed(1220)
a_df <- a_df |> slice_sample(n=8000)The starting dataset had 433323 observations, and with exclusions of any variables for complete case analysis, the dataset had 310071 observations. A sample of 8000 observations will be used for the analysis.
table1 <- a_df %>%
tbl_summary(
include = search,
statistic = all_continuous() ~ "{mean} ({sd})",
label= list(physhlth_days ~ "Physically unhealthy days in past 30 days",
menthlth_days ~ "Mentally unhealthy days in past 30 days",
bmi ~ "Body mass index",
depression ~ "Ever told have depressive disorder",
income ~ "Anual household income",
age_group ~ "Age (years)",
exercise ~ "Exercise in past 50 days",
sex ~ "Assigned sex at birth",
smoking ~ "Smoking status"
))|>
modify_caption(caption= "Table 1. Characteristics of sample of 2023 BRFSS particpants (n=8000)")
as_flex_table(table1)Characteristic | N = 8,0001 |
|---|---|
Physically unhealthy days in past 30 days | 4 (9) |
Mentally unhealthy days in past 30 days | 5 (8) |
Body mass index | 29 (7) |
Ever told have depressive disorder | 1,677 (21%) |
Anual household income | |
Less than $25K | 1,134 (14%) |
$25K - $49K | 1,838 (23%) |
$50K - $99K | 4,398 (55%) |
$100K+ | 630 (7.9%) |
Age (years) | |
18-34 | 1,288 (16%) |
35-49 | 1,651 (21%) |
50-64 | 2,169 (27%) |
65+ | 2,892 (36%) |
Assigned sex at birth | |
Male | 4,050 (51%) |
Female | 3,950 (49%) |
Exercise in past 50 days | 6,173 (77%) |
Smoking status | |
Never smoked | 4,795 (60%) |
Former smoker | 2,280 (29%) |
Current smoker | 925 (12%) |
1Mean (SD); n (%) | |
mod_max <- lm(physhlth_days ~ menthlth_days + bmi + depression + income +
age_group + exercise + sex + smoking, data = a_df)
tidy(mod_max, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Maximum Model: All Candidate Predictors",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.5446 | 0.5684 | 6.2359 | 0.0000 | 2.4304 | 4.6589 |
| menthlth_days | 0.2761 | 0.0120 | 22.9780 | 0.0000 | 0.2526 | 0.2997 |
| bmi | 0.0743 | 0.0140 | 5.3174 | 0.0000 | 0.0469 | 0.1017 |
| depressionYes | 1.3497 | 0.2452 | 5.5038 | 0.0000 | 0.8690 | 1.8304 |
| income$25K - $49K | -1.9927 | 0.3023 | -6.5908 | 0.0000 | -2.5854 | -1.4000 |
| income$50K - $99K | -3.0688 | 0.2744 | -11.1847 | 0.0000 | -3.6066 | -2.5310 |
| income$100K+ | -3.3754 | 0.4112 | -8.2091 | 0.0000 | -4.1814 | -2.5694 |
| age_group35-49 | 0.9924 | 0.3003 | 3.3051 | 0.0010 | 0.4038 | 1.5810 |
| age_group50-64 | 2.4057 | 0.2863 | 8.4014 | 0.0000 | 1.8444 | 2.9671 |
| age_group65+ | 2.8829 | 0.2778 | 10.3757 | 0.0000 | 2.3382 | 3.4275 |
| exerciseYes | -3.2309 | 0.2236 | -14.4527 | 0.0000 | -3.6691 | -2.7927 |
| sexFemale | -0.1406 | 0.1814 | -0.7749 | 0.4384 | -0.4962 | 0.2151 |
| smokingFormer smoker | 0.7184 | 0.2076 | 3.4608 | 0.0005 | 0.3115 | 1.1252 |
| smokingCurrent smoker | 1.1817 | 0.2962 | 3.9896 | 0.0001 | 0.6011 | 1.7622 |
summary_modmax <- glance(mod_max)
glance(mod_max) |>
dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = "Maximum Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | sigma | AIC | BIC | df.residual |
|---|---|---|---|---|---|
| 0.188 | 0.187 | 7.93 | 55849.88 | 55954.69 | 7986 |
The maximum model’s \(R^2\) is 0.188 and Adjusted \(R^2\) is 0.187, indicating that the model explains approximately 18.8% of the variance in physically unhealthy days. Because \(R^2\) will always increase with additional predictors added to the model, both AIC and BIC penalize complexity more than Adjusted \(R^2\), as a way to balance both goodness of fit and complexity.
## [1] 14
subset_metrics <- tibble(
p = 1:length(best_summary$adjr2),
`Adj. R²` = best_summary$adjr2,
BIC = best_summary$bic,
Cp = best_summary$cp
)
p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_summary$adjr2),
linetype = "dashed", color = "tomato") +
labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
theme_minimal(base_size = 12)
p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.min(best_summary$bic),
linetype = "dashed", color = "tomato") +
labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p1, p2, ncol = 2)Best Subsets: Adjusted R² and BIC by Model Size
The model that maximizes Adjusted \(R^2\) has 13 parameters, and the model that minimizes BIC has 13 parameters (including the intercept).
Note: In addition to including the intercept, the above counts reflects all parameters that are included in the model, including levels of factors that would normally be counted as a single predictor.
In this case, the two criteria do agree on the best model size - each output similar parameter numbers, at 12, and highlight the same model. BIC would typically favor a simpler model, as it penalizes additional parameters more than Adjusted \(R^2\).
# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)## Start: AIC=33144.86
## physhlth_days ~ menthlth_days + bmi + depression + income + age_group +
## exercise + sex + smoking
##
## Df Sum of Sq RSS AIC
## - sex 1 38 502263 33143
## <none> 502226 33145
## - smoking 2 1420 503646 33163
## - bmi 1 1778 504004 33171
## - depression 1 1905 504131 33173
## - income 3 8458 510684 33272
## - age_group 3 8554 510779 33274
## - exercise 1 13136 515362 33349
## - menthlth_days 1 33204 535430 33655
##
## Step: AIC=33143.46
## physhlth_days ~ menthlth_days + bmi + depression + income + age_group +
## exercise + smoking
##
## Df Sum of Sq RSS AIC
## <none> 502263 33143
## - smoking 2 1492 503755 33163
## - bmi 1 1798 504062 33170
## - depression 1 1868 504132 33171
## - income 3 8420 510684 33270
## - age_group 3 8516 510779 33272
## - exercise 1 13104 515367 33348
## - menthlth_days 1 33167 535430 33653
tidy(mod_backward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.4662 | 0.5593 | 6.1971 | 0.0000 | 2.3698 | 4.5626 |
| menthlth_days | 0.2758 | 0.0120 | 22.9655 | 0.0000 | 0.2523 | 0.2994 |
| bmi | 0.0747 | 0.0140 | 5.3476 | 0.0000 | 0.0473 | 0.1021 |
| depressionYes | 1.3278 | 0.2436 | 5.4509 | 0.0000 | 0.8503 | 1.8052 |
| income$25K - $49K | -1.9909 | 0.3023 | -6.5853 | 0.0000 | -2.5836 | -1.3983 |
| income$50K - $99K | -3.0593 | 0.2741 | -11.1615 | 0.0000 | -3.5965 | -2.5220 |
| income$100K+ | -3.3533 | 0.4102 | -8.1752 | 0.0000 | -4.1574 | -2.5493 |
| age_group35-49 | 0.9756 | 0.2995 | 3.2577 | 0.0011 | 0.3885 | 1.5626 |
| age_group50-64 | 2.3878 | 0.2854 | 8.3663 | 0.0000 | 1.8283 | 2.9473 |
| age_group65+ | 2.8659 | 0.2770 | 10.3471 | 0.0000 | 2.3230 | 3.4089 |
| exerciseYes | -3.2252 | 0.2234 | -14.4353 | 0.0000 | -3.6631 | -2.7872 |
| smokingFormer smoker | 0.7359 | 0.2063 | 3.5667 | 0.0004 | 0.3315 | 1.1404 |
| smokingCurrent smoker | 1.1982 | 0.2954 | 4.0559 | 0.0001 | 0.6191 | 1.7772 |
Backward elimination removed sex from the model and retained all other variables.
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = a_df)
mod_forward <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward", trace = 1)## Start: AIC=34785.58
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 62274 556283 33939
## + exercise 1 38144 580413 34278
## + income 3 30840 587717 34382
## + depression 1 25569 592988 34450
## + smoking 2 11940 606617 34634
## + bmi 1 8532 610025 34676
## + age_group 3 6839 611718 34703
## + sex 1 1128 617429 34773
## <none> 618557 34786
##
## Step: AIC=33938.69
## physhlth_days ~ menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 27643.6 528639 33533
## + income 3 19576.0 536707 33658
## + age_group 3 15837.8 540445 33714
## + smoking 2 6562.1 549721 33848
## + bmi 1 5267.5 551016 33865
## + depression 1 3240.2 553043 33894
## + sex 1 168.4 556115 33938
## <none> 556283 33939
##
## Step: AIC=33532.92
## physhlth_days ~ menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income 3 11314.2 517325 33366
## + age_group 3 10584.9 518054 33377
## + smoking 2 4154.0 524485 33474
## + depression 1 2683.3 525956 33494
## + bmi 1 2249.6 526390 33501
## <none> 528639 33533
## + sex 1 22.7 528617 33535
##
## Step: AIC=33365.84
## physhlth_days ~ menthlth_days + exercise + income
##
## Df Sum of Sq RSS AIC
## + age_group 3 9534.8 507790 33223
## + smoking 2 2756.9 514568 33327
## + depression 1 2143.4 515182 33335
## + bmi 1 2123.3 515202 33335
## <none> 517325 33366
## + sex 1 2.0 517323 33368
##
## Step: AIC=33223.02
## physhlth_days ~ menthlth_days + exercise + income + age_group
##
## Df Sum of Sq RSS AIC
## + depression 1 2336.90 505454 33188
## + bmi 1 1944.56 505846 33194
## + smoking 2 1614.19 506176 33202
## <none> 507790 33223
## + sex 1 43.68 507747 33224
##
## Step: AIC=33188.12
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## depression
##
## Df Sum of Sq RSS AIC
## + bmi 1 1698.57 503755 33163
## + smoking 2 1391.86 504062 33170
## + sex 1 139.23 505314 33188
## <none> 505454 33188
##
## Step: AIC=33163.19
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## depression + bmi
##
## Df Sum of Sq RSS AIC
## + smoking 2 1491.58 502263 33143
## <none> 503755 33163
## + sex 1 108.85 503646 33163
##
## Step: AIC=33143.46
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## depression + bmi + smoking
##
## Df Sum of Sq RSS AIC
## <none> 502263 33143
## + sex 1 37.758 502226 33145
tidy(mod_forward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.4662 | 0.5593 | 6.1971 | 0.0000 | 2.3698 | 4.5626 |
| menthlth_days | 0.2758 | 0.0120 | 22.9655 | 0.0000 | 0.2523 | 0.2994 |
| exerciseYes | -3.2252 | 0.2234 | -14.4353 | 0.0000 | -3.6631 | -2.7872 |
| income$25K - $49K | -1.9909 | 0.3023 | -6.5853 | 0.0000 | -2.5836 | -1.3983 |
| income$50K - $99K | -3.0593 | 0.2741 | -11.1615 | 0.0000 | -3.5965 | -2.5220 |
| income$100K+ | -3.3533 | 0.4102 | -8.1752 | 0.0000 | -4.1574 | -2.5493 |
| age_group35-49 | 0.9756 | 0.2995 | 3.2577 | 0.0011 | 0.3885 | 1.5626 |
| age_group50-64 | 2.3878 | 0.2854 | 8.3663 | 0.0000 | 1.8283 | 2.9473 |
| age_group65+ | 2.8659 | 0.2770 | 10.3471 | 0.0000 | 2.3230 | 3.4089 |
| depressionYes | 1.3278 | 0.2436 | 5.4509 | 0.0000 | 0.8503 | 1.8052 |
| bmi | 0.0747 | 0.0140 | 5.3476 | 0.0000 | 0.0473 | 0.1021 |
| smokingFormer smoker | 0.7359 | 0.2063 | 3.5667 | 0.0004 | 0.3315 | 1.1404 |
| smokingCurrent smoker | 1.1982 | 0.2954 | 4.0559 | 0.0001 | 0.6191 | 1.7772 |
Forward Selection has added everything except for sex - this means it has converged on the same outcome as Backward Selection.
## Start: AIC=34785.58
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 62274 556283 33939
## + exercise 1 38144 580413 34278
## + income 3 30840 587717 34382
## + depression 1 25569 592988 34450
## + smoking 2 11940 606617 34634
## + bmi 1 8532 610025 34676
## + age_group 3 6839 611718 34703
## + sex 1 1128 617429 34773
## <none> 618557 34786
##
## Step: AIC=33938.69
## physhlth_days ~ menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 27644 528639 33533
## + income 3 19576 536707 33658
## + age_group 3 15838 540445 33714
## + smoking 2 6562 549721 33848
## + bmi 1 5267 551016 33865
## + depression 1 3240 553043 33894
## + sex 1 168 556115 33938
## <none> 556283 33939
## - menthlth_days 1 62274 618557 34786
##
## Step: AIC=33532.92
## physhlth_days ~ menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income 3 11314 517325 33366
## + age_group 3 10585 518054 33377
## + smoking 2 4154 524485 33474
## + depression 1 2683 525956 33494
## + bmi 1 2250 526390 33501
## <none> 528639 33533
## + sex 1 23 528617 33535
## - exercise 1 27644 556283 33939
## - menthlth_days 1 51773 580413 34278
##
## Step: AIC=33365.84
## physhlth_days ~ menthlth_days + exercise + income
##
## Df Sum of Sq RSS AIC
## + age_group 3 9535 507790 33223
## + smoking 2 2757 514568 33327
## + depression 1 2143 515182 33335
## + bmi 1 2123 515202 33335
## <none> 517325 33366
## + sex 1 2 517323 33368
## - income 3 11314 528639 33533
## - exercise 1 19382 536707 33658
## - menthlth_days 1 45040 562365 34032
##
## Step: AIC=33223.02
## physhlth_days ~ menthlth_days + exercise + income + age_group
##
## Df Sum of Sq RSS AIC
## + depression 1 2337 505454 33188
## + bmi 1 1945 505846 33194
## + smoking 2 1614 506176 33202
## <none> 507790 33223
## + sex 1 44 507747 33224
## - age_group 3 9535 517325 33366
## - income 3 10264 518054 33377
## - exercise 1 15763 523553 33466
## - menthlth_days 1 51103 558894 33988
##
## Step: AIC=33188.12
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## depression
##
## Df Sum of Sq RSS AIC
## + bmi 1 1699 503755 33163
## + smoking 2 1392 504062 33170
## + sex 1 139 505314 33188
## <none> 505454 33188
## - depression 1 2337 507790 33223
## - income 3 9698 515151 33334
## - age_group 3 9728 515182 33335
## - exercise 1 15494 520947 33428
## - menthlth_days 1 34670 540124 33717
##
## Step: AIC=33163.19
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## depression + bmi
##
## Df Sum of Sq RSS AIC
## + smoking 2 1492 502263 33143
## <none> 503755 33163
## + sex 1 109 503646 33163
## - bmi 1 1699 505454 33188
## - depression 1 2091 505846 33194
## - age_group 3 9562 513317 33308
## - income 3 9569 513324 33308
## - exercise 1 13747 517502 33377
## - menthlth_days 1 34268 538023 33688
##
## Step: AIC=33143.46
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## depression + bmi + smoking
##
## Df Sum of Sq RSS AIC
## <none> 502263 33143
## + sex 1 38 502226 33145
## - smoking 2 1492 503755 33163
## - bmi 1 1798 504062 33170
## - depression 1 1868 504132 33171
## - income 3 8420 510684 33270
## - age_group 3 8516 510779 33272
## - exercise 1 13104 515367 33348
## - menthlth_days 1 33167 535430 33653
tidy(mod_stepwise, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.4662 | 0.5593 | 6.1971 | 0.0000 | 2.3698 | 4.5626 |
| menthlth_days | 0.2758 | 0.0120 | 22.9655 | 0.0000 | 0.2523 | 0.2994 |
| exerciseYes | -3.2252 | 0.2234 | -14.4353 | 0.0000 | -3.6631 | -2.7872 |
| income$25K - $49K | -1.9909 | 0.3023 | -6.5853 | 0.0000 | -2.5836 | -1.3983 |
| income$50K - $99K | -3.0593 | 0.2741 | -11.1615 | 0.0000 | -3.5965 | -2.5220 |
| income$100K+ | -3.3533 | 0.4102 | -8.1752 | 0.0000 | -4.1574 | -2.5493 |
| age_group35-49 | 0.9756 | 0.2995 | 3.2577 | 0.0011 | 0.3885 | 1.5626 |
| age_group50-64 | 2.3878 | 0.2854 | 8.3663 | 0.0000 | 1.8283 | 2.9473 |
| age_group65+ | 2.8659 | 0.2770 | 10.3471 | 0.0000 | 2.3230 | 3.4089 |
| depressionYes | 1.3278 | 0.2436 | 5.4509 | 0.0000 | 0.8503 | 1.8052 |
| bmi | 0.0747 | 0.0140 | 5.3476 | 0.0000 | 0.0473 | 0.1021 |
| smokingFormer smoker | 0.7359 | 0.2063 | 3.5667 | 0.0004 | 0.3315 | 1.1404 |
| smokingCurrent smoker | 1.1982 | 0.2954 | 4.0559 | 0.0001 | 0.6191 | 1.7772 |
Stepwise selection, which can both add and remove variables, converged on the same model that the forward selection and backward elimination methods of variable selection also landed on. These models all incorporate mentally unhealthy days, exercise, income, age, depression, BMI, and smoking as predictors. They likewise also all remove sex. It makes sense for these models to converge, as they are optimizing based off of the same metric - AIC.
The final model that I have chosen is the same model that all of the different stepwise selection methods have converged on - a model that includes exercise, income, age, depression, bmi, and smoking, and excludes sex. While these automated methods do not identify the best model, they removed sex for a reason, and that is because it explains very little of the variance in Physically Unhealthy Days, and is not statistically significant in the full model.
# Automated backward elimination using AIC
mod_final <- mod_stepwise
tidy(mod_final, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Final Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 3.4662 | 0.5593 | 6.1971 | 0.0000 | 2.3698 | 4.5626 |
| menthlth_days | 0.2758 | 0.0120 | 22.9655 | 0.0000 | 0.2523 | 0.2994 |
| exerciseYes | -3.2252 | 0.2234 | -14.4353 | 0.0000 | -3.6631 | -2.7872 |
| income$25K - $49K | -1.9909 | 0.3023 | -6.5853 | 0.0000 | -2.5836 | -1.3983 |
| income$50K - $99K | -3.0593 | 0.2741 | -11.1615 | 0.0000 | -3.5965 | -2.5220 |
| income$100K+ | -3.3533 | 0.4102 | -8.1752 | 0.0000 | -4.1574 | -2.5493 |
| age_group35-49 | 0.9756 | 0.2995 | 3.2577 | 0.0011 | 0.3885 | 1.5626 |
| age_group50-64 | 2.3878 | 0.2854 | 8.3663 | 0.0000 | 1.8283 | 2.9473 |
| age_group65+ | 2.8659 | 0.2770 | 10.3471 | 0.0000 | 2.3230 | 3.4089 |
| depressionYes | 1.3278 | 0.2436 | 5.4509 | 0.0000 | 0.8503 | 1.8052 |
| bmi | 0.0747 | 0.0140 | 5.3476 | 0.0000 | 0.0473 | 0.1021 |
| smokingFormer smoker | 0.7359 | 0.2063 | 3.5667 | 0.0004 | 0.3315 | 1.1404 |
| smokingCurrent smoker | 1.1982 | 0.2954 | 4.0559 | 0.0001 | 0.6191 | 1.7772 |
This is model is not great.
The residuals vs. fitted plot is used to look for both non-linearity and heteroscedasticity at once. The loess line on this diagnostic plot curves downwards in the center, and does not remain flat at zero - this indicates that the linear model does not fully capture the relationship - a transformation of the outcome or polynomial terms may be better suited for this model. Because the outcome is bounded between 0-30, it has a clear diagonal pattern, but it is difficult to identify heteroscedasticity from this plot.
The Q-Q plot indicates a severe departure from normality - the points display both an S-curve shape and a wide deviation above normal in the upper tail. This indicates both a right skew, and the potential need for a different model family entirely - it is closer in function to count data (although it is bounded), so poisson or negative binomial may be a better fit. This sees the same issue that the residuals vs. fitted plot highlighted - the outcome of physically unhealthy days is bounded by 0 and 30, and most people are reporting 0 physically unhealthy days. It also deviates from normal at the bottom of the plot, downwards, though less substantially than the upper extremity.
The scale-location plot is diagnostic of the same issues that the other plots have made clear, the bounded point data is not a good fit for a linear model. The loess line slopes upward from left to right, so the spread of residuals does increase systematically with fitted values, which may be an indication of heteroscedasticity. It does, however, also deviate substantially on the left, indicating that the bounding of the data causes an issue with fit on both sides, and may be more of a matter of poor model family fit or need for transformation than heteroscedasticity.
The leverage plot has called out a few high-leverage observations, but nothing has been placed outside of Cook’s distance, indicating that there are not obvious outliers with high leverage.
The major violation for this model is that the model is not the correct family for the outcome - because it is bounded and count-like, and all of the diagnostic plots reflect LINE assumption violations that are downstream of that issue. This results in violations within many of the LINE assumptions. It appears that the relationship may not be linear, per residuals vs. fitted, and with many loess-line deviations in the residual diagnostic plots. The Q-Q plot indicates that the residual errors do not follow a normal distribution, and all of the tests for homoscedasticity indicate that there may be deviation but it is less conclusive than the general poor model family fit.
a_df <- a_df |> mutate(
frequent_distress = factor(
ifelse(physhlth_days >= 14, 1, 0),
levels = c(0, 1),
labels = c( "No", "Yes")
))
## i know how to do this I've done it a million times i'm just tired
## this is the worst I've ever done it
a_df |> mutate(
total = n()) |>
group_by(frequent_distress, total) |>
summarize(count = n()) |>
mutate(pct = 100* (count/total))|>
kable()| frequent_distress | total | count | pct |
|---|---|---|---|
| No | 8000 | 6910 | 86.375 |
| Yes | 8000 | 1090 | 13.625 |
Most participants in this survey do not report frequent physical distress, with a prevalence of only about 13.6%.
##
## Call:
## glm(formula = frequent_distress ~ age_group, family = binomial,
## data = a_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5649 0.1082 -23.707 < 2e-16 ***
## age_group35-49 0.4580 0.1341 3.416 0.000635 ***
## age_group50-64 0.9270 0.1228 7.547 4.46e-14 ***
## age_group65+ 0.9151 0.1194 7.662 1.82e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6369.6 on 7999 degrees of freedom
## Residual deviance: 6276.9 on 7996 degrees of freedom
## AIC: 6284.9
##
## Number of Fisher Scoring iterations: 5
I have picked age as the primary exposure for the unadjusted model, because aging brings physical challenges that may heighten vulnerability to injury and pain.
tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3,
caption = "Simple Logistic Regression: FPD ~ Age Group") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.077 | 0.108 | -23.707 | 0.000 | 0.062 | 0.095 |
| age_group35-49 | 1.581 | 0.134 | 3.416 | 0.001 | 1.219 | 2.063 |
| age_group50-64 | 2.527 | 0.123 | 7.547 | 0.000 | 1.995 | 3.231 |
| age_group65+ | 2.497 | 0.119 | 7.662 | 0.000 | 1.986 | 3.173 |
Compared to the reference group of people in the age group 18-34, people in the age group 35-49 have 1.6 times the odds of frequent physical distress, 95% CI 1.2 to 2.1.
Compared to the reference group of 18-34, people in the age group of 50-64 have 2.5x the odds of reporting frequent physical distress, 95% CI 2.0 - 3.2.
Compared to the reference group of 18-34, people in the age group 65+ have 2.5 x the odds of reporting frequent physical distress, 95% CI 1.9 - 3.2.
The association between age group and frequent physical distress is statistically significant at alpha = 0.05, this is evident because for all groups in comparison to the reference group of 18-34, all 95% confidence intervals are greater than 1, not crossing 1.
mod_logistic <- glm(frequent_distress ~ menthlth_days + bmi + depression + income + age_group + exercise + smoking,
data = a_df,
family=binomial)
tidy(mod_logistic, conf.int = TRUE, exponentiate=TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 3))) |>
kable(
caption = "Maximum Model: All Candidate Predictors",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.058 | 0.219 | -12.988 | 0.000 | 0.038 | 0.089 |
| menthlth_days | 1.066 | 0.004 | 16.630 | 0.000 | 1.058 | 1.074 |
| bmi | 1.024 | 0.005 | 4.712 | 0.000 | 1.014 | 1.034 |
| depressionYes | 1.366 | 0.088 | 3.558 | 0.000 | 1.149 | 1.620 |
| income$25K - $49K | 0.667 | 0.100 | -4.057 | 0.000 | 0.549 | 0.811 |
| income$50K - $99K | 0.417 | 0.094 | -9.324 | 0.000 | 0.347 | 0.502 |
| income$100K+ | 0.317 | 0.199 | -5.781 | 0.000 | 0.212 | 0.462 |
| age_group35-49 | 1.639 | 0.144 | 3.421 | 0.001 | 1.238 | 2.183 |
| age_group50-64 | 2.692 | 0.134 | 7.380 | 0.000 | 2.078 | 3.518 |
| age_group65+ | 3.079 | 0.132 | 8.501 | 0.000 | 2.386 | 4.010 |
| exerciseYes | 0.442 | 0.077 | -10.659 | 0.000 | 0.380 | 0.514 |
| smokingFormer smoker | 1.309 | 0.082 | 3.300 | 0.001 | 1.115 | 1.536 |
| smokingCurrent smoker | 1.494 | 0.105 | 3.834 | 0.000 | 1.215 | 1.832 |
Holding all other variables constant, for each additional day reporting poor mental health, the odds of reporting frequent physical distress are multiplied 1.066x.
Holding all other variables constant, people reporting a previous diagnosis of a depressive disorder have 1.366 times the odds of reporting frequent physical distress.
Income is a protective factor - compared to people with an income below $25k, people reporting an income of $25k - 49k have 0.66 times the odds of reporting frequent physical distress, holding all other variables constant. Compared to the same group and holding all other variables constant, people with an income of $50k to $99k have 0.42 times the odds of experiencing frequent physical distress. And people reporting an income of $100k+ have 0.32 times the odds of reporting frequent physical distress compared to people with an income below $25k, holding all other variables constant.
I am testing the effect that the Smoking variable has on the model, first by fitting a reduced model that does not have this group, and then testing both the full and reduced model with the Likelihood Ratio Test.
mod_reduced <- glm(
frequent_distress ~ menthlth_days + bmi + depression + income + age_group + exercise,
data = a_df,
family = binomial
)
anova(mod_reduced, mod_logistic, test = "LRT") |>
kable(digits = 3,
caption = "LR Test: Does adding smoking status improve the model?") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 7989 | 5300.187 | NA | NA | NA |
| 7987 | 5280.669 | 2 | 19.518 | 0 |
The \(H_0\) states that the predictors that were dropped for the reduced model do not have a statistically singificant effect on the model, and the alternate hypothesis states that they do. The LR test gives the statistic on 2 degrees of freedom. The deviance is 19.5, and the P-value is below 0.001, meaning that at an alpha of 0.05, we can reject the null hypothesis. This indicates that smoking does significantly improve the model, so this variable will be retained.
roc_obj <- roc(
response = a_df$frequent_distress,
predictor = fitted(mod_logistic),
levels = c("No", "Yes"),
direction = "<"
)
auc_value <- auc(roc_obj)
ci.auc(roc_obj)## 95% CI: 0.7688-0.7991 (DeLong)
ggroc(roc_obj, color = "cyan", linewidth = 1.2) +
geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "coral") +
labs(title = "ROC Curve for Frequent Physical Distress Model",
subtitle = paste0("AUC = ", round(auc_value, 3)),
x = "Specificity", y = "Sensitivity") +
theme_classic()The model has an AUC of .784, and a CI of 0.7688 - 0.7991, indicating that it has acceptable discrimination, distinguishing between individuals with and without frequent physical distress well.
hl_test <- hoslem.test(
x = as.numeric(a_df$frequent_distress) - 1,
y = fitted(mod_logistic),
g = 10
)
hl_test##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(a_df$frequent_distress) - 1, fitted(mod_logistic)
## X-squared = 6.6248, df = 8, p-value = 0.5776
The null hypothesis states that the model fits the data well, and the alternate hypothesis is that it does. This test has 8 degrees of freedom, an X^2 of 6.624, and the p-value is 0.5776. At an alpha of 0.05, we do not have sufficient evidence to reject the null hypothesis, indicating that there is not evidence of poor model fit.
Both ROC and the Hosmer Lemeshow test assess validity, but Hosmer-Lemeshow is a measure of calibration, while and ROC is a measure of discrimination. Calibration is a measure of how well predicted probabilities match observed rates, while discrimination is the separation of events from non-events, these are complementary measures, and ideally we look to have both high, but models can have high calibration and low discrimination, and vice-versa. In this case, we find strong calibration and acceptable discrimination.
Our results indicate that physical health burden in the United States is associated with a variety of factors – physical, mental, behavioral, and socioeconomic factors all played a statistically significant role in both models. Both the linear and logistic models supported using the same predictors, with statistical and significance for all of those included in the final model. In both, income level, age group, and exercise status were all the strongest predictors of physical distress.
While this model has added response to the question, “Have you ever been told you have a depressive disorder?” as a predictor, most chronic conditions that may result in days of poor physical health have not been reflected in this analysis – these conditions are likely unmeasured confounders affecting both our outcome and our exposures. Additionally, while some elements of Socioeconomic Status are reflected with our measure of reported household income, this is a proxy for a large number of unmeasured effects, which are likely to further confound the outcome. Further, this analysis is entirely reflective of the results of a cross-sectional survey, where data is taken at a single point in time. This means that there is no way to infer causality, as there is no way to confirm whether an exposure occurred before an outcome.
Linear regression models can tell the magnitude of an effect of a predictor on the outcome, while logistic regressions measure the change in probability of an outcome. Linear regression is useful in being easy to interpret and convey the magnitude of effect, but by assuming that there is an effect uniformly across the predictor and outcome, it can add confusion that lowers interpretability. Logistic regression is more useful when you are looking for classification into categories. Linear regression model selection looks for residuals and numeric differences between observations and the prediction, whereas logistic selection frameworks look for accuracy and precision of the classification.
I did not use AI for this assignment.