#Research Question
What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?
library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(car)
library(gtsummary)
library(pROC)
library(ResourceSelection)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))Part 0. Data Preparation
brfss_full<-brfss_raw[,c("PHYSHLTH","MENTHLTH","_BMI5","SEXVAR","EXERANY2","_AGEG5YR","_INCOMG1","EDUCA","_SMOKER3")]Justify your choices briefly (2-3 sentences) in Part 0.
To examine individual, behavioral, and health factors associated with physical health burden in U.S. adults, I selected mentally unhealthy days, BMI, sex assigned at birth, physical activity, income, education, and smoking status as candidate predictors. These variables represent key biological, behavioral, and social determinants of health that are commonly associated with physical health outcomes in epidemiologic research. Including this range of factors allows for a more comprehensive assessment of potential contributors to physical health burden.
brfss_recode <-brfss_full|>
rename(
menthlth_days = MENTHLTH,
physhlth_days = PHYSHLTH,
bmi =`_BMI5`,
sex = SEXVAR,
exercise = EXERANY2,
age_group = `_AGEG5YR`,
income = `_INCOMG1`,
education = EDUCA,
smoking = `_SMOKER3`
) |>
mutate(
# outcome: mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth_days == 88 ~ 0,
menthlth_days %in% c(77,99) ~ NA_real_,
menthlth_days >=1 & menthlth_days<= 30 ~as.numeric(menthlth_days),
TRUE ~NA_real_
),
# Physically unhealthy days in past 30
physhlth_days = case_when(
physhlth_days == 88 ~ 0,
physhlth_days %in% c(77,99) ~ NA_real_,
physhlth_days >=1 & physhlth_days<=30 ~as.numeric(physhlth_days),
TRUE ~NA_real_
),
# BMI
bmi = case_when(
bmi == 9999 ~NA_real_,
TRUE ~as.numeric(bmi)/100
),
# Sex
sex = factor(sex,
levels = c(1,2),
labels = c("Male","Female")),
# Any exercise in the past 30 days
exercise = case_when(
exercise %in% c(7,9) ~NA_real_,
TRUE ~ as.numeric(exercise)
),
exercise = factor(exercise ,
levels = c(2,1),
labels = c("No", "Yes")),
# Age group in 5-year categories (13 levels)
age_group = case_when(
age_group %in% 1:3 ~ "18-34",
age_group %in% 4:6 ~ "35-49",
age_group %in% 7:9 ~ "50-64",
age_group %in% 10:13 ~"65+",
TRUE ~ NA_character_
),
age_group = factor (age_group,
levels = c("18-34","35-49", "50-64", "65+")),
#Annual household Income (7 levels)
income = case_when(
income %in% c(1,2) ~ "Less than $25,000",
income %in% c(3,4) ~ "$25,000 ~ $49,000",
income %in% c(5,6) ~ "$50,000 ~ $99,000",
income == 7 ~ "$100,000+",
income == 9 ~ NA_character_,
TRUE ~ NA_character_
),
income = factor(income,
levels = c("Less than $25,000","$25,000 ~ $49,000", "$50,000 ~ $99,000", "$100,000+" )
),
# Highest level of education completed
education = case_when(
education %in% 1:3 ~ "Less than HS",
education == 4 ~ "HS/GED",
education == 5 ~ "Some College",
education == 6 ~ "College graduate",
education == 9 ~ NA_character_,
TRUE ~ NA_character_),
education = factor(education,
levels = c("Less than HS",
"HS/GED",
"Some College",
"College graduate")),
# Smoking status 4 levels
smoking = case_when(
smoking %in% c(1,2) ~ "Current",
smoking == 3 ~ "Former",
smoking == 4 ~ "Never",
smoking == 9 ~ NA_character_,
TRUE ~ NA_character_
),
smoking = factor(smoking,
levels = c("Never",
"Former",
"Current")
))vars <- c("physhlth_days","menthlth_days", "bmi", "sex", "exercise", "age_group", "income","education","smoking")
missing_summary <- data.frame(
Variable = vars,
Missing_Count = colSums(is.na(brfss_recode[vars])),
Missing_Percent = round(colSums(is.na(brfss_recode[vars])) / nrow(brfss_recode) * 100, 2)
)
print(missing_summary)## Variable Missing_Count Missing_Percent
## physhlth_days physhlth_days 10785 2.49
## menthlth_days menthlth_days 8108 1.87
## bmi bmi 40535 9.35
## sex sex 0 0.00
## exercise exercise 1251 0.29
## age_group age_group 7779 1.80
## income income 86623 19.99
## education education 2325 0.54
## smoking smoking 23062 5.32
set.seed(1220)
brfss_analytic <-brfss_recode |>
drop_na(menthlth_days,physhlth_days,bmi,sex,exercise, age_group,income,education,smoking) |>
slice_sample(n=8000)
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_analytic), ncol(brfss_analytic))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 8000 |
| Variables | 9 |
Sample Size: 8000
brfss_analytic |>
select( physhlth_days, menthlth_days, bmi, sex , exercise, age_group, income, education, smoking) |>
tbl_summary(
label= list(
physhlth_days ~ "Physically unhealthy days in past 30",
menthlth_days ~ "Mentally unhealthy days in past 30",
bmi ~ "Body mass index",
exercise ~ "Any exercise in past 30 days",
age_group ~ "Age group in 5-year categories (13 levels)",
income ~ "Annual household income (7 levels)",
education ~ "Highest level of education completed",
smoking ~ "Smoking status (4 levels)"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics - BRFSS 2023 (n=8,000)**" ) |>
as_flex_table()Characteristic | N | N = 8,0001 |
|---|---|---|
Physically unhealthy days in past 30 | 8,000 | 4.4 (8.7) |
Mentally unhealthy days in past 30 | 8,000 | 4.3 (8.1) |
Body mass index | 8,000 | 28.7 (6.5) |
sex | 8,000 | |
Male | 3,936 (49%) | |
Female | 4,064 (51%) | |
Any exercise in past 30 days | 8,000 | 6,240 (78%) |
Age group in 5-year categories (13 levels) | 8,000 | |
18-34 | 1,277 (16%) | |
35-49 | 1,665 (21%) | |
50-64 | 2,055 (26%) | |
65+ | 3,003 (38%) | |
Annual household income (7 levels) | 8,000 | |
Less than $25,000 | 1,048 (13%) | |
$25,000 ~ $49,000 | 1,938 (24%) | |
$50,000 ~ $99,000 | 4,376 (55%) | |
$100,000+ | 638 (8.0%) | |
Highest level of education completed | 8,000 | |
Less than HS | 376 (4.7%) | |
HS/GED | 1,827 (23%) | |
Some College | 2,138 (27%) | |
College graduate | 3,659 (46%) | |
Smoking status (4 levels) | 8,000 | |
Never | 4,812 (60%) | |
Former | 2,262 (28%) | |
Current | 926 (12%) | |
1Mean (SD); n (%) | ||
Step 1: Fit the Maximum Model (5 points) • Fit a linear regression model containing all of your selected predictors 5 • Display the model summary • Report and interpret: R-squared, Adjusted R-squared, AIC (using AIC()), and BIC (using BIC())
mod_max <- lm(physhlth_days ~ menthlth_days + bmi + exercise + age_group +
income + education + smoking,
data = brfss_analytic)
tidy(mod_max, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Maximum Model: All Candidate Predictors",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 5.8000 | 0.6530 | 8.8821 | 0.0000 | 4.5200 | 7.0800 |
| menthlth_days | 0.3042 | 0.0114 | 26.7352 | 0.0000 | 0.2819 | 0.3265 |
| bmi | 0.0480 | 0.0140 | 3.4339 | 0.0006 | 0.0206 | 0.0754 |
| exerciseYes | -3.4505 | 0.2253 | -15.3127 | 0.0000 | -3.8922 | -3.0088 |
| age_group35-49 | 1.3477 | 0.2981 | 4.5215 | 0.0000 | 0.7634 | 1.9320 |
| age_group50-64 | 2.6452 | 0.2870 | 9.2171 | 0.0000 | 2.0826 | 3.2077 |
| age_group65+ | 2.8828 | 0.2740 | 10.5221 | 0.0000 | 2.3457 | 3.4198 |
| income$25,000 ~ $49,000 | -2.6241 | 0.3073 | -8.5398 | 0.0000 | -3.2264 | -2.0217 |
| income$50,000 ~ $99,000 | -3.2385 | 0.2952 | -10.9701 | 0.0000 | -3.8172 | -2.6598 |
| income$100,000+ | -3.8271 | 0.4303 | -8.8949 | 0.0000 | -4.6705 | -2.9837 |
| educationHS/GED | -1.1978 | 0.4508 | -2.6567 | 0.0079 | -2.0815 | -0.3140 |
| educationSome College | -0.7178 | 0.4525 | -1.5863 | 0.1127 | -1.6048 | 0.1692 |
| educationCollege graduate | -1.2319 | 0.4531 | -2.7188 | 0.0066 | -2.1201 | -0.3437 |
| smokingFormer | 0.6768 | 0.2052 | 3.2977 | 0.0010 | 0.2745 | 1.0791 |
| smokingCurrent | 1.1503 | 0.2964 | 3.8807 | 0.0001 | 0.5692 | 1.7313 |
glance(mod_max) |>
dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = "Maximum Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | sigma | AIC | BIC | df.residual |
|---|---|---|---|---|---|
| 0.184 | 0.183 | 7.858 | 55705.45 | 55817.24 | 7985 |
Interpretation: The maximum model explains approximately 18.4% of the variance in physically unhealthy days ((R²= 0.184, Adjusted R² = 0.183). The strongest predictors are exercise (exerciser report 3.45 fewer physically unhealthy days ) and income (people who have income 100k and above report 3.82 fewer physically unhealthy days than people who have income less than $25,000). Age, education and smoking status have association with physicaly unhealthy days.
AIC=55705.45 AIC balances model fit against complexity using the penalty. BIC=55817.24 BIC applies a stricter complexity penalty than AIC.
Step 2: Best Subsets Regression (10 points) • Use leaps::regsubsets() to perform best subsets regression on your maximum model
• Set nvmax equal to the total number of parameters in your maximum model
• Note: regsubsets() treats each dummy variable from a factor as a separate predictor. This means some levels of a categorical variable could be included while others are excluded. This is a known limitation. You will compare these results with stepwise methods (Step 3), which handle factors as complete units. • Create a summary table or plot showing Adjusted R-squared and BIC across model sizes (number of predictors) • Identify the model size that maximizes Adjusted R-squared • Identify the model size that minimizes BIC • In 2-3 sentences, describe what you observe: Do the two criteria agree on the best model size? If not, which criterion favors a simpler model
library(leaps)
best_subsets <- regsubsets(
physhlth_days ~ menthlth_days+ bmi+ sex +
exercise + age_group + income + education + smoking,
data = brfss_analytic,
nvmax = 8, # maximum number of variables to consider
method = "exhaustive"
)
best_summary <- summary(best_subsets)subset_metrics <- tibble(
p = 1:length(best_summary$adjr2),
`Adj. R²` = best_summary$adjr2,
BIC = best_summary$bic,
Cp = best_summary$cp
)
p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_summary$adjr2),
linetype = "dashed", color = "tomato") +
labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
theme_minimal(base_size = 12)
p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.min(best_summary$bic),
linetype = "dashed", color = "tomato") +
labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p1, p2, ncol = 2)
Best Subsets: Adjusted R² and BIC by Model Size
## Best model by Adj. R²: 8 variables
## Best model by BIC: 8 variables
# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")##
## Variables in BIC-best model:
## menthlth_days
## exerciseYes
## age_group35-49
## age_group50-64
## age_group65+
## income$25,000 ~ $49,000
## income$50,000 ~ $99,000
## income$100,000+
Interpretation: In this case, both criteria agree. Adjusted R^2 is maximized and BIC is minimized at the same model size of 8 variable. The BIC-best model retains menthlth_days, exerciseYes, all three age group dummies and all three income dummies, while excluding bmi, sex, education and smoking. Those predictors did not contirbute sufficient explanatory power to offset the BIC complexity penalty.
Step 3: Stepwise Selection Methods (10 points) Using the step() function, perform the following three selection procedures. For each, report which variables are in the final model.
• Backward elimination: Start from the maximum model (direction = “backward”)
• Forward selection: Start from an intercept-only model with the maximum model as the upper scope (direction = “forward”)
• Stepwise selection: Start from the intercept-only model with both directions allowed (direction = “both”) In 3-5 sentences, compare the results:
• Do the three methods agree on the same final model?
• Which variables (if any) were excluded by all three methods?
• Which variables were retained by all three methods?
Backward elimination:
## Step 1: Maximum model
## Variables: menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, income$25,000 ~ $49,000, income$50,000 ~ $99,000, income$100,000+, educationHS/GED, educationSome College, educationCollege graduate, smokingFormer, smokingCurrent
# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
filter(term != "(Intercept)") |>
arrange(desc(p.value)) |>
dplyr::select(term, estimate, p.value) |>
mutate(across(where(is.numeric), \(x) round(x, 4)))
pvals |>
head(5) |>
kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | p.value |
|---|---|---|
| educationSome College | -0.7178 | 0.1127 |
| educationHS/GED | -1.1978 | 0.0079 |
| educationCollege graduate | -1.2319 | 0.0066 |
| smokingFormer | 0.6768 | 0.0010 |
| bmi | 0.0480 | 0.0006 |
# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)## Start: AIC=33000.43
## physhlth_days ~ menthlth_days + bmi + exercise + age_group +
## income + education + smoking
##
## Df Sum of Sq RSS AIC
## <none> 493117 33000
## - education 3 745 493862 33007
## - bmi 1 728 493845 33010
## - smoking 2 1295 494412 33017
## - income 3 7909 501026 33122
## - age_group 3 8302 501418 33128
## - exercise 1 14480 507597 33230
## - menthlth_days 1 44141 537257 33684
tidy(mod_backward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 5.8000 | 0.6530 | 8.8821 | 0.0000 | 4.5200 | 7.0800 |
| menthlth_days | 0.3042 | 0.0114 | 26.7352 | 0.0000 | 0.2819 | 0.3265 |
| bmi | 0.0480 | 0.0140 | 3.4339 | 0.0006 | 0.0206 | 0.0754 |
| exerciseYes | -3.4505 | 0.2253 | -15.3127 | 0.0000 | -3.8922 | -3.0088 |
| age_group35-49 | 1.3477 | 0.2981 | 4.5215 | 0.0000 | 0.7634 | 1.9320 |
| age_group50-64 | 2.6452 | 0.2870 | 9.2171 | 0.0000 | 2.0826 | 3.2077 |
| age_group65+ | 2.8828 | 0.2740 | 10.5221 | 0.0000 | 2.3457 | 3.4198 |
| income$25,000 ~ $49,000 | -2.6241 | 0.3073 | -8.5398 | 0.0000 | -3.2264 | -2.0217 |
| income$50,000 ~ $99,000 | -3.2385 | 0.2952 | -10.9701 | 0.0000 | -3.8172 | -2.6598 |
| income$100,000+ | -3.8271 | 0.4303 | -8.8949 | 0.0000 | -4.6705 | -2.9837 |
| educationHS/GED | -1.1978 | 0.4508 | -2.6567 | 0.0079 | -2.0815 | -0.3140 |
| educationSome College | -0.7178 | 0.4525 | -1.5863 | 0.1127 | -1.6048 | 0.1692 |
| educationCollege graduate | -1.2319 | 0.4531 | -2.7188 | 0.0066 | -2.1201 | -0.3437 |
| smokingFormer | 0.6768 | 0.2052 | 3.2977 | 0.0010 | 0.2745 | 1.0791 |
| smokingCurrent | 1.1503 | 0.2964 | 3.8807 | 0.0001 | 0.5692 | 1.7313 |
Interpretation: The backward elimination is identical to the maximum model, retaining all 14 terms. Backward elimination(AIC based) kept all predictors including education, even thought education_some college has a non-significant p-value: 0.113.
• Forward selection: Start from an intercept-only model with the maximum model as the upper scope (direction = “forward”)
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)
mod_forward <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward", trace = 1)## Start: AIC=34603.45
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 59071 545563 33783
## + exercise 1 37706 566927 34090
## + income 3 32443 572191 34168
## + smoking 2 12018 592616 34447
## + education 3 11873 592761 34451
## + age_group 3 5515 599119 34536
## + bmi 1 4732 599901 34543
## <none> 604634 34603
##
## Step: AIC=33783.01
## physhlth_days ~ menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 28346.8 517216 33358
## + income 3 19686.5 525876 33495
## + age_group 3 14455.2 531108 33574
## + education 3 8240.9 537322 33667
## + smoking 2 6070.4 539492 33697
## + bmi 1 2821.0 542742 33744
## <none> 545563 33783
##
## Step: AIC=33358.15
## physhlth_days ~ menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income 3 11764.1 505452 33180
## + age_group 3 9921.3 507295 33209
## + smoking 2 3952.9 513263 33301
## + education 3 3409.0 513807 33311
## + bmi 1 753.5 516463 33348
## <none> 517216 33358
##
## Step: AIC=33180.09
## physhlth_days ~ menthlth_days + exercise + income
##
## Df Sum of Sq RSS AIC
## + age_group 3 9400.2 496052 33036
## + smoking 2 2483.5 502968 33145
## + bmi 1 781.6 504670 33170
## + education 3 937.1 504515 33171
## <none> 505452 33180
##
## Step: AIC=33035.91
## physhlth_days ~ menthlth_days + exercise + income + age_group
##
## Df Sum of Sq RSS AIC
## + smoking 2 1427.13 494625 33017
## + education 3 1013.78 495038 33026
## + bmi 1 666.76 495385 33027
## <none> 496052 33036
##
## Step: AIC=33016.86
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## smoking
##
## Df Sum of Sq RSS AIC
## + bmi 1 762.78 493862 33007
## + education 3 779.94 493845 33010
## <none> 494625 33017
##
## Step: AIC=33006.52
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## smoking + bmi
##
## Df Sum of Sq RSS AIC
## + education 3 745.36 493117 33000
## <none> 493862 33007
##
## Step: AIC=33000.43
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## smoking + bmi + education
tidy(mod_forward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 5.8000 | 0.6530 | 8.8821 | 0.0000 | 4.5200 | 7.0800 |
| menthlth_days | 0.3042 | 0.0114 | 26.7352 | 0.0000 | 0.2819 | 0.3265 |
| exerciseYes | -3.4505 | 0.2253 | -15.3127 | 0.0000 | -3.8922 | -3.0088 |
| income$25,000 ~ $49,000 | -2.6241 | 0.3073 | -8.5398 | 0.0000 | -3.2264 | -2.0217 |
| income$50,000 ~ $99,000 | -3.2385 | 0.2952 | -10.9701 | 0.0000 | -3.8172 | -2.6598 |
| income$100,000+ | -3.8271 | 0.4303 | -8.8949 | 0.0000 | -4.6705 | -2.9837 |
| age_group35-49 | 1.3477 | 0.2981 | 4.5215 | 0.0000 | 0.7634 | 1.9320 |
| age_group50-64 | 2.6452 | 0.2870 | 9.2171 | 0.0000 | 2.0826 | 3.2077 |
| age_group65+ | 2.8828 | 0.2740 | 10.5221 | 0.0000 | 2.3457 | 3.4198 |
| smokingFormer | 0.6768 | 0.2052 | 3.2977 | 0.0010 | 0.2745 | 1.0791 |
| smokingCurrent | 1.1503 | 0.2964 | 3.8807 | 0.0001 | 0.5692 | 1.7313 |
| bmi | 0.0480 | 0.0140 | 3.4339 | 0.0006 | 0.0206 | 0.0754 |
| educationHS/GED | -1.1978 | 0.4508 | -2.6567 | 0.0079 | -2.0815 | -0.3140 |
| educationSome College | -0.7178 | 0.4525 | -1.5863 | 0.1127 | -1.6048 | 0.1692 |
| educationCollege graduate | -1.2319 | 0.4531 | -2.7188 | 0.0066 | -2.1201 | -0.3437 |
Interpretation: Forward selection arrived at the same final model as backward elimination, including the same 14 predictor terms. The order of entry is informative: menthlth_days entered first (the strongest predictor), followed by exercise, income, age_group, smoking, bmi and education This ordering reflects each variable’s marginal contribution given the variables already in the model. The convergence of forward and backward methods on the same model increases our confidence in this particular subset, though this convergence is not guaranteed in general.
• Stepwise selection: Start from the intercept-only model with both directions allowed (direction = “both”) In 3-5 sentences, compare the results:
mod_stepwise <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "both", trace = 1)## Start: AIC=34603.45
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 59071 545563 33783
## + exercise 1 37706 566927 34090
## + income 3 32443 572191 34168
## + smoking 2 12018 592616 34447
## + education 3 11873 592761 34451
## + age_group 3 5515 599119 34536
## + bmi 1 4732 599901 34543
## <none> 604634 34603
##
## Step: AIC=33783.01
## physhlth_days ~ menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 28347 517216 33358
## + income 3 19686 525876 33495
## + age_group 3 14455 531108 33574
## + education 3 8241 537322 33667
## + smoking 2 6070 539492 33697
## + bmi 1 2821 542742 33744
## <none> 545563 33783
## - menthlth_days 1 59071 604634 34603
##
## Step: AIC=33358.15
## physhlth_days ~ menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + income 3 11764 505452 33180
## + age_group 3 9921 507295 33209
## + smoking 2 3953 513263 33301
## + education 3 3409 513807 33311
## + bmi 1 754 516463 33348
## <none> 517216 33358
## - exercise 1 28347 545563 33783
## - menthlth_days 1 49711 566927 34090
##
## Step: AIC=33180.09
## physhlth_days ~ menthlth_days + exercise + income
##
## Df Sum of Sq RSS AIC
## + age_group 3 9400 496052 33036
## + smoking 2 2484 502968 33145
## + bmi 1 782 504670 33170
## + education 3 937 504515 33171
## <none> 505452 33180
## - income 3 11764 517216 33358
## - exercise 1 20424 525876 33495
## - menthlth_days 1 41532 546984 33810
##
## Step: AIC=33035.91
## physhlth_days ~ menthlth_days + exercise + income + age_group
##
## Df Sum of Sq RSS AIC
## + smoking 2 1427 494625 33017
## + education 3 1014 495038 33026
## + bmi 1 667 495385 33027
## <none> 496052 33036
## - age_group 3 9400 505452 33180
## - income 3 11243 507295 33209
## - exercise 1 17199 513251 33307
## - menthlth_days 1 47177 543229 33761
##
## Step: AIC=33016.86
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## smoking
##
## Df Sum of Sq RSS AIC
## + bmi 1 763 493862 33007
## + education 3 780 493845 33010
## <none> 494625 33017
## - smoking 2 1427 496052 33036
## - age_group 3 8344 502968 33145
## - income 3 9897 504522 33169
## - exercise 1 16705 511329 33281
## - menthlth_days 1 44789 539414 33708
##
## Step: AIC=33006.52
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## smoking + bmi
##
## Df Sum of Sq RSS AIC
## + education 3 745 493117 33000
## <none> 493862 33007
## - bmi 1 763 494625 33017
## - smoking 2 1523 495385 33027
## - age_group 3 8245 502107 33133
## - income 3 9781 503643 33157
## - exercise 1 15215 509077 33247
## - menthlth_days 1 44206 538068 33690
##
## Step: AIC=33000.43
## physhlth_days ~ menthlth_days + exercise + income + age_group +
## smoking + bmi + education
##
## Df Sum of Sq RSS AIC
## <none> 493117 33000
## - education 3 745 493862 33007
## - bmi 1 728 493845 33010
## - smoking 2 1295 494412 33017
## - income 3 7909 501026 33122
## - age_group 3 8302 501418 33128
## - exercise 1 14480 507597 33230
## - menthlth_days 1 44141 537257 33684
tidy(mod_stepwise, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 5.8000 | 0.6530 | 8.8821 | 0.0000 | 4.5200 | 7.0800 |
| menthlth_days | 0.3042 | 0.0114 | 26.7352 | 0.0000 | 0.2819 | 0.3265 |
| exerciseYes | -3.4505 | 0.2253 | -15.3127 | 0.0000 | -3.8922 | -3.0088 |
| income$25,000 ~ $49,000 | -2.6241 | 0.3073 | -8.5398 | 0.0000 | -3.2264 | -2.0217 |
| income$50,000 ~ $99,000 | -3.2385 | 0.2952 | -10.9701 | 0.0000 | -3.8172 | -2.6598 |
| income$100,000+ | -3.8271 | 0.4303 | -8.8949 | 0.0000 | -4.6705 | -2.9837 |
| age_group35-49 | 1.3477 | 0.2981 | 4.5215 | 0.0000 | 0.7634 | 1.9320 |
| age_group50-64 | 2.6452 | 0.2870 | 9.2171 | 0.0000 | 2.0826 | 3.2077 |
| age_group65+ | 2.8828 | 0.2740 | 10.5221 | 0.0000 | 2.3457 | 3.4198 |
| smokingFormer | 0.6768 | 0.2052 | 3.2977 | 0.0010 | 0.2745 | 1.0791 |
| smokingCurrent | 1.1503 | 0.2964 | 3.8807 | 0.0001 | 0.5692 | 1.7313 |
| bmi | 0.0480 | 0.0140 | 3.4339 | 0.0006 | 0.0206 | 0.0754 |
| educationHS/GED | -1.1978 | 0.4508 | -2.6567 | 0.0079 | -2.0815 | -0.3140 |
| educationSome College | -0.7178 | 0.4525 | -1.5863 | 0.1127 | -1.6048 | 0.1692 |
| educationCollege graduate | -1.2319 | 0.4531 | -2.7188 | 0.0066 | -2.1201 | -0.3437 |
Interpretation: The stepwise procedure, which allows both addition and removal at each step, also converges on the identical model. This three-way agreement (backward = forward = stepwise) is reassuring but should not be taken as proof that this is the “correct” model. All three methods optimize the same criterion (AIC) on the same data.
# Comparing all three models
method_comparison <- tribble(
~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
"Maximum model",
length(coef(mod_max)) - 1,
round(glance(mod_max)$adj.r.squared, 4),
round(AIC(mod_max), 1),
round(BIC(mod_max), 1),
"Backward (AIC)",
length(coef(mod_backward)) - 1,
round(glance(mod_backward)$adj.r.squared, 4),
round(AIC(mod_backward), 1),
round(BIC(mod_backward), 1),
"Forward (AIC)",
length(coef(mod_forward)) - 1,
round(glance(mod_forward)$adj.r.squared, 4),
round(AIC(mod_forward), 1),
round(BIC(mod_forward), 1),
"Stepwise (AIC)",
length(coef(mod_stepwise)) - 1,
round(glance(mod_stepwise)$adj.r.squared, 4),
round(AIC(mod_stepwise), 1),
round(BIC(mod_stepwise), 1)
)
method_comparison |>
kable(caption = "Comparison of Variable Selection Methods") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | Variables selected | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| Maximum model | 14 | 0.183 | 55705.4 | 55817.2 |
| Backward (AIC) | 14 | 0.183 | 55705.4 | 55817.2 |
| Forward (AIC) | 14 | 0.183 | 55705.4 | 55817.2 |
| Stepwise (AIC) | 14 | 0.183 | 55705.4 | 55817.2 |
Interpretation All three AIC-based stepwise methods backward elimination, forward selection, and stepwise selection agreed on the same final model, retaining all 14 predictor terms from the maximum model. No variables were excluded by any method, meaning every predictor (menthlth_days, bmi, exercise, age_group, income, education, and smoking) was contribute sufficiently to justify its inclusion under the AIC penalty criterion.
Step 4: Final Model Selection and Interpretation (5 points) • Choose your final model. State which method(s) guided your choice and why. • Display the coefficient table for your final model using tidy() with confidence intervals • Interpret the coefficients for at least three predictors in plain language, including units and “holding all other variables constant” language. Include at least one continuous and one categorical predictor.
tidy(mod_max, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Final Model: Coefficient table",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 5.8000 | 0.6530 | 8.8821 | 0.0000 | 4.5200 | 7.0800 |
| menthlth_days | 0.3042 | 0.0114 | 26.7352 | 0.0000 | 0.2819 | 0.3265 |
| bmi | 0.0480 | 0.0140 | 3.4339 | 0.0006 | 0.0206 | 0.0754 |
| exerciseYes | -3.4505 | 0.2253 | -15.3127 | 0.0000 | -3.8922 | -3.0088 |
| age_group35-49 | 1.3477 | 0.2981 | 4.5215 | 0.0000 | 0.7634 | 1.9320 |
| age_group50-64 | 2.6452 | 0.2870 | 9.2171 | 0.0000 | 2.0826 | 3.2077 |
| age_group65+ | 2.8828 | 0.2740 | 10.5221 | 0.0000 | 2.3457 | 3.4198 |
| income$25,000 ~ $49,000 | -2.6241 | 0.3073 | -8.5398 | 0.0000 | -3.2264 | -2.0217 |
| income$50,000 ~ $99,000 | -3.2385 | 0.2952 | -10.9701 | 0.0000 | -3.8172 | -2.6598 |
| income$100,000+ | -3.8271 | 0.4303 | -8.8949 | 0.0000 | -4.6705 | -2.9837 |
| educationHS/GED | -1.1978 | 0.4508 | -2.6567 | 0.0079 | -2.0815 | -0.3140 |
| educationSome College | -0.7178 | 0.4525 | -1.5863 | 0.1127 | -1.6048 | 0.1692 |
| educationCollege graduate | -1.2319 | 0.4531 | -2.7188 | 0.0066 | -2.1201 | -0.3437 |
| smokingFormer | 0.6768 | 0.2052 | 3.2977 | 0.0010 | 0.2745 | 1.0791 |
| smokingCurrent | 1.1503 | 0.2964 | 3.8807 | 0.0001 | 0.5692 | 1.7313 |
Interpretation: The maximum model is selected as the final model. All three stepwise methods agreed backward elimination, forward selection and stepwise selection all ended up with the same full model. When three different approaches starting from different points and working in different directions all have the same answer, that is a good sign the predictors is well suppoeted.
Menthlth_days: Holding all other variable constant, each additional mentally unhealthy days reported in the past 30 days is associated with 0.304 more poor physical health days (95%CI: 0.2819, 0.3265, p<0.001). This is the strongest estimated predictor in the model.
Bmi: Holding all other variables constant, each one-unit increase in BMI is associated with 0.0480 additional poor physical health days (95%CI: 0.0206, 0.0754 p<0.001). This effect is small but with evidence linking higher BMI to worse physical conditions.
Smoking: Holding all other variables constant, both former and current smokers report more poor physical health days compared to never smokers. Holding all other variable constant, former smoker reported 0.6768 more days of physical unhealthy days compared to never smoker (95%CI: 0.5692, 1.7313; p<0.001).
Step 5: LINE Assumption Check (5 points)
Using your final linear model, produce the four base R diagnostic plots: par(mfrow = c(2, 2)) plot(your_final_model) par(mfrow = c(1, 1))
Interpret each panel in 1-2 sentences: 1. Residuals vs. Fitted: Is there
evidence of non-linearity or non-constant variance? There is no clear
evidence of non-linearity, as the smoothed line stays relatively flat
across fitted values. However, the spread of residuals appears slightly
uneven.
The residuals do not perfectly follow a normal distribution. They align well in the middle range but drift away from the diagonal line in the upper tail, suggesting the distribution is somewhat skewed.
The spread of residuals is not fully constant across fitted values. The upward trend in the smoothed line suggests that variance increases as fitted values get larger, which is a sign of heteroscedasticity.
There are no highly influential observations in this model. No data points come close to exceeding Cook’s distance thresholds, so no individual observations are unduly driving the results.
Write a brief conclusion (2-3 sentences): Are the LINE assumptions reasonably satisfied? What violations, if any, do you observe?
The LINE assumptions are mostly met, with linearity holding reasonably well and no concerning influential observations. The two main violations are mild heteroscedasticity where variance grows slightly with larger fitted values and a departure from normality in the upper tail of the residuals. Given the large sample size (n = 8,000), these violations are unlikely to affect the reliability of the model’s estimates or conclusions.
You will now reframe the research question using a binary outcome. The CDC defines frequent physical distress as reporting 14 or more physically unhealthy days in the past 30 days. Step 1: Create the Binary Outcome (5 points) • Create a new variable frequent_distress: o 1 if physhlth_days >= 14 o 0 if physhlth_days < 14 • Store as a factor with levels “No” (reference) and “Yes” • Report the prevalence of frequent physical distress (count and percentage in each category) • In 1-2 sentences, comment on the balance of the outcome
brfss_new <-brfss_analytic|>
mutate(
frequent_distress = ifelse(physhlth_days >=14, 1, 0),
frequent_distress = factor(frequent_distress,
levels = c(0, 1),
labels = c("No", "Yes"))
)
brfss_new |>
count(frequent_distress) |>
mutate(percent = n/ sum(n) * 100) |>
mutate(percent = round(percent, 1)) |>
kable(
caption = "Prevalence of frequent physical distress",
col.names = c("Frequency Distress", "Count", "Percentage")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Frequency Distress | Count | Percentage |
|---|---|---|
| No | 6914 | 86.4 |
| Yes | 1086 | 13.6 |
Interpretation: 13.6% of people reported frequent physical distress while 86.4% did not report frequent physical distress. The outcome is not balanced with a large proportion of individuals in the “No” category.
Step 2: Simple (Unadjusted) Logistic Regression (10 points) • Choose one predictor from your final linear model that you expect to have a strong association with physical distress. State why you chose it. 6 • Fit a simple logistic regression: mod_simple <- glm(frequent_distress ~ your_predictor, data = brfss_analytic, family = binomial) • Display the model summary • Calculate and report the odds ratio and 95% confidence interval: exp(coef(mod_simple)) # Odds ratio exp(confint(mod_simple)) # 95% CI for OR • Interpret the odds ratio in plain language. For a continuous predictor: “For every one-unit increase in [predictor], the odds of frequent physical distress are multiplied by [OR], 95% CI [lower, upper].” For a categorical predictor: “Compared to [reference group], [comparison group] has [OR] times the odds of frequent physical distress, 95% CI [lower, upper].” • State whether the association is statistically significant at alpha = 0.05 and how you know (Wald test p-value or CI excluding 1.0)
mod_simple <- glm(frequent_distress ~ exercise, data = brfss_new,
family = binomial(link = "logit"))
tidy(mod_simple, conf.int = TRUE, exponentiate = FALSE) |>
kable(digits = 3, caption = "Simple Logistic Regression: Frequent physical distress ~ Exercise (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -0.933 | 0.053 | -17.613 | 0 | -1.037 | -0.830 |
| exerciseYes | -1.329 | 0.068 | -19.423 | 0 | -1.463 | -1.194 |
tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3,
caption = "Simple Logistic Regression: frequent physical distress ~ Exercise (Odds Ratio Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.394 | 0.053 | -17.613 | 0 | 0.354 | 0.436 |
| exerciseYes | 0.265 | 0.068 | -19.423 | 0 | 0.232 | 0.303 |
mod_simple |>
tbl_regression(
exponentiate = TRUE,
label = list(exercise ~ "Any exercise in past 30 days")
) |>
bold_labels() |>
bold_p()| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Any exercise in past 30 days | |||
| No | — | — | |
| Yes | 0.26 | 0.23, 0.30 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
Interpretation: Exercise was selected because it was the strongest predictor producing the largest coefficient -3.45 days among all predictors. It has a clear relationship with physical health in public health literature.
Log-odds scale: The coefficient for exerciseYes is -1.329 meaning exercise is associated with lower log-odds of frequent physical distress. Odds ratio for exercise Yes is 0.265 (95%CI: 0.23, 0.3; p<0.001). Compared to people who did not exercise in the past 30 days, those who did exercise have 0.265 times the odds of experiencing frequent physical distress. In other words, exerciser have roughly 73% lower odds of frequent physical distress than non-exercisers, holding other variables constant. The aossocicaiton is strong and statistically significant.
Step 3: Multiple Logistic Regression (10 points) • Fit a multiple logistic regression using the same predictors as your final linear model from Part 1: mod_logistic <- glm(frequent_distress ~ [same predictors as final linear model], data = brfss_analytic, family = binomial) • Display the model summary • Calculate and display a table of adjusted odds ratios with 95% confidence intervals for all predictors: tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |> kable(digits = 3) • Interpret the adjusted odds ratios for at least three predictors in plain language, using “holding all other variables constant” language. Include at least one continuous and one categorical predictor.
mod_logistic <-glm(
frequent_distress ~ menthlth_days + bmi + sex + exercise + age_group + income + education + smoking,
data = brfss_new,
family= binomial)
summary(mod_logistic)##
## Call:
## glm(formula = frequent_distress ~ menthlth_days + bmi + sex +
## exercise + age_group + income + education + smoking, family = binomial,
## data = brfss_new)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.255697 0.243962 -9.246 < 2e-16 ***
## menthlth_days 0.068198 0.003664 18.611 < 2e-16 ***
## bmi 0.014348 0.005155 2.783 0.005380 **
## sexFemale 0.014039 0.073171 0.192 0.847851
## exerciseYes -0.885038 0.076998 -11.494 < 2e-16 ***
## age_group35-49 0.762689 0.147703 5.164 2.42e-07 ***
## age_group50-64 1.213299 0.139855 8.675 < 2e-16 ***
## age_group65+ 1.291576 0.137163 9.416 < 2e-16 ***
## income$25,000 ~ $49,000 -0.589866 0.101555 -5.808 6.31e-09 ***
## income$50,000 ~ $99,000 -0.853635 0.100733 -8.474 < 2e-16 ***
## income$100,000+ -1.498863 0.230273 -6.509 7.56e-11 ***
## educationHS/GED -0.336043 0.149017 -2.255 0.024129 *
## educationSome College -0.148256 0.149889 -0.989 0.322613
## educationCollege graduate -0.494821 0.155068 -3.191 0.001418 **
## smokingFormer 0.296136 0.081651 3.627 0.000287 ***
## smokingCurrent 0.358801 0.105610 3.397 0.000680 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6354.8 on 7999 degrees of freedom
## Residual deviance: 5280.5 on 7984 degrees of freedom
## AIC: 5312.5
##
## Number of Fisher Scoring iterations: 6
tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 3))) |>
kable(
caption = "Adjusted Odds Ratios for Frequent Physical Distress",
digits = 3
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.105 | 0.244 | -9.246 | 0.000 | 0.065 | 0.168 |
| menthlth_days | 1.071 | 0.004 | 18.611 | 0.000 | 1.063 | 1.078 |
| bmi | 1.014 | 0.005 | 2.783 | 0.005 | 1.004 | 1.025 |
| sexFemale | 1.014 | 0.073 | 0.192 | 0.848 | 0.879 | 1.171 |
| exerciseYes | 0.413 | 0.077 | -11.494 | 0.000 | 0.355 | 0.480 |
| age_group35-49 | 2.144 | 0.148 | 5.164 | 0.000 | 1.610 | 2.875 |
| age_group50-64 | 3.365 | 0.140 | 8.675 | 0.000 | 2.569 | 4.448 |
| age_group65+ | 3.639 | 0.137 | 9.416 | 0.000 | 2.795 | 4.787 |
| income$25,000 ~ $49,000 | 0.554 | 0.102 | -5.808 | 0.000 | 0.454 | 0.677 |
| income$50,000 ~ $99,000 | 0.426 | 0.101 | -8.474 | 0.000 | 0.350 | 0.519 |
| income$100,000+ | 0.223 | 0.230 | -6.509 | 0.000 | 0.139 | 0.345 |
| educationHS/GED | 0.715 | 0.149 | -2.255 | 0.024 | 0.535 | 0.960 |
| educationSome College | 0.862 | 0.150 | -0.989 | 0.323 | 0.645 | 1.160 |
| educationCollege graduate | 0.610 | 0.155 | -3.191 | 0.001 | 0.451 | 0.829 |
| smokingFormer | 1.345 | 0.082 | 3.627 | 0.000 | 1.145 | 1.577 |
| smokingCurrent | 1.432 | 0.106 | 3.397 | 0.001 | 1.162 | 1.758 |
Interpretation: menthlth_days: holding all other variables constant, for every one additional mentally unhealthy days, the odds of frequenty physical distress are 1.071 times higher (95%CI: 1.063, 1.078) statistically significant.
Sexfemale: After adjusted for all other variables, there is no meaningful differences in the odds of frequenty physical distress between males and female. (95%CI: 0.879, 1.171) The CI crosses 1.0, therefore, the assocaition is not statistically significant.
Exercise Yes: holding all other variables constant, compared to people who do not exercise, people who did exercise have 0.413 times the odds of frequent physical unhealthy days about 59% lower odds. The association is statistically significant (95%CI:0.355, 0.480; p<0.001 ).
Step 4: Likelihood Ratio Test (5 points) • Choose a group of related predictors to test collectively (for example, all income levels, all education levels, or all smoking levels) • Fit a reduced model that excludes this group • Conduct a likelihood ratio test comparing the reduced and full models: anova(mod_reduced, mod_logistic, test = “Chisq”) 7 • State the null and alternative hypotheses • Report the test statistic (deviance), degrees of freedom, and p-value • Conclude at alpha = 0.05: Does this group of predictors significantly improve the model?
mod_reduced <- glm(
frequent_distress ~ menthlth_days + bmi + age_group + income + sex + smoking + exercise,
data = brfss_new,
family = binomial
)
anova(mod_reduced, mod_logistic, test = "LRT") |>
kable(digits = 3,
caption = "LR Test: Does removing education improve the model? ") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 7987 | 5299.505 | NA | NA | NA |
| 7984 | 5280.523 | 3 | 18.982 | 0 |
Interpretaiton: Null Hyptohesis: Education level is not assocaited with frequent physical distress ( education coefficient = 0 ).
Alternative Hypothesis: Education level is associated with frequent physical distress ( at least one education coefficients ≠ 0 ).
Chi-square(deviance) = 18.982 df = 3 p-value <0.001 We can reject the null hypothesis since p-value <0.001. Education level significantly improve the model.
Step 5: ROC Curve and AUC (5 points) • Use the pROC package to generate a ROC curve for your multiple logistic regression
roc_obj <- roc(
response = brfss_new$frequent_distress,
predictor = fitted(mod_logistic),
levels = c("No", "Yes"),
direction = "<"
)
auc_value <- auc(roc_obj)
ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
labs(title = "ROC Curve for Frequent Physical Distress Model",
subtitle = paste0("AUC = ", round(auc_value, 3)),
x = "Specificity", y = "Sensitivity") +
theme_minimal()## Area under the curve: 0.7774
## 95% CI: 0.7619-0.7928 (DeLong)
• Report the AUC with 95% confidence interval: Area under the curve: 0.7774 95% CI: 0.7619-0.7928 (DeLong)
An AUC is approximately 0.7774 indicates acceptable to excellent discrimination, meaning the model can distinguish between individuals with and without frequent physical distress reasonably well.
Step 6: Hosmer-Lemeshow Goodness-of-Fit Test (5 points) • Conduct the Hosmer-Lemeshow test using the ResourceSelection package: hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10) • State the null hypothesis (the model fits the data well) and alternative hypothesis (the model does not fit the data well) • Report the test statistic, degrees of freedom, and p-value • Interpret: At alpha = 0.05, is there evidence of poor model fit? • In 1-2 sentences, discuss how the Hosmer-Lemeshow result complements the ROC/AUC finding
hl_test <- hoslem.test(
x = as.numeric(brfss_new$frequent_distress) - 1,
y = fitted(mod_logistic),
g = 10
)
hl_test##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(brfss_new$frequent_distress) - 1, fitted(mod_logistic)
## X-squared = 11.097, df = 8, p-value = 0.1963
Null Hypothesis: the model fits the data well. Alternative Hypothesis: The model does not fit the data well. Chi-squared test = 11.097 Df=8 p-value = 0.1963 Since p-value > 0.05, we failed to reject the null hypothesis. There is no statistically significant evidence of poor model fit. The ROC/AUC AUC=0.737 told us the model does an acceptable job sperating people with and without frequent physical distress. The Hosmer Lemeshow test further confirmed that the model’s predicted probabilities are also well calibrated to the actual observed rates.
Part 3: Reflection (15 points) Write 250-400 words in continuous prose (no bullet points). A. Statistical Insights (5 points): What do your results suggest about the factors associated with physical health burden among U.S. adults? Which predictors were the strongest in both the linear and logistic models? Were any predictors significant in one model but not the other? What are the key limitations of using cross-sectional survey data for this type of analysis? Name at least two specific potential confounders not measured in your model.
The results suggest that physical health burden among U.S. adults is shaped by behavioral, socioeconomic, and demographic factors. In both the linear and logistic models, exercise and mental health days are the strongest predictors. People who exercised reported roughly 3.45 fewer poor physical health days in the linear model and had about 59% lower odds of frequent physical distress in the logistic model. Mental health days showed a consistent positive association across both frameworks, with each additional poor mental health day linked to more poor physical health days and higher odds of frequent distress. Income and age group were also strong and consistent across both models, showing clear gradients in the expected direction.The key limitations of using cross-sectional survey data is lacking of causality.
B. Linear versus Logistic Regression (5 points): Compare what the linear regression model tells you versus what the logistic regression model tells you about the same research question. What information does each approach provide that the other does not? In what situations would you prefer one approach over the other? How do the model selection criteria differ between the two frameworks (for example, R-squared versus AUC)?
Linear regression estimated how many additional poor physical health days were associated with each predictor, making results easy to interpret in days. Logistic regression instead examined whether someone had 14 or more poor physical health days and reported associations as odds ratios. Linear regression captures the full range of physical health burden, while logistic regression identifies factors linked to a high-burden group that may be more relevant for screening and intervention. Linear regression is preferred when the outcome is continuous, while logistic regression is more useful when a meaningful cutoff is important. Model evaluation also differs: linear regression uses R² and Adjusted R² to assess explained variance, while logistic regression uses AUC to measure discrimination and the Hosmer-Lemeshow test to assess calibration. In this analysis, the logistic model showed acceptable discrimination (AUC = 0.777), and the linear model explained a modest amount of variation (R² = 0.184), indicating both models identified meaningful associations.
C. AI Transparency (5 points): Describe any parts of this assignment where you sought AI assistance. Include the specific prompts you used and how you verified the correctness of the output. If you did not use AI, state this explicitly
I used Claude as an AI assistant for debugging during the data preparation and recoding section, where I encountered syntax errors. Claude helped me identify where these errors occurred and explained why the corrected syntax was right, which helped me learn from the mistakes rather than just fixing them blindly. Prompts “debugging the code”.