Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw03_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.
# Load required packages
library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(car)
library(gtsummary)
library(leaps)
library(pROC)
library(ResourceSelection)## [1] 433323 350
## [1] 433323
## [1] 350
## The raw BRFSS dataset contains 433323 rows and 350 columns.
# Select outcome + chosen predictors
brfss_work <- brfss_raw %>%
select(
PHYSHLTH,
MENTHLTH,
`_BMI5`,
SEXVAR,
EXERANY2,
ADDEPEV3,
`_AGEG5YR`,
`_INCOMG1`,
EDUCA,
`_SMOKER3`
)
# Recode variables
brfss_clean <- brfss_work %>%
mutate(
physhlth_days = case_when(
PHYSHLTH == 88 ~ 0,
PHYSHLTH %in% c(77, 99) ~ NA_real_,
PHYSHLTH >= 1 & PHYSHLTH <= 30 ~ as.numeric(PHYSHLTH),
TRUE ~ NA_real_
),
menthlth_days = case_when(
MENTHLTH == 88 ~ 0,
MENTHLTH %in% c(77, 99) ~ NA_real_,
MENTHLTH >= 1 & MENTHLTH <= 30 ~ as.numeric(MENTHLTH),
TRUE ~ NA_real_
),
bmi = case_when(
`_BMI5` == 9999 ~ NA_real_,
TRUE ~ as.numeric(`_BMI5`) / 100
),
sex = case_when(
SEXVAR == 1 ~ "Male",
SEXVAR == 2 ~ "Female",
TRUE ~ NA_character_
),
exercise = case_when(
EXERANY2 == 1 ~ "Yes",
EXERANY2 == 2 ~ "No",
EXERANY2 %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
),
depression = case_when(
ADDEPEV3 == 1 ~ "Yes",
ADDEPEV3 == 2 ~ "No",
ADDEPEV3 %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
),
age_group = case_when(
`_AGEG5YR` %in% 1:3 ~ "18-34",
`_AGEG5YR` %in% 4:6 ~ "35-49",
`_AGEG5YR` %in% 7:9 ~ "50-64",
`_AGEG5YR` %in% 10:13 ~ "65+",
`_AGEG5YR` == 14 ~ NA_character_,
TRUE ~ NA_character_
),
income = case_when(
`_INCOMG1` %in% 1:2 ~ "Less than $25K",
`_INCOMG1` %in% 3:4 ~ "$25K-$49K",
`_INCOMG1` %in% 5:6 ~ "$50K-$99K",
`_INCOMG1` == 7 ~ "$100K+",
`_INCOMG1` == 9 ~ NA_character_,
TRUE ~ NA_character_
),
education = case_when(
EDUCA %in% 1:3 ~ "Less than HS",
EDUCA == 4 ~ "HS/GED",
EDUCA == 5 ~ "Some college",
EDUCA == 6 ~ "College graduate",
EDUCA == 9 ~ NA_character_,
TRUE ~ NA_character_
),
smoking = case_when(
`_SMOKER3` %in% 1:2 ~ "Current",
`_SMOKER3` == 3 ~ "Former",
`_SMOKER3` == 4 ~ "Never",
`_SMOKER3` == 9 ~ NA_character_,
TRUE ~ NA_character_
)
) %>%
mutate(
sex = relevel(factor(sex), ref = "Male"),
exercise = relevel(factor(exercise), ref = "No"),
depression = relevel(factor(depression), ref = "No"),
age_group = relevel(factor(age_group), ref = "18-34"),
income = relevel(factor(income), ref = "Less than $25K"),
education = relevel(factor(education), ref = "Less than HS"),
smoking = relevel(factor(smoking), ref = "Never")
)
# Keep only cleaned variables for the analytic dataset
brfss_clean_final <- brfss_clean %>%
select(
physhlth_days,
menthlth_days,
bmi,
sex,
exercise,
depression,
age_group,
income,
education,
smoking
)
# Missingness before removing missing data
missing_summary <- tibble(
variable = c("physhlth_days", "menthlth_days", "bmi"),
missing_n = c(
sum(is.na(brfss_clean_final$physhlth_days)),
sum(is.na(brfss_clean_final$menthlth_days)),
sum(is.na(brfss_clean_final$bmi))
)
) %>%
mutate(
missing_pct = 100 * missing_n / nrow(brfss_clean_final)
)
missing_summary## # A tibble: 3 × 3
## variable missing_n missing_pct
## <chr> <int> <dbl>
## 1 physhlth_days 10785 2.49
## 2 menthlth_days 8108 1.87
## 3 bmi 40535 9.35
# Create analytic dataset
set.seed(1220)
brfss_analytic <- brfss_clean_final %>%
drop_na() %>%
slice_sample(n = 8000)
# Report final analytic sample size
nrow(brfss_analytic)## [1] 8000
# Descriptive summary table
tbl_summary(
brfss_analytic,
type = list(
sex ~ "categorical",
exercise ~ "categorical",
depression ~ "categorical",
age_group ~ "categorical",
income ~ "categorical",
education ~ "categorical",
smoking ~ "categorical"
)
)| Characteristic | N = 8,0001 |
|---|---|
| physhlth_days | 0 (0, 4) |
| menthlth_days | 0 (0, 5) |
| bmi | 27 (24, 32) |
| sex | |
| Male | 4,007 (50%) |
| Female | 3,993 (50%) |
| exercise | |
| No | 1,763 (22%) |
| Yes | 6,237 (78%) |
| depression | |
| No | 6,313 (79%) |
| Yes | 1,687 (21%) |
| age_group | |
| 18-34 | 1,287 (16%) |
| 35-49 | 1,683 (21%) |
| 50-64 | 2,098 (26%) |
| 65+ | 2,932 (37%) |
| income | |
| Less than $25K | 1,092 (14%) |
| $100K+ | 653 (8.2%) |
| $25K-$49K | 1,888 (24%) |
| $50K-$99K | 4,367 (55%) |
| education | |
| Less than HS | 387 (4.8%) |
| College graduate | 3,746 (47%) |
| HS/GED | 1,720 (22%) |
| Some college | 2,147 (27%) |
| smoking | |
| Never | 4,815 (60%) |
| Current | 929 (12%) |
| Former | 2,256 (28%) |
| 1 Median (Q1, Q3); n (%) | |
Before removing missing data, physhlth_days had 10,785 missing values (2.49%), menthlth_days had 8,108 (1.87%), and BMI had 40,535 (9.35%). After removing observations with missing values and randomly sampling with set.seed(1220), the final analytic dataset included 8,000 respondents.
#----------------------------------------------------------#
# PART 1: Model Selection for Linear Regression
#----------------------------------------------------------#
# Assumes brfss_analytic was created in Part 0
# Outcome: physhlth_days
# Predictors:
# menthlth_days, bmi, sex, exercise, depression,
# age_group, income, education, smoking
#----------------------------------------------------------#
# Step 1: Fit the Maximum Model
#----------------------------------------------------------#
max_lm <- lm(
physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
age_group + income + education + smoking,
data = brfss_analytic
)
summary(max_lm)##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise +
## depression + age_group + income + education + smoking, data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9005 -3.9214 -1.8200 0.7445 31.2775
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.27832 0.66322 4.943 7.85e-07 ***
## menthlth_days 0.29364 0.01195 24.563 < 2e-16 ***
## bmi 0.09918 0.01396 7.107 1.29e-12 ***
## sexFemale -0.06683 0.18126 -0.369 0.712382
## exerciseYes -3.57069 0.22529 -15.849 < 2e-16 ***
## depressionYes 0.83235 0.24556 3.390 0.000703 ***
## age_group35-49 0.92707 0.29954 3.095 0.001975 **
## age_group50-64 2.22979 0.28720 7.764 9.26e-15 ***
## age_group65+ 3.02349 0.27616 10.948 < 2e-16 ***
## income$100K+ -3.91649 0.42567 -9.201 < 2e-16 ***
## income$25K-$49K -2.44157 0.30554 -7.991 1.53e-15 ***
## income$50K-$99K -3.50150 0.29178 -12.001 < 2e-16 ***
## educationCollege graduate 0.42865 0.45029 0.952 0.341157
## educationHS/GED 0.03516 0.44933 0.078 0.937628
## educationSome college 0.45289 0.44870 1.009 0.312837
## smokingCurrent 1.08265 0.29955 3.614 0.000303 ***
## smokingFormer 0.57889 0.20811 2.782 0.005421 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.887 on 7983 degrees of freedom
## Multiple R-squared: 0.203, Adjusted R-squared: 0.2014
## F-statistic: 127.1 on 16 and 7983 DF, p-value: < 2.2e-16
# Model fit statistics for maximum model
max_model_stats <- tibble(
Model = "Maximum Model",
R_squared = summary(max_lm)$r.squared,
Adjusted_R_squared = summary(max_lm)$adj.r.squared,
AIC = AIC(max_lm),
BIC = BIC(max_lm)
)
max_model_stats %>%
mutate(across(where(is.numeric), round, 3)) %>%
kbl(caption = "Maximum Linear Regression Model Fit Statistics") %>%
kable_styling(full_width = FALSE)| Model | R_squared | Adjusted_R_squared | AIC | BIC |
|---|---|---|---|---|
| Maximum Model | 0.203 | 0.201 | 55764.86 | 55890.63 |
#----------------------------------------------------------#
# Step 2: Best Subsets Regression
#----------------------------------------------------------#
# Create model matrix so factors are expanded into dummy variables
x <- model.matrix(
physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
age_group + income + education + smoking,
data = brfss_analytic
)
# Remove intercept column
x <- x[, -1]
# Build dataset for regsubsets
regsub_data <- data.frame(
physhlth_days = brfss_analytic$physhlth_days,
x
)
# Set nvmax equal to total number of candidate predictors/parameters
nvmax_val <- ncol(regsub_data) - 1
best_subsets <- regsubsets(
physhlth_days ~ .,
data = regsub_data,
nvmax = nvmax_val,
method = "exhaustive"
)
best_subsets_summary <- summary(best_subsets)
best_subsets_results <- tibble(
model_size = 1:nvmax_val,
adj_r2 = best_subsets_summary$adjr2,
bic = best_subsets_summary$bic
)
best_subsets_results %>%
mutate(
adj_r2 = round(adj_r2, 4),
bic = round(bic, 2)
) %>%
kbl(caption = "Best Subsets Results by Model Size") %>%
kable_styling(full_width = FALSE)| model_size | adj_r2 | bic |
|---|---|---|
| 1 | 0.1070 | -888.43 |
| 2 | 0.1576 | -1347.18 |
| 3 | 0.1673 | -1432.08 |
| 4 | 0.1742 | -1490.36 |
| 5 | 0.1805 | -1543.54 |
| 6 | 0.1865 | -1594.84 |
| 7 | 0.1918 | -1638.67 |
| 8 | 0.1974 | -1686.63 |
| 9 | 0.1987 | -1691.62 |
| 10 | 0.2000 | -1696.32 |
| 11 | 0.2008 | -1695.95 |
| 12 | 0.2014 | -1694.30 |
| 13 | 0.2015 | -1687.85 |
| 14 | 0.2016 | -1679.88 |
| 15 | 0.2015 | -1671.02 |
| 16 | 0.2014 | -1662.04 |
# Plot Adjusted R-squared across model sizes
ggplot(best_subsets_results, aes(x = model_size, y = adj_r2)) +
geom_line() +
geom_point() +
labs(
title = "Best Subsets: Adjusted R-squared by Model Size",
x = "Model Size",
y = "Adjusted R-squared"
) +
theme_minimal()# Plot BIC across model sizes
ggplot(best_subsets_results, aes(x = model_size, y = bic)) +
geom_line() +
geom_point() +
labs(
title = "Best Subsets: BIC by Model Size",
x = "Model Size",
y = "BIC"
) +
theme_minimal()# Identify best model size by Adjusted R-squared and BIC
best_adj_r2_row <- best_subsets_results %>%
filter(adj_r2 == max(adj_r2))
best_bic_row <- best_subsets_results %>%
filter(bic == min(bic))
cat("Model size with maximum Adjusted R-squared:", best_adj_r2_row$model_size, "\n")## Model size with maximum Adjusted R-squared: 14
## Model size with minimum BIC: 10
##
## Variables in model with maximum Adjusted R-squared:
## (Intercept) menthlth_days bmi
## 3.2695545 0.2933798 0.0993539
## exerciseYes depressionYes age_group35.49
## -3.5664466 0.8227978 0.9171982
## age_group50.64 age_group65. income.100K.
## 2.2210320 3.0151712 -3.8990537
## income.25K..49K income.50K..99K educationCollege.graduate
## -2.4381856 -3.4929870 0.3948283
## educationSome.college smokingCurrent smokingFormer
## 0.4220925 1.0912107 0.5848116
##
## Variables in model with minimum BIC:
## (Intercept) menthlth_days bmi exerciseYes depressionYes
## 3.7739087 0.2974875 0.0962865 -3.5865640 0.9002398
## age_group35.49 age_group50.64 age_group65. income.100K. income.25K..49K
## 1.0994401 2.3926369 3.1804914 -3.9434145 -2.4191235
## income.50K..99K
## -3.4765315
#----------------------------------------------------------#
# Step 3: Stepwise Selection Methods
#----------------------------------------------------------#
# Backward elimination
backward_model <- step(
max_lm,
direction = "backward",
trace = 0
)
# Forward selection
null_model <- lm(physhlth_days ~ 1, data = brfss_analytic)
forward_model <- step(
null_model,
scope = formula(max_lm),
direction = "forward",
trace = 0
)
# Stepwise selection (both directions)
stepwise_model <- step(
null_model,
scope = formula(max_lm),
direction = "both",
trace = 0
)
# Display summaries
summary(backward_model)##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + exercise +
## depression + age_group + income + smoking, data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1018 -3.8895 -1.8323 0.7495 30.9167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42720 0.55800 6.142 8.54e-10 ***
## menthlth_days 0.29357 0.01193 24.612 < 2e-16 ***
## bmi 0.09891 0.01390 7.114 1.22e-12 ***
## exerciseYes -3.52573 0.22348 -15.777 < 2e-16 ***
## depressionYes 0.85465 0.24349 3.510 0.000451 ***
## age_group35-49 0.95193 0.29762 3.198 0.001387 **
## age_group50-64 2.24872 0.28575 7.870 4.02e-15 ***
## age_group65+ 3.05550 0.27408 11.148 < 2e-16 ***
## income$100K+ -3.73413 0.40666 -9.182 < 2e-16 ***
## income$25K-$49K -2.38110 0.30272 -7.866 4.15e-15 ***
## income$50K-$99K -3.35910 0.27753 -12.103 < 2e-16 ***
## smokingCurrent 1.02655 0.29420 3.489 0.000487 ***
## smokingFormer 0.55675 0.20567 2.707 0.006803 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared: 0.2026, Adjusted R-squared: 0.2014
## F-statistic: 169.1 on 12 and 7987 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + exercise + income +
## age_group + bmi + smoking + depression, data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1018 -3.8895 -1.8323 0.7495 30.9167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42720 0.55800 6.142 8.54e-10 ***
## menthlth_days 0.29357 0.01193 24.612 < 2e-16 ***
## exerciseYes -3.52573 0.22348 -15.777 < 2e-16 ***
## income$100K+ -3.73413 0.40666 -9.182 < 2e-16 ***
## income$25K-$49K -2.38110 0.30272 -7.866 4.15e-15 ***
## income$50K-$99K -3.35910 0.27753 -12.103 < 2e-16 ***
## age_group35-49 0.95193 0.29762 3.198 0.001387 **
## age_group50-64 2.24872 0.28575 7.870 4.02e-15 ***
## age_group65+ 3.05550 0.27408 11.148 < 2e-16 ***
## bmi 0.09891 0.01390 7.114 1.22e-12 ***
## smokingCurrent 1.02655 0.29420 3.489 0.000487 ***
## smokingFormer 0.55675 0.20567 2.707 0.006803 **
## depressionYes 0.85465 0.24349 3.510 0.000451 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared: 0.2026, Adjusted R-squared: 0.2014
## F-statistic: 169.1 on 12 and 7987 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + exercise + income +
## age_group + bmi + smoking + depression, data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1018 -3.8895 -1.8323 0.7495 30.9167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42720 0.55800 6.142 8.54e-10 ***
## menthlth_days 0.29357 0.01193 24.612 < 2e-16 ***
## exerciseYes -3.52573 0.22348 -15.777 < 2e-16 ***
## income$100K+ -3.73413 0.40666 -9.182 < 2e-16 ***
## income$25K-$49K -2.38110 0.30272 -7.866 4.15e-15 ***
## income$50K-$99K -3.35910 0.27753 -12.103 < 2e-16 ***
## age_group35-49 0.95193 0.29762 3.198 0.001387 **
## age_group50-64 2.24872 0.28575 7.870 4.02e-15 ***
## age_group65+ 3.05550 0.27408 11.148 < 2e-16 ***
## bmi 0.09891 0.01390 7.114 1.22e-12 ***
## smokingCurrent 1.02655 0.29420 3.489 0.000487 ***
## smokingFormer 0.55675 0.20567 2.707 0.006803 **
## depressionYes 0.85465 0.24349 3.510 0.000451 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared: 0.2026, Adjusted R-squared: 0.2014
## F-statistic: 169.1 on 12 and 7987 DF, p-value: < 2.2e-16
# Extract terms from each final model
backward_terms <- attr(terms(backward_model), "term.labels")
forward_terms <- attr(terms(forward_model), "term.labels")
stepwise_terms <- attr(terms(stepwise_model), "term.labels")
cat("\nBackward model variables:\n")##
## Backward model variables:
## [1] "menthlth_days" "bmi" "exercise" "depression"
## [5] "age_group" "income" "smoking"
##
## Forward model variables:
## [1] "menthlth_days" "exercise" "income" "age_group"
## [5] "bmi" "smoking" "depression"
##
## Stepwise model variables:
## [1] "menthlth_days" "exercise" "income" "age_group"
## [5] "bmi" "smoking" "depression"
# Compare variables retained/excluded
all_predictors <- c(
"menthlth_days", "bmi", "sex", "exercise", "depression",
"age_group", "income", "education", "smoking"
)
retained_all_three <- Reduce(intersect, list(backward_terms, forward_terms, stepwise_terms))
excluded_all_three <- setdiff(
all_predictors,
union(union(backward_terms, forward_terms), stepwise_terms)
)
cat("\nRetained by all three methods:\n")##
## Retained by all three methods:
## [1] "menthlth_days" "bmi" "exercise" "depression"
## [5] "age_group" "income" "smoking"
##
## Excluded by all three methods:
## [1] "sex" "education"
#----------------------------------------------------------#
# Step 4: Final Model Selection and Interpretation
#----------------------------------------------------------#
# Choose your final model here
# You can change this to backward_model or forward_model if preferred
final_lm <- stepwise_model
summary(final_lm)##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + exercise + income +
## age_group + bmi + smoking + depression, data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.1018 -3.8895 -1.8323 0.7495 30.9167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.42720 0.55800 6.142 8.54e-10 ***
## menthlth_days 0.29357 0.01193 24.612 < 2e-16 ***
## exerciseYes -3.52573 0.22348 -15.777 < 2e-16 ***
## income$100K+ -3.73413 0.40666 -9.182 < 2e-16 ***
## income$25K-$49K -2.38110 0.30272 -7.866 4.15e-15 ***
## income$50K-$99K -3.35910 0.27753 -12.103 < 2e-16 ***
## age_group35-49 0.95193 0.29762 3.198 0.001387 **
## age_group50-64 2.24872 0.28575 7.870 4.02e-15 ***
## age_group65+ 3.05550 0.27408 11.148 < 2e-16 ***
## bmi 0.09891 0.01390 7.114 1.22e-12 ***
## smokingCurrent 1.02655 0.29420 3.489 0.000487 ***
## smokingFormer 0.55675 0.20567 2.707 0.006803 **
## depressionYes 0.85465 0.24349 3.510 0.000451 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.887 on 7987 degrees of freedom
## Multiple R-squared: 0.2026, Adjusted R-squared: 0.2014
## F-statistic: 169.1 on 12 and 7987 DF, p-value: < 2.2e-16
final_model_stats <- tibble(
Model = "Final Linear Model",
R_squared = summary(final_lm)$r.squared,
Adjusted_R_squared = summary(final_lm)$adj.r.squared,
AIC = AIC(final_lm),
BIC = BIC(final_lm)
)
final_model_stats %>%
mutate(across(where(is.numeric), round, 3)) %>%
kbl(caption = "Final Linear Model Fit Statistics") %>%
kable_styling(full_width = FALSE)| Model | R_squared | Adjusted_R_squared | AIC | BIC |
|---|---|---|---|---|
| Final Linear Model | 0.203 | 0.201 | 55760.55 | 55858.37 |
# Coefficient table with 95% confidence intervals
final_lm_tidy <- tidy(final_lm, conf.int = TRUE)
final_lm_tidy %>%
mutate(across(where(is.numeric), round, 3)) %>%
kbl(caption = "Final Linear Regression Model Coefficients with 95% Confidence Intervals") %>%
kable_styling(full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 3.427 | 0.558 | 6.142 | 0.000 | 2.333 | 4.521 |
| menthlth_days | 0.294 | 0.012 | 24.612 | 0.000 | 0.270 | 0.317 |
| exerciseYes | -3.526 | 0.223 | -15.777 | 0.000 | -3.964 | -3.088 |
| income$100K+ | -3.734 | 0.407 | -9.182 | 0.000 | -4.531 | -2.937 |
| income$25K-$49K | -2.381 | 0.303 | -7.866 | 0.000 | -2.975 | -1.788 |
| income$50K-$99K | -3.359 | 0.278 | -12.103 | 0.000 | -3.903 | -2.815 |
| age_group35-49 | 0.952 | 0.298 | 3.198 | 0.001 | 0.369 | 1.535 |
| age_group50-64 | 2.249 | 0.286 | 7.870 | 0.000 | 1.689 | 2.809 |
| age_group65+ | 3.055 | 0.274 | 11.148 | 0.000 | 2.518 | 3.593 |
| bmi | 0.099 | 0.014 | 7.114 | 0.000 | 0.072 | 0.126 |
| smokingCurrent | 1.027 | 0.294 | 3.489 | 0.000 | 0.450 | 1.603 |
| smokingFormer | 0.557 | 0.206 | 2.707 | 0.007 | 0.154 | 0.960 |
| depressionYes | 0.855 | 0.243 | 3.510 | 0.000 | 0.377 | 1.332 |
#----------------------------------------------------------#
# Step 5: LINE Assumption Check
#----------------------------------------------------------#
par(mfrow = c(2, 2))
plot(final_lm)The maximum linear regression model included mental health days, BMI, sex, exercise, depression, age group, income, education, and smoking status as predictors of physically unhealthy days. Model fit was assessed using R-squared, adjusted R-squared, AIC, and BIC.
The model explained approximately 20.3% of the variability in physically unhealthy days (R² = 0.203; Adjusted R² = 0.201), indicating a modest but meaningful fit.
The model had an AIC of 55,764.86 and a BIC of 55,890.63, which serve as baseline metrics for comparison with simpler models.
Several predictors were statistically significant, including mental health days, BMI, exercise, depression, age group, income, and smoking status, while sex and education were not statistically significant predictors of physically unhealthy days.
Although the model explains a modest proportion of the variance, this is expected given that physical health outcomes are influenced by many unmeasured behavioral, environmental, and clinical factors.
Best subsets regression was used to compare candidate models across different model sizes. Adjusted R-squared was used to identify the model with the strongest explanatory power, while BIC was used to identify the most parsimonious model.
The model with 14 predictors maximized adjusted R-squared (0.2016), while the model with 10 predictors minimized BIC (-1696.32).
These criteria did not agree, which is expected. Adjusted R-squared tends to favor more complex models, whereas BIC applies a stronger penalty for model complexity and therefore favors more parsimonious models.
Based on BIC, a model with approximately 10 predictors provides the best balance between model fit and simplicity, and is therefore preferred.
Backward elimination, forward selection, and stepwise selection were
performed using the step() function. The final models from
each approach were compared to identify predictors that were
consistently retained or excluded.
The variables mental health days, BMI, exercise, depression, age group, income, and smoking status were retained by all three methods, indicating they are strong and consistent predictors of physically unhealthy days.
In contrast, sex and education were excluded by all three methods, suggesting that they do not contribute meaningfully to the model after accounting for the other predictors.
The final model was selected based on parsimony, consistency across selection methods, and overall model fit. The stepwise model was chosen because it retained only predictors that were consistently identified as important while maintaining strong model performance.
The final model had an R² of 0.203 and an adjusted R² of 0.201, with an AIC of 55,760.55 and a BIC of 55,858.37, indicating similar explanatory power to the maximum model despite including fewer predictors.
This suggests that the excluded variables did not meaningfully improve model fit and that the final model achieves a more efficient balance between complexity and explanatory power.
Diagnostic plots were examined to assess the LINE assumptions, including linearity, independence, normality, and equal variance of residuals.
The residuals vs. fitted plot showed some evidence of non-constant variance, with a slight funnel shape suggesting mild heteroscedasticity.
The Q-Q plot indicated that residuals were approximately normally distributed, although there were noticeable deviations in the upper tail, suggesting departures from normality.
The scale-location plot further supported the presence of heteroscedasticity, as the spread of residuals increased slightly with fitted values.
The residuals vs. leverage plot did not reveal any highly influential observations with extreme Cook’s distance, although a few points exhibited moderately higher leverage.
Overall, the LINE assumptions appear to be partially satisfied, with evidence of heteroscedasticity and mild departures from normality. While these violations are not severe, the results should be interpreted with some caution.
#----------------------------------------------------------#
# PART 2: Logistic Regression
#----------------------------------------------------------#
# Assumes brfss_analytic already exists from Part 0
# and final linear model predictors were:
# menthlth_days + exercise + income + age_group + bmi + smoking + depression
#----------------------------------------------------------#
# Step 1: Create Binary Outcome
#----------------------------------------------------------#
brfss_analytic <- brfss_analytic %>%
mutate(
frequent_distress = case_when(
physhlth_days >= 14 ~ "Yes",
physhlth_days < 14 ~ "No",
TRUE ~ NA_character_
),
frequent_distress = factor(frequent_distress, levels = c("No", "Yes"))
)
# Prevalence table
distress_prev <- brfss_analytic %>%
count(frequent_distress) %>%
mutate(percent = 100 * n / sum(n))
distress_prev %>%
mutate(percent = round(percent, 2)) %>%
kbl(caption = "Prevalence of Frequent Physical Distress") %>%
kable_styling(full_width = FALSE)| frequent_distress | n | percent |
|---|---|---|
| No | 6895 | 86.19 |
| Yes | 1105 | 13.81 |
#----------------------------------------------------------#
# Step 2: Simple (Unadjusted) Logistic Regression
#----------------------------------------------------------#
# Choose one strong predictor from the final linear model.
# Here: exercise
mod_simple <- glm(
frequent_distress ~ exercise,
data = brfss_analytic,
family = binomial
)
summary(mod_simple)##
## Call:
## glm(formula = frequent_distress ~ exercise, family = binomial,
## data = brfss_analytic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.87145 0.05223 -16.69 <2e-16 ***
## exerciseYes -1.39670 0.06793 -20.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6424.7 on 7999 degrees of freedom
## Residual deviance: 6020.9 on 7998 degrees of freedom
## AIC: 6024.9
##
## Number of Fisher Scoring iterations: 5
# Odds ratio and 95% CI
simple_or <- exp(coef(mod_simple))
simple_ci <- exp(confint(mod_simple))
simple_results <- tibble(
term = names(simple_or),
odds_ratio = as.numeric(simple_or),
conf.low = simple_ci[, 1],
conf.high = simple_ci[, 2]
)
simple_results %>%
mutate(across(where(is.numeric), round, 3)) %>%
kbl(caption = "Simple Logistic Regression: Odds Ratios and 95% CI") %>%
kable_styling(full_width = FALSE)| term | odds_ratio | conf.low | conf.high |
|---|---|---|---|
| (Intercept) | 0.418 | 0.377 | 0.463 |
| exerciseYes | 0.247 | 0.217 | 0.283 |
#----------------------------------------------------------#
# Step 3: Multiple Logistic Regression
#----------------------------------------------------------#
mod_logistic <- glm(
frequent_distress ~ menthlth_days + exercise + income +
age_group + bmi + smoking + depression,
data = brfss_analytic,
family = binomial
)
summary(mod_logistic)##
## Call:
## glm(formula = frequent_distress ~ menthlth_days + exercise +
## income + age_group + bmi + smoking + depression, family = binomial,
## data = brfss_analytic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.879241 0.218703 -13.165 < 2e-16 ***
## menthlth_days 0.067729 0.003883 17.442 < 2e-16 ***
## exerciseYes -0.960316 0.076063 -12.625 < 2e-16 ***
## income$100K+ -1.397224 0.208803 -6.692 2.21e-11 ***
## income$25K-$49K -0.568895 0.099366 -5.725 1.03e-08 ***
## income$50K-$99K -0.995229 0.094489 -10.533 < 2e-16 ***
## age_group35-49 0.413981 0.147478 2.807 0.00500 **
## age_group50-64 0.924551 0.137207 6.738 1.60e-11 ***
## age_group65+ 1.281364 0.133111 9.626 < 2e-16 ***
## bmi 0.031461 0.005047 6.234 4.56e-10 ***
## smokingCurrent 0.360242 0.105900 3.402 0.00067 ***
## smokingFormer 0.218424 0.081922 2.666 0.00767 **
## depressionYes 0.258291 0.089178 2.896 0.00378 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6424.7 on 7999 degrees of freedom
## Residual deviance: 5186.1 on 7987 degrees of freedom
## AIC: 5212.1
##
## Number of Fisher Scoring iterations: 5
# Adjusted odds ratios with 95% CI
logistic_results <- tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE)
logistic_results %>%
mutate(across(where(is.numeric), round, 3)) %>%
kbl(caption = "Multiple Logistic Regression: Adjusted Odds Ratios with 95% CI") %>%
kable_styling(full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.056 | 0.219 | -13.165 | 0.000 | 0.036 | 0.086 |
| menthlth_days | 1.070 | 0.004 | 17.442 | 0.000 | 1.062 | 1.078 |
| exerciseYes | 0.383 | 0.076 | -12.625 | 0.000 | 0.330 | 0.444 |
| income$100K+ | 0.247 | 0.209 | -6.692 | 0.000 | 0.161 | 0.367 |
| income$25K-$49K | 0.566 | 0.099 | -5.725 | 0.000 | 0.466 | 0.688 |
| income$50K-$99K | 0.370 | 0.094 | -10.533 | 0.000 | 0.307 | 0.445 |
| age_group35-49 | 1.513 | 0.147 | 2.807 | 0.005 | 1.136 | 2.026 |
| age_group50-64 | 2.521 | 0.137 | 6.738 | 0.000 | 1.934 | 3.313 |
| age_group65+ | 3.602 | 0.133 | 9.626 | 0.000 | 2.787 | 4.699 |
| bmi | 1.032 | 0.005 | 6.234 | 0.000 | 1.022 | 1.042 |
| smokingCurrent | 1.434 | 0.106 | 3.402 | 0.001 | 1.163 | 1.762 |
| smokingFormer | 1.244 | 0.082 | 2.666 | 0.008 | 1.059 | 1.460 |
| depressionYes | 1.295 | 0.089 | 2.896 | 0.004 | 1.086 | 1.540 |
#----------------------------------------------------------#
# Step 4: Likelihood Ratio Test
#----------------------------------------------------------#
# Test income as a group of related predictors
mod_reduced <- glm(
frequent_distress ~ menthlth_days + exercise +
age_group + bmi + smoking + depression,
data = brfss_analytic,
family = binomial
)
lrt_results <- anova(mod_reduced, mod_logistic, test = "Chisq")
lrt_results## Analysis of Deviance Table
##
## Model 1: frequent_distress ~ menthlth_days + exercise + age_group + bmi +
## smoking + depression
## Model 2: frequent_distress ~ menthlth_days + exercise + income + age_group +
## bmi + smoking + depression
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 7990 5308.7
## 2 7987 5186.1 3 122.6 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Optional formatted display
lrt_table <- broom::tidy(lrt_results)
lrt_table %>%
mutate(across(where(is.numeric), round, 3)) %>%
kbl(caption = "Likelihood Ratio Test Comparing Reduced and Full Logistic Models") %>%
kable_styling(full_width = FALSE)| term | df.residual | residual.deviance | df | deviance | p.value |
|---|---|---|---|---|---|
| frequent_distress ~ menthlth_days + exercise + age_group + bmi + smoking + depression | 7990 | 5308.706 | NA | NA | NA |
| frequent_distress ~ menthlth_days + exercise + income + age_group + bmi + smoking + depression | 7987 | 5186.106 | 3 | 122.6 | 0 |
#----------------------------------------------------------#
# Step 5: ROC Curve and AUC
#----------------------------------------------------------#
roc_obj <- roc(brfss_analytic$frequent_distress, fitted(mod_logistic))
plot(
roc_obj,
main = "ROC Curve: Frequent Physical Distress Model",
print.auc = TRUE
)auc_value <- auc(roc_obj)
auc_ci <- ci.auc(roc_obj)
cat("AUC:", round(as.numeric(auc_value), 3), "\n")## AUC: 0.793
## 95% CI for AUC: 0.778 - 0.808
#----------------------------------------------------------#
# Step 6: Hosmer-Lemeshow Goodness-of-Fit Test
#----------------------------------------------------------#
hl_test <- hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10)
hl_test##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: mod_logistic$y, fitted(mod_logistic)
## X-squared = 13.516, df = 8, p-value = 0.09529
# Optional formatted display
hl_table <- tibble(
statistic = as.numeric(hl_test$statistic),
df = as.numeric(hl_test$parameter),
p_value = as.numeric(hl_test$p.value)
)
hl_table %>%
mutate(across(where(is.numeric), round, 3)) %>%
kbl(caption = "Hosmer-Lemeshow Goodness-of-Fit Test") %>%
kable_styling(full_width = FALSE)| statistic | df | p_value |
|---|---|---|
| 13.516 | 8 | 0.095 |
A binary outcome variable, frequent_distress, was created and coded as “Yes” for individuals reporting 14 or more physically unhealthy days in the past 30 days and “No” otherwise.
In the analytic sample, 13.81% (n = 1,105) of respondents reported frequent physical distress, while 86.19% (n = 6,895) did not.
This indicates that the outcome is moderately imbalanced, with a smaller proportion of individuals experiencing frequent distress.
I selected exercise as the predictor for the simple logistic regression because it showed a strong association in the linear regression model and is an important modifiable behavioral factor. The odds ratio for exercise was 0.247 (95% CI: 0.217, 0.283).
Compared to individuals who did not exercise, those who reported exercising had 0.25 times the odds of frequent physical distress. This corresponds to approximately a 75% reduction in the odds of frequent distress. The association is statistically significant (p < 0.001), as the confidence interval does not include 1.0.
A multiple logistic regression model was fit using the same predictors as the final linear model, as required.
Key Adjusted Odds Ratios: Mental health (continuous)
Holding all other variables constant, each additional day of poor mental health was associated with 1.07 times higher odds of frequent physical distress (95% CI: 1.06, 1.08).
Exercise (binary)
Holding all other variables constant, individuals who exercised had 0.38 times the odds of frequent physical distress compared to those who did not exercise (95% CI: 0.33, 0.44), indicating a substantial protective effect.
Income (categorical)
Compared to individuals with income less than $25,000:
Those earning $100K+ had 0.25 times the odds (95% CI: 0.16, 0.37) Those earning $50K–$99K had 0.37 times the odds Those earning $25K–$49K had 0.57 times the odds
This shows a strong gradient where higher income is associated with lower odds of distress.
Age group
Compared to adults aged 18–34:
Age 35–49: 1.51 times higher odds Age 50–64: 2.52 times higher odds Age 65+: 3.60 times higher odds
This suggests increasing physical distress with age.
Hypotheses: H₀: Income does not improve model fit
Hₐ: Income significantly improves model fit
Results:
The likelihood ratio test comparing the reduced and full models yielded a chi-square statistic of 122.6 with 3 degrees of freedom and a p-value < 0.001.
Conclusion:
At α = 0.05, we reject the null hypothesis, concluding that income significantly improves the model and should be retained.
The ROC curve was used to evaluate the model’s ability to distinguish between individuals with and without frequent physical distress.
The AUC was 0.793 (95% CI: 0.778–0.808).
This indicates acceptable to good discrimination, meaning the model can reasonably distinguish between individuals who do and do not experience frequent distress.
Hypotheses: H₀: The model fits the data well Hₐ: The model does not fit the data well Results:
The Hosmer-Lemeshow test produced a chi-square statistic of 13.52 with 8 degrees of freedom and a p-value of 0.095.
Interpretation:
Because the p-value is greater than 0.05, we fail to reject the null hypothesis, indicating no evidence of poor model fit.
Complement with AUC:
The Hosmer-Lemeshow test assesses model calibration, while the ROC/AUC assesses discrimination. Together, these results indicate that the model both fits the data adequately and has good predictive performance.
The results of this analysis suggest that physical health burden among U.S. adults is influenced by a combination of behavioral, socioeconomic, and health-related factors. Across both the linear and logistic regression models, mental health, exercise, income, age, BMI, smoking, and depression emerged as consistent predictors. Mental health was one of the strongest predictors in both models, with more mentally unhealthy days associated with both an increase in physically unhealthy days and higher odds of frequent physical distress. Exercise showed a strong protective effect, with individuals who exercised reporting fewer unhealthy days and significantly lower odds of distress. Income also demonstrated a clear gradient, with higher income levels associated with better physical health outcomes. Age was another important factor, with older adults experiencing greater health burden.
Some predictors, such as sex and education, were not significant in the linear model and were excluded from the final models, suggesting their effects may be mediated through other variables. A key limitation of this analysis is the use of cross-sectional survey data, which prevents causal inference. Additionally, self-reported measures may introduce reporting bias. Important potential confounders not included in the model include access to healthcare, chronic disease status, and geographic factors, all of which could influence both exposures and outcomes.
The linear and logistic regression models provide complementary insights into the same research question. The linear model quantifies how predictors are associated with the number of physically unhealthy days, offering detailed information about changes in magnitude. In contrast, the logistic model focuses on the probability of experiencing frequent physical distress, which is often more relevant for public health decision-making and risk classification. The evaluation metrics also differ: linear regression uses R-squared to assess explained variance, while logistic regression relies on AUC to assess discrimination. In this analysis, the logistic model demonstrated acceptable discrimination (AUC ≈ 0.79), indicating reasonable predictive performance.
Overall, linear regression is useful when the outcome is continuous and detailed variation is of interest, while logistic regression is preferred when the goal is to identify individuals at higher risk of a clinically meaningful threshold outcome. These findings suggest that interventions promoting physical activity and addressing socioeconomic disparities may reduce the burden of physical distress at the population level.
End of Homework 4