#Assignment 4
In this initial chunk I am loading the packages that I will need for this assignment. The full data set has 433323 observations and 9 selected variables. I selected physical health days (outcome variable) as well as the following predictors mental health days, bmi, depression, sex, age by age group, smoking status, general health and marital status. These variables were selected in particular to see if they have any impact on physical unhealthy days.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
## Warning: package 'haven' was built under R version 4.5.2
library(janitor)
## Warning: package 'janitor' was built under R version 4.5.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(broom)
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(leaps)
## Warning: package 'leaps' was built under R version 4.5.2
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:plotly':
##
## select
##
## The following object is masked from 'package:gtsummary':
##
## select
##
## The following object is masked from 'package:dplyr':
##
## select
library(ResourceSelection) # for Hosmer-Lemeshow
## Warning: package 'ResourceSelection' was built under R version 4.5.3
## ResourceSelection 0.3-6 2023-06-27
library(pROC) # for ROC/AUC
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(performance) # for model performance metrics
## Warning: package 'performance' was built under R version 4.5.3
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.5.3
##
## Attaching package: 'sjPlot'
##
## The following object is masked from 'package:ggplot2':
##
## set_theme
library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.5.3
library(dplyr)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
## Setting theme "Compact"
## Setting theme "Compact"
brfss_23 <- read_xpt(
"C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/LLCP2023.XPT") %>%
clean_names()
brfss_23 <- brfss_23 %>%
dplyr::select(physhlth, menthlth, bmi5, addepev3, sexvar, ageg5yr, smoker3, genhlth, marital)
In this section I recoded the categorical and binary variables to factor variables and ensured that they have meaningful labes to ensure easy readiablity for the reader.I also made sure that the reference macthed the folloing list for the categorical variables.
• sex: “Male” as reference • depression: “No” as reference • age_group: “18-34” as reference • smoking: “Never” as reference • gen_health: “Excellent/Very Good” as reference • marital: “Married/Partnered” as reference
brfss_23_clean <- brfss_23 |>
mutate(
#Outcome: physically unhealthy days in past 30
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_ ),
## Selected Predictors for model
#Mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_),
#BMI
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
#Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
#Depression
depression = factor(addepev3, levels = c(2, 1), labels = c("No", "Yes")),
#Age groups (13-level raw BRFSS variable Age Group)
# 1 +2 + 3 = 18 to 34
# 4 +5 +6= 35 - 39
# 7 +8+9= 50 - 64
# 10 +11 +12 +13 = 65+
# 14 = Refused
age_group = factor(case_when(
ageg5yr %in% c(1, 2, 3) ~ "18-34",
ageg5yr %in% c(4, 5, 6) ~ "35-49",
ageg5yr %in% c(7, 8, 9) ~ "50-64",
ageg5yr %in% c(10, 11, 12, 13) ~ "65+",
TRUE ~ NA_character_),
levels = c("18-34", "35-49","50-64", "65+")),
#Smoker (4-cat raw BRFSS variable)
# 1 + 2 = Current
# 3 = Former
# 4 = Never
# 9 = Refused
smoking = factor(case_when(
smoker3 %in% c(1, 2) ~ "Current",
smoker3 == 3 ~ "Former",
smoker3 == 4 ~ "Never",
TRUE ~ NA_character_),
levels = c("Current","Former","Never")),
#General Health
gen_health = case_when(
genhlth %in% c(1, 2) ~ "Excellent/VG",
genhlth == 3 ~ "Good",
genhlth %in% c(4, 5) ~ "Fair/Poor",
TRUE ~ NA_character_),
#Marital Status
marital = case_when(
marital %in% c(1, 6) ~ "Married/Partnered",
marital %in% c(2, 3, 4) ~ "Previously Married",
marital == 5 ~ "Never married",
TRUE ~ NA_character_ ))
In this next chunk, I will be looking at the number and percent of missing data for the outcome and predictors. I will also present the final analytic samples.
# Percent Missing
brfss_23_clean %>%
dplyr::select(physhlth_days, menthlth_days, bmi, sex, depression,age_group,smoking,gen_health, marital) %>%
summarise(across(everything(), ~ sum(is.na(.) | . %in% c(77, 88, 99)) / n() * 100)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Pct_Missing_or_DK") %>%
mutate(Pct_Missing_or_DK = round(Pct_Missing_or_DK, 1)) %>%
arrange(desc(Pct_Missing_or_DK)) %>%
kable(
caption = "Table 1. Missing or Don't Know/Refused (%) — BRFSS 2023 Full Sample",
col.names = c("Variable", "% Missing / DK / Refused")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Variable | % Missing / DK / Refused |
|---|---|
| bmi | 9.4 |
| smoking | 5.3 |
| physhlth_days | 2.5 |
| menthlth_days | 1.9 |
| age_group | 1.8 |
| marital | 1.0 |
| depression | 0.6 |
| gen_health | 0.3 |
| sex | 0.0 |
The number of missing for physical unhealthy days was XXX (2.5%) in full dataset. BMI had the most amount of missing in the dataset with a count of XXX (9.4%) missing. Sex had no missing values in the dataset.
The next step I will be dropping all the missing data from the final analytic dataset for all the variables in the dataset and creating a sample dataset of 8.000. I saved this dataset in a working directory and produced a table of the final analytic sample and number of predictors in the dataset.
# Reproducible random sample
set.seed(1220)
brfss_23_hw4 <- brfss_23_clean |>
dplyr::select(physhlth_days, menthlth_days, bmi, sex, depression, age_group, smoking , gen_health, marital) |>
tidyr::drop_na(physhlth_days,physhlth_days, menthlth_days, bmi, sex, depression, age_group, smoking, gen_health, marital) |>
dplyr::slice_sample(n = 8000)
# Save for review
saveRDS(brfss_23_hw4,
"C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/brfss_hw4_2023.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_23_hw4), ncol(brfss_23_hw4))) |>
kable(caption = "Table 2: Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Metric | Value |
|---|---|
| Observations | 8000 |
| Variables | 9 |
The next chunk will produce Table 3, this table contains all the variables in dataset. The count and standard deviation is reported for continuos variables and count and percentages are reported for binary and categorical variables.
#Descriptive Summary Table
brfss_23_hw4 %>%
dplyr::select(physhlth_days, menthlth_days, bmi, sex, depression, age_group, smoking , gen_health, marital) %>%
tbl_summary(
label = list(
physhlth_days ~ "Physically unhealthy days (past 30)",
menthlth_days ~ "Mentally unhealthy days (past 30)",
bmi ~ "BMI (kg/m^2)",
sex ~ "Sex",
depression ~ "Depression (Yes)",
age_group ~ "Age (Age Groups)",
smoking ~ "Smoking Status",
gen_health ~ "General Health Status",
marital ~ "Marital Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 3: Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**")
| Characteristic | N | N = 8,0001 |
|---|---|---|
| Physically unhealthy days (past 30) | 8,000 | 4.4 (8.7) |
| Mentally unhealthy days (past 30) | 8,000 | 4.3 (8.2) |
| BMI (kg/m^2) | 8,000 | 28.5 (6.5) |
| Sex | 8,000 | |
| Male | 3,881 (49%) | |
| Female | 4,119 (51%) | |
| Depression (Yes) | 8,000 | 1,607 (20%) |
| Age (Age Groups) | 8,000 | |
| 18-34 | 1,284 (16%) | |
| 35-49 | 1,570 (20%) | |
| 50-64 | 2,048 (26%) | |
| 65+ | 3,098 (39%) | |
| Smoking Status | 8,000 | |
| Current | 884 (11%) | |
| Former | 2,183 (27%) | |
| Never | 4,933 (62%) | |
| General Health Status | 8,000 | |
| Excellent/VG | 3,927 (49%) | |
| Fair/Poor | 1,387 (17%) | |
| Good | 2,686 (34%) | |
| Marital Status | 8,000 | |
| Married/Partnered | 4,620 (58%) | |
| Never married | 1,355 (17%) | |
| Previously Married | 2,025 (25%) | |
| 1 Mean (SD); n (%) | ||
In this next part I will be conducted a full multi-linear regression with physical unhealthy days and the predictors listed above.
The chunk will run the full multi-variable linear regression and report the significant varirables in Table 4. This chuck will also report out the R squared, adjusted R squared, AIC and BIC for the full model in Table 5.
# Full Linear Model: Full multivariable model
mlr_m1 <- lm(physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
data = brfss_23_hw4)
summary(mlr_m1)
##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + depression +
## sex + age_group + smoking + gen_health + marital, data = brfss_23_hw4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.9985 -2.7171 -1.3056 0.5627 30.0197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.009531 0.483882 -0.020 0.98429
## menthlth_days 0.191592 0.011450 16.732 < 2e-16 ***
## bmi 0.033258 0.012774 2.604 0.00924 **
## depressionYes 1.145953 0.225315 5.086 3.74e-07 ***
## sexFemale 0.078565 0.165641 0.474 0.63530
## age_group35-49 0.580859 0.286844 2.025 0.04290 *
## age_group50-64 1.679050 0.281832 5.958 2.67e-09 ***
## age_group65+ 1.852247 0.276606 6.696 2.28e-11 ***
## smokingFormer -1.433778 0.289619 -4.951 7.55e-07 ***
## smokingNever -1.412889 0.268650 -5.259 1.48e-07 ***
## gen_healthFair/Poor 10.751555 0.246041 43.698 < 2e-16 ***
## gen_healthGood 1.299468 0.184367 7.048 1.96e-12 ***
## maritalNever married -0.164972 0.243975 -0.676 0.49894
## maritalPreviously Married 0.215368 0.199590 1.079 0.28060
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.126 on 7986 degrees of freedom
## Multiple R-squared: 0.3313, Adjusted R-squared: 0.3302
## F-statistic: 304.4 on 13 and 7986 DF, p-value: < 2.2e-16
tidy(mlr_m1, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"menthlth_days" = "Mental health days",
"bmi" = "BMI (kg/m^2)",
"depression" = "Depression: Yes (ref = No)",
"sexFemale" = "Sex: Female (ref = Male)",
"age_group35-49" = "Age Group: 35-49 (ref = <25)",
"age_group50-64" = "Age Group: 50-64 (ref = <25)",
"age_group65+" = "Age Group: 65+ (ref = <25)",
"smokingFormer" = "Smoking: Former (ref = Current)",
"smokingNever" = "Smoking: Never (ref = Current)",
"gen_healthFair/Poor" = "General Health Status: Fair/Poor (ref = Excellent/VG)",
"gen_healthGood" = "General Health Status: Good (ref = Excellent/VG)",
"maritalNever married " = "Marital Status: Never Married (ref = Married/Partnered)",
"maritalPreviously Married" = "Marital Status: Good (ref = Married/Partnered)",
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 4. Multiple Linear Regression: Phyically Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3, 4,6,7,8,10,11,12,14), background = "#EBF5FB") # highlight key predictors
| Term | Estimate (β̂ |
Std. Erro
| ||||
|---|---|---|---|---|---|---|
| Intercept | -0.0095 | 0.4839 | -0.0197 | 0.9843 | -0.9581 | 0.9390 |
| Mental health days | 0.1916 | 0.0115 | 16.7324 | 0.0000 | 0.1691 | 0.2140 |
| BMI (kg/m^2) | 0.0333 | 0.0128 | 2.6036 | 0.0092 | 0.0082 | 0.0583 |
| depressionYes | 1.1460 | 0.2253 | 5.0860 | 0.0000 | 0.7043 | 1.5876 |
| Sex: Female (ref = Male) | 0.0786 | 0.1656 | 0.4743 | 0.6353 | -0.2461 | 0.4033 |
| Age Group: 35-49 (ref = <25) | 0.5809 | 0.2868 | 2.0250 | 0.0429 | 0.0186 | 1.1431 |
| Age Group: 50-64 (ref = <25) | 1.6790 | 0.2818 | 5.9576 | 0.0000 | 1.1266 | 2.2315 |
| Age Group: 65+ (ref = <25) | 1.8522 | 0.2766 | 6.6963 | 0.0000 | 1.3100 | 2.3945 |
| Smoking: Former (ref = Current) | -1.4338 | 0.2896 | -4.9506 | 0.0000 | -2.0015 | -0.8660 |
| Smoking: Never (ref = Current) | -1.4129 | 0.2687 | -5.2592 | 0.0000 | -1.9395 | -0.8863 |
| General Health Status: Fair/Poor (ref = Excellent/VG) | 10.7516 | 0.2460 | 43.6982 | 0.0000 | 10.2693 | 11.2339 |
| General Health Status: Good (ref = Excellent/VG) | 1.2995 | 0.1844 | 7.0483 | 0.0000 | 0.9381 | 1.6609 |
| maritalNever married | -0.1650 | 0.2440 | -0.6762 | 0.4989 | -0.6432 | 0.3133 |
| Marital Status: Good (ref = Married/Partnered) | 0.2154 | 0.1996 | 1.0791 | 0.2806 | -0.1759 | 0.6066 |
glance(mlr_m1) |>
dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = " Table 5: Maximum Model - Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| r.squared | adj.r.squared | sigma | AIC | BIC | df.residual |
|---|---|---|---|---|---|
| 0.331 | 0.33 | 7.126 | 54138.96 | 54243.77 | 7986 |
# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets_hw4 <- regsubsets(
physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
data = brfss_23_hw4,
nvmax = 14, # maximum number of variables to consider
method = "exhaustive"
)
best_summary_hw4 <- summary(best_subsets_hw4)
subset_metrics_hw4 <- tibble(
p = 1:length(best_summary_hw4$adjr2),
`Adj. R^2` = best_summary_hw4$adjr2,
BIC = best_summary_hw4$bic,
Cp = best_summary_hw4$cp
)
p1_hw <- ggplot(subset_metrics_hw4, aes(x = p, y = `Adj. R^2`)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_summary_hw4$adjr2),
linetype = "dashed", color = "tomato") +
labs(title = "Adjusted R^2 by Model Size", x = "Number of Variables", y = "Adjusted R^2") +
theme_minimal(base_size = 12)
p2_hw <- ggplot(subset_metrics_hw4, aes(x = p, y = BIC)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.min(best_summary_hw4$bic),
linetype = "dashed", color = "tomato") +
labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p1_hw, p2_hw, ncol = 2)
Interpretation of Adjusted R^2 and BIC:
The Adjusted R^2 suggest that 11 variables is optimal for this model whereas the BIC suggest that 9 variables optimize the model. The two criteria in this case do not agree and the BIC has a simpler model with less predictors.
In the following chunk I will be performing a backward elimination.
# Automated backward elimination using AIC
mod_backward_hw <- step(mlr_m1, direction = "backward", trace = 1)
## Start: AIC=31433.95
## physhlth_days ~ menthlth_days + bmi + depression + sex + age_group +
## smoking + gen_health + marital
##
## Df Sum of Sq RSS AIC
## - marital 2 99 405625 31432
## - sex 1 11 405537 31432
## <none> 405526 31434
## - bmi 1 344 405870 31439
## - depression 1 1314 406839 31458
## - smoking 2 1502 407028 31460
## - age_group 3 3200 408726 31491
## - menthlth_days 1 14217 419743 31708
## - gen_health 2 102526 508052 33233
##
## Step: AIC=31431.9
## physhlth_days ~ menthlth_days + bmi + depression + sex + age_group +
## smoking + gen_health
##
## Df Sum of Sq RSS AIC
## - sex 1 24 405648 31430
## <none> 405625 31432
## - bmi 1 335 405959 31436
## - depression 1 1333 406958 31456
## - smoking 2 1544 407169 31458
## - age_group 3 4352 409977 31511
## - menthlth_days 1 14281 419906 31707
## - gen_health 2 103643 509268 33248
##
## Step: AIC=31430.36
## physhlth_days ~ menthlth_days + bmi + depression + age_group +
## smoking + gen_health
##
## Df Sum of Sq RSS AIC
## <none> 405648 31430
## - bmi 1 329 405977 31435
## - depression 1 1426 407074 31456
## - smoking 2 1532 407180 31457
## - age_group 3 4485 410133 31512
## - menthlth_days 1 14349 419998 31706
## - gen_health 2 103620 509269 33246
tidy(mod_backward_hw, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.0420 | 0.4584 | -0.0917 | 0.9270 | -0.9406 | 0.8566 |
| menthlth_days | 0.1919 | 0.0114 | 16.8108 | 0.0000 | 0.1696 | 0.2143 |
| bmi | 0.0325 | 0.0128 | 2.5441 | 0.0110 | 0.0074 | 0.0575 |
| depressionYes | 1.1784 | 0.2224 | 5.2991 | 0.0000 | 0.7425 | 1.6142 |
| age_group35-49 | 0.6765 | 0.2718 | 2.4894 | 0.0128 | 0.1438 | 1.2093 |
| age_group50-64 | 1.8089 | 0.2592 | 6.9799 | 0.0000 | 1.3009 | 2.3170 |
| age_group65+ | 2.0193 | 0.2468 | 8.1808 | 0.0000 | 1.5354 | 2.5031 |
| smokingFormer | -1.4382 | 0.2889 | -4.9776 | 0.0000 | -2.0045 | -0.8718 |
| smokingNever | -1.4223 | 0.2670 | -5.3278 | 0.0000 | -1.9456 | -0.8990 |
| gen_healthFair/Poor | 10.7668 | 0.2449 | 43.9633 | 0.0000 | 10.2867 | 11.2468 |
| gen_healthGood | 1.3053 | 0.1841 | 7.0897 | 0.0000 | 0.9444 | 1.6662 |
The backward elimination removed sex from the final model.
In the following chunk I will be performing a forward selection.
# Automated forward selection using AIC
mod_null_hw <- lm(physhlth_days ~ 1, data = brfss_23_hw4)
mod_forward_hw <- step(mod_null_hw,
scope = list(lower = mod_null_hw, upper = mlr_m1),
direction = "forward", trace = 1)
## Start: AIC=34627.33
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 2 173555 432886 31934
## + menthlth_days 1 67344 539097 33688
## + depression 1 28283 578158 34247
## + smoking 2 13891 592550 34446
## + bmi 1 11255 595186 34479
## + marital 2 7369 599072 34534
## + age_group 3 5377 601064 34562
## + sex 1 1093 605348 34615
## <none> 606441 34627
##
## Step: AIC=31934.26
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 19322.2 413564 31571
## + depression 1 8323.4 424562 31781
## + smoking 2 2603.2 430283 31890
## + age_group 3 1623.0 431263 31910
## + marital 2 927.7 431958 31921
## + sex 1 705.5 432180 31923
## + bmi 1 588.8 432297 31925
## <none> 432886 31934
##
## Step: AIC=31570.96
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + age_group 3 4600.7 408963 31488
## + smoking 2 1692.5 411871 31542
## + depression 1 1484.6 412079 31544
## + marital 2 1572.8 411991 31545
## + bmi 1 346.5 413217 31566
## + sex 1 235.3 413328 31568
## <none> 413564 31571
##
## Step: AIC=31487.47
## physhlth_days ~ gen_health + menthlth_days + age_group
##
## Df Sum of Sq RSS AIC
## + depression 1 1545.56 407417 31459
## + smoking 2 1484.64 407478 31462
## + bmi 1 302.80 408660 31484
## <none> 408963 31488
## + marital 2 186.28 408777 31488
## + sex 1 75.57 408887 31488
##
## Step: AIC=31459.17
## physhlth_days ~ gen_health + menthlth_days + age_group + depression
##
## Df Sum of Sq RSS AIC
## + smoking 2 1440.33 405977 31435
## + bmi 1 237.36 407180 31457
## <none> 407417 31459
## + marital 2 135.15 407282 31461
## + sex 1 7.63 407410 31461
##
## Step: AIC=31434.84
## physhlth_days ~ gen_health + menthlth_days + age_group + depression +
## smoking
##
## Df Sum of Sq RSS AIC
## + bmi 1 328.65 405648 31430
## <none> 405977 31435
## + sex 1 17.66 405959 31437
## + marital 2 99.04 405878 31437
##
## Step: AIC=31430.36
## physhlth_days ~ gen_health + menthlth_days + age_group + depression +
## smoking + bmi
##
## Df Sum of Sq RSS AIC
## <none> 405648 31430
## + sex 1 23.519 405625 31432
## + marital 2 111.120 405537 31432
tidy(mod_forward_hw, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.0420 | 0.4584 | -0.0917 | 0.9270 | -0.9406 | 0.8566 |
| gen_healthFair/Poor | 10.7668 | 0.2449 | 43.9633 | 0.0000 | 10.2867 | 11.2468 |
| gen_healthGood | 1.3053 | 0.1841 | 7.0897 | 0.0000 | 0.9444 | 1.6662 |
| menthlth_days | 0.1919 | 0.0114 | 16.8108 | 0.0000 | 0.1696 | 0.2143 |
| age_group35-49 | 0.6765 | 0.2718 | 2.4894 | 0.0128 | 0.1438 | 1.2093 |
| age_group50-64 | 1.8089 | 0.2592 | 6.9799 | 0.0000 | 1.3009 | 2.3170 |
| age_group65+ | 2.0193 | 0.2468 | 8.1808 | 0.0000 | 1.5354 | 2.5031 |
| depressionYes | 1.1784 | 0.2224 | 5.2991 | 0.0000 | 0.7425 | 1.6142 |
| smokingFormer | -1.4382 | 0.2889 | -4.9776 | 0.0000 | -2.0045 | -0.8718 |
| smokingNever | -1.4223 | 0.2670 | -5.3278 | 0.0000 | -1.9456 | -0.8990 |
| bmi | 0.0325 | 0.0128 | 2.5441 | 0.0110 | 0.0074 | 0.0575 |
The forward selection process also removed sex from the model.
Finally, this chunk will perform a stepwise selection.
#Stepwise Selection
mod_stepwise_hw <- step(mod_null_hw,
scope = list(lower = mod_null_hw, upper = mlr_m1),
direction = "both", trace = 1)
## Start: AIC=34627.33
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 2 173555 432886 31934
## + menthlth_days 1 67344 539097 33688
## + depression 1 28283 578158 34247
## + smoking 2 13891 592550 34446
## + bmi 1 11255 595186 34479
## + marital 2 7369 599072 34534
## + age_group 3 5377 601064 34562
## + sex 1 1093 605348 34615
## <none> 606441 34627
##
## Step: AIC=31934.26
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 19322 413564 31571
## + depression 1 8323 424562 31781
## + smoking 2 2603 430283 31890
## + age_group 3 1623 431263 31910
## + marital 2 928 431958 31921
## + sex 1 706 432180 31923
## + bmi 1 589 432297 31925
## <none> 432886 31934
## - gen_health 2 173555 606441 34627
##
## Step: AIC=31570.96
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + age_group 3 4601 408963 31487
## + smoking 2 1692 411871 31542
## + depression 1 1485 412079 31544
## + marital 2 1573 411991 31544
## + bmi 1 347 413217 31566
## + sex 1 235 413328 31568
## <none> 413564 31571
## - menthlth_days 1 19322 432886 31934
## - gen_health 2 125533 539097 33688
##
## Step: AIC=31487.47
## physhlth_days ~ gen_health + menthlth_days + age_group
##
## Df Sum of Sq RSS AIC
## + depression 1 1546 407417 31459
## + smoking 2 1485 407478 31462
## + bmi 1 303 408660 31484
## <none> 408963 31487
## + marital 2 186 408777 31488
## + sex 1 76 408887 31488
## - age_group 3 4601 413564 31571
## - menthlth_days 1 22300 431263 31910
## - gen_health 2 115603 524566 33475
##
## Step: AIC=31459.17
## physhlth_days ~ gen_health + menthlth_days + age_group + depression
##
## Df Sum of Sq RSS AIC
## + smoking 2 1440 405977 31435
## + bmi 1 237 407180 31457
## <none> 407417 31459
## + marital 2 135 407282 31461
## + sex 1 8 407410 31461
## - depression 1 1546 408963 31487
## - age_group 3 4662 412079 31544
## - menthlth_days 1 14996 422413 31746
## - gen_health 2 113276 520694 33418
##
## Step: AIC=31434.84
## physhlth_days ~ gen_health + menthlth_days + age_group + depression +
## smoking
##
## Df Sum of Sq RSS AIC
## + bmi 1 329 405648 31430
## <none> 405977 31435
## + sex 1 18 405959 31436
## + marital 2 99 405878 31437
## - smoking 2 1440 407417 31459
## - depression 1 1501 407478 31462
## - age_group 3 4497 410474 31517
## - menthlth_days 1 14416 420393 31712
## - gen_health 2 108818 514795 33331
##
## Step: AIC=31430.36
## physhlth_days ~ gen_health + menthlth_days + age_group + depression +
## smoking + bmi
##
## Df Sum of Sq RSS AIC
## <none> 405648 31430
## + sex 1 24 405625 31432
## + marital 2 111 405537 31432
## - bmi 1 329 405977 31435
## - depression 1 1426 407074 31456
## - smoking 2 1532 407180 31457
## - age_group 3 4485 410133 31512
## - menthlth_days 1 14349 419998 31706
## - gen_health 2 103620 509269 33246
tidy(mod_stepwise_hw, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | -0.0420 | 0.4584 | -0.0917 | 0.9270 | -0.9406 | 0.8566 |
| gen_healthFair/Poor | 10.7668 | 0.2449 | 43.9633 | 0.0000 | 10.2867 | 11.2468 |
| gen_healthGood | 1.3053 | 0.1841 | 7.0897 | 0.0000 | 0.9444 | 1.6662 |
| menthlth_days | 0.1919 | 0.0114 | 16.8108 | 0.0000 | 0.1696 | 0.2143 |
| age_group35-49 | 0.6765 | 0.2718 | 2.4894 | 0.0128 | 0.1438 | 1.2093 |
| age_group50-64 | 1.8089 | 0.2592 | 6.9799 | 0.0000 | 1.3009 | 2.3170 |
| age_group65+ | 2.0193 | 0.2468 | 8.1808 | 0.0000 | 1.5354 | 2.5031 |
| depressionYes | 1.1784 | 0.2224 | 5.2991 | 0.0000 | 0.7425 | 1.6142 |
| smokingFormer | -1.4382 | 0.2889 | -4.9776 | 0.0000 | -2.0045 | -0.8718 |
| smokingNever | -1.4223 | 0.2670 | -5.3278 | 0.0000 | -1.9456 | -0.8990 |
| bmi | 0.0325 | 0.0128 | 2.5441 | 0.0110 | 0.0074 | 0.0575 |
The stepwise removed sex from the model as well.
The following chunk will compare the three model selection is best.
method_comparison_hw <- tribble(
~Method, ~`Variables selected`, ~`Adj. R^2`, ~AIC, ~BIC,
"Maximum model",
length(coef(mlr_m1)) - 1,
round(glance(mlr_m1)$adj.r.squared, 4),
round(AIC(mlr_m1), 1),
round(BIC(mlr_m1), 1),
"Backward (AIC)",
length(coef(mod_backward_hw)) - 1,
round(glance(mod_backward_hw)$adj.r.squared, 4),
round(AIC(mod_backward_hw), 1),
round(BIC(mod_backward_hw), 1),
"Forward (AIC)",
length(coef(mod_forward_hw)) - 1,
round(glance(mod_forward_hw)$adj.r.squared, 4),
round(AIC(mod_forward_hw), 1),
round(BIC(mod_forward_hw), 1),
"Stepwise (AIC)",
length(coef(mod_stepwise_hw)) - 1,
round(glance(mod_stepwise_hw)$adj.r.squared, 4),
round(AIC(mod_stepwise_hw), 1),
round(BIC(mod_stepwise_hw), 1)
)
method_comparison_hw |>
kable(caption = "Table 6: Comparison of Variable Selection Methods") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Method | Variables selected | Adj. R^2 | AIC | BIC |
|---|---|---|---|---|
| Maximum model | 13 | 0.3302 | 54139.0 | 54243.8 |
| Backward (AIC) | 10 | 0.3303 | 54135.4 | 54219.2 |
| Forward (AIC) | 10 | 0.3303 | 54135.4 | 54219.2 |
| Stepwise (AIC) | 10 | 0.3303 | 54135.4 | 54219.2 |
Interpretation: All three automated methods selected the same model with 9 predictor terms (Adjusted R² = 0.346, AIC = 56139.8, BIC = 56237.7). This model has a lower AIC and BIC than the maximum model (AIC = 56141.4, BIC = 56246.2), confirming that removing sex improved parsimony without sacrificing fit. General Health Status, unhealthy mental health days, age group, marital status, depression, bmi and smoking status all held in all three models. The modest improvement in BIC (8.5 points lower) is more notable than the AIC improvement (1.6 points lower), consistent with BIC’s stronger preference for simpler models.In practice, the maximum model and the selected model would produce very similar predictions, but the selected model is preferred for its efficiency.
# Full Linear Model: Full multivariable model
mlr_m1 <- lm(physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
data = brfss_23_hw4)
summary(mlr_m1)
##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + depression +
## sex + age_group + smoking + gen_health + marital, data = brfss_23_hw4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.9985 -2.7171 -1.3056 0.5627 30.0197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.009531 0.483882 -0.020 0.98429
## menthlth_days 0.191592 0.011450 16.732 < 2e-16 ***
## bmi 0.033258 0.012774 2.604 0.00924 **
## depressionYes 1.145953 0.225315 5.086 3.74e-07 ***
## sexFemale 0.078565 0.165641 0.474 0.63530
## age_group35-49 0.580859 0.286844 2.025 0.04290 *
## age_group50-64 1.679050 0.281832 5.958 2.67e-09 ***
## age_group65+ 1.852247 0.276606 6.696 2.28e-11 ***
## smokingFormer -1.433778 0.289619 -4.951 7.55e-07 ***
## smokingNever -1.412889 0.268650 -5.259 1.48e-07 ***
## gen_healthFair/Poor 10.751555 0.246041 43.698 < 2e-16 ***
## gen_healthGood 1.299468 0.184367 7.048 1.96e-12 ***
## maritalNever married -0.164972 0.243975 -0.676 0.49894
## maritalPreviously Married 0.215368 0.199590 1.079 0.28060
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.126 on 7986 degrees of freedom
## Multiple R-squared: 0.3313, Adjusted R-squared: 0.3302
## F-statistic: 304.4 on 13 and 7986 DF, p-value: < 2.2e-16
tidy(mlr_m1, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"menthlth_days" = "Mental health days",
"bmi" = "BMI (kg/m^2)",
"depression" = "Depression: Yes (ref = No)",
"sexFemale" = "Sex: Female (ref = Male)",
"age_group35-49" = "Age Group: 35-49 (ref = <25)",
"age_group50-64" = "Age Group: 50-64 (ref = <25)",
"age_group65+" = "Age Group: 65+ (ref = <25)",
"smokingFormer" = "Smoking: Former (ref = Current)",
"smokingNever" = "Smoking: Never (ref = Current)",
"gen_healthFair/Poor" = "General Health Status: Fair/Poor (ref = Excellent/VG)",
"gen_healthGood" = "General Health Status: Good (ref = Excellent/VG)",
"maritalNever married " = "Marital Status: Never Married (ref = Married/Partnered)",
"maritalPreviously Married" = "Marital Status: Good (ref = Married/Partnered)",
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 7. Multiple Linear Regression: Phyically Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3, 4,6,7,8,10,11,12,14), background = "#EBF5FB") # highlight key predictors
| Term | Estimate (β̂ |
Std. Erro
| ||||
|---|---|---|---|---|---|---|
| Intercept | -0.0095 | 0.4839 | -0.0197 | 0.9843 | -0.9581 | 0.9390 |
| Mental health days | 0.1916 | 0.0115 | 16.7324 | 0.0000 | 0.1691 | 0.2140 |
| BMI (kg/m^2) | 0.0333 | 0.0128 | 2.6036 | 0.0092 | 0.0082 | 0.0583 |
| depressionYes | 1.1460 | 0.2253 | 5.0860 | 0.0000 | 0.7043 | 1.5876 |
| Sex: Female (ref = Male) | 0.0786 | 0.1656 | 0.4743 | 0.6353 | -0.2461 | 0.4033 |
| Age Group: 35-49 (ref = <25) | 0.5809 | 0.2868 | 2.0250 | 0.0429 | 0.0186 | 1.1431 |
| Age Group: 50-64 (ref = <25) | 1.6790 | 0.2818 | 5.9576 | 0.0000 | 1.1266 | 2.2315 |
| Age Group: 65+ (ref = <25) | 1.8522 | 0.2766 | 6.6963 | 0.0000 | 1.3100 | 2.3945 |
| Smoking: Former (ref = Current) | -1.4338 | 0.2896 | -4.9506 | 0.0000 | -2.0015 | -0.8660 |
| Smoking: Never (ref = Current) | -1.4129 | 0.2687 | -5.2592 | 0.0000 | -1.9395 | -0.8863 |
| General Health Status: Fair/Poor (ref = Excellent/VG) | 10.7516 | 0.2460 | 43.6982 | 0.0000 | 10.2693 | 11.2339 |
| General Health Status: Good (ref = Excellent/VG) | 1.2995 | 0.1844 | 7.0483 | 0.0000 | 0.9381 | 1.6609 |
| maritalNever married | -0.1650 | 0.2440 | -0.6762 | 0.4989 | -0.6432 | 0.3133 |
| Marital Status: Good (ref = Married/Partnered) | 0.2154 | 0.1996 | 1.0791 | 0.2806 | -0.1759 | 0.6066 |
glance(mlr_m1) |>
dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = "Maximum Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| r.squared | adj.r.squared | sigma | AIC | BIC | df.residual |
|---|---|---|---|---|---|
| 0.331 | 0.33 | 7.126 | 54138.96 | 54243.77 | 7986 |
Mental Unhealthy Days: Each additional unhealthy mental health day is associated with an estimated .2519 increase in physically unhealthy days on average, adjusting for all other covariates (95% CI: .2292 to .2746).
Depression: Yes: Compared to those without depression (the reference group), those with depression report an estimated .9862 more physically unhealthy days on average, holding all other variables constant.
General Health Status: Fair/Poor Compared to those with Excellent/VG (the reference group), those with fair/poor general health status report an estimated 9.8970 more physically unhealthy days on average, holding all other variables constant.
par(mfrow = c(2, 2))
plot(mlr_m1, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)
par(mfrow = c(1, 1))
Residuals vs. Fitted: Is there evidence of non-linearity or non-constant variance? Normal Q-Q: Do the residuals approximately follow a normal distribution? Where do they deviate? Scale-Location: Is the spread of residuals roughly constant across fitted values? Residuals vs. Leverage: Are there any highly influential observations (large Cook’s distance)?
brfss_23_hw4 <- brfss_23_hw4 |>
mutate(
fpd = factor(
ifelse(physhlth_days >= 14, 1, 0),
levels = c(0, 1),
labels = c("No", "Yes")
))
brfss_23_hw4 %>%
count(fpd) %>%
mutate(percent = n / sum(n) * 100)
## # A tibble: 2 × 3
## fpd n percent
## <fct> <int> <dbl>
## 1 No 6924 86.6
## 2 Yes 1076 13.4
brfss_23_hw4 %>%
dplyr::select(fpd) %>%
tbl_summary(
label = list(fpd ~ "Frequent Physical Distress"),
statistic = list(all_categorical() ~ "{n} ({p}%)")
)
| Characteristic | N = 8,0001 |
|---|---|
| Frequent Physical Distress | 1,076 (13%) |
| 1 n (%) | |
1,504 respondents (19%) reported having physically unhealthy days in the last 30 days.
I chose to view general health status as one predictor to look at more closely because the association was rather high in the multi linear model compared to the other predictors.
mod_simple_hw <- glm(fpd ~ gen_health, data = brfss_23_hw4,
family = binomial(link = "logit"))
tidy(mod_simple_hw, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3, caption = "Table 8: Simple Logistic Regression: FPD ~ Exercise (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.035 | 0.089 | -37.934 | 0 | 0.029 | 0.041 |
| gen_healthFair/Poor | 28.135 | 0.104 | 32.225 | 0 | 23.042 | 34.589 |
| gen_healthGood | 3.055 | 0.110 | 10.141 | 0 | 2.467 | 3.800 |
Interpretation Compared to those with Excellent/very good general health status, those with Fair/poor general health status has the odds of frequent physical distress are multiplied by 23.064, 95% CI [19.013, 28.202], this is statistically significant at a 95% confidence interval and the CI does not cross 1.
Compared to those with Excellent/very good general health status, those with Good general health status has the the odds of frequent physical distress are multiplied by 2.627, 95% CI [2.120, 3.272], this is statistically significant at a 95% confidence interval and the CI does not cross 1.
mod_logistic_hw <- glm(fpd ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
data = brfss_23_hw4,
family = binomial(link = "logit"))
tidy(mod_logistic_hw, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3, caption = "Table 9: Simple Logistic Regression:FMD ~ all Predictors (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.016 | 0.233 | -17.684 | 0.000 | 0.010 | 0.026 |
| menthlth_days | 1.050 | 0.004 | 11.364 | 0.000 | 1.042 | 1.059 |
| bmi | 1.010 | 0.005 | 1.807 | 0.071 | 0.999 | 1.020 |
| depressionYes | 1.395 | 0.096 | 3.458 | 0.001 | 1.154 | 1.683 |
| sexFemale | 1.063 | 0.081 | 0.751 | 0.453 | 0.907 | 1.247 |
| age_group35-49 | 1.448 | 0.160 | 2.308 | 0.021 | 1.059 | 1.986 |
| age_group50-64 | 2.357 | 0.152 | 5.630 | 0.000 | 1.754 | 3.188 |
| age_group65+ | 2.426 | 0.152 | 5.847 | 0.000 | 1.809 | 3.278 |
| smokingFormer | 0.639 | 0.121 | -3.709 | 0.000 | 0.504 | 0.810 |
| smokingNever | 0.638 | 0.112 | -3.999 | 0.000 | 0.513 | 0.796 |
| gen_healthFair/Poor | 17.350 | 0.110 | 25.844 | 0.000 | 14.014 | 21.610 |
| gen_healthGood | 2.441 | 0.114 | 7.858 | 0.000 | 1.958 | 3.057 |
| maritalNever married | 0.931 | 0.123 | -0.583 | 0.560 | 0.730 | 1.182 |
| maritalPreviously Married | 1.068 | 0.090 | 0.722 | 0.470 | 0.894 | 1.274 |
Mental Unhealthy Days: For every one-unit increase in mentally unhealthy days, the odds of frequent physical distress are multiplied by 1.060, 95% CI [1.052, 1.068] holding all other variables constant..
Depression: Yes: Compared to those without depression (the reference group), the odds of frequent physical distress are multiplied by 1.194, 95% CI [1.007, 1.413] holding all other variables constant.
General Health Status: Fair/Poor Compared to those with Excellent/VG (the reference group), those with fair/poor general health the odds of frequent physical distress are multiplied by 13.311, 95% CI [10.856, 16.438] holding all other variables constant.
#SImple Logistic Regression - income
mod_nosmoking <- glm(fpd ~ menthlth_days + bmi + depression + sex + age_group + gen_health + marital, data = brfss_23_hw4,
family = binomial(link = "logit"))
tidy(mod_nosmoking, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3, caption = " Table 10: Simple Logistic Regression: FPD ~ all but smoking (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.011 | 0.214 | -21.052 | 0.000 | 0.007 | 0.017 |
| menthlth_days | 1.051 | 0.004 | 11.594 | 0.000 | 1.042 | 1.060 |
| bmi | 1.007 | 0.005 | 1.350 | 0.177 | 0.997 | 1.018 |
| depressionYes | 1.415 | 0.096 | 3.621 | 0.000 | 1.171 | 1.706 |
| sexFemale | 1.046 | 0.080 | 0.555 | 0.579 | 0.893 | 1.224 |
| age_group35-49 | 1.539 | 0.159 | 2.704 | 0.007 | 1.128 | 2.108 |
| age_group50-64 | 2.518 | 0.151 | 6.104 | 0.000 | 1.878 | 3.398 |
| age_group65+ | 2.472 | 0.151 | 5.993 | 0.000 | 1.845 | 3.337 |
| gen_healthFair/Poor | 18.210 | 0.110 | 26.476 | 0.000 | 14.732 | 22.645 |
| gen_healthGood | 2.532 | 0.113 | 8.221 | 0.000 | 2.033 | 3.167 |
| maritalNever married | 0.960 | 0.122 | -0.333 | 0.739 | 0.754 | 1.218 |
| maritalPreviously Married | 1.099 | 0.090 | 1.047 | 0.295 | 0.921 | 1.310 |
# Likelihood ratio test
lr_test_hw <- anova(mod_nosmoking, mod_logistic_hw, test = "Chisq")
lr_test_hw |>
kable(digits = 3, caption = " Table 11: Likelihood Ratio Test: Full Model vs. Null Smoking") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 7988 | 4513.487 | NA | NA | NA |
| 7986 | 4496.368 | 2 | 17.119 | 0 |
glance(mod_logistic_hw) |>
dplyr::select(null.deviance, deviance, AIC, BIC, nobs) |>
kable(digits = 1, caption = "Model Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| null.deviance | deviance | AIC | BIC | nobs |
|---|---|---|---|---|
| 6317.6 | 4496.4 | 4524.4 | 4622.2 | 8000 |
roc_obj_hw <- roc(
response = brfss_23_hw4$fpd,
predictor = fitted(mod_logistic_hw),
levels = c("No", "Yes"),
direction = "<"
)
auc_value <- auc(roc_obj_hw)
ggroc(roc_obj_hw, color = "steelblue", linewidth = 1.2) +
geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
labs(title = "Table 12: ROC Curve for Frequent Physical Distress Model",
subtitle = paste0("AUC = ", round(auc_value, 3)),
x = "Specificity", y = "Sensitivity") +
theme_minimal()
This model is able to distinguish between individuals with and without physical distress days by having an AUC of 0.851 with excellent discrimination.
hl_test_hw <- hoslem.test(
x = as.numeric(brfss_23_hw4$fpd) - 1,
y = fitted(mod_logistic_hw),
g = 10
)
hl_test_hw
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(brfss_23_hw4$fpd) - 1, fitted(mod_logistic_hw)
## X-squared = 5.8446, df = 8, p-value = 0.6646
brfss_23_hw4 |>
mutate(pred_prob_hw = fitted(mod_logistic_hw),
obs_outcome_hw = as.numeric(fpd) - 1,
decile_hw = ntile(pred_prob_hw, 10)) |>
group_by(decile_hw) |>
summarise(
mean_pred = mean(pred_prob_hw),
mean_obs = mean(obs_outcome_hw),
.groups = "drop"
) |>
ggplot(aes(x = mean_pred, y = mean_obs)) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
geom_point(size = 3, color = "steelblue") +
geom_line(color = "steelblue") +
labs(title = " Table 13: Calibration Plot: Observed vs. Predicted Probability of FPD",
subtitle = "Points should fall on the dashed line for perfect calibration",
x = "Mean Predicted Probability (within decile)",
y = "Observed Proportion (within decile)") +
theme_minimal()
Interpretation: A small p-value (< 0.05) suggests that the model does not fit well in some regions of predicted probability. With large samples (like ours), the Hosmer-Lemeshow test can be over-powered and detect trivial misfit. Always pair it with a calibration plot. The calibration plot is well calibrated as the pointed tend to fall along the dashed line.
A. Statistical Insights:
The results from this analysis suggest that menthlth_days, bmi, sex, depression, age_group, smoking status , general health status, and marital status have a significant statistcal burden among U.S adults. General health status had the strongest association in both the linear and logistic models.The key limitations for cross-sectional survey data is that temporality can not be determined. Exercise and income are two potential confounders.
B. Linear versus Logistic Regression:
The linear regression model tells us what variables are correlated with physical distress days and identify who is likely to report more physical distress days. The logistic regression is used for binary ouctomes such as those who experiecne more than 14 physical disress days in 30 days, yes or no. The logistic regression tell us the probability of the event occuring. The model selection criteria differ between the two frameworks R squared measures the variances whereas AUC measures the level of discrimination of how well the model is able to distinguish between those with and without the outcome.
C. AI Transparency I did not use AI for this assignment.