This step is loading the necessary packages for this assignment.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
## Warning: package 'haven' was built under R version 4.5.2
library(janitor)
## Warning: package 'janitor' was built under R version 4.5.2
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(broom)
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
## Setting theme "Compact"
## Setting theme "Compact"
Importing the BRFSS 2023 data that will be used for the analysis.
brfss_23 <- read_xpt(
"C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/LLCP2023.XPT"
) %>%
clean_names()
brfss_23 <- brfss_23 %>%
select(menthlth, physhlth, bmi5, sexvar, exerany2, ageg5yr, incomg1, educa, smoker3)
This next step will be cleaning the variables that are needed for the assignment.
brfss_23_clean <- brfss_23 |>
mutate(
# Outcome: mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
#BMI
bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
#Exercise
exercise = factor(exerany2, levels = c(2, 1), labels = c("No", "Yes")),
# Age groups (13-level raw BRFSS variable Age Group)
# 1 = 18 to 24
# 2 = 25 - 29
# 3 = 30 - 34
# 4 = 35 - 39
# 5 = 40 - 44
# 6 = 45 - 49
# 7 = 50 - 54
# 8 = 55 - 59
# 9 = 60 - 64
# 10 = 65 - 69
# 11 = 70 - 74
# 12 = 75 - 79
# 13 = 80+
# 14 = Refused
age_group = factor(case_when(
ageg5yr == 1 ~ "18-24",
ageg5yr == 2 ~ "25-29",
ageg5yr == 3 ~ "30-34",
ageg5yr == 4 ~ "35-39",
ageg5yr == 5 ~ "40-44",
ageg5yr == 6 ~ "45-49",
ageg5yr == 7 ~ "50-54",
ageg5yr == 8 ~ "55-59",
ageg5yr == 9 ~ "60-64",
ageg5yr == 10 ~ "65-69",
ageg5yr == 11 ~ "70-74",
ageg5yr == 12 ~ "75-79",
ageg5yr == 13 ~ "80+",
TRUE ~ NA_character_
), levels = c("18-24", "25-29", "30-34", "35-39", "40-44","45-49", "50-54", "55-59", "60-64", "65-69","70-74", "75-79", "80+")),
# Income groups (7-level raw BRFSS variable EDUCA)
# 1 = < $15,000
# 2 = $15,000 - $24,000
# 3 = $25,000 - $34,000
# 4 = $35,000 - $49,000
# 5 = $50,000 - $99,000
# 6 = $100,000 - $199,000
# 7 = >= $200,000
# 9 = Refused
income = factor(case_when(
incomg1 == 1 ~ "< $15,000",
incomg1 == 2 ~ "$15,000 - $24,000",
incomg1 == 3 ~ "$25,000 - $34,000",
incomg1 == 4 ~ "$35,000 - $49,000",
incomg1 == 5 ~ "$50,000 - $99,000",
incomg1 == 6 ~ "$100,000 - $199,000",
incomg1 == 7 ~ ">= $200,000",
TRUE ~ NA_character_
), levels = c("< $15,000", "$15,000 - $24,000", "$25,000 - $34,000", "$35,000 - $49,000", "$50,000 - $100,000","$100,000 - $199,000", ">= $200,000")),
# Education (6-level raw BRFSS variable EDUCA)
# 1 = Never attended school or only kindergarten
# 2 = Grades 1 through 8 (Elementary)
# 3 = Grades 9 through 11 (Some high school)
# 4 = Grade 12 or GED (High school graduate)
# 5 = College 1 year to 3 years (Some college or technical school)
# 6 = College 4 years or more (College graduate)
# 9 = Refused
education = factor(case_when(
educa %in% c(1, 2) ~ "Less than HS",
educa == 3 ~ "HS diploma or GED",
educa == 4 ~ "Some College or Technical School",
educa == 5 ~ "College Graduate",
educa == 6 ~ "Graduate or Professional Degree",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS diploma or GED", "Some College or Technical School", "College graduate", "Graduate or Professional Degree")),
# Smoker (4-cat raw BRFSS variable)
# 1 = Current smoker - now smokes every day
# 2 = Current smoker - now smokes some days
# 3 = Former smoker
# 4 = Never smoked
# 9 = Refused
smoking = factor(case_when(
smoker3 == 1 ~ "Currently daily smoker",
smoker3 == 2 ~ "Current some-day smoker",
smoker3 == 3 ~ "Former Smoker",
smoker3 == 4 ~ "Never Smoker",
TRUE ~ NA_character_),
levels = c("Currently daily smoker", "Current some-day smoker","Former Smoker","Never Smoker"))) %>%
filter(
!is.na(menthlth_days),
!is.na(physhlth_days),
!is.na(exercise),
!is.na(age_group),
!is.na(sex),
!is.na(education),
!is.na(income),
!is.na(bmi),
!is.na(smoking)
)
# Reproducible random sample
set.seed(1220)
brfss_23_hw <- brfss_23_clean |>
select(menthlth_days, physhlth_days, bmi, exercise, age_group, sex, income, education, smoking) |>
slice_sample(n = 8000)
# Save for review
saveRDS(brfss_23_hw,
"C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/brfss_hw3_2023.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_23_hw), ncol(brfss_23_hw))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Metric | Value |
|---|---|
| Observations | 8000 |
| Variables | 9 |
In this code chunk, I will be assessing the missing data in the orginal brfss 2023 data.
brfss_23 %>%
select(menthlth, physhlth,bmi5,sexvar,exerany2, ageg5yr, incomg1, educa ,smoker3) %>%
summarise(across(everything(), ~ sum(is.na(.) | . %in% c(77, 88, 99)) / n() * 100)) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Pct_Missing_or_DK") %>%
mutate(Pct_Missing_or_DK = round(Pct_Missing_or_DK, 1)) %>%
arrange(desc(Pct_Missing_or_DK)) %>%
kable(
caption = "Table 1. Missing or Don't Know/Refused (%) — BRFSS 2023 Full Sample",
col.names = c("Variable", "% Missing / DK / Refused")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Variable | % Missing / DK / Refused |
|---|---|
| physhlth | 61.7 |
| menthlth | 61.2 |
| bmi5 | 9.4 |
| sexvar | 0.0 |
| exerany2 | 0.0 |
| ageg5yr | 0.0 |
| incomg1 | 0.0 |
| educa | 0.0 |
| smoker3 | 0.0 |
brfss_23_hw %>%
select(menthlth_days, physhlth_days, exercise , age_group, income, sex, bmi, smoking) %>%
tbl_summary(
label = list(
menthlth_days ~ "Mentally unhealthy days (past 30)",
physhlth_days ~ "Physically unhealthy days (past 30)",
exercise ~ "Any physical activity (past 30 days)",
age_group ~ "Age (years)",
income ~ "Household income",
sex ~ "Sex",
bmi ~ "BMI (kg/m²)",
smoking ~ "Smoking Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) %>%
add_n() %>%
bold_labels() %>%
modify_caption("**Table 2. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**")
| Characteristic | N | N = 8,0001 |
|---|---|---|
| Mentally unhealthy days (past 30) | 8,000 | 4.4 (8.3) |
| Physically unhealthy days (past 30) | 8,000 | 4.6 (8.9) |
| Any physical activity (past 30 days) | 8,000 | 6,154 (77%) |
| Age (years) | 8,000 | |
| 18-24 | 423 (5.3%) | |
| 25-29 | 381 (4.8%) | |
| 30-34 | 455 (5.7%) | |
| 35-39 | 577 (7.2%) | |
| 40-44 | 618 (7.7%) | |
| 45-49 | 574 (7.2%) | |
| 50-54 | 713 (8.9%) | |
| 55-59 | 748 (9.4%) | |
| 60-64 | 810 (10%) | |
| 65-69 | 852 (11%) | |
| 70-74 | 679 (8.5%) | |
| 75-79 | 554 (6.9%) | |
| 80+ | 616 (7.7%) | |
| Household income | 8,000 | |
| < $15,000 | 568 (7.1%) | |
| $15,000 - $24,000 | 988 (12%) | |
| $25,000 - $34,000 | 1,188 (15%) | |
| $35,000 - $49,000 | 1,413 (18%) | |
| $50,000 - $100,000 | 0 (0%) | |
| $100,000 - $199,000 | 2,732 (34%) | |
| >= $200,000 | 1,111 (14%) | |
| Sex | 8,000 | |
| Male | 3,982 (50%) | |
| Female | 4,018 (50%) | |
| BMI (kg/m²) | 8,000 | 28.5 (6.5) |
| Smoking Status | 8,000 | |
| Currently daily smoker | 586 (7.3%) | |
| Current some-day smoker | 289 (3.6%) | |
| Former Smoker | 2,114 (26%) | |
| Never Smoker | 5,011 (63%) | |
| 1 Mean (SD); n (%) | ||
mlr_1 <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
data = brfss_23_hw)
summary(mlr_1)
##
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise +
## age_group + income + education + smoking, data = brfss_23_hw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4646 -3.7646 -1.7218 0.8144 29.7247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.42489 0.83736 10.061 < 2e-16
## physhlth_days 0.27658 0.01006 27.486 < 2e-16
## bmi 0.03099 0.01354 2.288 0.022177
## sexFemale 1.34866 0.17156 7.861 4.30e-15
## exerciseYes -0.41491 0.21972 -1.888 0.059018
## age_group25-29 -0.24722 0.53387 -0.463 0.643317
## age_group30-34 -1.88075 0.51573 -3.647 0.000267
## age_group35-39 -2.77938 0.49222 -5.647 1.69e-08
## age_group40-44 -3.29517 0.48665 -6.771 1.37e-11
## age_group45-49 -2.73500 0.49684 -5.505 3.81e-08
## age_group50-54 -2.92787 0.47506 -6.163 7.48e-10
## age_group55-59 -4.32729 0.47062 -9.195 < 2e-16
## age_group60-64 -4.42429 0.46166 -9.583 < 2e-16
## age_group65-69 -5.65667 0.45712 -12.375 < 2e-16
## age_group70-74 -6.36341 0.47649 -13.355 < 2e-16
## age_group75-79 -6.12956 0.49560 -12.368 < 2e-16
## age_group80+ -6.57363 0.48348 -13.597 < 2e-16
## income$15,000 - $24,000 -1.00417 0.39745 -2.527 0.011538
## income$25,000 - $34,000 -0.97399 0.38967 -2.500 0.012456
## income$35,000 - $49,000 -1.54576 0.38518 -4.013 6.05e-05
## income$100,000 - $199,000 -2.45837 0.38373 -6.407 1.57e-10
## income>= $200,000 -2.69793 0.43130 -6.255 4.17e-10
## educationHS diploma or GED 1.31318 0.60605 2.167 0.030280
## educationSome College or Technical School 1.83884 0.51711 3.556 0.000379
## educationGraduate or Professional Degree 2.11765 0.52622 4.024 5.77e-05
## smokingCurrent some-day smoker -0.97626 0.54242 -1.800 0.071924
## smokingFormer Smoker -2.34531 0.36303 -6.460 1.11e-10
## smokingNever Smoker -3.36439 0.34610 -9.721 < 2e-16
##
## (Intercept) ***
## physhlth_days ***
## bmi *
## sexFemale ***
## exerciseYes .
## age_group25-29
## age_group30-34 ***
## age_group35-39 ***
## age_group40-44 ***
## age_group45-49 ***
## age_group50-54 ***
## age_group55-59 ***
## age_group60-64 ***
## age_group65-69 ***
## age_group70-74 ***
## age_group75-79 ***
## age_group80+ ***
## income$15,000 - $24,000 *
## income$25,000 - $34,000 *
## income$35,000 - $49,000 ***
## income$100,000 - $199,000 ***
## income>= $200,000 ***
## educationHS diploma or GED *
## educationSome College or Technical School ***
## educationGraduate or Professional Degree ***
## smokingCurrent some-day smoker .
## smokingFormer Smoker ***
## smokingNever Smoker ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.508 on 7972 degrees of freedom
## Multiple R-squared: 0.182, Adjusted R-squared: 0.1792
## F-statistic: 65.68 on 27 and 7972 DF, p-value: < 2.2e-16
tidy(mlr_1, conf.int = TRUE) %>%
mutate(
term = dplyr::recode(term,
"(Intercept)" = "Intercept",
"physhlth_days" = "Physical health days",
"bmi" = "BMI (kg/m²)",
"sexFemale" = "Sex: Female (ref = Male)",
"exerciseNo" = "Exercise: No (ref = Yes)",
"age_group25-29" = "Age group 25 - 29 (ref = 18 - 24)",
"age_group30-34" = "Age group 30 - 34 (ref = 18 - 24)",
"age_group35-39" = "Age group 35 - 39 (ref = 18 - 24)",
"age_group40-44" = "Age group 40 - 44 (ref = 18 - 24)",
"age_group45-49" = "Age group 45 - 49 (ref = 18 - 24)",
"age_group50-54" = "Age group 50 - 54 (ref = 18 - 24)",
"age_group55-59" = "Age group 55 - 59 (ref = 18 - 24)",
"age_group60-64" = "Age group 60 - 64 (ref = 18 - 24)",
"age_group65-69" = "Age group 65 - 69 (ref = 18 - 24)",
"age_group70-74" = "Age group 70 - 74 (ref = 18 - 24)",
"age_group75-79" = "Age group 75 - 79 (ref = 18 - 24)",
"age_group80+" = "Age group 80+ (ref = 18 - 24)",
"income$15,000 - $24,000" = "Income $15,000 - $24,000 (ref = Less than $15,000)",
"income$25,000 - $34,000" = "Income $25,000 - $34,000 (ref = Less than $15,000)",
"income$35,000 - $49,000" = "Income $35,000 - $49,000 (ref = Less than $15,000)",
"income$100,000 - $199,000" = "Income $100,000 - $199,000 (ref = Less than $15,000)",
"income>= $200,000" = "Income >= $200,000 (ref = Less than $15,000)",
"educationHS diploma or GED" = "HS Diploma or GED (ref = Less than HS)",
"educationSome College or Technical School" = "Some College or Technical School (ref = Less than HS)",
"educationGraduate or Professional Degree" = "Graduate or Professional Degree (ref = Less than HS)",
"smokingCurrent some-day smoker" = "Current some-day Smoker (ref = Currently daily smoker)",
"smokingFormer Smoker" = "Former Smoker (ref = Currently daily smoker)",
"smokingNever Smoker" = "Never Smoker (ref = Currently daily smoker)"
),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
row_spec(c(2, 3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26,27,28), background = "#EBF5FB") # highlight key predictors
| Term | Estimate (β̂ |
Std. Erro
| ||||
|---|---|---|---|---|---|---|
| Intercept | 8.4249 | 0.8374 | 10.0613 | 0.0000 | 6.7834 | 10.0663 |
| Physical health days | 0.2766 | 0.0101 | 27.4864 | 0.0000 | 0.2569 | 0.2963 |
| BMI (kg/m²) | 0.0310 | 0.0135 | 2.2878 | 0.0222 | 0.0044 | 0.0575 |
| Sex: Female (ref = Male) | 1.3487 | 0.1716 | 7.8611 | 0.0000 | 1.0124 | 1.6850 |
| exerciseYes | -0.4149 | 0.2197 | -1.8883 | 0.0590 | -0.8456 | 0.0158 |
| Age group 25 - 29 (ref = 18 - 24) | -0.2472 | 0.5339 | -0.4631 | 0.6433 | -1.2937 | 0.7993 |
| Age group 30 - 34 (ref = 18 - 24) | -1.8808 | 0.5157 | -3.6468 | 0.0003 | -2.8917 | -0.8698 |
| Age group 35 - 39 (ref = 18 - 24) | -2.7794 | 0.4922 | -5.6466 | 0.0000 | -3.7443 | -1.8145 |
| Age group 40 - 44 (ref = 18 - 24) | -3.2952 | 0.4867 | -6.7711 | 0.0000 | -4.2491 | -2.3412 |
| Age group 45 - 49 (ref = 18 - 24) | -2.7350 | 0.4968 | -5.5047 | 0.0000 | -3.7089 | -1.7611 |
| Age group 50 - 54 (ref = 18 - 24) | -2.9279 | 0.4751 | -6.1631 | 0.0000 | -3.8591 | -1.9966 |
| Age group 55 - 59 (ref = 18 - 24) | -4.3273 | 0.4706 | -9.1949 | 0.0000 | -5.2498 | -3.4048 |
| Age group 60 - 64 (ref = 18 - 24) | -4.4243 | 0.4617 | -9.5835 | 0.0000 | -5.3293 | -3.5193 |
| Age group 65 - 69 (ref = 18 - 24) | -5.6567 | 0.4571 | -12.3746 | 0.0000 | -6.5527 | -4.7606 |
| Age group 70 - 74 (ref = 18 - 24) | -6.3634 | 0.4765 | -13.3547 | 0.0000 | -7.2975 | -5.4294 |
| Age group 75 - 79 (ref = 18 - 24) | -6.1296 | 0.4956 | -12.3679 | 0.0000 | -7.1011 | -5.1581 |
| Age group 80+ (ref = 18 - 24) | -6.5736 | 0.4835 | -13.5965 | 0.0000 | -7.5214 | -5.6259 |
| Income $15,000 - $24,000 (ref = Less than $15,000) | -1.0042 | 0.3974 | -2.5266 | 0.0115 | -1.7833 | -0.2251 |
| Income $25,000 - $34,000 (ref = Less than $15,000) | -0.9740 | 0.3897 | -2.4995 | 0.0125 | -1.7378 | -0.2101 |
| Income $35,000 - $49,000 (ref = Less than $15,000) | -1.5458 | 0.3852 | -4.0130 | 0.0001 | -2.3008 | -0.7907 |
| Income $100,000 - $199,000 (ref = Less than $15,000) | -2.4584 | 0.3837 | -6.4066 | 0.0000 | -3.2106 | -1.7062 |
| Income >= $200,000 (ref = Less than $15,000) | -2.6979 | 0.4313 | -6.2553 | 0.0000 | -3.5434 | -1.8525 |
| HS Diploma or GED (ref = Less than HS) | 1.3132 | 0.6060 | 2.1668 | 0.0303 | 0.1252 | 2.5012 |
| Some College or Technical School (ref = Less than HS) | 1.8388 | 0.5171 | 3.5560 | 0.0004 | 0.8252 | 2.8525 |
| Graduate or Professional Degree (ref = Less than HS) | 2.1177 | 0.5262 | 4.0243 | 0.0001 | 1.0861 | 3.1492 |
| Current some-day Smoker (ref = Currently daily smoker) | -0.9763 | 0.5424 | -1.7998 | 0.0719 | -2.0395 | 0.0870 |
| Former Smoker (ref = Currently daily smoker) | -2.3453 | 0.3630 | -6.4604 | 0.0000 | -3.0569 | -1.6337 |
| Never Smoker (ref = Currently daily smoker) | -3.3644 | 0.3461 | -9.7208 | 0.0000 | -4.0428 | -2.6859 |
b_hw <- round(coef(mlr_1), 3)
ci_hw <- round(confint(mlr_1), 3)
The full fitted regression equation is below.
\[\widehat{\text{Mental Health Days}} = 8.425 + 0.277(\text{Phys. Health Days}) + 0.031(\text{BMI}) + 1.349(\text{Female}) + -0.415(\text{Exercise: No}) + -0.247(\text{Age:25-29}) + -1.881(\text{Age:30-34}) + -2.779(\text{Age:35-39}) + -3.295(\text{Age:40-44}) + -2.735(\text{Age:45-49}) + -2.928(\text{Age:50-54}) + -4.327(\text{Age:55-59}) + -4.424(\text{Age:60-64}) + -5.657(\text{Age:65-69}) + -6.363(\text{Age:70-74}) + -6.13(\text{Age:75-79}) + -6.574(\text{Age:80+}) + -1.004(\text{Income:15,000 - 24,000}) + -0.974(\text{Income:25,000 - 34,000}) + -1.546(\text{Income:35,000 - 49,000}) + -2.458(\text{Income:100,000 - 199,000}) + -2.698(\text{Income:200,000}) +1.313(\text{Education: HS Diploma or GED}) + 1.839(\text{Education: Some College or Technical School}) + 2.118(\text{Education: Graduate or Professional Degree}) + -0.976(\text{Smoking Status: Current some-day Smoker}) + -2.345(\text{Smoking Status: Former Smoker}) + -3.364(\text{Smoking Status: Never Smoker}) \] ## Step 3
Interpretation:
Physical health days (\(\hat{\beta}\) = 0.277): Each additional day of poor physical health is associated with an estimated 0.277 additional mentally unhealthy day on average, holding BMI, sex, exercise, age group, education, and smoking (95% CI: 0.257 to 0.296).
BMI (\(\hat{\beta}\) = 0.031): Each additional increase in BMI hour is associated with an estimated 0.031 fewer mentally unhealthy days on average, adjusting for all other covariates (95% CI: 0.004 to 0.058). The negative sign indicates a protective association.
Sex: Female (\(\hat{\beta}\) = 1.349): Compared to males (the reference group), females report an estimated 1.349 more mentally unhealthy days on average, holding all other variables constant.
Exercise: No (\(\hat{\beta}\) = -0.415): People who engaged in any physical activity in the past 30 days report an estimated 0.415 fewer mentally unhealthy days compared to those who did not exercise, adjusting for all other covariates.
Age Group: 25-29 (\(\hat{\beta}\) = -0.247): Compared to 18-24 (the reference group), 25-29 age group report an estimated 0.247 more mentally unhealthy days on average, holding all other variables constant.
Income: $15,000 - \(24,000 (\)$ = -1.004): Compared to less than $15,000 (the reference group), 15,000 - 24,000 income group report an estimated 0.247 more mentally unhealthy days on average, holding all other variables constant.
Income: $25,000 - \(34,000 (\)$ = -0.974): Compared to less than $15,000 (the reference group), 25,000 - 34,000 income group report an estimated 0.247 more mentally unhealthy days on average, holding all other variables constant.
Report and interpret each of the following model-level statistics: • R-squared: proportion of variance in mentally unhealthy days explained by all predictors • Adjusted R-squared: how it differs from R-squared and what it adds • Root MSE (residual standard error): average prediction error of the model • Overall F-test: state H0, F-statistic, degrees of freedom (numerator and denominator), p-value, and conclusion
\[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{(none of the predictors matter)}\]
\[H_A: \text{At least one } \beta_j \neq 0\]
tribble(
~Model, ~Predictors, ~R2, ~`Adj. R²`,
"M3: Full (physhlth_days, bmi, sex, exercise, age_group, income,education, and smoking)", 28,
round(summary(mlr_1)$r.squared, 4), round(summary(mlr_1)$adj.r.squared, 4)
) %>%
kable(caption = "R² and Adjusted R² Across Sequential Models") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(1, bold = TRUE, background = "#EBF5FB")
| Model | Predictors | R2 | Adj. R² |
|---|---|---|---|
| M3: Full (physhlth_days, bmi, sex, exercise, age_group, income,education, and smoking) | 28 | 0.182 | 0.1792 |
# Augment dataset with fitted values and residuals
augmented_hw <- augment(mlr_1)
# Show a sample of fitted values and residuals
augmented_hw %>%
select(menthlth_days, physhlth_days, .fitted, .resid) %>%
slice_head(n = 10) %>%
mutate(across(where(is.numeric), ~ round(., 3))) %>%
kable(
caption = "First 10 Observations: Observed, Fitted, and Residual Values",
col.names = c("Observed menthlth_days (Y)", "physhlth_days (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Observed menthlth_days (Y) | physhlth_days (X) | Fitted (Ŷ) | Residual (e = Y − Ŷ) |
|---|---|---|---|
| 28 | 0 | 0.375 | 27.625 |
| 0 | 0 | -0.470 | 0.470 |
| 0 | 0 | 2.830 | -2.830 |
| 0 | 0 | 7.097 | -7.097 |
| 30 | 3 | 7.549 | 22.451 |
| 0 | 0 | -0.402 | 0.402 |
| 0 | 0 | 2.402 | -2.402 |
| 5 | 30 | 8.837 | -3.837 |
| 2 | 0 | 1.691 | 0.309 |
| 1 | 0 | 8.320 | -7.320 |
rmse_mlr_1 <- round(summary(mlr_1)$sigma, 2)
cat("Root MSE (Model 1):", rmse_mlr_1, "physhlth_days")
## Root MSE (Model 1): 7.51 physhlth_days
n_hw <- nrow(brfss_23_hw)
ss_resid_hw <- sum(augmented_hw$.resid^2)
mse_hw <- ss_resid_hw / (n_hw - 2)
sigma_hat_hw <- sqrt(mse_hw)
tibble(
Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
Value = c(round(ss_resid_hw, 2), round(mse_hw, 3), round(sigma_hat_hw, 3)),
Units = c("", "", "days")
) %>%
kable(caption = "Model Error Estimates") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
| Quantity | Value | Units |
|---|---|---|
| SS Residual | 449416.090 | |
| MSE (σ̂²) |
56.19
|
anova_mlr_1 <- anova(mlr_1)
print(anova_mlr_1)
## Analysis of Variance Table
##
## Response: menthlth_days
## Df Sum Sq Mean Sq F value Pr(>F)
## physhlth_days 1 55705 55705 988.1238 < 2.2e-16 ***
## bmi 1 667 667 11.8275 0.0005866 ***
## sex 1 3682 3682 65.3220 7.296e-16 ***
## exercise 1 339 339 6.0209 0.0141586 *
## age_group 12 26654 2221 39.3996 < 2.2e-16 ***
## income 5 5388 1078 19.1140 < 2.2e-16 ***
## education 3 912 304 5.3918 0.0010514 **
## smoking 3 6630 2210 39.2009 < 2.2e-16 ***
## Residuals 7972 449416 56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glance(mlr_1) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
mutate(across(where(is.numeric), ~ round(., 4))) %>%
pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
mutate(Statistic = dplyr::recode(Statistic,
"r.squared" = "R²",
"adj.r.squared" = "Adjusted R²",
"sigma" = "Residual Std. Error (Root MSE)",
"statistic" = "F-statistic",
"p.value" = "p-value (overall F-test)",
"df" = "Model df (p)",
"df.residual" = "Residual df (n − p − 1)",
"nobs" = "n (observations)"
)) %>%
kable(caption = "Overall Model Summary — Model C") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Statistic | Value |
|---|---|
| R² | 0.1820 |
| Adjusted R² | 0.1792 |
| Residual Std. Error (Root MSE) | 7.5083 |
| F-statistic | 65.6829 |
| p-value (overall F-test) | 0.0000 |
| Model df (p) | 27.0000 |
| Residual df (n − p − 1) | 7972.0000 |
| n (observations) | 8000.0000 |
\[H_0: \beta_j = 0 \quad \text{given all other predictors are in the model}\]
Anova(mlr_1, type = "III") %>%
as.data.frame() %>%
rownames_to_column("Source") %>%
filter(Source != "(Intercept)") %>%
mutate(
Significant = ifelse(`Pr(>F)` < 0.05, "Yes", "No"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Type III Partial F-Tests for Each Predictor",
col.names = c("Source", "Sum of Sq", "df", "F value", "p-value", "Significant (α = 0.05)")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
column_spec(6, bold = TRUE)
| Source | Sum of Sq | df | F value | p-value | Significant (α = 0.05) |
|---|---|---|---|---|---|
| physhlth_days | 42590.9955 | 1 | 755.5035 | 0.0000 | Yes |
| bmi | 295.0575 | 1 | 5.2339 | 0.0222 | Yes |
| sex | 3483.7945 | 1 | 61.7975 | 0.0000 | Yes |
| exercise | 201.0196 | 1 | 3.5658 | 0.0590 | No |
| age_group | 27683.0536 | 12 | 40.9215 | 0.0000 | Yes |
| income | 3373.3685 | 5 | 11.9677 | 0.0000 | Yes |
| education | 1007.8623 | 3 | 5.9593 | 0.0005 | Yes |
| smoking | 6629.7737 | 3 | 39.2009 | 0.0000 | Yes |
| Residuals | 449416.0904 | 7972 | NA | NA | NA |
# Reduced model: No income
mlr_noincome <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking,
data = brfss_23_hw)
# Full model: Full model
mlr_1_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
data = brfss_23_hw)
# Chunk test
anova(mlr_noincome, mlr_1_full) %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Reduced (No Income)", "Full (+ All Predictors)"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Chunk Test: Do demographic variables collectively add to the model?",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Reduced (No Income) | 7977 | 452789.5 | NA | NA | NA | NA |
| Full (+ All Predictors) | 7972 | 449416.1 | 5 | 3373.369 | 11.9677 | 0 |
# Reduced model: No income
mlr_noeducation <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking,
data = brfss_23_hw)
# Full model: Full model
mlr_1_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
data = brfss_23_hw)
# Chunk test
anova(mlr_noeducation, mlr_1_full) %>%
as.data.frame() %>%
rownames_to_column("Model") %>%
mutate(
Model = c("Reduced (No education)", "Full (+ All Predictors)"),
across(where(is.numeric), ~ round(., 4))
) %>%
kable(
caption = "Chunk Test: Do demographic variables collectively add to the model?",
col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Model | Res. df | RSS | df | Sum of Sq | F | p-value |
|---|---|---|---|---|---|---|
| Reduced (No education) | 7975 | 450424.0 | NA | NA | NA | NA |
| Full (+ All Predictors) | 7972 | 449416.1 | 3 | 1007.862 | 5.9593 | 5e-04 |
Interpretation: Both income and education adds statistically significant information to the model. Since Education has a smaller p-value we know that it is a stronger independent variable compared to income. The chunk test allows us to adjusts for all variables (including others in the group being tested).
brfss_23_hw <- brfss_23_hw %>%
mutate(
smoker_current = factor(case_when(
smoking %in% c("Currently daily smoker","Current some-day smoker") ~ "Current Smoker",
smoking %in% c("Former Smoker", "Never Smoker") ~ "Non-Smoker",
TRUE ~ NA_character_ ),
levels = c("Non-Smoker", "Current Smoker")))
Model A (main effects only): regress menthlth_days on physhlth_days, bmi, sex, smoker_current, exercise, age_group, income, and education.
• Model B (with interaction): same as Model A, but use sex * smoker_current in the formula to include the interaction term
# Model without interaction (additive model)
mod_a_hw <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education, data = brfss_23_hw)
# Model with interaction
mod_b_hw <- lm(menthlth_days ~ physhlth_days + bmi + exercise + age_group + income + education + sex*smoker_current, data = brfss_23_hw)
tidy(mod_b_hw, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Sleep * Sex → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 5.1637 | 0.7901 | 6.5352 | 0.0000 | 3.6148 | 6.7126 |
| physhlth_days | 0.2797 | 0.0100 | 27.8377 | 0.0000 | 0.2600 | 0.2994 |
| bmi | 0.0314 | 0.0136 | 2.3205 | 0.0203 | 0.0049 | 0.0580 |
| exerciseYes | -0.3949 | 0.2198 | -1.7964 | 0.0725 | -0.8257 | 0.0360 |
| age_group25-29 | -0.1179 | 0.5338 | -0.2208 | 0.8252 | -1.1643 | 0.9285 |
| age_group30-34 | -1.7075 | 0.5151 | -3.3152 | 0.0009 | -2.7171 | -0.6979 |
| age_group35-39 | -2.5745 | 0.4910 | -5.2438 | 0.0000 | -3.5369 | -1.6121 |
| age_group40-44 | -3.0759 | 0.4846 | -6.3476 | 0.0000 | -4.0257 | -2.1260 |
| age_group45-49 | -2.4694 | 0.4945 | -4.9943 | 0.0000 | -3.4387 | -1.5002 |
| age_group50-54 | -2.6979 | 0.4730 | -5.7035 | 0.0000 | -3.6251 | -1.7706 |
| age_group55-59 | -4.0907 | 0.4684 | -8.7336 | 0.0000 | -5.0089 | -3.1726 |
| age_group60-64 | -4.1570 | 0.4587 | -9.0620 | 0.0000 | -5.0562 | -3.2577 |
| age_group65-69 | -5.3182 | 0.4527 | -11.7469 | 0.0000 | -6.2057 | -4.4307 |
| age_group70-74 | -6.0272 | 0.4720 | -12.7701 | 0.0000 | -6.9524 | -5.1020 |
| age_group75-79 | -5.7415 | 0.4904 | -11.7079 | 0.0000 | -6.7028 | -4.7802 |
| age_group80+ | -6.2432 | 0.4796 | -13.0185 | 0.0000 | -7.1833 | -5.3031 |
| income$15,000 - $24,000 | -0.9455 | 0.3975 | -2.3784 | 0.0174 | -1.7247 | -0.1662 |
| income$25,000 - $34,000 | -0.8982 | 0.3899 | -2.3037 | 0.0213 | -1.6625 | -0.1339 |
| income$35,000 - $49,000 | -1.4272 | 0.3854 | -3.7030 | 0.0002 | -2.1827 | -0.6717 |
| income$100,000 - $199,000 | -2.3992 | 0.3840 | -6.2476 | 0.0000 | -3.1520 | -1.6464 |
| income>= $200,000 | -2.6826 | 0.4316 | -6.2154 | 0.0000 | -3.5287 | -1.8366 |
| educationHS diploma or GED | 1.4292 | 0.6057 | 2.3598 | 0.0183 | 0.2420 | 2.6165 |
| educationSome College or Technical School | 1.9280 | 0.5169 | 3.7300 | 0.0002 | 0.9148 | 2.9412 |
| educationGraduate or Professional Degree | 2.0664 | 0.5269 | 3.9220 | 0.0001 | 1.0336 | 3.0993 |
| sexFemale | 1.0339 | 0.1806 | 5.7260 | 0.0000 | 0.6799 | 1.3878 |
| smoker_currentCurrent Smoker | 1.6156 | 0.3880 | 4.1644 | 0.0000 | 0.8551 | 2.3761 |
| sexFemale:smoker_currentCurrent Smoker | 2.1849 | 0.5398 | 4.0475 | 0.0001 | 1.1267 | 3.2431 |
\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\]
\[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]
mod_b_hw_term <- tidy(mod_b_hw) |> filter(term == "sexFemale:smoker_currentCurrent Smoker")
cat("Interaction term (sexFemale:smoker_currentCurrent Smoker):\n")
## Interaction term (sexFemale:smoker_currentCurrent Smoker):
cat(" Estimate:", round(mod_b_hw_term$estimate, 4), "\n")
## Estimate: 2.1849
cat(" t-statistic:", round(mod_b_hw_term$statistic, 3), "\n")
## t-statistic: 4.048
cat(" p-value:", round(mod_b_hw_term$p.value, 4), "\n")
## p-value: 1e-04
Interpretation: The interaction term
(sexFemale:smoker_currentCurrent Smoker) has a coefficient
of 2.1849, with a t-statistic of 4.048 and p-value of 10^{-4}. Since the
p-value exceeds the conventional \(\alpha =
0.05\) threshold, we reject the null hypothesis that the slopes
are equal. In other words, there is statistically significant evidence
that the association between current smoking and mental unhealthy days
differs by sex.
anova(mod_a_hw, mod_b_hw) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Test for Coincidence: Does Sex Affect the Sleep-Physical Health Relationship at All?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education | 7974 | 451014.1 | NA | NA | NA | NA |
| menthlth_days ~ physhlth_days + bmi + exercise + age_group + income + education + sex * smoker_current | 7973 | 450089.3 | 1 | 924.8145 | 16.3824 | 1e-04 |
Interpretation: The partial F-test evaluates whether sex contributes anything to the model (either through a different intercept, a different slope, or both). The F-statistic is 16.38 with a p-value of 10^{-4}. Since p < 0.05, we o reject the null hypothesis that the two sex-specific regression lines are identical. This means that, in this unadjusted model, sex does significantly modify either the baseline level or the slope of the current smoking-mental health relationship.
pred_hw <- ggpredict(mod_b_hw, terms = c("smoker_current", "sex"))
ggplot(pred_hw, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
labs(
title = "Predicted Mental Health Days by Smoking Status and Sex",
subtitle = "From interaction model: Current Smoker * sex",
x = "Smoking Status",
y = "Predicted Mental Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
# Plot
ggplot(pred_hw, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_point(size = 3, position = position_dodge(width = 0.2)) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
alpha = 0.15, color = NA, position = position_dodge(width = 0.2)) +
scale_x_discrete(limits = c("Non-Smoker", "Current Smoker")) +
scale_color_brewer(palette = "Set1") +
labs(
title = "Predicted Mental Unhealthy Days by Smoking Status and Sex",
subtitle = "From interaction model: smoker_current * sex",
x = "Smoking Status",
y = "Predicted Mental Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13)
## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
Interpretation: The predicted values plot confirms what the formal test indicated: the regression lines for males and females are not parallel. Both groups show an increase in predicted mentally unhealthy days depending on smoking status. The female points are above below the male points across most of the smoking status, which is statistically significant. This visualization reinforces the conclusion that sex does meaningfully modify the smoking status mental health days in this sample.
What do the results suggest about the determinants of mental health among U.S. adults? Which predictors were most strongly associated? Which were not statistically significant? What are the key limitations of using cross-sectional survey data? Name at least two specific confounders that could bias the associations you estimated.
Response: The results suggest that physical health, bmi, sex, excercise, age, income, education and smoking status are all assocaited with mental health among U.S residents. Age had the largest assocaited across all age groups when compared to the reference group (18-24), the only age group that was not statistically significant was age group 25-29 years old. The other group that was not statisically significant when compared to the reference group was income group 100,000 - 199,000 when compared to the reference group less than 15,000. The key limitations of using cross-sectional data is that temporarlity can not be established therefore we are unable to be certain that the predictors happened prior to the outcome of poor mental health days. We know that there are associations between the predictors but we don’t know if they influfenced the respondents poor mental health days or vice versa. I think physical health and excersise could be potential confounders.
What does Adjusted R-squared tell you that regular R-squared does not, and why is this distinction especially important when adding multiple categorical predictors? Why is it useful to test groups of predictors (income, education) collectively with chunk tests, rather than relying only on the individual t-test for each level?
Response: The Adjusted R-square adds a pentaly to the model for extra degress of freedom that are added to the model, whereas the r-quared will always increase as predictors are added to the model. It is useful to test groups of predictors collectively with chunk test rather than relying on individual t-test for each level because groups can add statistcailly significant information to to the model even if a individual predictor is not significant.
Describe which parts of the assignment (if any) you sought AI assistance for, what specific prompts you used, and how you verified the AI-assisted output was correct. If you did not use AI, state that explicitly.
Response:I used AI for Step 5: Visualize the Interaction to verfiy my code, for whatever reason the code that I developed was producing an empty graph. I enetered my code into the AI and asked if anything looked off in the code. I read through the output and tested my code to ensure that everything was set up correctly. All my information was properly set up but the graph was still not producing anything, I asked AI to modify the code and it provided the code that I used. It is slightly different but when I used it, the graph appeared. Not sure what was causing the error but was able to get it to run and interpret the visual to answer the questions.