Part 0. Data Acquisition and Preparation

This step is loading the necessary packages for this assignment.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)

## Warning: package 'haven' was built under R version 4.5.2

library(janitor)

## Warning: package 'janitor' was built under R version 4.5.2

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(knitr)
library(kableExtra)

## Warning: package 'kableExtra' was built under R version 4.5.2

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(broom)
library(gtsummary)

## Warning: package 'gtsummary' was built under R version 4.5.2

library(GGally)

## Warning: package 'GGally' was built under R version 4.5.2

library(car)

## Warning: package 'car' was built under R version 4.5.2

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.5.2

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(ggeffects)

## Warning: package 'ggeffects' was built under R version 4.5.2

library(plotly)

## Warning: package 'plotly' was built under R version 4.5.2

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

## Setting theme "Compact"
## Setting theme "Compact"

Importing the BRFSS 2023 data that will be used for the analysis.

brfss_23 <- read_xpt(
  "C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/LLCP2023.XPT"
) %>% 
  clean_names()

brfss_23 <- brfss_23 %>% 
  select(menthlth, physhlth, bmi5, sexvar, exerany2, ageg5yr, incomg1, educa, smoker3)

This next step will be cleaning the variables that are needed for the assignment.

brfss_23_clean <- brfss_23 |>
  mutate(
    # Outcome: mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    
    #BMI
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    
    #Exercise
    exercise = factor(exerany2, levels = c(2, 1), labels = c("No", "Yes")),
    
    # Age groups (13-level raw BRFSS variable Age Group)
    # 1 = 18 to 24
    # 2 = 25 - 29
    # 3 = 30 - 34
    # 4 = 35 - 39
    # 5 = 40 - 44
    # 6 = 45 - 49
    # 7 = 50 - 54
    # 8 = 55 - 59
    # 9 = 60 - 64
    # 10 = 65 - 69
    # 11 = 70 - 74
    # 12 = 75 - 79   
    # 13 = 80+
    # 14 = Refused
    age_group = factor(case_when(
      ageg5yr == 1 ~ "18-24",
      ageg5yr == 2 ~ "25-29",
      ageg5yr == 3 ~ "30-34",
      ageg5yr == 4 ~ "35-39",
      ageg5yr == 5 ~ "40-44",
      ageg5yr == 6 ~ "45-49",
      ageg5yr == 7 ~ "50-54",
      ageg5yr == 8 ~ "55-59",
      ageg5yr == 9 ~ "60-64",
      ageg5yr == 10 ~ "65-69",
      ageg5yr == 11 ~ "70-74",
      ageg5yr == 12 ~ "75-79",
      ageg5yr == 13 ~ "80+",
      TRUE         ~ NA_character_
    ), levels = c("18-24", "25-29", "30-34", "35-39", "40-44","45-49", "50-54", "55-59", "60-64", "65-69","70-74", "75-79", "80+")),

    # Income groups (7-level raw BRFSS variable EDUCA)
    # 1 = < $15,000
    # 2 = $15,000 - $24,000
    # 3 = $25,000 - $34,000
    # 4 = $35,000 - $49,000
    # 5 = $50,000 - $99,000
    # 6 = $100,000 - $199,000
    # 7 = >= $200,000
    # 9 = Refused
    income = factor(case_when(
      incomg1 == 1 ~ "< $15,000",
      incomg1 == 2 ~ "$15,000 - $24,000",
      incomg1 == 3 ~ "$25,000 - $34,000",
      incomg1 == 4 ~ "$35,000 - $49,000",
      incomg1 == 5 ~ "$50,000 - $99,000",
      incomg1 == 6 ~ "$100,000 - $199,000",
      incomg1 == 7 ~ ">= $200,000",
      TRUE         ~ NA_character_
    ), levels = c("< $15,000", "$15,000 - $24,000", "$25,000 - $34,000", "$35,000 - $49,000", "$50,000 - $100,000","$100,000 - $199,000", ">= $200,000")),
    
    # Education (6-level raw BRFSS variable EDUCA)
    # 1 = Never attended school or only kindergarten
    # 2 = Grades 1 through 8 (Elementary)
    # 3 = Grades 9 through 11 (Some high school)
    # 4 = Grade 12 or GED (High school graduate)
    # 5 = College 1 year to 3 years (Some college or technical school)
    # 6 = College 4 years or more (College graduate)
    # 9 = Refused
    
    education = factor(case_when(
      educa %in% c(1, 2) ~ "Less than HS",
      educa == 3             ~ "HS diploma or GED",      
      educa == 4             ~ "Some College or Technical School",
      educa == 5             ~ "College Graduate",
      educa == 6             ~ "Graduate or Professional Degree",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS diploma or GED", "Some College or Technical School", "College graduate", "Graduate or Professional Degree")),

    # Smoker (4-cat raw BRFSS variable)
    # 1 = Current smoker - now smokes every day
    # 2 = Current smoker - now smokes some days
    # 3 = Former smoker
    # 4 = Never smoked
    # 9 = Refused 
    
    smoking = factor(case_when(
      smoker3 == 1 ~ "Currently daily smoker",
      smoker3 == 2 ~ "Current some-day smoker",
      smoker3 == 3 ~ "Former Smoker",
      smoker3 == 4 ~ "Never Smoker",
      TRUE         ~ NA_character_), 
      
      levels = c("Currently daily smoker", "Current some-day smoker","Former Smoker","Never Smoker"))) %>% 
      
  filter(
    !is.na(menthlth_days),
    !is.na(physhlth_days),
    !is.na(exercise),
    !is.na(age_group),
    !is.na(sex),
    !is.na(education),
    !is.na(income),
    !is.na(bmi),
    !is.na(smoking)
    )
  

# Reproducible random sample
set.seed(1220)
brfss_23_hw <- brfss_23_clean |>
  select(menthlth_days, physhlth_days, bmi, exercise, age_group, sex, income, education, smoking) |>
  slice_sample(n = 8000)

# Save for review
saveRDS(brfss_23_hw,
  "C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/brfss_hw3_2023.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_23_hw), ncol(brfss_23_hw))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	8000
Variables	9

In this code chunk, I will be assessing the missing data in the orginal brfss 2023 data.

brfss_23 %>%
  select(menthlth, physhlth,bmi5,sexvar,exerany2, ageg5yr, incomg1, educa ,smoker3) %>%
  summarise(across(everything(), ~ sum(is.na(.) | . %in% c(77, 88, 99)) / n() * 100)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Pct_Missing_or_DK") %>%
  mutate(Pct_Missing_or_DK = round(Pct_Missing_or_DK, 1)) %>%
  arrange(desc(Pct_Missing_or_DK)) %>%
  kable(
    caption = "Table 1. Missing or Don't Know/Refused (%) — BRFSS 2023 Full Sample",
    col.names = c("Variable", "% Missing / DK / Refused")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 1. Missing or Don’t Know/Refused (%) — BRFSS 2023 Full Sample
Variable	% Missing / DK / Refused
physhlth	61.7
menthlth	61.2
bmi5	9.4
sexvar	0.0
exerany2	0.0
ageg5yr	0.0
incomg1	0.0
educa	0.0
smoker3	0.0

brfss_23_hw %>%
  select(menthlth_days, physhlth_days, exercise , age_group, income, sex, bmi, smoking) %>%
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      exercise      ~ "Any physical activity (past 30 days)",      
      age_group     ~ "Age (years)",
      income        ~ "Household income",
      sex           ~ "Sex",
      bmi           ~ "BMI (kg/m²)",      
      smoking           ~ "Smoking Status"      

    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 2. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**")

**Table 2. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**
Characteristic	N	N = 8,000¹
Mentally unhealthy days (past 30)	8,000	4.4 (8.3)
Physically unhealthy days (past 30)	8,000	4.6 (8.9)
Any physical activity (past 30 days)	8,000	6,154 (77%)
Age (years)	8,000
18-24		423 (5.3%)
25-29		381 (4.8%)
30-34		455 (5.7%)
35-39		577 (7.2%)
40-44		618 (7.7%)
45-49		574 (7.2%)
50-54		713 (8.9%)
55-59		748 (9.4%)
60-64		810 (10%)
65-69		852 (11%)
70-74		679 (8.5%)
75-79		554 (6.9%)
80+		616 (7.7%)
Household income	8,000
< $15,000		568 (7.1%)
$15,000 - $24,000		988 (12%)
$25,000 - $34,000		1,188 (15%)
$35,000 - $49,000		1,413 (18%)
$50,000 - $100,000		0 (0%)
$100,000 - $199,000		2,732 (34%)
>= $200,000		1,111 (14%)
Sex	8,000
Male		3,982 (50%)
Female		4,018 (50%)
BMI (kg/m²)	8,000	28.5 (6.5)
Smoking Status	8,000
Currently daily smoker		586 (7.3%)
Current some-day smoker		289 (3.6%)
Former Smoker		2,114 (26%)
Never Smoker		5,011 (63%)
¹ Mean (SD); n (%)

Part 1: Multiple Linear Regression

Step 1: Fitting a multiple linear regression with mental health days as the outcome and the following predictors: physhelth_days, bmi, sex, exercise, age_group, education and smoking

mlr_1 <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
         data = brfss_23_hw)

summary(mlr_1)

## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise + 
##     age_group + income + education + smoking, data = brfss_23_hw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.4646  -3.7646  -1.7218   0.8144  29.7247 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                8.42489    0.83736  10.061  < 2e-16
## physhlth_days                              0.27658    0.01006  27.486  < 2e-16
## bmi                                        0.03099    0.01354   2.288 0.022177
## sexFemale                                  1.34866    0.17156   7.861 4.30e-15
## exerciseYes                               -0.41491    0.21972  -1.888 0.059018
## age_group25-29                            -0.24722    0.53387  -0.463 0.643317
## age_group30-34                            -1.88075    0.51573  -3.647 0.000267
## age_group35-39                            -2.77938    0.49222  -5.647 1.69e-08
## age_group40-44                            -3.29517    0.48665  -6.771 1.37e-11
## age_group45-49                            -2.73500    0.49684  -5.505 3.81e-08
## age_group50-54                            -2.92787    0.47506  -6.163 7.48e-10
## age_group55-59                            -4.32729    0.47062  -9.195  < 2e-16
## age_group60-64                            -4.42429    0.46166  -9.583  < 2e-16
## age_group65-69                            -5.65667    0.45712 -12.375  < 2e-16
## age_group70-74                            -6.36341    0.47649 -13.355  < 2e-16
## age_group75-79                            -6.12956    0.49560 -12.368  < 2e-16
## age_group80+                              -6.57363    0.48348 -13.597  < 2e-16
## income$15,000 - $24,000                   -1.00417    0.39745  -2.527 0.011538
## income$25,000 - $34,000                   -0.97399    0.38967  -2.500 0.012456
## income$35,000 - $49,000                   -1.54576    0.38518  -4.013 6.05e-05
## income$100,000 - $199,000                 -2.45837    0.38373  -6.407 1.57e-10
## income>= $200,000                         -2.69793    0.43130  -6.255 4.17e-10
## educationHS diploma or GED                 1.31318    0.60605   2.167 0.030280
## educationSome College or Technical School  1.83884    0.51711   3.556 0.000379
## educationGraduate or Professional Degree   2.11765    0.52622   4.024 5.77e-05
## smokingCurrent some-day smoker            -0.97626    0.54242  -1.800 0.071924
## smokingFormer Smoker                      -2.34531    0.36303  -6.460 1.11e-10
## smokingNever Smoker                       -3.36439    0.34610  -9.721  < 2e-16
##                                              
## (Intercept)                               ***
## physhlth_days                             ***
## bmi                                       *  
## sexFemale                                 ***
## exerciseYes                               .  
## age_group25-29                               
## age_group30-34                            ***
## age_group35-39                            ***
## age_group40-44                            ***
## age_group45-49                            ***
## age_group50-54                            ***
## age_group55-59                            ***
## age_group60-64                            ***
## age_group65-69                            ***
## age_group70-74                            ***
## age_group75-79                            ***
## age_group80+                              ***
## income$15,000 - $24,000                   *  
## income$25,000 - $34,000                   *  
## income$35,000 - $49,000                   ***
## income$100,000 - $199,000                 ***
## income>= $200,000                         ***
## educationHS diploma or GED                *  
## educationSome College or Technical School ***
## educationGraduate or Professional Degree  ***
## smokingCurrent some-day smoker            .  
## smokingFormer Smoker                      ***
## smokingNever Smoker                       ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.508 on 7972 degrees of freedom
## Multiple R-squared:  0.182,  Adjusted R-squared:  0.1792 
## F-statistic: 65.68 on 27 and 7972 DF,  p-value: < 2.2e-16

tidy(mlr_1, conf.int = TRUE) %>%
  mutate(
    term     = dplyr::recode(term,
      "(Intercept)" = "Intercept",
      "physhlth_days" = "Physical health days",
      "bmi" = "BMI (kg/m²)",
      "sexFemale"  = "Sex: Female (ref = Male)",
      "exerciseNo" = "Exercise: No (ref = Yes)",
      "age_group25-29" = "Age group 25 - 29 (ref = 18 - 24)",
      "age_group30-34" = "Age group 30 - 34 (ref = 18 - 24)",
      "age_group35-39" = "Age group 35 - 39 (ref = 18 - 24)",
      "age_group40-44" = "Age group 40 - 44 (ref = 18 - 24)",
      "age_group45-49" = "Age group 45 - 49 (ref = 18 - 24)",
      "age_group50-54" = "Age group 50 - 54 (ref = 18 - 24)", 
      "age_group55-59" = "Age group 55 - 59 (ref = 18 - 24)",      
      "age_group60-64" = "Age group 60 - 64 (ref = 18 - 24)",      
      "age_group65-69" = "Age group 65 - 69 (ref = 18 - 24)",      
      "age_group70-74" = "Age group 70 - 74 (ref = 18 - 24)",      
      "age_group75-79" = "Age group 75 - 79 (ref = 18 - 24)",  
      "age_group80+" = "Age group 80+ (ref = 18 - 24)",
      "income$15,000 - $24,000" = "Income $15,000 - $24,000 (ref = Less than $15,000)",      
      "income$25,000 - $34,000" = "Income $25,000 - $34,000 (ref = Less than $15,000)",      
      "income$35,000 - $49,000" = "Income $35,000 - $49,000 (ref = Less than $15,000)",  
      "income$100,000 - $199,000" = "Income $100,000 - $199,000 (ref = Less than $15,000)",   
      "income>= $200,000" = "Income >= $200,000 (ref = Less than $15,000)",      
      "educationHS diploma or GED" = "HS Diploma or GED (ref = Less than HS)",      
      "educationSome College or Technical School" = "Some College or Technical School (ref = Less than HS)",  
      "educationGraduate or Professional Degree" = "Graduate or Professional Degree (ref = Less than HS)",      
      "smokingCurrent some-day smoker" = "Current some-day Smoker (ref = Currently daily smoker)",      
      "smokingFormer Smoker" = "Former Smoker (ref = Currently daily smoker)",  
      "smokingNever Smoker" = "Never Smoker (ref = Currently daily smoker)"      
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption  = "Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)",
    col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3,4,5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,22,23,24,25,26,27,28), background = "#EBF5FB")  # highlight key predictors

Table 3. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2020, n = 5,000)
Term	Estimate (β̂	Std. Erro
Intercept	8.4249	0.8374	10.0613	0.0000	6.7834	10.0663
Physical health days	0.2766	0.0101	27.4864	0.0000	0.2569	0.2963
BMI (kg/m²)	0.0310	0.0135	2.2878	0.0222	0.0044	0.0575
Sex: Female (ref = Male)	1.3487	0.1716	7.8611	0.0000	1.0124	1.6850
exerciseYes	-0.4149	0.2197	-1.8883	0.0590	-0.8456	0.0158
Age group 25 - 29 (ref = 18 - 24)	-0.2472	0.5339	-0.4631	0.6433	-1.2937	0.7993
Age group 30 - 34 (ref = 18 - 24)	-1.8808	0.5157	-3.6468	0.0003	-2.8917	-0.8698
Age group 35 - 39 (ref = 18 - 24)	-2.7794	0.4922	-5.6466	0.0000	-3.7443	-1.8145
Age group 40 - 44 (ref = 18 - 24)	-3.2952	0.4867	-6.7711	0.0000	-4.2491	-2.3412
Age group 45 - 49 (ref = 18 - 24)	-2.7350	0.4968	-5.5047	0.0000	-3.7089	-1.7611
Age group 50 - 54 (ref = 18 - 24)	-2.9279	0.4751	-6.1631	0.0000	-3.8591	-1.9966
Age group 55 - 59 (ref = 18 - 24)	-4.3273	0.4706	-9.1949	0.0000	-5.2498	-3.4048
Age group 60 - 64 (ref = 18 - 24)	-4.4243	0.4617	-9.5835	0.0000	-5.3293	-3.5193
Age group 65 - 69 (ref = 18 - 24)	-5.6567	0.4571	-12.3746	0.0000	-6.5527	-4.7606
Age group 70 - 74 (ref = 18 - 24)	-6.3634	0.4765	-13.3547	0.0000	-7.2975	-5.4294
Age group 75 - 79 (ref = 18 - 24)	-6.1296	0.4956	-12.3679	0.0000	-7.1011	-5.1581
Age group 80+ (ref = 18 - 24)	-6.5736	0.4835	-13.5965	0.0000	-7.5214	-5.6259
Income $15,000 - $24,000 (ref = Less than $15,000)	-1.0042	0.3974	-2.5266	0.0115	-1.7833	-0.2251
Income $25,000 - $34,000 (ref = Less than $15,000)	-0.9740	0.3897	-2.4995	0.0125	-1.7378	-0.2101
Income $35,000 - $49,000 (ref = Less than $15,000)	-1.5458	0.3852	-4.0130	0.0001	-2.3008	-0.7907
Income $100,000 - $199,000 (ref = Less than $15,000)	-2.4584	0.3837	-6.4066	0.0000	-3.2106	-1.7062
Income >= $200,000 (ref = Less than $15,000)	-2.6979	0.4313	-6.2553	0.0000	-3.5434	-1.8525
HS Diploma or GED (ref = Less than HS)	1.3132	0.6060	2.1668	0.0303	0.1252	2.5012
Some College or Technical School (ref = Less than HS)	1.8388	0.5171	3.5560	0.0004	0.8252	2.8525
Graduate or Professional Degree (ref = Less than HS)	2.1177	0.5262	4.0243	0.0001	1.0861	3.1492
Current some-day Smoker (ref = Currently daily smoker)	-0.9763	0.5424	-1.7998	0.0719	-2.0395	0.0870
Former Smoker (ref = Currently daily smoker)	-2.3453	0.3630	-6.4604	0.0000	-3.0569	-1.6337
Never Smoker (ref = Currently daily smoker)	-3.3644	0.3461	-9.7208	0.0000	-4.0428	-2.6859

b_hw <- round(coef(mlr_1), 3)
ci_hw <- round(confint(mlr_1), 3)

Step 2

The full fitted regression equation is below.

\[\widehat{\text{Mental Health Days}} = 8.425 + 0.277(\text{Phys. Health Days}) + 0.031(\text{BMI}) + 1.349(\text{Female}) + -0.415(\text{Exercise: No}) + -0.247(\text{Age:25-29}) + -1.881(\text{Age:30-34}) + -2.779(\text{Age:35-39}) + -3.295(\text{Age:40-44}) + -2.735(\text{Age:45-49}) + -2.928(\text{Age:50-54}) + -4.327(\text{Age:55-59}) + -4.424(\text{Age:60-64}) + -5.657(\text{Age:65-69}) + -6.363(\text{Age:70-74}) + -6.13(\text{Age:75-79}) + -6.574(\text{Age:80+}) + -1.004(\text{Income:15,000 - 24,000}) + -0.974(\text{Income:25,000 - 34,000}) + -1.546(\text{Income:35,000 - 49,000}) + -2.458(\text{Income:100,000 - 199,000}) + -2.698(\text{Income:200,000}) +1.313(\text{Education: HS Diploma or GED}) + 1.839(\text{Education: Some College or Technical School}) + 2.118(\text{Education: Graduate or Professional Degree}) + -0.976(\text{Smoking Status: Current some-day Smoker}) + -2.345(\text{Smoking Status: Former Smoker}) + -3.364(\text{Smoking Status: Never Smoker}) \] ## Step 3

Interpretation:

Physical health days ($\hat{\beta}$ = 0.277): Each additional day of poor physical health is associated with an estimated 0.277 additional mentally unhealthy day on average, holding BMI, sex, exercise, age group, education, and smoking (95% CI: 0.257 to 0.296).

BMI ($\hat{\beta}$ = 0.031): Each additional increase in BMI hour is associated with an estimated 0.031 fewer mentally unhealthy days on average, adjusting for all other covariates (95% CI: 0.004 to 0.058). The negative sign indicates a protective association.

Sex: Female ($\hat{\beta}$ = 1.349): Compared to males (the reference group), females report an estimated 1.349 more mentally unhealthy days on average, holding all other variables constant.

Exercise: No ($\hat{\beta}$ = -0.415): People who engaged in any physical activity in the past 30 days report an estimated 0.415 fewer mentally unhealthy days compared to those who did not exercise, adjusting for all other covariates.

Age Group: 25-29 ($\hat{\beta}$ = -0.247): Compared to 18-24 (the reference group), 25-29 age group report an estimated 0.247 more mentally unhealthy days on average, holding all other variables constant.

Income: $15,000 - $24,000 ($$ = -1.004): Compared to less than $15,000 (the reference group), 15,000 - 24,000 income group report an estimated 0.247 more mentally unhealthy days on average, holding all other variables constant.

Income: $25,000 - $34,000 ($$ = -0.974): Compared to less than $15,000 (the reference group), 25,000 - 34,000 income group report an estimated 0.247 more mentally unhealthy days on average, holding all other variables constant.

Step 4

Report and interpret each of the following model-level statistics: • R-squared: proportion of variance in mentally unhealthy days explained by all predictors • Adjusted R-squared: how it differs from R-squared and what it adds • Root MSE (residual standard error): average prediction error of the model • Overall F-test: state H0, F-statistic, degrees of freedom (numerator and denominator), p-value, and conclusion

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{(none of the predictors matter)}\]

\[H_A: \text{At least one } \beta_j \neq 0\]

tribble(
  ~Model, ~Predictors, ~R2, ~`Adj. R²`,
  "M3: Full (physhlth_days, bmi, sex, exercise, age_group, income,education, and smoking)", 28,
    round(summary(mlr_1)$r.squared, 4), round(summary(mlr_1)$adj.r.squared, 4)
) %>%
  kable(caption = "R² and Adjusted R² Across Sequential Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(1, bold = TRUE, background = "#EBF5FB")

R² and Adjusted R² Across Sequential Models
Model	Predictors	R2	Adj. R²
M3: Full (physhlth_days, bmi, sex, exercise, age_group, income,education, and smoking)	28	0.182	0.1792

# Augment dataset with fitted values and residuals
augmented_hw <- augment(mlr_1)

# Show a sample of fitted values and residuals
augmented_hw %>%
  select(menthlth_days, physhlth_days, .fitted, .resid) %>%
  slice_head(n = 10) %>%
  mutate(across(where(is.numeric), ~ round(., 3))) %>%
  kable(
    caption = "First 10 Observations: Observed, Fitted, and Residual Values",
    col.names = c("Observed menthlth_days (Y)", "physhlth_days (X)", "Fitted (Ŷ)", "Residual (e = Y − Ŷ)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

First 10 Observations: Observed, Fitted, and Residual Values
Observed menthlth_days (Y)	physhlth_days (X)	Fitted (Ŷ)	Residual (e = Y − Ŷ)
28	0	0.375	27.625
0	0	-0.470	0.470
0	0	2.830	-2.830
0	0	7.097	-7.097
30	3	7.549	22.451
0	0	-0.402	0.402
0	0	2.402	-2.402
5	30	8.837	-3.837
2	0	1.691	0.309
1	0	8.320	-7.320

rmse_mlr_1 <- round(summary(mlr_1)$sigma, 2)
cat("Root MSE (Model 1):", rmse_mlr_1, "physhlth_days")

## Root MSE (Model 1): 7.51 physhlth_days

n_hw <- nrow(brfss_23_hw)
ss_resid_hw <- sum(augmented_hw$.resid^2)
mse_hw <- ss_resid_hw / (n_hw - 2)
sigma_hat_hw <- sqrt(mse_hw)

tibble(
  Quantity = c("SS Residual", "MSE (σ̂²)", "Residual Std. Error (σ̂)"),
  Value    = c(round(ss_resid_hw, 2), round(mse_hw, 3), round(sigma_hat_hw, 3)),
  Units    = c("", "", "days")
) %>%
  kable(caption = "Model Error Estimates") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Error Estimates
Quantity	Value	Units
SS Residual	449416.090
MSE (σ̂²)	56.19

anova_mlr_1 <- anova(mlr_1)
print(anova_mlr_1)

## Analysis of Variance Table
## 
## Response: menthlth_days
##                 Df Sum Sq Mean Sq  F value    Pr(>F)    
## physhlth_days    1  55705   55705 988.1238 < 2.2e-16 ***
## bmi              1    667     667  11.8275 0.0005866 ***
## sex              1   3682    3682  65.3220 7.296e-16 ***
## exercise         1    339     339   6.0209 0.0141586 *  
## age_group       12  26654    2221  39.3996 < 2.2e-16 ***
## income           5   5388    1078  19.1140 < 2.2e-16 ***
## education        3    912     304   5.3918 0.0010514 ** 
## smoking          3   6630    2210  39.2009 < 2.2e-16 ***
## Residuals     7972 449416      56                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

glance(mlr_1) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Statistic = dplyr::recode(Statistic,
    "r.squared"     = "R²",
    "adj.r.squared" = "Adjusted R²",
    "sigma"         = "Residual Std. Error (Root MSE)",
    "statistic"     = "F-statistic",
    "p.value"       = "p-value (overall F-test)",
    "df"            = "Model df (p)",
    "df.residual"   = "Residual df (n − p − 1)",
    "nobs"          = "n (observations)"
  )) %>%
  kable(caption = "Overall Model Summary — Model C") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Overall Model Summary — Model C
Statistic	Value
R²	0.1820
Adjusted R²	0.1792
Residual Std. Error (Root MSE)	7.5083
F-statistic	65.6829
p-value (overall F-test)	0.0000
Model df (p)	27.0000
Residual df (n − p − 1)	7972.0000
n (observations)	8000.0000

Part 2: Tests of Hypotheses

\[H_0: \beta_j = 0 \quad \text{given all other predictors are in the model}\]

Step 1:

Anova(mlr_1, type = "III") %>%
  as.data.frame() %>%
  rownames_to_column("Source") %>%
  filter(Source != "(Intercept)") %>%
  mutate(
    Significant = ifelse(`Pr(>F)` < 0.05, "Yes", "No"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Type III Partial F-Tests for Each Predictor",
    col.names = c("Source", "Sum of Sq", "df", "F value", "p-value", "Significant (α = 0.05)")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  column_spec(6, bold = TRUE)

Type III Partial F-Tests for Each Predictor
Source	Sum of Sq	df	F value	p-value	Significant (α = 0.05)
physhlth_days	42590.9955	1	755.5035	0.0000	Yes
bmi	295.0575	1	5.2339	0.0222	Yes
sex	3483.7945	1	61.7975	0.0000	Yes
exercise	201.0196	1	3.5658	0.0590	No
age_group	27683.0536	12	40.9215	0.0000	Yes
income	3373.3685	5	11.9677	0.0000	Yes
education	1007.8623	3	5.9593	0.0005	Yes
smoking	6629.7737	3	39.2009	0.0000	Yes
Residuals	449416.0904	7972	NA	NA	NA

Step 2: Fitting a reduced and fitted model for income

# Reduced model: No income
mlr_noincome <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + education + smoking,
         data = brfss_23_hw)

# Full model: Full model
mlr_1_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
         data = brfss_23_hw)

# Chunk test
anova(mlr_noincome, mlr_1_full) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (No Income)", "Full (+ All Predictors)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Chunk Test: Do demographic variables collectively add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Chunk Test: Do demographic variables collectively add to the model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced (No Income)	7977	452789.5	NA	NA	NA	NA
Full (+ All Predictors)	7972	449416.1	5	3373.369	11.9677	0

Step 3: Fitting a reduced and fitted model for education

# Reduced model: No income
mlr_noeducation <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + smoking,
         data = brfss_23_hw)

# Full model: Full model
mlr_1_full <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + income + education + smoking,
         data = brfss_23_hw)

# Chunk test
anova(mlr_noeducation, mlr_1_full) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (No education)", "Full (+ All Predictors)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Chunk Test: Do demographic variables collectively add to the model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Chunk Test: Do demographic variables collectively add to the model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced (No education)	7975	450424.0	NA	NA	NA	NA
Full (+ All Predictors)	7972	449416.1	3	1007.862	5.9593	5e-04

Step 4: Interpretation

Interpretation: Both income and education adds statistically significant information to the model. Since Education has a smaller p-value we know that it is a stronger independent variable compared to income. The chunk test allows us to adjusts for all variables (including others in the group being tested).

Part 3: Interaction Analysis

Step 1 : Creating Binary Smoking Variable

brfss_23_hw <- brfss_23_hw %>% 
  mutate(
    smoker_current = factor(case_when(
      smoking %in% c("Currently daily smoker","Current some-day smoker")  ~ "Current Smoker",
      smoking %in% c("Former Smoker", "Never Smoker") ~ "Non-Smoker",
      TRUE ~ NA_character_ ),
    levels = c("Non-Smoker", "Current Smoker")))

Step 2: Fit two models

Model A (main effects only): regress menthlth_days on physhlth_days, bmi, sex, smoker_current, exercise, age_group, income, and education.

• Model B (with interaction): same as Model A, but use sex * smoker_current in the formula to include the interaction term

# Model without interaction (additive model)
mod_a_hw <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income +  education, data = brfss_23_hw)

# Model with interaction

mod_b_hw <- lm(menthlth_days ~ physhlth_days + bmi  + exercise + age_group + income +  education + sex*smoker_current, data = brfss_23_hw)


tidy(mod_b_hw, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Sleep * Sex → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: Sleep * Sex → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	5.1637	0.7901	6.5352	0.0000	3.6148	6.7126
physhlth_days	0.2797	0.0100	27.8377	0.0000	0.2600	0.2994
bmi	0.0314	0.0136	2.3205	0.0203	0.0049	0.0580
exerciseYes	-0.3949	0.2198	-1.7964	0.0725	-0.8257	0.0360
age_group25-29	-0.1179	0.5338	-0.2208	0.8252	-1.1643	0.9285
age_group30-34	-1.7075	0.5151	-3.3152	0.0009	-2.7171	-0.6979
age_group35-39	-2.5745	0.4910	-5.2438	0.0000	-3.5369	-1.6121
age_group40-44	-3.0759	0.4846	-6.3476	0.0000	-4.0257	-2.1260
age_group45-49	-2.4694	0.4945	-4.9943	0.0000	-3.4387	-1.5002
age_group50-54	-2.6979	0.4730	-5.7035	0.0000	-3.6251	-1.7706
age_group55-59	-4.0907	0.4684	-8.7336	0.0000	-5.0089	-3.1726
age_group60-64	-4.1570	0.4587	-9.0620	0.0000	-5.0562	-3.2577
age_group65-69	-5.3182	0.4527	-11.7469	0.0000	-6.2057	-4.4307
age_group70-74	-6.0272	0.4720	-12.7701	0.0000	-6.9524	-5.1020
age_group75-79	-5.7415	0.4904	-11.7079	0.0000	-6.7028	-4.7802
age_group80+	-6.2432	0.4796	-13.0185	0.0000	-7.1833	-5.3031
income$15,000 - $24,000	-0.9455	0.3975	-2.3784	0.0174	-1.7247	-0.1662
income$25,000 - $34,000	-0.8982	0.3899	-2.3037	0.0213	-1.6625	-0.1339
income$35,000 - $49,000	-1.4272	0.3854	-3.7030	0.0002	-2.1827	-0.6717
income$100,000 - $199,000	-2.3992	0.3840	-6.2476	0.0000	-3.1520	-1.6464
income>= $200,000	-2.6826	0.4316	-6.2154	0.0000	-3.5287	-1.8366
educationHS diploma or GED	1.4292	0.6057	2.3598	0.0183	0.2420	2.6165
educationSome College or Technical School	1.9280	0.5169	3.7300	0.0002	0.9148	2.9412
educationGraduate or Professional Degree	2.0664	0.5269	3.9220	0.0001	1.0336	3.0993
sexFemale	1.0339	0.1806	5.7260	0.0000	0.6799	1.3878
smoker_currentCurrent Smoker	1.6156	0.3880	4.1644	0.0000	0.8551	2.3761
sexFemale:smoker_currentCurrent Smoker	2.1849	0.5398	4.0475	0.0001	1.1267	3.2431

Step 3: Testing the Interaction

\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\]

\[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]

mod_b_hw_term <- tidy(mod_b_hw) |> filter(term == "sexFemale:smoker_currentCurrent Smoker")
cat("Interaction term (sexFemale:smoker_currentCurrent Smoker):\n")

## Interaction term (sexFemale:smoker_currentCurrent Smoker):

cat("  Estimate:", round(mod_b_hw_term$estimate, 4), "\n")

##   Estimate: 2.1849

cat("  t-statistic:", round(mod_b_hw_term$statistic, 3), "\n")

##   t-statistic: 4.048

cat("  p-value:", round(mod_b_hw_term$p.value, 4), "\n")

##   p-value: 1e-04

Interpretation: The interaction term (sexFemale:smoker_currentCurrent Smoker) has a coefficient of 2.1849, with a t-statistic of 4.048 and p-value of 10^{-4}. Since the p-value exceeds the conventional $\alpha = 0.05$ threshold, we reject the null hypothesis that the slopes are equal. In other words, there is statistically significant evidence that the association between current smoking and mental unhealthy days differs by sex.

anova(mod_a_hw, mod_b_hw) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Test for Coincidence: Does Sex Affect the Sleep-Physical Health Relationship at All?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Test for Coincidence: Does Sex Affect the Sleep-Physical Health Relationship at All?
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education	7974	451014.1	NA	NA	NA	NA
menthlth_days ~ physhlth_days + bmi + exercise + age_group + income + education + sex * smoker_current	7973	450089.3	1	924.8145	16.3824	1e-04

Step 4: Interpret the Interaction

Interpretation: The partial F-test evaluates whether sex contributes anything to the model (either through a different intercept, a different slope, or both). The F-statistic is 16.38 with a p-value of 10^{-4}. Since p < 0.05, we o reject the null hypothesis that the two sex-specific regression lines are identical. This means that, in this unadjusted model, sex does significantly modify either the baseline level or the slope of the current smoking-mental health relationship.

Step 5: Visualize the Interaction

pred_hw <- ggpredict(mod_b_hw, terms = c("smoker_current", "sex"))

ggplot(pred_hw, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.2, color = NA) +
  labs(
    title    = "Predicted Mental Health Days by Smoking Status and Sex",
    subtitle = "From interaction model: Current Smoker * sex",
    x        = "Smoking Status",
    y        = "Predicted Mental Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

# Plot
ggplot(pred_hw, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3, position = position_dodge(width = 0.2)) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high),
              alpha = 0.15, color = NA, position = position_dodge(width = 0.2)) +
  scale_x_discrete(limits = c("Non-Smoker", "Current Smoker")) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Predicted Mental Unhealthy Days by Smoking Status and Sex",
    subtitle = "From interaction model: smoker_current * sex",
    x = "Smoking Status",
    y = "Predicted Mental Unhealthy Days",
    color = "Sex",
    fill = "Sex"
  ) +
  theme_minimal(base_size = 13)

## `geom_line()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Step 6: Interpret the interaction

Interpretation: The predicted values plot confirms what the formal test indicated: the regression lines for males and females are not parallel. Both groups show an increase in predicted mentally unhealthy days depending on smoking status. The female points are above below the male points across most of the smoking status, which is statistically significant. This visualization reinforces the conclusion that sex does meaningfully modify the smoking status mental health days in this sample.

Part 4: Reflection

A.

What do the results suggest about the determinants of mental health among U.S. adults? Which predictors were most strongly associated? Which were not statistically significant? What are the key limitations of using cross-sectional survey data? Name at least two specific confounders that could bias the associations you estimated.

Response: The results suggest that physical health, bmi, sex, excercise, age, income, education and smoking status are all assocaited with mental health among U.S residents. Age had the largest assocaited across all age groups when compared to the reference group (18-24), the only age group that was not statistically significant was age group 25-29 years old. The other group that was not statisically significant when compared to the reference group was income group 100,000 - 199,000 when compared to the reference group less than 15,000. The key limitations of using cross-sectional data is that temporarlity can not be established therefore we are unable to be certain that the predictors happened prior to the outcome of poor mental health days. We know that there are associations between the predictors but we don’t know if they influfenced the respondents poor mental health days or vice versa. I think physical health and excersise could be potential confounders.

B.

What does Adjusted R-squared tell you that regular R-squared does not, and why is this distinction especially important when adding multiple categorical predictors? Why is it useful to test groups of predictors (income, education) collectively with chunk tests, rather than relying only on the individual t-test for each level?

Response: The Adjusted R-square adds a pentaly to the model for extra degress of freedom that are added to the model, whereas the r-quared will always increase as predictors are added to the model. It is useful to test groups of predictors collectively with chunk test rather than relying on individual t-test for each level because groups can add statistcailly significant information to to the model even if a individual predictor is not significant.

C.

Describe which parts of the assignment (if any) you sought AI assistance for, what specific prompts you used, and how you verified the AI-assisted output was correct. If you did not use AI, state that explicitly.

Response:I used AI for Step 5: Visualize the Interaction to verfiy my code, for whatever reason the code that I developed was producing an empty graph. I enetered my code into the AI and asked if anything looked off in the code. I read through the output and tested my code to ensure that everything was set up correctly. All my information was properly set up but the graph was still not producing anything, I asked AI to modify the code and it provided the code that I used. It is slightly different but when I used it, the graph appeared. Not sure what was causing the error but was able to get it to run and interpret the visual to answer the questions.

EPI553_HW03_Coq_Arielle

Arielle Coq

2026-03-24