Introduction

Mental health is a leading public health concern in the United States. The BRFSS survey asks respondents: “Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good?” The resulting variable, mentally unhealthy days, captures a broad measure of psychological distress and is widely used in public health surveillance. Multiple regression allows us to examine the independent association between each predictor and mental health, holding all other factors constant. Interaction analysis tests whether the association between two predictors (such as smoking and mental health) differs across levels of a third variable (such as sex), a concept epidemiologists call effect modification.

Research Question

What individual and behavioral factors are associated with the number of mentally unhealthy days reported by U.S. adults, and does the association between current smoking and mental health differ by sex?

#dataset

About the BRFSS

The Behavioral Risk Factor Surveillance System (BRFSS) is a nationwide telephone survey conducted annually by state health departments in collaboration with the CDC. The 2023 data contain approximately 433,000 respondents across the United States and cover a broad range of health behaviors, chronic conditions, and sociodemographic characteristics.

Setup and Data

library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library (GGally)
library (car)

Loading BRFSS 2023 Data

We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2023 data.

# Load raw BRFSS 2023 data

brfss_raw <- read_xpt(
  "C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/LLCP2023.XPT"
) %>%
  janitor::clean_names()

# Examine the raw data
dim(brfss_raw)

## [1] 433323    350

head(names(brfss_raw),350)

##   [1] "state"    "fmonth"   "idate"    "imonth"   "iday"     "iyear"   
##   [7] "dispcode" "seqno"    "psu"      "ctelenm1" "pvtresd1" "colghous"
##  [13] "statere1" "celphon1" "ladult1"  "numadult" "respslc1" "landsex2"
##  [19] "lndsxbrt" "safetime" "ctelnum1" "cellfon5" "cadult1"  "cellsex2"
##  [25] "celsxbrt" "pvtresd3" "cclghous" "cstate1"  "landline" "hhadult" 
##  [31] "sexvar"   "genhlth"  "physhlth" "menthlth" "poorhlth" "primins1"
##  [37] "persdoc3" "medcost1" "checkup1" "exerany2" "exract12" "exeroft1"
##  [43] "exerhmm1" "exract22" "exeroft2" "exerhmm2" "strength" "bphigh6" 
##  [49] "bpmeds1"  "cholchk3" "toldhi3"  "cholmed3" "cvdinfr4" "cvdcrhd4"
##  [55] "cvdstrk3" "asthma3"  "asthnow"  "chcscnc1" "chcocnc1" "chccopd3"
##  [61] "addepev3" "chckdny2" "havarth4" "diabete4" "diabage4" "marital" 
##  [67] "educa"    "renthom1" "numhhol4" "numphon4" "cpdemo1c" "veteran3"
##  [73] "employ1"  "children" "income3"  "pregnant" "weight2"  "height3" 
##  [79] "deaf"     "blind"    "decide"   "diffwalk" "diffdres" "diffalon"
##  [85] "fall12mn" "fallinj5" "smoke100" "smokday2" "usenow3"  "ecignow2"
##  [91] "alcday4"  "avedrnk3" "drnk3ge5" "maxdrnks" "flushot7" "flshtmy3"
##  [97] "pneuvac4" "shingle2" "hivtst7"  "hivtstd3" "seatbelt" "drnkdri2"
## [103] "covidpo1" "covidsm1" "covidact" "pdiabts1" "prediab2" "diabtype"
## [109] "insulin1" "chkhemo3" "eyeexam1" "diabeye1" "diabedu1" "feetsore"
## [115] "arthexer" "arthedu"  "lmtjoin3" "arthdis2" "joinpai2" "lcsfirst"
## [121] "lcslast"  "lcsnumcg" "lcsctsc1" "lcsscncr" "lcsctwhn" "hadmam"  
## [127] "howlong"  "cervscrn" "crvclcnc" "crvclpap" "crvclhpv" "hadhyst2"
## [133] "psatest1" "psatime1" "pcpsars2" "psasugs1" "pcstalk2" "hadsigm4"
## [139] "colnsigm" "colntes1" "sigmtes1" "lastsig4" "colncncr" "vircolo1"
## [145] "vclntes2" "smalstol" "stoltest" "stooldn2" "bldstfit" "sdnates1"
## [151] "cncrdiff" "cncrage"  "cncrtyp2" "csrvtrt3" "csrvdoc1" "csrvsum" 
## [157] "csrvrtrn" "csrvinst" "csrvinsr" "csrvdein" "csrvclin" "csrvpain"
## [163] "csrvctl2" "indortan" "numburn3" "sunprtct" "wkdayout" "wkendout"
## [169] "cimemlo1" "cdworry"  "cddiscu1" "cdhous1"  "cdsocia1" "caregiv1"
## [175] "crgvrel4" "crgvlng1" "crgvhrs1" "crgvprb3" "crgvalzd" "crgvper1"
## [181] "crgvhou1" "crgvexpt" "lastsmk2" "stopsmk2" "mentcigs" "mentecig"
## [187] "heattbco" "firearm5" "gunload"  "loadulk2" "hasymp1"  "hasymp2" 
## [193] "hasymp3"  "hasymp4"  "hasymp5"  "hasymp6"  "strsymp1" "strsymp2"
## [199] "strsymp3" "strsymp4" "strsymp5" "strsymp6" "firstaid" "aspirin" 
## [205] "birthsex" "somale"   "sofemale" "trnsgndr" "marijan1" "marjsmok"
## [211] "marjeat"  "marjvape" "marjdab"  "marjothr" "usemrjn4" "acedeprs"
## [217] "acedrink" "acedrugs" "aceprisn" "acedivrc" "acepunch" "acehurt1"
## [223] "aceswear" "acetouch" "acetthem" "acehvsex" "aceadsaf" "aceadned"
## [229] "imfvpla4" "hpvadvc4" "hpvadsht" "tetanus1" "covidva1" "covacge1"
## [235] "covidnu2" "lsatisfy" "emtsuprt" "sdlonely" "sdhemply" "foodstmp"
## [241] "sdhfood1" "sdhbills" "sdhutils" "sdhtrnsp" "sdhstre1" "rrclass3"
## [247] "rrcognt2" "rrtreat"  "rratwrk2" "rrhcare4" "rrphysm2" "rcsgend1"
## [253] "rcsxbrth" "rcsrltn2" "casthdx2" "casthno2" "qstver"   "qstlang" 
## [259] "metstat"  "urbstat"  "mscode"   "ststr"    "strwt"    "rawrake" 
## [265] "wt2rake"  "imprace"  "chispnc"  "crace1"   "cageg"    "cllcpwt" 
## [271] "dualuse"  "dualcor"  "llcpwt2"  "llcpwt"   "rfhlth"   "phys14d" 
## [277] "ment14d"  "hlthpl1"  "hcvu653"  "totinda"  "metvl12"  "metvl22" 
## [283] "maxvo21"  "fc601"    "actin13"  "actin23"  "padur1"   "padur2"  
## [289] "pafreq1"  "pafreq2"  "minac12"  "minac22"  "strfreq"  "pamiss3" 
## [295] "pamin13"  "pamin23"  "pa3min"   "pavig13"  "pavig23"  "pa3vigm" 
## [301] "pacat3"   "paindx3"  "pa150r4"  "pa300r4"  "pa30023"  "pastrng" 
## [307] "parec3"   "pastae3"  "rfhype6"  "cholch3"  "rfchol3"  "michd"   
## [313] "ltasth1"  "casthm1"  "asthms1"  "drdxar2"  "mrace1"   "hispanc" 
## [319] "race"     "raceg21"  "racegr3"  "raceprv"  "sex"      "ageg5yr" 
## [325] "age65yr"  "age80"    "age_g"    "htin4"    "htm4"     "wtkg3"   
## [331] "bmi5"     "bmi5cat"  "rfbmi5"   "chldcnt"  "educag"   "incomg1" 
## [337] "smoker3"  "rfsmok3"  "cureci2"  "drnkany6" "drocdy4"  "rfbing6" 
## [343] "drnkwk2"  "rfdrhv8"  "flshot7"  "pneumo3"  "aidtst4"  "rfseat2" 
## [349] "rfseat3"  "drnkdrv"

# Select 9 variables of interest
brfss_hw3 <- brfss_raw %>%
  select(
    menthlth,       # outcome: mentally unhealthy days
    physhlth,       # physical health days
    bmi5,        # BMI (with backticks because of underscore)
    sexvar,         # sex
    exerany2,       # exercise
    ageg5yr,     # age group
    incomg1,     # income
    educa,          # education
    smoker3      # smoking status
  )

# Check
dim(brfss_hw3)      # 433,000 × 9

## [1] 433323      9

head(brfss_hw3)     # Show first few rows

## # A tibble: 6 × 9
##   menthlth physhlth  bmi5 sexvar exerany2 ageg5yr incomg1 educa smoker3
##      <dbl>    <dbl> <dbl>  <dbl>    <dbl>   <dbl>   <dbl> <dbl>   <dbl>
## 1       88       88  3047      2        2      13       9     5       4
## 2       88       88  2856      2        1      13       9     5       4
## 3        2        6  2231      2        1      13       1     4       3
## 4       88        2  2744      2        1      12       9     5       4
## 5       88       88  2585      2        1      12       5     5       4
## 6        3        2  3018      2        1       9       5     5       4

# Recode variables

brfss_clean <- brfss_hw3 %>%
  mutate(
    menthlth_days = case_when (
    menthlth == 88 ~ 0, #none means 0 days
    menthlth %in% c(77,99) ~ NA_real_, #dont know and refused means NA
    menthlth >=1 & menthlth <=30 ~ as.numeric (menthlth), #valid values
    TRUE ~ NA_real_
    ),
    
    physhlth_days = case_when (
    physhlth == 88 ~ 0, #none means 0 days
    physhlth %in% c(77,99) ~ NA_real_, #dont know and refused means NA
    physhlth >=1 & physhlth <=30 ~ as.numeric (physhlth), #valid values
    TRUE ~ NA_real_
    ),
    
    bmi = case_when (
      bmi5 == 9999 ~ NA_real_, 
      bmi5 > 0 ~ bmi5 /100, # convert to actual BMI
      TRUE ~ NA_real_
    ),
    
    sex = factor(ifelse (sexvar == 1, "Male", "Female")
    ),
    
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      exerany2 %in% c(7,9) ~ NA_character_, 
      TRUE ~ NA_character_
    ),
    levels = c ("No", "Yes") # making no as reference
  ),
  
  age_group = factor(case_when(
        ageg5yr == 1 ~ "18-24",
        ageg5yr == 2 ~ "25-29",
        ageg5yr == 3 ~ "30-34",
        ageg5yr == 4 ~ "35-39",
        ageg5yr == 5 ~ "40-44",
        ageg5yr == 6 ~ "45-49",
        ageg5yr == 7 ~ "50-54",
        ageg5yr == 8 ~ "55-59",
        ageg5yr == 9 ~ "60-64",
        ageg5yr == 10 ~ "65-69",
        ageg5yr == 11 ~ "70-74",
        ageg5yr == 12 ~ "75-79",
        ageg5yr == 13 ~ "80+",
        ageg5yr == 14 ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("18-24", "25-29", "30-34", "35-39", "40-44", "45-49", 
                 "50-54", "55-59", "60-64", "65-69", "70-74", "75-79", "80+")
      #making "18-24" reference 
    ),  
  
   income = factor(case_when(
        incomg1 == 1 ~ "Less than $15,000",
        incomg1 == 2 ~ "$15,000 to <$25,000",
        incomg1 == 3 ~ "$25,000 to <$35,000",
        incomg1 == 4 ~ "$35,000 to <$50,000",
        incomg1 == 5 ~ "$50,000 to <$75,000",
        incomg1 == 6 ~ "$75,000 to <$100,000",
        incomg1 == 7 ~ "$100,000 or more",
        incomg1 == 9 ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("Less than $15,000", "$15,000 to <$25,000", 
                 "$25,000 to <$35,000", "$35,000 to <$50,000",
                 "$50,000 to <$75,000", "$75,000 to <$100,000",
                 "$100,000 or more")
       # making "Less than $15,000"  reference
    ),
   
  education = factor(case_when(
        educa %in% c(1, 2) ~ "Less than high school",
        educa == 3 ~ "High school diploma or GED",
        educa == 4 ~ "Some college or technical school",
        educa == 5 ~ "College graduate",
        educa == 6 ~ "Graduate or professional degree",
        educa == 9 ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("Less than high school", "High school diploma or GED",
                 "Some college or technical school", "College graduate",
                 "Graduate or professional degree")
      # making "Less than high school"  reference
    ),
    
    smoking = factor(case_when(
        smoker3 == 1 ~ "Current daily smoker",
        smoker3 == 2 ~ "Current some-day smoker",
        smoker3 == 3 ~ "Former smoker",
        smoker3 == 4 ~ "Never smoker",
        smoker3 == 9 ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("Never smoker", "Former smoker", 
                 "Current some-day smoker", "Current daily smoker")
    )
    # making "Never smoker" is reference (first in levels)
    
  )

brfss_clean %>%
  select(menthlth_days, bmi, income) %>%
  tbl_summary(
    missing = "always", 
    missing_text = "Missing (n, %)",
    label = list(
      menthlth_days ~ "Mental Health Days",
      bmi ~ "BMI",
      income ~ "Annual Income"
    )
  ) %>%
  modify_caption("**Table: Missing Before Exclusions**")

**Table: Missing Before Exclusions**
Characteristic	N = 433,323¹
Mental Health Days	0 (0, 5)
Missing (n, %)	8,108
BMI	27 (24, 32)
Missing (n, %)	40,535
Annual Income
Less than $15,000	19,187 (5.5%)
$15,000 to <$25,000	31,069 (9.0%)
$25,000 to <$35,000	38,508 (11%)
$35,000 to <$50,000	47,502 (14%)
$50,000 to <$75,000	107,027 (31%)
$75,000 to <$100,000	76,637 (22%)
$100,000 or more	26,770 (7.7%)
Missing (n, %)	86,623
¹ Median (Q1, Q3); n (%)

# Select recoded variables, apply filters, drop missing, take sample
set.seed (1220) # CRITICAL: reproducibility
# Same seed = same random sample across all computers

brfss_analytic <- brfss_clean %>%
  drop_na(menthlth_days, physhlth_days, bmi, sex, exercise, 
         age_group, income, education, smoking) %>%
    # drop_na removes any row with NA in ANY of these columns
  slice_sample(n = 8000) # Randomly select 8,000 rows

# Save analytic dataset
saveRDS(brfss_clean,("C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_hw3_clean.rds"
))

# Check the recoded data
head(brfss_hw3)

## # A tibble: 6 × 9
##   menthlth physhlth  bmi5 sexvar exerany2 ageg5yr incomg1 educa smoker3
##      <dbl>    <dbl> <dbl>  <dbl>    <dbl>   <dbl>   <dbl> <dbl>   <dbl>
## 1       88       88  3047      2        2      13       9     5       4
## 2       88       88  2856      2        1      13       9     5       4
## 3        2        6  2231      2        1      13       1     4       3
## 4       88        2  2744      2        1      12       9     5       4
## 5       88       88  2585      2        1      12       5     5       4
## 6        3        2  3018      2        1       9       5     5       4

glimpse(brfss_hw3)

## Rows: 433,323
## Columns: 9
## $ menthlth <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88,…
## $ physhlth <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, 30, 1…
## $ bmi5     <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 2296, 4…
## $ sexvar   <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ exerany2 <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2, 1, 1…
## $ ageg5yr  <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12, 13,…
## $ incomg1  <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3, 4, 3…
## $ educa    <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5, 4, 4…
## $ smoker3  <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1, 3, 3…

head(brfss_clean)

## # A tibble: 6 × 18
##   menthlth physhlth  bmi5 sexvar exerany2 ageg5yr incomg1 educa smoker3
##      <dbl>    <dbl> <dbl>  <dbl>    <dbl>   <dbl>   <dbl> <dbl>   <dbl>
## 1       88       88  3047      2        2      13       9     5       4
## 2       88       88  2856      2        1      13       9     5       4
## 3        2        6  2231      2        1      13       1     4       3
## 4       88        2  2744      2        1      12       9     5       4
## 5       88       88  2585      2        1      12       5     5       4
## 6        3        2  3018      2        1       9       5     5       4
## # ℹ 9 more variables: menthlth_days <dbl>, physhlth_days <dbl>, bmi <dbl>,
## #   sex <fct>, exercise <fct>, age_group <fct>, income <fct>, education <fct>,
## #   smoking <fct>

glimpse(brfss_clean)

## Rows: 433,323
## Columns: 18
## $ menthlth      <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88…
## $ physhlth      <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, …
## $ bmi5          <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 22…
## $ sexvar        <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2,…
## $ exerany2      <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2,…
## $ ageg5yr       <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12…
## $ incomg1       <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3,…
## $ educa         <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5,…
## $ smoker3       <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1,…
## $ menthlth_days <dbl> 0, 0, 2, 0, 0, 3, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, …
## $ physhlth_days <dbl> 0, 0, 6, 2, 0, 2, 8, 1, 5, 0, 0, 2, 0, 0, 0, 4, 30, 15, …
## $ bmi           <dbl> 30.47, 28.56, 22.31, 27.44, 25.85, 30.18, 24.41, 27.27, …
## $ sex           <fct> Female, Female, Female, Female, Female, Female, Male, Fe…
## $ exercise      <fct> No, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, Yes, No, Y…
## $ age_group     <fct> 80+, 80+, 80+, 75-79, 75-79, 60-64, 80+, 75-79, 80+, 75-…
## $ income        <fct> NA, NA, "Less than $15,000", NA, "$50,000 to <$75,000", …
## $ education     <fct> College graduate, College graduate, Some college or tech…
## $ smoking       <fct> Never smoker, Never smoker, Former smoker, Never smoker,…

head(brfss_analytic)

## # A tibble: 6 × 18
##   menthlth physhlth  bmi5 sexvar exerany2 ageg5yr incomg1 educa smoker3
##      <dbl>    <dbl> <dbl>  <dbl>    <dbl>   <dbl>   <dbl> <dbl>   <dbl>
## 1       88       88  2789      1        1      13       6     6       3
## 2       88       88  2583      1        1       4       6     5       3
## 3       15       88  4743      2        2      10       2     5       4
## 4        4        2  3430      1        1       5       6     5       4
## 5        7        2  2231      2        1       1       1     5       4
## 6        2        3  2652      2        1       4       6     6       3
## # ℹ 9 more variables: menthlth_days <dbl>, physhlth_days <dbl>, bmi <dbl>,
## #   sex <fct>, exercise <fct>, age_group <fct>, income <fct>, education <fct>,
## #   smoking <fct>

glimpse(brfss_analytic)

## Rows: 8,000
## Columns: 18
## $ menthlth      <dbl> 88, 88, 15, 4, 7, 2, 88, 1, 88, 88, 5, 88, 1, 1, 88, 20,…
## $ physhlth      <dbl> 88, 88, 88, 2, 2, 3, 30, 88, 88, 88, 88, 3, 2, 88, 88, 4…
## $ bmi5          <dbl> 2789, 2583, 4743, 3430, 2231, 2652, 2554, 2048, 3361, 21…
## $ sexvar        <dbl> 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1,…
## $ exerany2      <dbl> 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ ageg5yr       <dbl> 13, 4, 10, 5, 1, 4, 6, 8, 13, 3, 3, 13, 7, 2, 6, 7, 6, 1…
## $ incomg1       <dbl> 6, 6, 2, 6, 1, 6, 7, 5, 5, 1, 5, 5, 6, 2, 6, 4, 5, 6, 4,…
## $ educa         <dbl> 6, 5, 5, 5, 5, 6, 5, 5, 6, 4, 5, 6, 6, 5, 4, 3, 5, 6, 3,…
## $ smoker3       <dbl> 3, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 4, 4, 2, 4, 3, 3,…
## $ menthlth_days <dbl> 0, 0, 15, 4, 7, 2, 0, 1, 0, 0, 5, 0, 1, 1, 0, 20, 2, 0, …
## $ physhlth_days <dbl> 0, 0, 0, 2, 2, 3, 30, 0, 0, 0, 0, 3, 2, 0, 0, 4, 0, 0, 1…
## $ bmi           <dbl> 27.89, 25.83, 47.43, 34.30, 22.31, 26.52, 25.54, 20.48, …
## $ sex           <fct> Male, Male, Female, Male, Female, Female, Male, Female, …
## $ exercise      <fct> Yes, Yes, No, Yes, Yes, Yes, No, Yes, No, No, Yes, Yes, …
## $ age_group     <fct> 80+, 35-39, 65-69, 40-44, 18-24, 35-39, 45-49, 55-59, 80…
## $ income        <fct> "$75,000 to <$100,000", "$75,000 to <$100,000", "$15,000…
## $ education     <fct> Graduate or professional degree, College graduate, Colle…
## $ smoking       <fct> Former smoker, Former smoker, Never smoker, Never smoker…

tibble(
  Metric = c("Observations", "Variables"),
  Value  = c(nrow(brfss_analytic), ncol(brfss_analytic))
) %>%
  kable(caption = "Analytic Dataset Dimensions") %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	8000
Variables	18

brfss_analytic %>%
  select(menthlth_days, physhlth_days, bmi, sex, exercise, 
                age_group, income, education, smoking) %>%
  summary() %>%
  kable(caption = "Descriptive Statistics: Key Continuous Variables") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Descriptive Statistics: Key Continuous Variables
menthlth_days	physhlth_days	bmi	sex	exercise	age_group	income	education	smoking
Min. : 0.000	Min. : 0.000	Min. :12.26	Female:4064	No :1760	65-69 : 901	Less than $15,000 : 407	Less than high school : 124	Never smoker :4812
1st Qu.: 0.000	1st Qu.: 0.000	1st Qu.:24.33	Male :3936	Yes:6240	70-74 : 808	$15,000 to <$25,000 : 641	High school diploma or GED : 252	Former smoker :2262
Median : 0.000	Median : 0.000	Median :27.50	NA	NA	60-64 : 787	$25,000 to <$35,000 : 871	Some college or technical school:1827	Current some-day smoker: 268
Mean : 4.318	Mean : 4.426	Mean :28.70	NA	NA	75-79 : 663	$35,000 to <$50,000 :1067	College graduate :2138	Current daily smoker : 658
3rd Qu.: 5.000	3rd Qu.: 4.000	3rd Qu.:31.89	NA	NA	55-59 : 660	$50,000 to <$75,000 :2511	Graduate or professional degree :3659	NA
Max. :30.000	Max. :30.000	Max. :94.02	NA	NA	80+ : 631	$75,000 to <$100,000:1865	NA	NA
NA	NA	NA	NA	NA	(Other):3550	$100,000 or more : 638	NA	NA

brfss_analytic %>%
  select(menthlth_days, physhlth_days, bmi, sex, exercise, 
                age_group, income, education, smoking) %>%
  tbl_summary( # Stratify by sex (Male vs. Female)
    by = sex, 
    label = list(
      menthlth_days ~ "Mental Health Days",
      physhlth_days ~ "Physical Health Days",
      bmi ~ "Body Mass Index",
      exercise ~ "Any Exercise",
      age_group ~ "Age Group",
      income ~ "Annual Income",
      education ~ "Education Level",
      smoking ~ "Smoking Status"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 1. Descriptive Statistics (BRFSS 2023, n = 8,000)**")

**Table 1. Descriptive Statistics (BRFSS 2023, n = 8,000)**
Characteristic	N	Female N = 4,064¹	Male N = 3,936¹
Mental Health Days	8,000	5.1 (8.6)	3.5 (7.5)
Physical Health Days	8,000	4.9 (8.9)	4.0 (8.4)
Body Mass Index	8,000	28.7 (7.0)	28.7 (6.0)
Any Exercise	8,000	3,094 (76%)	3,146 (80%)
Age Group	8,000
18-24		171 (4.2%)	235 (6.0%)
25-29		189 (4.7%)	219 (5.6%)
30-34		210 (5.2%)	253 (6.4%)
35-39		302 (7.4%)	263 (6.7%)
40-44		292 (7.2%)	290 (7.4%)
45-49		252 (6.2%)	266 (6.8%)
50-54		303 (7.5%)	305 (7.7%)
55-59		352 (8.7%)	308 (7.8%)
60-64		379 (9.3%)	408 (10%)
65-69		483 (12%)	418 (11%)
70-74		426 (10%)	382 (9.7%)
75-79		338 (8.3%)	325 (8.3%)
80+		367 (9.0%)	264 (6.7%)
Annual Income	8,000
Less than $15,000		247 (6.1%)	160 (4.1%)
$15,000 to <$25,000		370 (9.1%)	271 (6.9%)
$25,000 to <$35,000		495 (12%)	376 (9.6%)
$35,000 to <$50,000		585 (14%)	482 (12%)
$50,000 to <$75,000		1,260 (31%)	1,251 (32%)
$75,000 to <$100,000		869 (21%)	996 (25%)
$100,000 or more		238 (5.9%)	400 (10%)
Education Level	8,000
Less than high school		49 (1.2%)	75 (1.9%)
High school diploma or GED		122 (3.0%)	130 (3.3%)
Some college or technical school		877 (22%)	950 (24%)
College graduate		1,120 (28%)	1,018 (26%)
Graduate or professional degree		1,896 (47%)	1,763 (45%)
Smoking Status	8,000
Never smoker		2,573 (63%)	2,239 (57%)
Former smoker		1,055 (26%)	1,207 (31%)
Current some-day smoker		117 (2.9%)	151 (3.8%)
Current daily smoker		319 (7.8%)	339 (8.6%)
¹ Mean (SD); n (%)

p_hist <- ggplot(brfss_analytic, aes(x = menthlth_days)) +
  geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
  labs(
    title    = "Distribution of Mentally Unhealthy Days in the Past 30 Days",
    subtitle = "BRFSS 2020 Analytic Sample (n = 8,000)",
    x        = "Number of Mentally Unhealthy Days",
    y        = "Count"
  ) +
  theme_minimal(base_size = 13)

ggplotly(p_hist)

Distribution of Mentally Unhealthy Days (BRFSS 2023)

#Part 1: Multiple Linear Regression (25 Points)

Research question: What is the independent association of each predictor with the number of mentally unhealthy days in the past 30 days?

# Pairs plot of continuous predictors vs outcome
brfss_analytic %>%
  select(menthlth_days, physhlth_days, bmi) %>%
  rename(
    `Mental Health\nDays`  = menthlth_days,
    `Physical Health\nDays` = physhlth_days,
    `BMI\n(kg/m2)`         = bmi,
  ) %>%
  ggpairs(
    lower  = list(continuous = wrap("points", alpha = 0.05, size = 0.5)),
    diag   = list(continuous = wrap("densityDiag", fill = "steelblue", alpha = 0.5)),
    upper  = list(continuous = wrap("cor", size = 4)),
    title  = "Pairwise Relationships Among Key Variables (BRFSS 2023)"
  ) +
  theme_minimal(base_size = 11)

Visualizing Key Predictors vs. Mental Health Days

The pair-wise correlation matrix above shows the distribution of each variable, scatterplots and correlation coefficient for eeach pair. 1. Mental health days and Physical health days (correlation = 0.313 i.e. moderate positive) It shows uoward trend, which means as mental health days increase, physical health days also increase and people who report more days of poor mental health also tend to report more days of poor physical health.

Mental Health days and BMI (correlation = 0.065 ~ weak relationship) It means BMI is onl weakly related. People with higher or lower BMI do not differ as much.
Physical health days and BMI (correlation = 0.088 ~ also weak) It means BMI is also only slightly related to physical health days.

Before fitting any model, we must address missing data. Let’s ensure we don’t with our BRFSS context

# Assess missingness in our analytic variables before the sample was taken
brfss_analytic %>%
  select(menthlth_days, physhlth_days, bmi, sex, exercise, 
                age_group, income, education, smoking) %>%
  summarise(across(everything(), ~ sum(is.na(.) | . %in% c(77, 88, 99)) / n() * 100)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Pct_Missing_or_DK") %>%
  mutate(Pct_Missing_or_DK = round(Pct_Missing_or_DK, 1)) %>%
  arrange(desc(Pct_Missing_or_DK)) %>%
  kable(
    caption = "Table 2. Missing or Don't Know/Refused (%) — BRFSS 2023 Full Sample",
    col.names = c("Variable", "% Missing / DK / Refused")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2. Missing or Don’t Know/Refused (%) — BRFSS 2023 Full Sample
Variable	% Missing / DK / Refused
menthlth_days	0
physhlth_days	0
bmi	0
sex	0
exercise	0
age_group	0
income	0
education	0
smoking	0

##Step 1: Fit a multiple linear regression model with menthlth_days as the outcome and the following predictors: physhlth_days, bmi, sex, exercise, age_group, income, education, and smoking. Display the full summary() output

# Model: Full multivariable model
mod_mlr_full <- lm(
  menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + 
                  income + education + smoking,
  data = brfss_analytic)

# Display the full summary

summary(mod_mlr_full)

## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + exercise + 
##     age_group + income + education + smoking, data = brfss_analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.080  -3.865  -1.597   0.712  30.471 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                8.10409    0.92801   8.733  < 2e-16
## physhlth_days                              0.26558    0.01007  26.384  < 2e-16
## bmi                                        0.03338    0.01321   2.527 0.011510
## sexMale                                   -1.39038    0.16750  -8.301  < 2e-16
## exerciseYes                               -0.65116    0.21472  -3.033 0.002432
## age_group25-29                            -1.05660    0.51938  -2.034 0.041950
## age_group30-34                            -1.09395    0.50646  -2.160 0.030803
## age_group35-39                            -1.81103    0.48851  -3.707 0.000211
## age_group40-44                            -2.89307    0.48749  -5.935 3.07e-09
## age_group45-49                            -3.05618    0.49769  -6.141 8.61e-10
## age_group50-54                            -3.51674    0.48323  -7.277 3.72e-13
## age_group55-59                            -4.49597    0.47555  -9.454  < 2e-16
## age_group60-64                            -4.52215    0.45848  -9.863  < 2e-16
## age_group65-69                            -5.57795    0.45019 -12.390  < 2e-16
## age_group70-74                            -6.02536    0.45741 -13.173  < 2e-16
## age_group75-79                            -6.28656    0.47484 -13.239  < 2e-16
## age_group80+                              -6.81968    0.47684 -14.302  < 2e-16
## income$15,000 to <$25,000                 -1.63703    0.46899  -3.491 0.000485
## income$25,000 to <$35,000                 -2.07637    0.44797  -4.635 3.63e-06
## income$35,000 to <$50,000                 -2.54685    0.43819  -5.812 6.40e-09
## income$50,000 to <$75,000                 -3.05048    0.41069  -7.428 1.22e-13
## income$75,000 to <$100,000                -3.49984    0.42923  -8.154 4.07e-16
## income$100,000 or more                    -3.78409    0.50036  -7.563 4.38e-14
## educationHigh school diploma or GED        0.07991    0.81066   0.099 0.921484
## educationSome college or technical school  1.07679    0.68973   1.561 0.118520
## educationCollege graduate                  1.79091    0.69119   2.591 0.009585
## educationGraduate or professional degree   1.73781    0.69250   2.509 0.012111
## smokingFormer smoker                       0.94710    0.19329   4.900 9.77e-07
## smokingCurrent some-day smoker             1.35011    0.46820   2.884 0.003942
## smokingCurrent daily smoker                2.93681    0.32162   9.131  < 2e-16
##                                              
## (Intercept)                               ***
## physhlth_days                             ***
## bmi                                       *  
## sexMale                                   ***
## exerciseYes                               ** 
## age_group25-29                            *  
## age_group30-34                            *  
## age_group35-39                            ***
## age_group40-44                            ***
## age_group45-49                            ***
## age_group50-54                            ***
## age_group55-59                            ***
## age_group60-64                            ***
## age_group65-69                            ***
## age_group70-74                            ***
## age_group75-79                            ***
## age_group80+                              ***
## income$15,000 to <$25,000                 ***
## income$25,000 to <$35,000                 ***
## income$35,000 to <$50,000                 ***
## income$50,000 to <$75,000                 ***
## income$75,000 to <$100,000                ***
## income$100,000 or more                    ***
## educationHigh school diploma or GED          
## educationSome college or technical school    
## educationCollege graduate                 ** 
## educationGraduate or professional degree  *  
## smokingFormer smoker                      ***
## smokingCurrent some-day smoker            ** 
## smokingCurrent daily smoker               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.352 on 7970 degrees of freedom
## Multiple R-squared:  0.1853, Adjusted R-squared:  0.1824 
## F-statistic: 62.52 on 29 and 7970 DF,  p-value: < 2.2e-16

#model tidy table

tidy(mod_mlr_full, conf.int = TRUE) %>%
  mutate(
    term = dplyr::recode(term,
      
      "(Intercept)"                               = "Intercept",  
      "physhlth_days"                             = "Physical health days",
      "bmi"                                       = "Body Mass Index",
      "sexMale"                                   = "Sex: Male (ref = Female)",
      "exerciseYes"                               = "Any Exercise: Yes (ref = No)",
      "age_group25-29"                            = "Age: 25-29",
      "age_group30-34"                            = "Age: 30-34",
      "age_group35-39"                            = "Age: 35-39",
      "age_group40-44"                            = "Age: 40-44",
      "age_group45-49"                            = "Age: 45-49",
      "age_group50-54"                            = "Age: 50-54",
      "age_group55-59"                            = "Age: 55-59",
      "age_group60-64"                            = "Age: 60-64",
      "age_group65-69"                            = "Age: 65-69",
      "age_group70-74"                            = "Age: 70-74",
      "age_group75-79"                            = "Age: 75-79",
      "age_group80+"                              = "Age: 80+",
      `income$15,000 to <$25,000`                 = "Income: $15k to <$25k",
      `income$25,000 to <$35,000`                 = "Income: $25k to <$35k",
      `income$35,000 to <$50,000`                 = "Income: $35k to <$50k",
      `income$50,000 to <$75,000`                 = "Income: $50k to <$75k",
      `income$75,000 to <$100,000`                = "Income: $75k to <$100k",
      `income$100,000 or more`                    = "Income: $100k+",
      "educationHigh school diploma or GED"       = "Education: HS Diploma/GED",
      "educationSome college or technical school" = "Education: Some College",
      "educationCollege graduate"                 = "Education: College Graduate",
      "educationGraduate or professional degree"  = "Education: Graduate Degree",
      "smokingFormer smoker"                      = "Smoking: Former",
      "smokingCurrent some-day smoker"             = "Smoking: Current (some days)",
      "smokingCurrent daily smoker"               = "Smoking: Current (daily)"
),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption   = "Table. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)",
    col.names = c("Term", "Estimate", "Std. Error", "t-statistic", "p-value", "95% CI Lower", "95% CI Upper"),
    format = "html"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = TRUE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3), background = "#EBF5FB")

Table. Multiple Linear Regression: Mentally Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)
Term	Estimate	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
Intercept	8.1041	0.9280	8.7328	0.0000	6.2850	9.9232
Physical health days	0.2656	0.0101	26.3841	0.0000	0.2459	0.2853
Body Mass Index	0.0334	0.0132	2.5274	0.0115	0.0075	0.0593
Sex: Male (ref = Female)	-1.3904	0.1675	-8.3007	0.0000	-1.7187	-1.0620
Any Exercise: Yes (ref = No)	-0.6512	0.2147	-3.0326	0.0024	-1.0721	-0.2303
Age: 25-29	-1.0566	0.5194	-2.0343	0.0420	-2.0747	-0.0385
Age: 30-34	-1.0939	0.5065	-2.1600	0.0308	-2.0867	-0.1012
Age: 35-39	-1.8110	0.4885	-3.7072	0.0002	-2.7686	-0.8534
Age: 40-44	-2.8931	0.4875	-5.9346	0.0000	-3.8487	-1.9375
Age: 45-49	-3.0562	0.4977	-6.1408	0.0000	-4.0318	-2.0806
Age: 50-54	-3.5167	0.4832	-7.2775	0.0000	-4.4640	-2.5695
Age: 55-59	-4.4960	0.4755	-9.4543	0.0000	-5.4282	-3.5638
Age: 60-64	-4.5221	0.4585	-9.8633	0.0000	-5.4209	-3.6234
Age: 65-69	-5.5779	0.4502	-12.3903	0.0000	-6.4604	-4.6955
Age: 70-74	-6.0254	0.4574	-13.1728	0.0000	-6.9220	-5.1287
Age: 75-79	-6.2866	0.4748	-13.2392	0.0000	-7.2174	-5.3557
Age: 80+	-6.8197	0.4768	-14.3019	0.0000	-7.7544	-5.8850
Income: $15k to <$25k	-1.6370	0.4690	-3.4905	0.0005	-2.5564	-0.7177
Income: $25k to <$35k	-2.0764	0.4480	-4.6351	0.0000	-2.9545	-1.1982
Income: $35k to <$50k	-2.5469	0.4382	-5.8122	0.0000	-3.4058	-1.6879
Income: $50k to <$75k	-3.0505	0.4107	-7.4277	0.0000	-3.8555	-2.2454
Income: $75k to <$100k	-3.4998	0.4292	-8.1537	0.0000	-4.3413	-2.6584
Income: $100k+	-3.7841	0.5004	-7.5628	0.0000	-4.7649	-2.8033
Education: HS Diploma/GED	0.0799	0.8107	0.0986	0.9215	-1.5092	1.6690
Education: Some College	1.0768	0.6897	1.5612	0.1185	-0.2753	2.4288
Education: College Graduate	1.7909	0.6912	2.5911	0.0096	0.4360	3.1458
Education: Graduate Degree	1.7378	0.6925	2.5095	0.0121	0.3803	3.0953
Smoking: Former	0.9471	0.1933	4.9000	0.0000	0.5682	1.3260
Smoking: Current (some days)	1.3501	0.4682	2.8836	0.0039	0.4323	2.2679
Smoking: Current (daily)	2.9368	0.3216	9.1313	0.0000	2.3063	3.5673

Physical health days For each additional day of poor physical health, people report about 0.27 more mentally unhealthy days.People who feel physically unwell more often also tend to feel mentally unwell more often.
Age All age coefficients are negative which means fewer mentally unhealthy days compared to the reference group of age 18-24. By age 50-54, about 3.5 and by age 80+, about 6.8 fewer mentally unhealthy days. In a way, mental distress seems to be more among youngest adults and steadily decreases more with age.
Income Higher income is linked to better mental health.Compared to the lowest income group, middle-income income group report 1.6 to 3 fewer mentally unhealthy days.
Sex Men report fewer mentally unhealthy days (1.4).
Exercise: People who exercise report better mental health (0.65).
BMI: Each unit increase in BMI added about 0.033 mentally unhealthy days.
Smoking: Smoking is strongly associated with worse mental health, especially daily smoking (294 more days).
Education: Only college graduates and those with graduate degrees show slightly higher mentally unhealthy days (about +1.7 to +1.8 days).

#Step 2: Write the fitted regression equation including all terms, with coefficients rounded to three decimal places.

The Fitted Equation

\[\widehat{\text{Ment. Health Days}} = \beta_0 + \beta_1(\text{Phys. Health}) + \beta_2(\text{BMI}) + \beta_3(\text{Sex: Male}) + \beta_4(\text{Exercise: Yes}) + \sum \beta_{age}(\text{Age Group}) + \sum \beta_{inc}(\text{Income}) + \sum \beta_{edu}(\text{Education}) + \sum \beta_{smk}(\text{Smoking})\]

coefs <- round(coef(mod_mlr_full), 3)
ci <- round(confint(mod_mlr_full), 3)

coef(mod_mlr_full)["physhlth_days"]

## physhlth_days 
##     0.2655846

coef(mod_mlr_full)["bmi"]

##        bmi 
## 0.03337537

coef(mod_mlr_full)["sexMale"]

##   sexMale 
## -1.390377

coef(mod_mlr_full)["exerciseYes"]

## exerciseYes 
##  -0.6511597

coef(mod_mlr_full)["age_group45-49"]

## age_group45-49 
##      -3.056179

coef(mod_mlr_full)["income$75,000 to <$100,000"]

## income$75,000 to <$100,000 
##                   -3.49984

coef(mod_mlr_full)["educationHigh school diploma or GED"]

## educationHigh school diploma or GED 
##                          0.07990525

coef(mod_mlr_full)["smokingFormer smoker"]

## smokingFormer smoker 
##            0.9471031

Interpreting Each Term

Intercept (β₀ = 8.1041): This represents the predicted mentally unhealthy days for someone with zero physically unhealthy days, BMI = 0, age = 0, income = 0, female, and no exercise — a scenario that is not meaningful in practice. It is simply a mathematical baseline.

Physical health days (β = 0.2656): Each additional day of poor physical health is associated with 0.27 more mentally unhealthy days on average, holding all other variables constant (95% CI: 0.2459 to 0.2853). This is the strongest predictor in the model and reflects the well‑documented connection between physical and mental well‑being.
Body Mass Index (β = 0.0334): For each one‑unit increase in BMI, people report about 0.03 more mentally unhealthy days on average (95% CI: 0.0075 to 0.0593). This is a small but statistically detectable association.
Sex: Male (β = –1.3904): Men report about 1.39 fewer mentally unhealthy days than women, adjusting for all other variables (95% CI: –1.7187 to –1.0620).
Exercise: Yes (β = –0.6512): People who exercised in the past 30 days report 0.65 fewer mentally unhealthy days than those who did not (95% CI: –1.0721 to –0.2303).
Age (reference = 18–24): All age groups report fewer mentally unhealthy days than the youngest adults. For example:

Age 25–29: 1.06 fewer days (95% CI: –2.0747 to –0.0385)
Age 50–54: 3.52 fewer days (95% CI: –4.4640 to –2.5695)
Age 80+: 6.82 fewer days (95% CI: –7.7544 to –5.8850)

Income (reference = <$15k): Higher income is strongly associated with fewer mentally unhealthy days. For example:

$25k–35k: 2.08 fewer days (95% CI: –2.9545 to –1.1982)
$50k–75k: 3.05 fewer days (95% CI: –3.8555 to –2.2454)
$100k+: 3.78 fewer days (95% CI: –4.7649 to –2.8033)**

Education: Most education categories are not strongly associated with mental health days, except:

College graduates: 1.79 more days (95% CI: 0.4360 to 3.1458)
Graduate degree: 1.74 more days (95% CI: 0.3803 to 3.0953)**

Smoking: Compared to never‑smokers:

Former smokers: 0.95 more days
Some‑day smokers: 1.35 more days
Daily smokers: 2.94 more days Smoking shows a clear dose‑response pattern with worse mental health.

##Step 3: Interpret the following coefficients in plain language. For each, state the direction, magnitude, and meaning in terms of mentally unhealthy days, holding all other variables constant:

• physhlth_days (continuous predictor) • bmi (continuous predictor) • sex: Female vs. Male (reference) • exercise: Yes vs. No (reference) • One age_group coefficient of your choice • Two income coefficients of your choice, compared to the reference category (Less than $15,000)

Physical health days: Each additional day of poor physical health is linked with about 0.27 more mentally unhealthy days, meaning people who feel physically unwell more often also tend to feel mentally unwell more often, even after accounting for all other variables.

BMI: Each one‑unit increase in BMI is associated with about 0.03 more mentally unhealthy days, a very small increase that suggests only a weak relationship between body weight and mental health holding all other predictors constant.

Sex (Female vs. Male): Women report about 1.39 more mentally unhealthy days than men on average, holding all other predictors constant, indicating that females experience more frequent mental distress in this dataset.

Exercise (Yes vs. No): People who exercised in the past month report about 0.65 fewer mentally unhealthy days than those who did not exercise, suggesting that physical activity is associated with better mental well‑being, holding all other predictors constant.

Age group (example: 50–54 vs. 18–24 (ref)): Adults aged 50–54 report about 3.52 fewer mentally unhealthy days than adults aged 18–24, showing that mental distress decreases substantially with age after holding all other predictors comstant.

Income (examples: $25k–35k and $100k+ vs. <$15k) compared to the reference category (Less than $15,000): Adults earning $25k–35k report about 2.08 fewer mentally unhealthy days than those earning under $15k, and adults earning $100k or more report about 3.78 fewer mentally unhealthy days, demonstrating a strong gradient where higher income is associated with better mental health, holding all other predictors constant.

#Step 4: Report and interpret each of the following model-level statistics:

\[H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{(no predictor is useful)}\]

H₀: All regression coefficients for the predictors are zero. (None of the predictors collectively explain variation in mentally unhealthy days.)

\[H_1: \text{At least one } \beta_j \neq 0\]

H₁: At least one regression coefficient is non‑zero. (At least one predictor contributes to explaining mentally unhealthy days.)

glance(mod_mlr_full) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs) %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
  mutate(
    Value = round(Value, 4),
    Metric = dplyr::recode(Metric,
      "r.squared"     = "R²",
      "adj.r.squared" = "Adjusted R²",
      "sigma"         = "Root MSE (σ̂)",
      "statistic"     = "F-statistic (overall)",
      "p.value"       = "p-value (overall F-test)",
      "df"            = "Model df (p)",
      "df.residual"   = "Residual df (n − p − 1)",
      "nobs"          = "n (observations)"
    )
  ) %>%
  kable(caption = "Table. Overall Model MLR Summary") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table. Overall Model MLR Summary
Metric	Value
R²	0.1853
Adjusted R²	0.1824
Root MSE (σ̂)	7.351

• R-squared: proportion of variance in mentally unhealthy days explained by all predictors

The model had an R‑squared of 0.1853, meaning that about 18.5% of the differences in mentally unhealthy days across adults in the sample were explained by the predictors included in the model.

• Adjusted R-squared: how it differs from R-squared and what it adds

The adjusted R‑squared was slightly lower at 0.1824, which accounts for the number of predictors and provides a more realistic estimate of how well the model would perform in new data.

• Root MSE (residual standard error): average prediction error of the model

The Root MSE (residual standard error) was 7.35, indicating that, on average, the model’s predictions differ from people’s actual mentally unhealthy days by about seven days.

• Overall F-test: state H0, F-statistic, degrees of freedom (numerator and denominator), p-value, and conclusion

The overall F‑test was significant (F = 62.52 on 29 and 7,970 degrees of freedom, p < 0.001), meaning that the full set of predictors collectively explains far more variation in mentally unhealthy days than a model with no predictors.

#Part 2: Tests of Hypotheses (20 Points)

#Step 1: Obtain Type III (partial) sums of squares for all predictors using car::Anova(model, type = “III”). Display the full output. Identify which predictors have statistically significant partial associations at α = 0.05.

# Type III using car::Anova()

anova_type3 <- Anova(mod_mlr_full, type = "III")
print(anova_type3)

## Anova Table (Type III tests)
## 
## Response: menthlth_days
##               Sum Sq   Df  F value    Pr(>F)    
## (Intercept)     4122    1  76.2618 < 2.2e-16 ***
## physhlth_days  37623    1 696.1198 < 2.2e-16 ***
## bmi              345    1   6.3879 0.0115095 *  
## sex             3724    1  68.9012 < 2.2e-16 ***
## exercise         497    1   9.1966 0.0024325 ** 
## age_group      30092   12  46.3986 < 2.2e-16 ***
## income          4734    6  14.5982 < 2.2e-16 ***
## education       1265    4   5.8521 0.0001064 ***
## smoking         5101    3  31.4613 < 2.2e-16 ***
## Residuals     430750 7970                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# output in table
anova_type3 %>%
  as.data.frame() %>%
  rownames_to_column("Source") %>%
  mutate(across(where(is.numeric), ~ round(., 4))) %>%
  kable(
    caption = "Table 5. Type III (Partial) Sums of Squares — car::Anova()",
    col.names = c("Source", "Sum of Sq", "df", "F value", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  footnote(general = "Each F-test uses the MSE from the full model. Each variable is tested given ALL other variables.")

Table 5. Type III (Partial) Sums of Squares — car::Anova()
Source	Sum of Sq	df	F value	p-value
(Intercept)	4121.6807	1	76.2618	0.0000
physhlth_days	37622.7952	1	696.1198	0.0000
bmi	345.2408	1	6.3879	0.0115
sex	3723.8662	1	68.9012	0.0000
exercise	497.0434	1	9.1966	0.0024
age_group	30092.1774	12	46.3986	0.0000
income	4733.8943	6	14.5982	0.0000
education	1265.1504	4	5.8521	0.0001
smoking	5101.1076	3	31.4613	0.0000
Residuals	430750.0872	7970	NA	NA
Note:
Each F-test uses the MSE from the full model. Each variable is tested given ALL other variables.

Interpretation: Every predictor except the intercept significantly improves the model, and physical health days, age group, income, sex, and smoking are the strongest contributors.Detailed interpretation for each predictor are as below:

physHlth_days (F = 696.12, p < 0.001) This is by far the strongest predictor. Even after adjusting for BMI, sex, exercise, age, income, education, and smoking, physical health days still explains a large amount of variation in mentally unhealthy days.

bmi (F = 6.39, p = 0.0115) BMI adds a small but statistically significant contribution. It matters, but not nearly as much as the major predictors.

sex (F = 68.90, p < 0.001) Sex is an important predictor even after adjusting for everything else. Women report more mentally unhealthy days than men.

exercise (F = 9.20, p = 0.0024) Exercise still contributes meaningfully to the model after controlling for other variables.

age_group (F = 46.40, p < 0.001) Age is a major predictor. The entire set of age categories collectively improves the model.

income (F = 14.60, p < 0.001) Income categories, as a group, significantly improve the model. This matches your chunk test.

education (F = 5.85, p = 0.0001) Education adds a modest but statistically significant contribution.

smoking (F = 31.46, p < 0.001) Smoking status is a strong predictor of mentally unhealthy days.

#Step 2 (Chunk test for income): Test whether income collectively improves the model significantly, after accounting for all other predictors:

Income has 7 categories → 6 dummy variables. Do they collectively matter?

We test this by comparing:

Model A (reduced): Without income variables Model B (full): With income variables

#13.Fit a reduced model that includes all predictors except income.

# Reduced model: without income
m_noincome <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + 
                  education + smoking,  # income removed
  data = brfss_analytic
)
# Test: does income improve the fit?
income_test <- anova(m_noincome, mod_mlr_full)
print(income_test)

## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + 
##     education + smoking
## Model 2: menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + 
##     income + education + smoking
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   7976 435484                                  
## 2   7970 430750  6    4733.9 14.598 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#14.Use anova(model_reduced, model_full) to compare the two models.

# Compare using anova() — this performs the partial F-test
# Chunk test
anova(m_noincome, mod_mlr_full) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (without income)", "Full (+ income)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Table Chunk Test: Does Income Collectively Improve the Model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table Chunk Test: Does Income Collectively Improve the Model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced (without income)	7976	435484.0	NA	NA	NA	NA
Full (+ income)	7970	430750.1	6	4733.894	14.5982	0

Interpretation: Adding income reduces RSS by 4733.894. Income adds 6 degrees of freedom and F = 14.5982 with p < 0.001 Income categories collectively improve the model beyond all other predictors.

#15.State H0 and HA for this test clearly.

\[H_0: \beta^* = 0 \quad \text{($X^*$ does not add to the model given the other variables)}\]

\[H_1: \beta^* \neq 0\]

H0: All income coefficients = 0 (income doesn’t matter) H1: At least one income coefficient ≠ 0 (income matters)

#16.Report the F-statistic, degrees of freedom, and p-value.

# Extract the overall F-test from summary
f_stat <- summary(m_noincome)$fstatistic
cat("F-statistic:", round(f_stat[1], 2), "\n")

## F-statistic: 74.27

cat("Numerator df (p):", f_stat[2], "\n")

## Numerator df (p): 23

cat("Denominator df (n-p-1):", f_stat[3], "\n")

## Denominator df (n-p-1): 7976

cat("p-value:", format.pval(pf(f_stat[1], f_stat[2], f_stat[3], lower.tail = FALSE)), "\n")

## p-value: < 2.22e-16

F‑statistic: 23
Numerator df: 29
Denominator df: 7,976
p‑value: < 2.2e‑16

#17.State your conclusion at α = 0.05.

Because the p‑value is far below 0.05, we reject the null hypothesis and conclude that income categories explains a statistically significant amount of variation in mentally unhealthy days.

#Step 3 (Chunk test for education): Repeat the same procedure for education as a group of predictors.

Fit a reduced model that includes all predictors except education

# Reduced model: without education
m_noedu <- lm(menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + 
                  income + smoking,  # education removed
  data = brfss_analytic
)
# Test: does income improve the fit?
edu_test <- anova(m_noedu, mod_mlr_full)
print(edu_test)

## Analysis of Variance Table
## 
## Model 1: menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + 
##     income + smoking
## Model 2: menthlth_days ~ physhlth_days + bmi + sex + exercise + age_group + 
##     income + education + smoking
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   7974 432015                                  
## 2   7970 430750  4    1265.2 5.8521 0.0001064 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Use anova(model_reduced, model_full) to compare the two models.

# Compare using anova() — this performs the partial F-test
# Chunk test
anova(m_noedu, mod_mlr_full) %>%
  as.data.frame() %>%
  rownames_to_column("Model") %>%
  mutate(
    Model = c("Reduced (without education)", "Full (+ education)"),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption = "Table Chunk Test: Does education Collectively Improve the Model?",
    col.names = c("Model", "Res. df", "RSS", "df", "Sum of Sq", "F", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table Chunk Test: Does education Collectively Improve the Model?
Model	Res. df	RSS	df	Sum of Sq	F	p-value
Reduced (without education)	7974	432015.2	NA	NA	NA	NA
Full (+ education)	7970	430750.1	4	1265.15	5.8521	1e-04

Interpretation: Adding education reduces RSS by 1265.15. Education adds 4 degrees of freedom and F = 5.8521 with p < 0.001 so, Education categories collectively improve the model beyond all other predictors.

#State H0 and HA for this test clearly.

\[H_0: \beta^* = 0 \quad \text{($X^*$ does not add to the model given the other variables)}\]

H₀: All education coefficients = 0

\[H_1: \beta^* \neq 0\] H1: At least one education coefficient ≠ 0

Report the F-statistic, degrees of freedom, and p-value.

# Extract the overall F-test from summary
f_stat <- summary(m_noedu)$fstatistic
cat("F-statistic:", round(f_stat[1], 2), "\n")

## F-statistic: 71.42

cat("Numerator df (p):", f_stat[2], "\n")

## Numerator df (p): 25

cat("Denominator df (n-p-1):", f_stat[3], "\n")

## Denominator df (n-p-1): 7974

cat("p-value:", format.pval(pf(f_stat[1], f_stat[2], f_stat[3], lower.tail = FALSE)), "\n")

## p-value: < 2.22e-16

F‑statistic: 71.42
Numerator df: 25
Denominator df: 7,974
p‑value: < 2.2e‑16

#17.State your conclusion at α = 0.05.

Because the p‑value is far below 0.05, we reject the null hypothesis and conclude that education categories also explains a statistically significant amount of variation in mentally unhealthy days.

#Step 4: Write a 3 to 5 sentence synthesis interpreting the Type III results and your chunk test findings. Which predictors make the strongest independent contributions? What do the chunk tests add beyond the individual t-tests from summary()?

The Type III ANOVA results show that several predictors make strong independent contributions to mentally unhealthy days, especially physical health days, age group, income, sex, and smoking, all of which have large F‑values and highly significant p‑values. Education also shows a statistically significant partial effect, though its contribution is smaller than the major predictors. The chunk tests also agree with these findings by showing that both income (F = 14.60, p < 0.001) and education (F = 5.85, p = 0.0001) collectively improve the model beyond all other variables, even when individual coefficients may appear modest. Chunk tests add value because they evaluate entire blocks of related predictors such as all income categories or all education levels rather than testing each coefficient separately, which is especially important for multi‑level categorical variables. The Type III results and chunk tests demonstrate that while some individual coefficients are small, the grouped predictors still meaningfully enhance model fit.

#Part 3: Interaction Analysis (25 Points)

Background: Effect modification occurs when the association between a predictor and an outcome differs across levels of another variable. You will test whether the association between current smoking and mentally unhealthy days differs between men and women.

#Step 1: Create a binary smoking variable called smoker_current:

• “Current smoker” includes current daily smokers and current some-day smokers. • “Non-smoker” includes former smokers and never smokers. Use “Non-smoker” as the reference category.

brfss_analytic <- brfss_analytic %>%
  mutate(
    smoker_current = factor(
      case_when(
        smoking %in% c("Current daily smoker", "Current some-day smoker") 
          ~ "Current smoker",
        smoking %in% c("Former smoker", "Never smoker") 
          ~ "Non-smoker",
        TRUE ~ NA_character_
      ),
      levels = c("Non-smoker", "Current smoker")
      # Non-smoker is reference (first)
    )
  )

# Check the new variable
table(brfss_analytic$smoker_current)

## 
##     Non-smoker Current smoker 
##           7074            926

#Step 2: Fit two models: • Model A (main effects only): regress menthlth_days on physhlth_days, bmi, sex, smoker_current, exercise, age_group, income, and education. • Model B (with interaction): same as Model A, but use sex * smoker_current in the formula to include the interaction term.

# Model without interaction  (Model A: No sex × smoker interaction)
mod_A <- lm(menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + 
                  age_group + income + education, data = brfss_analytic)

# Model with interaction (Model B: Includes sex × smoker_current interaction)
mod_B <- lm(
  menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + 
                  age_group + income + education + 
                  sex:smoker_current,  # INTERACTION TERM
  data = brfss_analytic)

summary (mod_A)

## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + smoker_current + 
##     exercise + age_group + income + education, data = brfss_analytic)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.2913  -3.8732  -1.6219   0.6681  30.4937 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                8.08842    0.92973   8.700  < 2e-16
## physhlth_days                              0.26862    0.01007  26.673  < 2e-16
## bmi                                        0.03341    0.01323   2.525 0.011589
## sexMale                                   -1.33309    0.16732  -7.967 1.84e-15
## smoker_currentCurrent smoker               2.12874    0.27121   7.849 4.74e-15
## exerciseYes                               -0.67248    0.21503  -3.127 0.001770
## age_group25-29                            -0.91492    0.51978  -1.760 0.078412
## age_group30-34                            -0.88226    0.50606  -1.743 0.081301
## age_group35-39                            -1.58102    0.48765  -3.242 0.001191
## age_group40-44                            -2.61572    0.48577  -5.385 7.46e-08
## age_group45-49                            -2.82462    0.49698  -5.684 1.37e-08
## age_group50-54                            -3.26004    0.48206  -6.763 1.45e-11
## age_group55-59                            -4.23010    0.47413  -8.922  < 2e-16
## age_group60-64                            -4.24840    0.45682  -9.300  < 2e-16
## age_group65-69                            -5.23383    0.44671 -11.716  < 2e-16
## age_group70-74                            -5.70233    0.45453 -12.546  < 2e-16
## age_group75-79                            -5.89774    0.47031 -12.540  < 2e-16
## age_group80+                              -6.48879    0.47372 -13.697  < 2e-16
## income$15,000 to <$25,000                 -1.67974    0.46981  -3.575 0.000352
## income$25,000 to <$35,000                 -2.10230    0.44863  -4.686 2.83e-06
## income$35,000 to <$50,000                 -2.58691    0.43898  -5.893 3.95e-09
## income$50,000 to <$75,000                 -3.08227    0.41140  -7.492 7.50e-14
## income$75,000 to <$100,000                -3.53598    0.43000  -8.223 2.30e-16
## income$100,000 or more                    -3.86251    0.50112  -7.708 1.43e-14
## educationHigh school diploma or GED        0.21386    0.81186   0.263 0.792237
## educationSome college or technical school  1.19653    0.69073   1.732 0.083264
## educationCollege graduate                  1.90349    0.69219   2.750 0.005974
## educationGraduate or professional degree   1.74564    0.69382   2.516 0.011890
##                                              
## (Intercept)                               ***
## physhlth_days                             ***
## bmi                                       *  
## sexMale                                   ***
## smoker_currentCurrent smoker              ***
## exerciseYes                               ** 
## age_group25-29                            .  
## age_group30-34                            .  
## age_group35-39                            ** 
## age_group40-44                            ***
## age_group45-49                            ***
## age_group50-54                            ***
## age_group55-59                            ***
## age_group60-64                            ***
## age_group65-69                            ***
## age_group70-74                            ***
## age_group75-79                            ***
## age_group80+                              ***
## income$15,000 to <$25,000                 ***
## income$25,000 to <$35,000                 ***
## income$35,000 to <$50,000                 ***
## income$50,000 to <$75,000                 ***
## income$75,000 to <$100,000                ***
## income$100,000 or more                    ***
## educationHigh school diploma or GED          
## educationSome college or technical school .  
## educationCollege graduate                 ** 
## educationGraduate or professional degree  *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.366 on 7972 degrees of freedom
## Multiple R-squared:  0.182,  Adjusted R-squared:  0.1792 
## F-statistic:  65.7 on 27 and 7972 DF,  p-value: < 2.2e-16

summary (mod_B)

## 
## Call:
## lm(formula = menthlth_days ~ physhlth_days + bmi + sex + smoker_current + 
##     exercise + age_group + income + education + sex:smoker_current, 
##     data = brfss_analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.337  -3.837  -1.604   0.628  30.426 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                                8.08491    0.92943   8.699  < 2e-16
## physhlth_days                              0.26859    0.01007  26.679  < 2e-16
## bmi                                        0.03309    0.01323   2.502 0.012380
## sexMale                                   -1.18553    0.17752  -6.678 2.58e-11
## smoker_currentCurrent smoker               2.80406    0.38412   7.300 3.16e-13
## exerciseYes                               -0.67285    0.21496  -3.130 0.001753
## age_group25-29                            -0.92018    0.51962  -1.771 0.076616
## age_group30-34                            -0.89242    0.50591  -1.764 0.077774
## age_group35-39                            -1.59287    0.48752  -3.267 0.001090
## age_group40-44                            -2.62864    0.48564  -5.413 6.39e-08
## age_group45-49                            -2.84255    0.49687  -5.721 1.10e-08
## age_group50-54                            -3.27784    0.48195  -6.801 1.11e-11
## age_group55-59                            -4.24987    0.47404  -8.965  < 2e-16
## age_group60-64                            -4.26404    0.45671  -9.336  < 2e-16
## age_group65-69                            -5.25062    0.44662 -11.756  < 2e-16
## age_group70-74                            -5.71106    0.45439 -12.569  < 2e-16
## age_group75-79                            -5.90758    0.47018 -12.565  < 2e-16
## age_group80+                              -6.49946    0.47359 -13.724  < 2e-16
## income$15,000 to <$25,000                 -1.63574    0.46999  -3.480 0.000503
## income$25,000 to <$35,000                 -2.07457    0.44862  -4.624 3.82e-06
## income$35,000 to <$50,000                 -2.54551    0.43915  -5.796 7.03e-09
## income$50,000 to <$75,000                 -3.04298    0.41157  -7.394 1.57e-13
## income$75,000 to <$100,000                -3.50972    0.42999  -8.162 3.79e-16
## income$100,000 or more                    -3.84047    0.50103  -7.665 2.00e-14
## educationHigh school diploma or GED        0.12556    0.81238   0.155 0.877177
## educationSome college or technical school  1.11789    0.69123   1.617 0.105865
## educationCollege graduate                  1.81788    0.69283   2.624 0.008711
## educationGraduate or professional degree   1.66905    0.69428   2.404 0.016240
## sexMale:smoker_currentCurrent smoker      -1.28327    0.51705  -2.482 0.013089
##                                              
## (Intercept)                               ***
## physhlth_days                             ***
## bmi                                       *  
## sexMale                                   ***
## smoker_currentCurrent smoker              ***
## exerciseYes                               ** 
## age_group25-29                            .  
## age_group30-34                            .  
## age_group35-39                            ** 
## age_group40-44                            ***
## age_group45-49                            ***
## age_group50-54                            ***
## age_group55-59                            ***
## age_group60-64                            ***
## age_group65-69                            ***
## age_group70-74                            ***
## age_group75-79                            ***
## age_group80+                              ***
## income$15,000 to <$25,000                 ***
## income$25,000 to <$35,000                 ***
## income$35,000 to <$50,000                 ***
## income$50,000 to <$75,000                 ***
## income$75,000 to <$100,000                ***
## income$100,000 or more                    ***
## educationHigh school diploma or GED          
## educationSome college or technical school    
## educationCollege graduate                 ** 
## educationGraduate or professional degree  *  
## sexMale:smoker_currentCurrent smoker      *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.363 on 7971 degrees of freedom
## Multiple R-squared:  0.1826, Adjusted R-squared:  0.1798 
## F-statistic: 63.61 on 28 and 7971 DF,  p-value: < 2.2e-16

tidy(mod_B, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Model B: Coefficients for Sex, Smoking, and Interaction",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model B: Coefficients for Sex, Smoking, and Interaction
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	8.0849	0.9294	8.6988	0.0000	6.2630	9.9068
physhlth_days	0.2686	0.0101	26.6788	0.0000	0.2489	0.2883
bmi	0.0331	0.0132	2.5017	0.0124	0.0072	0.0590
sexMale	-1.1855	0.1775	-6.6784	0.0000	-1.5335	-0.8376
smoker_currentCurrent smoker	2.8041	0.3841	7.2999	0.0000	2.0511	3.5570
exerciseYes	-0.6728	0.2150	-3.1301	0.0018	-1.0942	-0.2515
age_group25-29	-0.9202	0.5196	-1.7709	0.0766	-1.9388	0.0984
age_group30-34	-0.8924	0.5059	-1.7640	0.0778	-1.8841	0.0993
age_group35-39	-1.5929	0.4875	-3.2673	0.0011	-2.5485	-0.6372
age_group40-44	-2.6286	0.4856	-5.4127	0.0000	-3.5806	-1.6766
age_group45-49	-2.8425	0.4969	-5.7209	0.0000	-3.8165	-1.8686
age_group50-54	-3.2778	0.4820	-6.8012	0.0000	-4.2226	-2.3331
age_group55-59	-4.2499	0.4740	-8.9652	0.0000	-5.1791	-3.3206
age_group60-64	-4.2640	0.4567	-9.3364	0.0000	-5.1593	-3.3688
age_group65-69	-5.2506	0.4466	-11.7563	0.0000	-6.1261	-4.3751
age_group70-74	-5.7111	0.4544	-12.5686	0.0000	-6.6018	-4.8203
age_group75-79	-5.9076	0.4702	-12.5646	0.0000	-6.8292	-4.9859
age_group80+	-6.4995	0.4736	-13.7239	0.0000	-7.4278	-5.5711
income$15,000 to <$25,000	-1.6357	0.4700	-3.4804	0.0005	-2.5570	-0.7144
income$25,000 to <$35,000	-2.0746	0.4486	-4.6243	0.0000	-2.9540	-1.1952
income$35,000 to <$50,000	-2.5455	0.4392	-5.7964	0.0000	-3.4064	-1.6847
income$50,000 to <$75,000	-3.0430	0.4116	-7.3935	0.0000	-3.8498	-2.2362
income$75,000 to <$100,000	-3.5097	0.4300	-8.1623	0.0000	-4.3526	-2.6668
income$100,000 or more	-3.8405	0.5010	-7.6651	0.0000	-4.8226	-2.8583
educationHigh school diploma or GED	0.1256	0.8124	0.1546	0.8772	-1.4669	1.7180
educationSome college or technical school	1.1179	0.6912	1.6172	0.1059	-0.2371	2.4729
educationCollege graduate	1.8179	0.6928	2.6239	0.0087	0.4598	3.1760
educationGraduate or professional degree	1.6691	0.6943	2.4040	0.0162	0.3081	3.0300
sexMale:smoker_currentCurrent smoker	-1.2833	0.5171	-2.4819	0.0131	-2.2968	-0.2697

#Step 3: Test whether the interaction is statistically significant:

18.Use anova(model_A, model_B) to compare the two models.

anova(mod_A, mod_B) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Test for INTERACTION: Does smoking's effect differ by sex??") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Test for INTERACTION: Does smoking’s effect differ by sex??
term	df.residual	rss	df	sumsq	statistic	p.value
menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education	7972	432508.9	NA	NA	NA	NA
menthlth_days ~ physhlth_days + bmi + sex + smoker_current + exercise + age_group + income + education + sex:smoker_current	7971	432174.9	1	333.9749	6.1598	0.0131

Interpretation: The effect of smoking on mentally unhealthy days differs by sex. The association between smoking status and mental health is not the same for men and women as we can see the F = 6.16 with p = 0.0131 < 0.05.

19.State H0 and HA for this test.

\[H_0: \beta_2 = \beta_3 = 0 \quad \text{(the two lines are identical)}\] \[H_1: \text{At least one } \neq 0 \quad \text{(the lines differ in intercept and/or slope)}\] Hypotheses: H0: The interaction coefficient = 0 (smoking effect is SAME for men and women)

H1: The interaction coefficient ≠ 0 (smoking effect DIFFERS by sex)

20.Report the F-statistic, degrees of freedom, and p-value.

F‑statistic: 6.16
df: 1
p‑value: < 2.22e‑16

21.State your conclusion at α = 0.05.

Because the p‑value is below α = 0.05, we reject the null hypothesis that the interaction coefficient equals zero. This means there is statistically significant evidence that the association between smoking status and mentally unhealthy days differs for men and women.

#Step 4: Interpret the interaction using the coefficients from Model B:

# Extract model B  coefficients

b_coef  <- coef(mod_B)

b_sexFemale <- b_coef ["sexFemale"]
b_smokingcurrent <- b_coef ["smoker_currentCurrent smoker"]
b_interaction <- b_coef["sexMale:smoker_currentCurrent smoker"]

#Reference group: Male non-smoker

# Effect of smoking for MALES (the reference sex)
effect_smokingMale <- b_smokingcurrent
cat("Effect of smoking for Males:", round(effect_smokingMale, 4), "\n")

## Effect of smoking for Males: 2.8041

# Effect of smoking for FEMALES
# main effect of smoking + interaction term

effect_smokingFemale <- b_smokingcurrent + b_interaction
cat("Effect of smoking for Females:", round(effect_smokingFemale, 4), "\n")

## Effect of smoking for Females: 1.5208

# How much is smoking's effect for males vs females?

difference <- effect_smokingMale - effect_smokingFemale
cat("Difference (Male - Female):", round(difference, 4), "\n")

## Difference (Male - Female): 1.2833

22.Report the estimated effect of current smoking on mentally unhealthy days among men (the coefficient for smoker_current). Interpret in plain language.

Among men, the estimated effect of current smoking on mentally unhealthy days is 2.8041 additional days per month. This means that, holding all other variables constant, male current smokers report about 2.8 more mentally unhealthy days than male non‑smokers. This coefficient represents the smoking effect specifically for men because men are the reference category for sex.

23.Compute and report the estimated effect among women (smoker_current coefficient plus the interaction coefficient). Interpret in plain language.

Among women, the estimated effect of current smoking is 1.52 additional mentally unhealthy days per month. This estimate combines the main smoking coefficient and the sex × smoking interaction term, meaning female current smokers report about 1.5 more mentally unhealthy days than female non‑smokers, adjusting for all other predictors.

24.In 2 to 3 sentences, explain what these estimates mean together: is smoking more strongly associated with poor mental health in one sex than the other, and by how much?

The estimates show that smoking is associated with worse mental health for both men and women, but the association is stronger for men. Male smokers experience about 1.3 more mentally unhealthy days from smoking than female smokers (2.8 − 1.52). This difference reflects the statistically significant interaction: the mental health impact of smoking is not the same across sexes, with men showing a larger increase in mentally unhealthy days when they smoke.

#Step 5: Visualize the interaction using ggeffects::ggpredict(model_B, terms = c(“smoker_current”, “sex”)). The plot should show predicted mentally unhealthy days for each combination of smoking status and sex, with labeled axes, a descriptive title, and a legend identifying men and women.

pred_menthlth <- ggpredict(mod_B, terms = c("smoker_current", "sex"))

ggplot(pred_menthlth, aes(x = x, y = predicted, color = group, fill = group, group = group)) +
  geom_point(size = 4) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Effect of Smoking on Mental Health by Sex",
    subtitle = "Predicted mentally unhealthy days with 95% confidence intervals",
    x        = "Smoking Status",
    y        = "Predicted Mentally Unhealthy Days",
    color    = "Sex",
    fill     = "Sex"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Interaction: Sleep Effect by Education Level

Step 6: Write 3 to 5 sentences interpreting the interaction for a general public health audience. Do not include coefficient values or p-values. Focus on the substantive finding and its implications.

Both men and women who currently smoke report more mentally unhealthy days than those who do not smoke, suggesting that smoking is broadly linked to poorer mental health regardless of sex. However, this association appears to be stronger among men than women. This pattern of effect modification is important for public health because it suggests that smoking cessation interventions may need to consider sex-specific mental health impacts rather than assuming the relationship is uniform across all groups. For men in particular, smoking may be related with mental distress in a way that makes addressing both simultaneously especially important.

#Part 4: Reflection (15 Points) Write a focused reflection of 250 to 400 words in continuous prose (not as a bulleted list) addressing all three topics below.

A. Findings and Limitations (6 points) What do the results suggest about the determinants of mental health among U.S. adults? Which predictors were most strongly associated? Which were not statistically significant? What are the key limitations of using cross-sectional survey data? Name at least two specific confounders that could bias the associations you estimated. B. From Simple to Multiple Regression (4 points) What does Adjusted R-squared tell you that regular R-squared does not, and why is this distinction especially important when adding multiple categorical predictors? Why is it useful to test groups of predictors (income, education) collectively with chunk tests, rather than relying only on the individual t-test for each level? C. AI Transparency (5 points) Describe which parts of the assignment (if any) you sought AI assistance for, what specific prompts you used, and how you verified the AI-assisted output was correct. If you did not use AI, state that explicitly.

The results of this analysis suggest that mental health among U.S. adults is shaped by a wide range of individual, behavioral, and socioeconomic factors simultaneously. Physical health days was by far the strongest predictor, which means that people experiencing frequent physical illness are likely also experiencing mental illness and the two dimensions of health are connected. Age showed a strong negative association, with older adults reporting fewer mentally unhealthy days than the youngest group and it could be because of the greater emotional regulation or support available or it also could be that people with severe mental illness are less likely to survive to older age. Income was also a strong predictor, with higher-earning adults consistently reporting fewer mentally unhealthy days. It shows the link between financial security and psychological well-being. Smoking, sex, and exercise also contributed meaningfully but BMI had a very small association and education showed a somewhat reverse pattern, with higher education linked to slightly more mentally unhealthy days in some categories, possibly because more educated respondents may be more willing to report distress.

A key limitation of this analysis is that the BRFSS is cross-sectional, meaning all variables are measured at a single point in time. This makes it impossible to establish whether smoking causes worse mental health or whether people with worse mental health are more likely to smoke so, reverse causation is a real concern. Two specific confounders are history of trauma or adverse childhood experiences (ACEs) who may bring long-term coping behaviors like smoking and they also have worse mental health, and it could effect of smoking modified and socioeconomic status (lower income or unstable employment can increase both smoking and mental distress).

Regular R-squared always increases when you add more predictors to a model, even if those predictors are useless noise, because each new variable absorbs at least a tiny amount of residual variance by chance. Adjusted R-squared corrects this by penalizing the statistic for each additional predictor added, so it only increases when a new variable improves the model more than would be expected by random chance alone. This distinction is important because adding multi-level categorical variables like age group, income, and education each contribute many degrees of freedom at once. Age alone adds 12 indicator variables and without the penalty, R-squared could look high simply because so many predictors are included. Chunk tests address the similar issue: when a categorical variable like income is split into six dummy variables, each individual t-test only tells you whether one specific income level differs from the reference category, holding everything else constant. The chunk test asks a broader and more meaningful question : do the income categories as a whole explain significant variation in the outcome beyond what all other variables already explain? This is the right question when we are deciding whether to keep an entire variable in the model, rather than making decisions one coefficient at a time.

C. AI Transparency

I used search tools including AI mode of google, couple of times during this assignment, always to troubleshoot specific problems rather than to generate analysis or interpretation.

The first time was during variable recoding in the data setup section. After recoding, I noticed that values like 77, 88, and 99 were still appearing in my summary output, which confused me because I thought I had handled them. I searched Google to understand why, and eventually realized that I was looking at the wrong columns — my dataset still had the original 9 variables alongside the new recoded ones, and I was reading the old columns by mistake. This taught me to be more careful about checking which version of a variable I am actually looking at. The second instance was when I was writing the fitted regression equation in Part 1 Step 2. Because there were so many coefficients across age, income, education, and smoking categories, I searched to learn how to represent groups of coefficients more compactly in a equation. I found that using the summation notation with a subscript was the standard way to handle this, and I applied that to keep the equation readable without listing every single term. The third was in Part 3 Step 5, when my interaction plot was showing only two isolated points per group with no lines connecting them. I asked GitHub Copilot why this was happening and it suggested adding group = group inside the aes() call, which tells ggplot which points belong to the same line. I added it, the lines appeared, and I confirmed visually that the plot now correctly showed the interaction between smoking status and sex.

A smaller thing I also searched was why backtick notation was no longer needed for my variable names after loading the data. I found that janitor::clean_names() had already converted all names to small case, removing the need for backticks and underscore entirely, which I confirmed by checking the column names in the cleaned dataset. In every case, I verified the suggestion worked by checking the output directly in R and making sure it matched what I expected based on my understanding of the data and the assignment.

##End of homework 3 #Epi553

Factors associated with Poor Mental health and does smoking effect differ between men and women

A Secondary analysis of the 2023 BRFSS data

Suruchi Shahi

2026-04-05

Introduction

Research Question

Setup and Data

Loading BRFSS 2023 Data