#Research Question What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?
#dataset
About the BRFSS
The Behavioral Risk Factor Surveillance System (BRFSS) is a nationwide telephone survey conducted annually by state health departments in collaboration with the CDC.The 2023 data contain approximately 433,000 respondents across the United States and cover a broad range of health behaviors, chronic conditions, and sociodemographic characteristics.
library(tidyverse)
library(haven)
library(here)
library(knitr)
library(kableExtra)
library(plotly)
library(broom)
library(ggeffects)
library(gtsummary)
library (GGally)
library (car)
library (naniar)
library (flextable)
library (dplyr)
library(leaps)
library(pROC)
library(ResourceSelection)We will use the Behavioral Risk Factor Surveillance System (BRFSS) 2023 data.
# Load raw BRFSS 2023 data
brfss_raw <- read_xpt(
"C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/LLCP2023.XPT"
) %>%
janitor::clean_names()
# Examine the raw data
dim(brfss_raw)## [1] 433323 350
## [1] "state" "fmonth" "idate" "imonth" "iday" "iyear"
## [7] "dispcode" "seqno" "psu" "ctelenm1" "pvtresd1" "colghous"
## [13] "statere1" "celphon1" "ladult1" "numadult" "respslc1" "landsex2"
## [19] "lndsxbrt" "safetime" "ctelnum1" "cellfon5" "cadult1" "cellsex2"
## [25] "celsxbrt" "pvtresd3" "cclghous" "cstate1" "landline" "hhadult"
## [31] "sexvar" "genhlth" "physhlth" "menthlth" "poorhlth" "primins1"
## [37] "persdoc3" "medcost1" "checkup1" "exerany2" "exract12" "exeroft1"
## [43] "exerhmm1" "exract22" "exeroft2" "exerhmm2" "strength" "bphigh6"
## [49] "bpmeds1" "cholchk3" "toldhi3" "cholmed3" "cvdinfr4" "cvdcrhd4"
## [55] "cvdstrk3" "asthma3" "asthnow" "chcscnc1" "chcocnc1" "chccopd3"
## [61] "addepev3" "chckdny2" "havarth4" "diabete4" "diabage4" "marital"
## [67] "educa" "renthom1" "numhhol4" "numphon4" "cpdemo1c" "veteran3"
## [73] "employ1" "children" "income3" "pregnant" "weight2" "height3"
## [79] "deaf" "blind" "decide" "diffwalk" "diffdres" "diffalon"
## [85] "fall12mn" "fallinj5" "smoke100" "smokday2" "usenow3" "ecignow2"
## [91] "alcday4" "avedrnk3" "drnk3ge5" "maxdrnks" "flushot7" "flshtmy3"
## [97] "pneuvac4" "shingle2" "hivtst7" "hivtstd3" "seatbelt" "drnkdri2"
## [103] "covidpo1" "covidsm1" "covidact" "pdiabts1" "prediab2" "diabtype"
## [109] "insulin1" "chkhemo3" "eyeexam1" "diabeye1" "diabedu1" "feetsore"
## [115] "arthexer" "arthedu" "lmtjoin3" "arthdis2" "joinpai2" "lcsfirst"
## [121] "lcslast" "lcsnumcg" "lcsctsc1" "lcsscncr" "lcsctwhn" "hadmam"
## [127] "howlong" "cervscrn" "crvclcnc" "crvclpap" "crvclhpv" "hadhyst2"
## [133] "psatest1" "psatime1" "pcpsars2" "psasugs1" "pcstalk2" "hadsigm4"
## [139] "colnsigm" "colntes1" "sigmtes1" "lastsig4" "colncncr" "vircolo1"
## [145] "vclntes2" "smalstol" "stoltest" "stooldn2" "bldstfit" "sdnates1"
## [151] "cncrdiff" "cncrage" "cncrtyp2" "csrvtrt3" "csrvdoc1" "csrvsum"
## [157] "csrvrtrn" "csrvinst" "csrvinsr" "csrvdein" "csrvclin" "csrvpain"
## [163] "csrvctl2" "indortan" "numburn3" "sunprtct" "wkdayout" "wkendout"
## [169] "cimemlo1" "cdworry" "cddiscu1" "cdhous1" "cdsocia1" "caregiv1"
## [175] "crgvrel4" "crgvlng1" "crgvhrs1" "crgvprb3" "crgvalzd" "crgvper1"
## [181] "crgvhou1" "crgvexpt" "lastsmk2" "stopsmk2" "mentcigs" "mentecig"
## [187] "heattbco" "firearm5" "gunload" "loadulk2" "hasymp1" "hasymp2"
## [193] "hasymp3" "hasymp4" "hasymp5" "hasymp6" "strsymp1" "strsymp2"
## [199] "strsymp3" "strsymp4" "strsymp5" "strsymp6" "firstaid" "aspirin"
## [205] "birthsex" "somale" "sofemale" "trnsgndr" "marijan1" "marjsmok"
## [211] "marjeat" "marjvape" "marjdab" "marjothr" "usemrjn4" "acedeprs"
## [217] "acedrink" "acedrugs" "aceprisn" "acedivrc" "acepunch" "acehurt1"
## [223] "aceswear" "acetouch" "acetthem" "acehvsex" "aceadsaf" "aceadned"
## [229] "imfvpla4" "hpvadvc4" "hpvadsht" "tetanus1" "covidva1" "covacge1"
## [235] "covidnu2" "lsatisfy" "emtsuprt" "sdlonely" "sdhemply" "foodstmp"
## [241] "sdhfood1" "sdhbills" "sdhutils" "sdhtrnsp" "sdhstre1" "rrclass3"
## [247] "rrcognt2" "rrtreat" "rratwrk2" "rrhcare4" "rrphysm2" "rcsgend1"
## [253] "rcsxbrth" "rcsrltn2" "casthdx2" "casthno2" "qstver" "qstlang"
## [259] "metstat" "urbstat" "mscode" "ststr" "strwt" "rawrake"
## [265] "wt2rake" "imprace" "chispnc" "crace1" "cageg" "cllcpwt"
## [271] "dualuse" "dualcor" "llcpwt2" "llcpwt" "rfhlth" "phys14d"
## [277] "ment14d" "hlthpl1" "hcvu653" "totinda" "metvl12" "metvl22"
## [283] "maxvo21" "fc601" "actin13" "actin23" "padur1" "padur2"
## [289] "pafreq1" "pafreq2" "minac12" "minac22" "strfreq" "pamiss3"
## [295] "pamin13" "pamin23" "pa3min" "pavig13" "pavig23" "pa3vigm"
## [301] "pacat3" "paindx3" "pa150r4" "pa300r4" "pa30023" "pastrng"
## [307] "parec3" "pastae3" "rfhype6" "cholch3" "rfchol3" "michd"
## [313] "ltasth1" "casthm1" "asthms1" "drdxar2" "mrace1" "hispanc"
## [319] "race" "raceg21" "racegr3" "raceprv" "sex" "ageg5yr"
## [325] "age65yr" "age80" "age_g" "htin4" "htm4" "wtkg3"
## [331] "bmi5" "bmi5cat" "rfbmi5" "chldcnt" "educag" "incomg1"
## [337] "smoker3" "rfsmok3" "cureci2" "drnkany6" "drocdy4" "rfbing6"
## [343] "drnkwk2" "rfdrhv8" "flshot7" "pneumo3" "aidtst4" "rfseat2"
## [349] "rfseat3" "drnkdrv"
# Select 9 variables of interest
brfss_hw4 <- brfss_raw %>%
select(
menthlth, # outcome: mentally unhealthy days
physhlth, # physical health days
bmi5, # BMI (with backticks because of underscore)
sexvar, # sex
exerany2, # any exercise past 30 days
addepev3, #ever told depressive disorder
ageg5yr, # age group
incomg1, # annual household income
educa, # education level
smoker3, # smoking status
genhlth, #general health status
marital #marital status
)
# Check
dim(brfss_hw4) # 433,000 × 9## [1] 433323 12
## # A tibble: 6 × 12
## menthlth physhlth bmi5 sexvar exerany2 addepev3 ageg5yr incomg1 educa smoker3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 88 88 3047 2 2 2 13 9 5 4
## 2 88 88 2856 2 1 1 13 9 5 4
## 3 2 6 2231 2 1 2 13 1 4 3
## 4 88 2 2744 2 1 1 12 9 5 4
## 5 88 88 2585 2 1 1 12 5 5 4
## 6 3 2 3018 2 1 2 9 5 5 4
## # ℹ 2 more variables: genhlth <dbl>, marital <dbl>
# Recode variables
brfss_clean <- brfss_hw4 %>%
mutate(
# OUTCOME
physhlth_days = case_when(
physhlth == 88 ~ 0, # 88 = none = 0 days
physhlth %in% c(77, 99) ~ NA_real_, # don't know and refused means NA
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth), # valid values
TRUE ~ NA_real_
),
# CONTINUOUS PREDICTORS
menthlth_days = case_when(
menthlth == 88 ~ 0, # none means 0 days
menthlth %in% c(77, 99) ~ NA_real_, # don't know and refused means NA
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth), # valid values
TRUE ~ NA_real_
),
bmi = case_when(
bmi5 == 9999 ~ NA_real_,
bmi5 > 0 ~ bmi5 / 100, # convert to actual BMI
TRUE ~ NA_real_
),
# BINARY PREDICTORS
sex = factor(ifelse(sexvar == 1, "Male", "Female")),
exercise = factor(
case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
exerany2 %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("No", "Yes") # making no as reference
),
depression = factor(
case_when(
addepev3 == 1 ~ "Yes",
addepev3 == 2 ~ "No",
addepev3 %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("No", "Yes") # making no as reference
),
# CATEGORICAL PREDICTORS
age_group = factor(
case_when(
ageg5yr %in% 1:3 ~ "18-34",
ageg5yr %in% 4:6 ~ "35-49",
ageg5yr %in% 7:9 ~ "50-64",
ageg5yr %in% 10:13 ~ "65+",
ageg5yr == 14 ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("18-34", "35-49", "50-64", "65+") # making "18-34" reference
),
income = factor(
case_when(
incomg1 %in% 1:2 ~ "Less than $25K",
incomg1 %in% 3:4 ~ "$25K-$49K",
incomg1 %in% 5:6 ~ "$50K-$99K",
incomg1 == 7 ~ "$100K+",
incomg1 == 9 ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("Less than $25K", "$25K-$49K", "$50K-$99K", "$100K+") # making "Less than $25K" reference
),
education = factor(
case_when(
educa %in% 1:3 ~ "Less than HS",
educa == 4 ~ "HS/GED",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
educa == 9 ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("Less than HS", "HS/GED", "Some college", "College graduate")
),
smoking = factor(
case_when(
smoker3 %in% 1:2 ~ "Current smoker",
smoker3 == 3 ~ "Former smoker",
smoker3 == 4 ~ "Never smoker",
smoker3 == 9 ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("Never smoker", "Former smoker", "Current smoker") # making "Never smoker" reference
),
gen_health = factor(
case_when(
genhlth %in% 1:2 ~ "Excellent/Very Good",
genhlth == 3 ~ "Good",
genhlth %in% 4:5 ~ "Fair/Poor",
TRUE ~ NA_character_
),
levels = c("Excellent/Very Good", "Good", "Fair/Poor")
),
marital = factor(
case_when(
marital %in% c(1, 6) ~ "Married/Partnered",
marital %in% 2:4 ~ "Previously married",
marital == 5 ~ "Never married",
TRUE ~ NA_character_
),
levels = c("Married/Partnered", "Previously married", "Never married")
)
)brfss_clean %>%
select(physhlth_days, menthlth_days, bmi, marital) %>%
miss_var_summary() %>%
# Rename columns for the table report
rename(
Variable = variable,
`Number Missing` = n_miss,
`Percent Missing` = pct_miss
) %>%
# Create the formatted table
flextable() %>%
set_caption(caption = "Table: Missing Before Exclusions")Variable | Number Missing | Percent Missing |
|---|---|---|
bmi | 40,535 | 9.35 |
physhlth_days | 10,785 | 2.49 |
menthlth_days | 8,108 | 1.87 |
marital | 4,289 | 0.990 |
# Select recoded variables, apply filters, drop missing, take sample
set.seed(1220) # CRITICAL: reproducibility
# Same seed = same random sample across all computers
brfss_analytic <- brfss_clean %>%
drop_na() %>%
# drop_na removes any row with NA in ANY of these columns
slice_sample(n = 8000) # Randomly select 8,000 rows
# Save analytic dataset
saveRDS(brfss_clean,("C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_hw4_clean.rds"
))
# Check the recoded data
head(brfss_hw4)## # A tibble: 6 × 12
## menthlth physhlth bmi5 sexvar exerany2 addepev3 ageg5yr incomg1 educa smoker3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 88 88 3047 2 2 2 13 9 5 4
## 2 88 88 2856 2 1 1 13 9 5 4
## 3 2 6 2231 2 1 2 13 1 4 3
## 4 88 2 2744 2 1 1 12 9 5 4
## 5 88 88 2585 2 1 1 12 5 5 4
## 6 3 2 3018 2 1 2 9 5 5 4
## # ℹ 2 more variables: genhlth <dbl>, marital <dbl>
## Rows: 433,323
## Columns: 12
## $ menthlth <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88,…
## $ physhlth <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, 30, 1…
## $ bmi5 <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 2296, 4…
## $ sexvar <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ exerany2 <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2, 1, 1…
## $ addepev3 <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ ageg5yr <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12, 13,…
## $ incomg1 <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3, 4, 3…
## $ educa <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5, 4, 4…
## $ smoker3 <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1, 3, 3…
## $ genhlth <dbl> 2, 2, 4, 2, 4, 3, 4, 4, 3, 3, 3, 2, 2, 3, 3, 4, 5, 3, 3, 3, 3…
## $ marital <dbl> 1, 2, 3, 1, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 3, 3, 2…
## # A tibble: 6 × 23
## menthlth physhlth bmi5 sexvar exerany2 addepev3 ageg5yr incomg1 educa smoker3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 88 88 3047 2 2 2 13 9 5 4
## 2 88 88 2856 2 1 1 13 9 5 4
## 3 2 6 2231 2 1 2 13 1 4 3
## 4 88 2 2744 2 1 1 12 9 5 4
## 5 88 88 2585 2 1 1 12 5 5 4
## 6 3 2 3018 2 1 2 9 5 5 4
## # ℹ 13 more variables: genhlth <dbl>, marital <fct>, physhlth_days <dbl>,
## # menthlth_days <dbl>, bmi <dbl>, sex <fct>, exercise <fct>,
## # depression <fct>, age_group <fct>, income <fct>, education <fct>,
## # smoking <fct>, gen_health <fct>
## Rows: 433,323
## Columns: 23
## $ menthlth <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88…
## $ physhlth <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, …
## $ bmi5 <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 22…
## $ sexvar <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2,…
## $ exerany2 <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2,…
## $ addepev3 <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ ageg5yr <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12…
## $ incomg1 <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3,…
## $ educa <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5,…
## $ smoker3 <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1,…
## $ genhlth <dbl> 2, 2, 4, 2, 4, 3, 4, 4, 3, 3, 3, 2, 2, 3, 3, 4, 5, 3, 3,…
## $ marital <fct> Married/Partnered, Previously married, Previously marrie…
## $ physhlth_days <dbl> 0, 0, 6, 2, 0, 2, 8, 1, 5, 0, 0, 2, 0, 0, 0, 4, 30, 15, …
## $ menthlth_days <dbl> 0, 0, 2, 0, 0, 3, NA, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10, 0, …
## $ bmi <dbl> 30.47, 28.56, 22.31, 27.44, 25.85, 30.18, 24.41, 27.27, …
## $ sex <fct> Female, Female, Female, Female, Female, Female, Male, Fe…
## $ exercise <fct> No, Yes, Yes, Yes, Yes, Yes, No, No, No, Yes, Yes, No, Y…
## $ depression <fct> No, Yes, No, Yes, Yes, No, No, No, No, No, No, No, No, N…
## $ age_group <fct> 65+, 65+, 65+, 65+, 65+, 50-64, 65+, 65+, 65+, 65+, 50-6…
## $ income <fct> NA, NA, Less than $25K, NA, $50K-$99K, $50K-$99K, $25K-$…
## $ education <fct> Some college, Some college, HS/GED, Some college, Some c…
## $ smoking <fct> Never smoker, Never smoker, Former smoker, Never smoker,…
## $ gen_health <fct> Excellent/Very Good, Excellent/Very Good, Fair/Poor, Exc…
## # A tibble: 6 × 23
## menthlth physhlth bmi5 sexvar exerany2 addepev3 ageg5yr incomg1 educa smoker3
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 88 88 2873 1 1 2 8 5 5 3
## 2 2 88 2889 1 2 2 6 5 6 4
## 3 88 88 4003 2 2 2 10 4 5 3
## 4 15 5 2746 2 1 1 8 1 5 3
## 5 5 88 4659 2 2 2 8 5 6 4
## 6 88 1 2597 1 1 2 10 5 5 4
## # ℹ 13 more variables: genhlth <dbl>, marital <fct>, physhlth_days <dbl>,
## # menthlth_days <dbl>, bmi <dbl>, sex <fct>, exercise <fct>,
## # depression <fct>, age_group <fct>, income <fct>, education <fct>,
## # smoking <fct>, gen_health <fct>
## Rows: 8,000
## Columns: 23
## $ menthlth <dbl> 88, 2, 88, 15, 5, 88, 88, 20, 88, 15, 88, 88, 2, 88, 88,…
## $ physhlth <dbl> 88, 88, 88, 5, 88, 1, 88, 30, 20, 7, 30, 88, 88, 88, 88,…
## $ bmi5 <dbl> 2873, 2889, 4003, 2746, 4659, 2597, 3255, 2663, 3948, 43…
## $ sexvar <dbl> 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 1, 1, 1, 1,…
## $ exerany2 <dbl> 1, 2, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,…
## $ addepev3 <dbl> 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ ageg5yr <dbl> 8, 6, 10, 8, 8, 10, 3, 8, 10, 8, 10, 9, 7, 6, 6, 7, 10, …
## $ incomg1 <dbl> 5, 5, 4, 1, 5, 5, 4, 2, 5, 2, 3, 5, 7, 6, 5, 6, 2, 7, 5,…
## $ educa <dbl> 5, 6, 5, 5, 6, 5, 6, 5, 6, 5, 4, 4, 6, 5, 6, 6, 6, 6, 6,…
## $ smoker3 <dbl> 3, 4, 3, 3, 4, 4, 4, 2, 4, 4, 3, 4, 4, 4, 4, 3, 3, 3, 4,…
## $ genhlth <dbl> 2, 3, 5, 4, 4, 2, 1, 3, 3, 3, 4, 3, 2, 2, 2, 4, 2, 1, 2,…
## $ marital <fct> Married/Partnered, Married/Partnered, Married/Partnered,…
## $ physhlth_days <dbl> 0, 0, 0, 5, 0, 1, 0, 30, 20, 7, 30, 0, 0, 0, 0, 30, 0, 0…
## $ menthlth_days <dbl> 0, 2, 0, 15, 5, 0, 0, 20, 0, 15, 0, 0, 2, 0, 0, 4, 30, 0…
## $ bmi <dbl> 28.73, 28.89, 40.03, 27.46, 46.59, 25.97, 32.55, 26.63, …
## $ sex <fct> Male, Male, Female, Female, Female, Male, Male, Female, …
## $ exercise <fct> Yes, No, No, Yes, No, Yes, Yes, Yes, Yes, No, No, No, Ye…
## $ depression <fct> No, No, No, Yes, No, No, No, Yes, No, Yes, No, No, No, N…
## $ age_group <fct> 50-64, 35-49, 65+, 50-64, 50-64, 65+, 18-34, 50-64, 65+,…
## $ income <fct> $50K-$99K, $50K-$99K, $25K-$49K, Less than $25K, $50K-$9…
## $ education <fct> Some college, College graduate, Some college, Some colle…
## $ smoking <fct> Former smoker, Never smoker, Former smoker, Former smoke…
## $ gen_health <fct> Excellent/Very Good, Good, Fair/Poor, Fair/Poor, Fair/Po…
tibble(
Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_analytic), ncol(brfss_analytic))
) %>%
kable(caption = "Analytic Dataset Dimensions") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 8000 |
| Variables | 23 |
#descriptive table
brfss_analytic %>%
select(physhlth_days, menthlth_days, bmi, sex, exercise,
depression, age_group, income, education, smoking, gen_health, marital) %>%
tbl_summary(
label = list(
physhlth_days ~ "Physical Health Days",
menthlth_days ~ "Mental Health Days",
bmi ~ "Body Mass Index",
sex ~ "Sex",
exercise ~ "Exercise (past 30 days)",
depression ~ "Depression Diagnosis",
age_group ~ "Age Group",
income ~ "Household Income",
education ~ "Education Level",
smoking ~ "Smoking Status",
gen_health ~ "General Health",
marital ~ "Marital Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = list(all_continuous() ~ 1),
missing = "no"
) %>%
bold_labels() %>%
modify_caption("**Descriptive Statistics: Analytic Sample (n=8,000)**")| Characteristic | N = 8,0001 |
|---|---|
| Physical Health Days | 4.3 (8.7) |
| Mental Health Days | 4.3 (8.2) |
| Body Mass Index | 28.7 (6.5) |
| Sex | |
| Female | 4,029 (50%) |
| Male | 3,971 (50%) |
| Exercise (past 30 days) | 6,157 (77%) |
| Depression Diagnosis | 1,776 (22%) |
| Age Group | |
| 18-34 | 1,294 (16%) |
| 35-49 | 1,657 (21%) |
| 50-64 | 2,109 (26%) |
| 65+ | 2,940 (37%) |
| Household Income | |
| Less than $25K | 1,090 (14%) |
| $25K-$49K | 1,931 (24%) |
| $50K-$99K | 4,297 (54%) |
| $100K+ | 682 (8.5%) |
| Education Level | |
| Less than HS | 391 (4.9%) |
| HS/GED | 1,887 (24%) |
| Some college | 2,115 (26%) |
| College graduate | 3,607 (45%) |
| Smoking Status | |
| Never smoker | 4,808 (60%) |
| Former smoker | 2,230 (28%) |
| Current smoker | 962 (12%) |
| General Health | |
| Excellent/Very Good | 3,926 (49%) |
| Good | 2,648 (33%) |
| Fair/Poor | 1,426 (18%) |
| Marital Status | |
| Married/Partnered | 4,582 (57%) |
| Previously married | 2,050 (26%) |
| Never married | 1,368 (17%) |
| 1 Mean (SD); n (%) | |
p_hist <- ggplot(brfss_analytic, aes(x = physhlth_days)) +
geom_histogram(binwidth = 1, fill = "steelblue", color = "white", alpha = 0.85) +
labs(
title = "Distribution of Physically Unhealthy Days in the Past 30 Days",
subtitle = "BRFSS 2023 Analytic Sample (n = 8,000)",
x = "Number of Physically Unhealthy Days",
y = "Count"
) +
theme_minimal(base_size = 13)
ggplotly(p_hist)Distribution of Physically Unhealthy Days (BRFSS 2023)
#Part 1: Model Selection for Linear Regression
#step1 : Fit maximum model
Before fitting any model, always visualize the bivariate relationship.
p_scatter <- ggplot(brfss_analytic, aes(x = physhlth_days, y = menthlth_days)) +
geom_point(alpha = 0.15, color = "steelblue", size = 1.2) +
geom_smooth(method = "lm", color = "red", linewidth = 1.2, se = TRUE) +
geom_smooth(method = "loess", color = "blue", linewidth = 1,
linetype = "dashed", se = FALSE) +
labs(
title = "Physically Unhealthy days vs. Mentally unhealthy days (BRFSS 2023)",
subtitle = "Red = Linear fit | Orange dashed = LOESS smoother",
x = "Physical Health (days)",
y = "Mental Health (days)"
) +
theme_minimal(base_size = 13)
ggplotly(p_scatter)BMI vs. Age — BRFSS 2020
The plot shows a clear positive association: as physically unhealthy days increase, mentally unhealthy days also tend to increase. The relationship is upward‑sloping, moderately strong, and not purely linear.
# The maximum model with all candidate predictors
mod_max <- lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + age_group + income + education + smoking + gen_health + marital,
data = brfss_analytic
)
summary(mod_max)##
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise +
## depression + age_group + income + education + smoking + gen_health +
## marital, data = brfss_analytic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.795 -2.839 -1.145 0.690 31.047
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.43581 0.62963 3.869 0.000110 ***
## menthlth_days 0.18362 0.01135 16.172 < 2e-16 ***
## bmi -0.01877 0.01287 -1.459 0.144638
## sexMale -0.23449 0.16389 -1.431 0.152538
## exerciseYes -1.93289 0.20408 -9.471 < 2e-16 ***
## depressionYes 0.77536 0.21528 3.602 0.000318 ***
## age_group35-49 0.76935 0.28365 2.712 0.006696 **
## age_group50-64 1.40108 0.27874 5.026 5.11e-07 ***
## age_group65+ 1.72725 0.27852 6.202 5.87e-10 ***
## income$25K-$49K -1.08440 0.27627 -3.925 8.74e-05 ***
## income$50K-$99K -1.52385 0.27607 -5.520 3.50e-08 ***
## income$100K+ -1.31580 0.39821 -3.304 0.000956 ***
## educationHS/GED 0.29371 0.40089 0.733 0.463806
## educationSome college 1.00198 0.40273 2.488 0.012867 *
## educationCollege graduate 1.05262 0.40445 2.603 0.009269 **
## smokingFormer smoker 0.43777 0.18776 2.332 0.019749 *
## smokingCurrent smoker 0.46414 0.26617 1.744 0.081241 .
## gen_healthGood 1.40982 0.18635 7.565 4.30e-14 ***
## gen_healthFair/Poor 10.06509 0.25239 39.879 < 2e-16 ***
## maritalPreviously married -0.26701 0.20679 -1.291 0.196680
## maritalNever married -0.41244 0.24899 -1.656 0.097666 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.112 on 7979 degrees of freedom
## Multiple R-squared: 0.3258, Adjusted R-squared: 0.3242
## F-statistic: 192.8 on 20 and 7979 DF, p-value: < 2.2e-16
#in table
tidy(mod_max, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Maximum Model: All Candidate Predictors",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.4358 | 0.6296 | 3.8686 | 0.0001 | 1.2016 | 3.6700 |
| menthlth_days | 0.1836 | 0.0114 | 16.1723 | 0.0000 | 0.1614 | 0.2059 |
| bmi | -0.0188 | 0.0129 | -1.4589 | 0.1446 | -0.0440 | 0.0065 |
| sexMale | -0.2345 | 0.1639 | -1.4308 | 0.1525 | -0.5558 | 0.0868 |
| exerciseYes | -1.9329 | 0.2041 | -9.4711 | 0.0000 | -2.3329 | -1.5328 |
| depressionYes | 0.7754 | 0.2153 | 3.6016 | 0.0003 | 0.3533 | 1.1974 |
| age_group35-49 | 0.7693 | 0.2836 | 2.7123 | 0.0067 | 0.2133 | 1.3254 |
| age_group50-64 | 1.4011 | 0.2787 | 5.0264 | 0.0000 | 0.8547 | 1.9475 |
| age_group65+ | 1.7272 | 0.2785 | 6.2016 | 0.0000 | 1.1813 | 2.2732 |
| income$25K-$49K | -1.0844 | 0.2763 | -3.9251 | 0.0001 | -1.6260 | -0.5428 |
| income$50K-$99K | -1.5238 | 0.2761 | -5.5199 | 0.0000 | -2.0650 | -0.9827 |
| income$100K+ | -1.3158 | 0.3982 | -3.3043 | 0.0010 | -2.0964 | -0.5352 |
| educationHS/GED | 0.2937 | 0.4009 | 0.7326 | 0.4638 | -0.4921 | 1.0796 |
| educationSome college | 1.0020 | 0.4027 | 2.4880 | 0.0129 | 0.2125 | 1.7914 |
| educationCollege graduate | 1.0526 | 0.4044 | 2.6026 | 0.0093 | 0.2598 | 1.8454 |
| smokingFormer smoker | 0.4378 | 0.1878 | 2.3316 | 0.0197 | 0.0697 | 0.8058 |
| smokingCurrent smoker | 0.4641 | 0.2662 | 1.7438 | 0.0812 | -0.0576 | 0.9859 |
| gen_healthGood | 1.4098 | 0.1863 | 7.5655 | 0.0000 | 1.0445 | 1.7751 |
| gen_healthFair/Poor | 10.0651 | 0.2524 | 39.8790 | 0.0000 | 9.5703 | 10.5598 |
| maritalPreviously married | -0.2670 | 0.2068 | -1.2912 | 0.1967 | -0.6724 | 0.1384 |
| maritalNever married | -0.4124 | 0.2490 | -1.6565 | 0.0977 | -0.9005 | 0.0756 |
glance(mod_max) |>
dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = "Maximum Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | sigma | AIC | BIC | df.residual |
|---|---|---|---|---|---|
| 0.326 | 0.324 | 7.112 | 54115.05 | 54268.77 | 7979 |
#Interpretation F‑statistic is highly significant (p < 2e‑16), meaning the full set of predictors collectively improves prediction of physically unhealthy days. Residual SD ≈ 7.1 days on 7979 df, which is reasonable given the 0–30 scale.
Overall, mental health, exercise, depression, age, income, smoking, and general health are the major drivers of physically unhealthy days. General health and exercise have the largest effect sizes. Income shows a clear protective gradient. BMI, sex, and marital status do not meaningfully predict physically unhealthy days once other factors are controlled.
# Report model fit criteria
r2 <- summary(mod_max)$r.squared
adj_r2 <- summary(mod_max)$adj.r.squared
aic <- AIC(mod_max)
bic <- BIC(mod_max)
tibble(
Criterion = c("R-squared", "Adjusted R-squared", "AIC", "BIC"),
Value = round(c(r2, adj_r2, aic, bic), 3)
) |>
kable(
caption = "Model Fit Criteria for Maximum Linear Regression Model",
align = c("l", "r")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Criterion | Value |
|---|---|
| R-squared | 0.326 |
| Adjusted R-squared | 0.324 |
| AIC | 54115.053 |
| BIC | 54268.772 |
#Interpretation R‑squared = 0.326; Adjusted R‑squared = 0.324 The model explains about 32 percent of the variation in physically unhealthy days.For BRFSS health‑day outcomes, this is typical because health‑day measures are influenced by many unmeasured factors. The adjusted R² being close to R² indicates the model is not overfitting despite many predictors.
AIC = 54115; BIC = 54269 Lower AIC/BIC would indicate a better‑fitting model if we compare this to reduced models.This is a moderately strong model for BRFSS data, and the fit is consistent with what we expect for self‑reported health outcomes. —
#Step 2: Best Subsets Regression
best_subset <- regsubsets(
physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + age_group + income + education + smoking + gen_health + marital,
data = brfss_analytic,
nvmax = 25, # generous upper bound to capture all dummy terms
really.big = TRUE
)
best_sub_summary <- summary(best_subset)# Summary table of Adjusted R-squared and BIC by model size
best_sub_df <- tibble(
`Model Size` = seq_along(best_sub_summary$adjr2),
`Adjusted R-sq` = round(best_sub_summary$adjr2, 4),
`BIC` = round(best_sub_summary$bic, 2)
) |>
mutate(
`Best Adj R-sq` = ifelse(`Adjusted R-sq` == max(`Adjusted R-sq`), "★", ""),
`Best BIC` = ifelse(`BIC` == min(`BIC`), "★", "")
)
best_sub_df |>
kable(
caption = "Best Subsets Regression: Adjusted R-squared and BIC by Model Size",
align = c("c", "r", "r", "c", "c")
) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) |>
row_spec(which(best_sub_summary$adjr2 == max(best_sub_summary$adjr2)),
bold = TRUE, color = "white", background = "#2196F3") |>
row_spec(which(best_sub_summary$bic == min(best_sub_summary$bic)),
bold = TRUE, color = "white", background = "#4CAF50")| Model Size | Adjusted R-sq | BIC | Best Adj R-sq | Best BIC |
|---|---|---|---|---|
| 1 | 0.2627 | -2420.70 | ||
| 2 | 0.2956 | -2778.12 | ||
| 3 | 0.3074 | -2905.97 | ||
| 4 | 0.3133 | -2966.10 | ||
| 5 | 0.3160 | -2989.73 | ||
| 6 | 0.3183 | -3007.98 | ||
| 7 | 0.3197 | -3016.83 | ||
| 8 | 0.3208 | -3021.26 | ||
| 9 | 0.3215 | -3021.91 | ★ | |
| 10 | 0.3221 | -3021.17 | ||
| 11 | 0.3225 | -3018.29 | ||
| 12 | 0.3231 | -3016.99 | ||
| 13 | 0.3235 | -3013.96 | ||
| 14 | 0.3237 | -3008.54 | ||
| 15 | 0.3239 | -3002.54 | ||
| 16 | 0.3240 | -2995.63 | ||
| 17 | 0.3241 | -2988.57 | ||
| 18 | 0.3241 | -2981.41 | ||
| 19 | 0.3242 | -2974.06 | ★ | |
| 20 | 0.3242 | -2965.61 | ★ |
subset_metrics <- tibble(
p = 1:length(best_sub_summary$adjr2),
`Adj. R²` = best_sub_summary$adjr2,
BIC = best_sub_summary$bic,
Cp = best_sub_summary$cp
)
p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
geom_line(linewidth = 1, color = "steelblue") +
geom_point(size = 3, color = "steelblue") +
geom_vline(xintercept = which.max(best_sub_summary$adjr2),
linetype = "dashed", color = "tomato") +
labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
theme_minimal(base_size = 12)
p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
geom_line(linewidth = 1, color = "green") +
geom_point(size = 3, color = "green") +
geom_vline(xintercept = which.min(best_sub_summary$bic),
linetype = "dashed", color = "tomato") +
labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
theme_minimal(base_size = 12)
gridExtra::grid.arrange(p1, p2, ncol = 2)Best Subsets: Adjusted R² and BIC by Model Size
## Best model by Adj. R²: 19 variables
## Best model by BIC: 9 variables
# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_sub_summary$bic)
best_vars <- names(which(best_sub_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")##
## Variables in BIC-best model:
## menthlth_days
## exerciseYes
## depressionYes
## age_group35-49
## age_group50-64
## age_group65+
## income$50K-$99K
## gen_healthGood
## gen_healthFair/Poor
#Interpretation: Adjusted R² increases steadily from 0.2627 (1 predictor) to 0.3242 (19–20 predictors). BIC decreases sharply early on, reaches its minimum at 9 predictors, then increases again.Adjusted R² suggests that adding predictors improves fit, but the improvement after ~10 predictors is negligible. Because the additional predictors in the larger models contribute very little explanatory power, the BIC‑selected model is preferred.
#Step 3: Stepwise Selection Methods
#Backward
# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)## Start: AIC=31410.04
## physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
## age_group + income + education + smoking + gen_health + marital
##
## Df Sum of Sq RSS AIC
## - marital 2 180 403789 31410
## <none> 403609 31410
## - sex 1 104 403712 31410
## - bmi 1 108 403716 31410
## - smoking 2 347 403956 31413
## - depression 1 656 404265 31421
## - education 3 876 404485 31421
## - income 3 1550 405158 31435
## - age_group 3 2235 405844 31448
## - exercise 1 4537 408146 31497
## - menthlth_days 1 13230 416839 31666
## - gen_health 2 84670 488279 32930
##
## Step: AIC=31409.61
## physhlth_days ~ menthlth_days + bmi + sex + exercise + depression +
## age_group + income + education + smoking + gen_health
##
## Df Sum of Sq RSS AIC
## - bmi 1 98 403887 31410
## <none> 403789 31410
## - sex 1 102 403891 31410
## - smoking 2 345 404134 31412
## - depression 1 637 404427 31420
## - education 3 875 404665 31421
## - income 3 1388 405177 31431
## - age_group 3 3048 406837 31464
## - exercise 1 4538 408327 31497
## - menthlth_days 1 13195 416984 31665
## - gen_health 2 84707 488496 32929
##
## Step: AIC=31409.55
## physhlth_days ~ menthlth_days + sex + exercise + depression +
## age_group + income + education + smoking + gen_health
##
## Df Sum of Sq RSS AIC
## <none> 403887 31410
## - sex 1 103 403990 31410
## - smoking 2 357 404244 31413
## - depression 1 612 404499 31420
## - education 3 875 404762 31421
## - income 3 1394 405281 31431
## - age_group 3 3049 406936 31464
## - exercise 1 4451 408338 31495
## - menthlth_days 1 13238 417125 31666
## - gen_health 2 85645 489532 32944
tidy(mod_backward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.6182 | 0.4912 | 3.2946 | 0.0010 | 0.6554 | 2.5811 |
| menthlth_days | 0.1836 | 0.0114 | 16.1748 | 0.0000 | 0.1614 | 0.2059 |
| sexMale | -0.2329 | 0.1633 | -1.4262 | 0.1539 | -0.5530 | 0.0872 |
| exerciseYes | -1.9048 | 0.2031 | -9.3793 | 0.0000 | -2.3029 | -1.5067 |
| depressionYes | 0.7471 | 0.2149 | 3.4769 | 0.0005 | 0.3259 | 1.1683 |
| age_group35-49 | 0.8374 | 0.2703 | 3.0980 | 0.0020 | 0.3075 | 1.3672 |
| age_group50-64 | 1.4778 | 0.2583 | 5.7206 | 0.0000 | 0.9714 | 1.9841 |
| age_group65+ | 1.8366 | 0.2513 | 7.3099 | 0.0000 | 1.3441 | 2.3292 |
| income$25K-$49K | -1.0324 | 0.2749 | -3.7557 | 0.0002 | -1.5713 | -0.4936 |
| income$50K-$99K | -1.3884 | 0.2658 | -5.2239 | 0.0000 | -1.9094 | -0.8674 |
| income$100K+ | -1.1122 | 0.3849 | -2.8897 | 0.0039 | -1.8666 | -0.3577 |
| educationHS/GED | 0.2576 | 0.4006 | 0.6431 | 0.5202 | -0.5276 | 1.0428 |
| educationSome college | 0.9636 | 0.4025 | 2.3943 | 0.0167 | 0.1747 | 1.7525 |
| educationCollege graduate | 1.0310 | 0.4043 | 2.5502 | 0.0108 | 0.2385 | 1.8236 |
| smokingFormer smoker | 0.4399 | 0.1876 | 2.3442 | 0.0191 | 0.0720 | 0.8077 |
| smokingCurrent smoker | 0.4776 | 0.2647 | 1.8038 | 0.0713 | -0.0414 | 0.9965 |
| gen_healthGood | 1.3636 | 0.1839 | 7.4161 | 0.0000 | 1.0032 | 1.7241 |
| gen_healthFair/Poor | 10.0009 | 0.2486 | 40.2245 | 0.0000 | 9.5136 | 10.4883 |
Interpretation:Backward elimination began with the full model and removed predictors that did not meaningfully improve AIC. The procedure ultimately retained a broad set of variables, including general health, mental health days, exercise, age group, income, depression, education, smoking status, and sex. This pattern reflects the cumulative contribution of many predictors once the strongest health‑related variables were accounted for. The backward path favored a more inclusive model, keeping any term that provided even modest incremental improvement in fit.
#Forward
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)
mod_forward <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "forward", trace = 1)## Start: AIC=34524.38
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 2 164022 434665 31967
## + menthlth_days 1 60303 538384 33677
## + exercise 1 34542 564145 34051
## + income 3 28842 569845 34135
## + depression 1 20750 577937 34244
## + smoking 2 12790 585897 34356
## + education 3 8593 590094 34415
## + marital 2 7321 591366 34430
## + bmi 1 6253 592434 34442
## + age_group 3 6527 592160 34443
## + sex 1 1649 597038 34504
## <none> 598687 34524
##
## Step: AIC=31967.07
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 17940.1 416725 31632
## + exercise 1 6466.7 428198 31849
## + depression 1 6070.1 428595 31857
## + income 3 3231.0 431434 31913
## + smoking 2 1776.3 432889 31938
## + age_group 3 1535.7 433129 31945
## + sex 1 1254.2 433411 31946
## + marital 2 830.3 433835 31956
## + education 3 415.5 434250 31965
## <none> 434665 31967
## + bmi 1 0.2 434665 31969
##
## Step: AIC=31631.88
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 5820.8 410904 31521
## + age_group 3 4661.5 412064 31548
## + income 3 1987.0 414738 31600
## + marital 2 1402.5 415322 31609
## + smoking 2 1034.6 415690 31616
## + depression 1 738.1 415987 31620
## + sex 1 605.2 416120 31622
## + education 3 318.9 416406 31632
## <none> 416725 31632
## + bmi 1 3.6 416721 31634
##
## Step: AIC=31521.35
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + age_group 3 3720.5 407184 31455
## + income 3 1198.4 409706 31504
## + marital 2 1064.8 409839 31505
## + depression 1 739.1 410165 31509
## + smoking 2 706.2 410198 31512
## + education 3 686.9 410217 31514
## + sex 1 449.3 410455 31515
## <none> 410904 31521
## + bmi 1 82.9 410821 31522
##
## Step: AIC=31454.58
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
##
## Df Sum of Sq RSS AIC
## + income 3 1162.69 406021 31438
## + depression 1 932.70 406251 31438
## + education 3 501.69 406682 31451
## + sex 1 261.03 406923 31452
## + smoking 2 360.51 406823 31452
## <none> 407184 31455
## + bmi 1 83.59 407100 31455
## + marital 2 40.11 407144 31458
##
## Step: AIC=31437.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income
##
## Df Sum of Sq RSS AIC
## + depression 1 861.57 405159 31423
## + education 3 953.62 405067 31425
## + smoking 2 305.02 405716 31436
## + sex 1 203.01 405818 31436
## <none> 406021 31438
## + bmi 1 74.91 405946 31438
## + marital 2 153.70 405867 31439
##
## Step: AIC=31422.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression
##
## Df Sum of Sq RSS AIC
## + education 3 837.21 404322 31412
## + smoking 2 259.11 404900 31422
## + sex 1 110.16 405049 31423
## + bmi 1 105.12 405054 31423
## <none> 405159 31423
## + marital 2 169.72 404990 31423
##
## Step: AIC=31412.16
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression + education
##
## Df Sum of Sq RSS AIC
## + smoking 2 332.18 403990 31410
## + bmi 1 110.19 404212 31412
## <none> 404322 31412
## + sex 1 77.87 404244 31413
## + marital 2 168.26 404154 31413
##
## Step: AIC=31409.59
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression + education + smoking
##
## Df Sum of Sq RSS AIC
## + sex 1 102.92 403887 31410
## <none> 403990 31410
## + bmi 1 98.80 403891 31410
## + marital 2 169.54 403820 31410
##
## Step: AIC=31409.55
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression + education + smoking + sex
##
## Df Sum of Sq RSS AIC
## <none> 403887 31410
## + bmi 1 97.861 403789 31410
## + marital 2 170.601 403716 31410
tidy(mod_forward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.6182 | 0.4912 | 3.2946 | 0.0010 | 0.6554 | 2.5811 |
| gen_healthGood | 1.3636 | 0.1839 | 7.4161 | 0.0000 | 1.0032 | 1.7241 |
| gen_healthFair/Poor | 10.0009 | 0.2486 | 40.2245 | 0.0000 | 9.5136 | 10.4883 |
| menthlth_days | 0.1836 | 0.0114 | 16.1748 | 0.0000 | 0.1614 | 0.2059 |
| exerciseYes | -1.9048 | 0.2031 | -9.3793 | 0.0000 | -2.3029 | -1.5067 |
| age_group35-49 | 0.8374 | 0.2703 | 3.0980 | 0.0020 | 0.3075 | 1.3672 |
| age_group50-64 | 1.4778 | 0.2583 | 5.7206 | 0.0000 | 0.9714 | 1.9841 |
| age_group65+ | 1.8366 | 0.2513 | 7.3099 | 0.0000 | 1.3441 | 2.3292 |
| income$25K-$49K | -1.0324 | 0.2749 | -3.7557 | 0.0002 | -1.5713 | -0.4936 |
| income$50K-$99K | -1.3884 | 0.2658 | -5.2239 | 0.0000 | -1.9094 | -0.8674 |
| income$100K+ | -1.1122 | 0.3849 | -2.8897 | 0.0039 | -1.8666 | -0.3577 |
| depressionYes | 0.7471 | 0.2149 | 3.4769 | 0.0005 | 0.3259 | 1.1683 |
| educationHS/GED | 0.2576 | 0.4006 | 0.6431 | 0.5202 | -0.5276 | 1.0428 |
| educationSome college | 0.9636 | 0.4025 | 2.3943 | 0.0167 | 0.1747 | 1.7525 |
| educationCollege graduate | 1.0310 | 0.4043 | 2.5502 | 0.0108 | 0.2385 | 1.8236 |
| smokingFormer smoker | 0.4399 | 0.1876 | 2.3442 | 0.0191 | 0.0720 | 0.8077 |
| smokingCurrent smoker | 0.4776 | 0.2647 | 1.8038 | 0.0713 | -0.0414 | 0.9965 |
| sexMale | -0.2329 | 0.1633 | -1.4262 | 0.1539 | -0.5530 | 0.0872 |
Interpretation: Forward selection started from the null model and added predictors in order of their marginal AIC improvement. General health entered first, followed by mental health days and exercise, confirming these as the dominant predictors. Age group and income entered next, with depression, education, smoking, and sex added later as smaller but still meaningful contributors. The forward procedure built the model gradually, and its final set of predictors matched the backward solution, reinforcing the stability of this subset.
#Stepwise
mod_stepwise <- step(mod_null,
scope = list(lower = mod_null, upper = mod_max),
direction = "both", trace = 1)## Start: AIC=34524.38
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 2 164022 434665 31967
## + menthlth_days 1 60303 538384 33677
## + exercise 1 34542 564145 34051
## + income 3 28842 569845 34135
## + depression 1 20750 577937 34244
## + smoking 2 12790 585897 34356
## + education 3 8593 590094 34415
## + marital 2 7321 591366 34430
## + bmi 1 6253 592434 34442
## + age_group 3 6527 592160 34443
## + sex 1 1649 597038 34504
## <none> 598687 34524
##
## Step: AIC=31967.07
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 17940 416725 31632
## + exercise 1 6467 428198 31849
## + depression 1 6070 428595 31857
## + income 3 3231 431434 31913
## + smoking 2 1776 432889 31938
## + age_group 3 1536 433129 31945
## + sex 1 1254 433411 31946
## + marital 2 830 433835 31956
## + education 3 416 434250 31965
## <none> 434665 31967
## + bmi 1 0 434665 31969
## - gen_health 2 164022 598687 34524
##
## Step: AIC=31631.88
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 5821 410904 31521
## + age_group 3 4661 412064 31548
## + income 3 1987 414738 31600
## + marital 2 1402 415322 31609
## + smoking 2 1035 415690 31616
## + depression 1 738 415987 31620
## + sex 1 605 416120 31622
## + education 3 319 416406 31632
## <none> 416725 31632
## + bmi 1 4 416721 31634
## - menthlth_days 1 17940 434665 31967
## - gen_health 2 121660 538384 33677
##
## Step: AIC=31521.35
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + age_group 3 3720 407184 31455
## + income 3 1198 409706 31504
## + marital 2 1065 409839 31505
## + depression 1 739 410165 31509
## + smoking 2 706 410198 31512
## + education 3 687 410217 31514
## + sex 1 449 410455 31515
## <none> 410904 31521
## + bmi 1 83 410821 31522
## - exercise 1 5821 416725 31632
## - menthlth_days 1 17294 428198 31849
## - gen_health 2 101805 512709 33288
##
## Step: AIC=31454.58
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
##
## Df Sum of Sq RSS AIC
## + income 3 1163 406021 31438
## + depression 1 933 406251 31438
## + education 3 502 406682 31451
## + sex 1 261 406923 31451
## + smoking 2 361 406823 31451
## <none> 407184 31455
## + bmi 1 84 407100 31455
## + marital 2 40 407144 31458
## - age_group 3 3720 410904 31521
## - exercise 1 4880 412064 31548
## - menthlth_days 1 19957 427141 31835
## - gen_health 2 94875 502059 33126
##
## Step: AIC=31437.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income
##
## Df Sum of Sq RSS AIC
## + depression 1 862 405159 31423
## + education 3 954 405067 31425
## + smoking 2 305 405716 31436
## + sex 1 203 405818 31436
## <none> 406021 31438
## + bmi 1 75 405946 31438
## + marital 2 154 405867 31439
## - income 3 1163 407184 31455
## - age_group 3 3685 409706 31504
## - exercise 1 4214 410235 31518
## - menthlth_days 1 18945 424966 31801
## - gen_health 2 87509 493530 32995
##
## Step: AIC=31422.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression
##
## Df Sum of Sq RSS AIC
## + education 3 837 404322 31412
## + smoking 2 259 404900 31422
## + sex 1 110 405049 31423
## + bmi 1 105 405054 31423
## <none> 405159 31423
## + marital 2 170 404990 31423
## - depression 1 862 406021 31438
## - income 3 1092 406251 31438
## - age_group 3 3871 409030 31493
## - exercise 1 4219 409378 31504
## - menthlth_days 1 13674 418834 31686
## - gen_health 2 86610 491770 32969
##
## Step: AIC=31412.16
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression + education
##
## Df Sum of Sq RSS AIC
## + smoking 2 332 403990 31410
## + bmi 1 110 404212 31412
## <none> 404322 31412
## + sex 1 78 404244 31413
## + marital 2 168 404154 31413
## - education 3 837 405159 31423
## - depression 1 745 405067 31425
## - income 3 1495 405818 31436
## - age_group 3 3558 407880 31476
## - exercise 1 4604 408926 31501
## - menthlth_days 1 13607 417929 31675
## - gen_health 2 86813 491135 32964
##
## Step: AIC=31409.59
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression + education + smoking
##
## Df Sum of Sq RSS AIC
## + sex 1 103 403887 31410
## <none> 403990 31410
## + bmi 1 99 403891 31410
## + marital 2 170 403820 31410
## - smoking 2 332 404322 31412
## - depression 1 691 404681 31421
## - education 3 910 404900 31422
## - income 3 1457 405447 31432
## - age_group 3 3178 407168 31466
## - exercise 1 4502 408492 31496
## - menthlth_days 1 13347 417337 31668
## - gen_health 2 85552 489542 32942
##
## Step: AIC=31409.55
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## income + depression + education + smoking + sex
##
## Df Sum of Sq RSS AIC
## <none> 403887 31410
## - sex 1 103 403990 31410
## + bmi 1 98 403789 31410
## + marital 2 171 403716 31410
## - smoking 2 357 404244 31413
## - depression 1 612 404499 31420
## - education 3 875 404762 31421
## - income 3 1394 405281 31431
## - age_group 3 3049 406936 31464
## - exercise 1 4451 408338 31495
## - menthlth_days 1 13238 417125 31666
## - gen_health 2 85645 489532 32944
tidy(mod_stepwise, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.6182 | 0.4912 | 3.2946 | 0.0010 | 0.6554 | 2.5811 |
| gen_healthGood | 1.3636 | 0.1839 | 7.4161 | 0.0000 | 1.0032 | 1.7241 |
| gen_healthFair/Poor | 10.0009 | 0.2486 | 40.2245 | 0.0000 | 9.5136 | 10.4883 |
| menthlth_days | 0.1836 | 0.0114 | 16.1748 | 0.0000 | 0.1614 | 0.2059 |
| exerciseYes | -1.9048 | 0.2031 | -9.3793 | 0.0000 | -2.3029 | -1.5067 |
| age_group35-49 | 0.8374 | 0.2703 | 3.0980 | 0.0020 | 0.3075 | 1.3672 |
| age_group50-64 | 1.4778 | 0.2583 | 5.7206 | 0.0000 | 0.9714 | 1.9841 |
| age_group65+ | 1.8366 | 0.2513 | 7.3099 | 0.0000 | 1.3441 | 2.3292 |
| income$25K-$49K | -1.0324 | 0.2749 | -3.7557 | 0.0002 | -1.5713 | -0.4936 |
| income$50K-$99K | -1.3884 | 0.2658 | -5.2239 | 0.0000 | -1.9094 | -0.8674 |
| income$100K+ | -1.1122 | 0.3849 | -2.8897 | 0.0039 | -1.8666 | -0.3577 |
| depressionYes | 0.7471 | 0.2149 | 3.4769 | 0.0005 | 0.3259 | 1.1683 |
| educationHS/GED | 0.2576 | 0.4006 | 0.6431 | 0.5202 | -0.5276 | 1.0428 |
| educationSome college | 0.9636 | 0.4025 | 2.3943 | 0.0167 | 0.1747 | 1.7525 |
| educationCollege graduate | 1.0310 | 0.4043 | 2.5502 | 0.0108 | 0.2385 | 1.8236 |
| smokingFormer smoker | 0.4399 | 0.1876 | 2.3442 | 0.0191 | 0.0720 | 0.8077 |
| smokingCurrent smoker | 0.4776 | 0.2647 | 1.8038 | 0.0713 | -0.0414 | 0.9965 |
| sexMale | -0.2329 | 0.1633 | -1.4262 | 0.1539 | -0.5530 | 0.0872 |
Interpretation: Stepwise selection combined forward addition and backward removal, re‑evaluating AIC after each change. The algorithm followed a similar trajectory to forward selection, beginning with general health and then adding mental health days, exercise, age group, income, depression, education, smoking, and sex. Because stepwise continually checks whether previously added variables remain necessary, its convergence with both forward and backward elimination indicates that each retained predictor consistently improves model fit across multiple search strategies.
method_comparison <- tribble(
~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
"Maximum model",
length(coef(mod_max)) - 1,
round(glance(mod_max)$adj.r.squared, 4),
round(AIC(mod_max), 1),
round(BIC(mod_max), 1),
"Backward (AIC)",
length(coef(mod_backward)) - 1,
round(glance(mod_backward)$adj.r.squared, 4),
round(AIC(mod_backward), 1),
round(BIC(mod_backward), 1),
"Forward (AIC)",
length(coef(mod_forward)) - 1,
round(glance(mod_forward)$adj.r.squared, 4),
round(AIC(mod_forward), 1),
round(BIC(mod_forward), 1),
"Stepwise (AIC)",
length(coef(mod_stepwise)) - 1,
round(glance(mod_stepwise)$adj.r.squared, 4),
round(AIC(mod_stepwise), 1),
round(BIC(mod_stepwise), 1)
)
method_comparison |>
kable(caption = "Comparison of Variable Selection Methods") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | Variables selected | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| Maximum model | 20 | 0.3242 | 54115.1 | 54268.8 |
| Backward (AIC) | 17 | 0.3239 | 54114.6 | 54247.3 |
| Forward (AIC) | 17 | 0.3239 | 54114.6 | 54247.3 |
| Stepwise (AIC) | 17 | 0.3239 | 54114.6 | 54247.3 |
Interpretation: All three AIC‑based procedures—backward, forward, and stepwise—converged on the same 17‑variable model, indicating strong agreement across selection strategies. The only variables excluded by all three methods were BMI and marital status, which consistently failed to improve AIC once other predictors were included. In contrast, variables such as general health, mental health days, exercise, age group, income, depression, education, smoking status, and sex were retained in every method, reflecting their stable and meaningful contribution to predicting physically unhealthy days. The near‑identical adjusted R² and AIC values across the three models further reinforce the consistency of these results.
#Step 4: Final Model Selection and Interpretation
For the final model, I selected the AIC‑based model identified by backward, forward, and stepwise selection, because all three procedures converged on the same 17‑predictor specification. AIC: Lower (54114.65 vs 54115.15) BIC: Much lower (247.3 vs 268.8) Adj R²: Essentially the same (0.3239 vs 0.3247) I preferred AIC‑based selection because it balances goodness‑of‑fit with model complexity and performed more reliably.
# Final model = maximum model (all stepwise methods agreed)
mod_final <- mod_stepwise
# Display coefficient table
tidy(mod_final, conf.int = TRUE) |>
mutate(across(where(is.numeric), ~ round(., 3))) |>
kable(
caption = "Final Linear Regression Model: Coefficient Table (N = 8,000)",
col.names = c("Term", "Estimate", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper"),
align = c("l", rep("r", 6))
) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE) |>
scroll_box(width = "100%")| Term | Estimate | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.618 | 0.491 | 3.295 | 0.001 | 0.655 | 2.581 |
| gen_healthGood | 1.364 | 0.184 | 7.416 | 0.000 | 1.003 | 1.724 |
| gen_healthFair/Poor | 10.001 | 0.249 | 40.225 | 0.000 | 9.514 | 10.488 |
| menthlth_days | 0.184 | 0.011 | 16.175 | 0.000 | 0.161 | 0.206 |
| exerciseYes | -1.905 | 0.203 | -9.379 | 0.000 | -2.303 | -1.507 |
| age_group35-49 | 0.837 | 0.270 | 3.098 | 0.002 | 0.308 | 1.367 |
| age_group50-64 | 1.478 | 0.258 | 5.721 | 0.000 | 0.971 | 1.984 |
| age_group65+ | 1.837 | 0.251 | 7.310 | 0.000 | 1.344 | 2.329 |
| income$25K-$49K | -1.032 | 0.275 | -3.756 | 0.000 | -1.571 | -0.494 |
| income$50K-$99K | -1.388 | 0.266 | -5.224 | 0.000 | -1.909 | -0.867 |
| income$100K+ | -1.112 | 0.385 | -2.890 | 0.004 | -1.867 | -0.358 |
| depressionYes | 0.747 | 0.215 | 3.477 | 0.001 | 0.326 | 1.168 |
| educationHS/GED | 0.258 | 0.401 | 0.643 | 0.520 | -0.528 | 1.043 |
| educationSome college | 0.964 | 0.402 | 2.394 | 0.017 | 0.175 | 1.753 |
| educationCollege graduate | 1.031 | 0.404 | 2.550 | 0.011 | 0.239 | 1.824 |
| smokingFormer smoker | 0.440 | 0.188 | 2.344 | 0.019 | 0.072 | 0.808 |
| smokingCurrent smoker | 0.478 | 0.265 | 1.804 | 0.071 | -0.041 | 0.997 |
| sexMale | -0.233 | 0.163 | -1.426 | 0.154 | -0.553 | 0.087 |
#Interpretation of Coefficients 1. Mental health days (continuous
predictor).
Holding all other variables constant, each additional mentally unhealthy
day a person reports is associated with 0.18 more physically unhealthy
days per month. This means that worse mental health is linked to
slightly more days of poor physical health, on average.
Exercise (categorical predictor: Yes vs. No).
Compared to people who do not exercise, individuals who report
exercising have 1.9 fewer physically unhealthy days per month, holding
all other variables constant. This indicates a meaningful protective
association between regular exercise and physical health.
General health (categorical predictor: Fair/Poor
vs. Excellent/Very Good).
Holding all else constant, individuals who rate their general health as
fair or poor report 10 more physically unhealthy days per month than
those in excellent or very good health. This is the largest effect in
the model and reflects the strong link between global health perception
and physical functioning.
#Step 5: LINE Assumption Check
#Interpretation 1. Residuals vs. Fitted The residuals show a slight curved pattern and uneven vertical spread, suggesting mild non‑linearity and non‑constant variance. The red smoothing line deviates from a flat horizontal line, reinforcing this.
Normal Q‑Q Plot The residuals do not follow a normal distribution perfectly: they deviate from the diagonal in both tails, with the upper tail showing the largest departure. This indicates right‑skewness and heavier‑than‑normal tails.
Scale‑Location Plot The spread of the standardized residuals increases slightly at higher fitted values, indicating heteroscedasticity rather than constant variance. The red line slopes upward instead of remaining flat.
Residuals vs. Leverage There are no highly influential observations, as no points exceed the Cook’s distance contours. A few points have moderate leverage, but none appear problematic.
Overall, the LINE assumptions are partially satisfied. The model shows mild violations of linearity, normality, and homoscedasticity, which are common with bounded count‑type outcomes like health‑day measures. However, there are no influential outliers, and the model remains reasonably stable despite these assumption departures.
#Part 2: Logistic Regression (40 points)
# Step 1: Create binary outcome
brfss_analytic <- brfss_analytic %>%
mutate(
frequent_distress = factor(
case_when(
physhlth_days >= 14 ~ "Yes",
physhlth_days < 14 ~ "No",
TRUE ~ NA_character_
),
levels = c("No", "Yes") # No is reference
)
)
# Report
distress_table <- brfss_analytic |>
count(frequent_distress) |>
mutate(Percentage = round(n / sum(n) * 100, 1))
distress_table |>
kable(
caption = "Prevalence of Frequent Physical Distress (≥14 days in past 30)",
col.names = c("Frequent Physical Distress", "N", "Percentage (%)"),
align = c("l", "r", "r")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Frequent Physical Distress | N | Percentage (%) |
|---|---|---|
| No | 6948 | 86.9 |
| Yes | 1052 | 13.2 |
#Interpretation: Among 8,000 adults, 1,052 (13.2%) reported frequent physical distress (≥14 physically unhealthy days in the past 30 days), while 6,948 (86.9%) did not. The outcome is imbalanced, with the “No distress” category making up the large majority of the sample.
#Step 2: Simple (Unadjusted) Logistic Regression (10 points)
Predictor: Mentally Unhealthy days WHy: general health is almost similar summary of physical health, while mental health is a distinct domain, and could make the logistic model more interpretable.It is also the strongest continuous predictor in the final linear model (t- statistic = 16.2), showing a clear association with physical health.
# Fit simple logistic regression
mod_simple <- glm(
frequent_distress ~ menthlth_days,
data = brfss_analytic,
family = binomial
)
summary(mod_simple)##
## Call:
## glm(formula = frequent_distress ~ menthlth_days, family = binomial,
## data = brfss_analytic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.346000 0.042986 -54.58 <2e-16 ***
## menthlth_days 0.073236 0.003181 23.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6227.7 on 7999 degrees of freedom
## Residual deviance: 5734.5 on 7998 degrees of freedom
## AIC: 5738.5
##
## Number of Fisher Scoring iterations: 5
# Display results with odds ratios
tidy(mod_simple, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Simple Logistic Regression: Frequent Distress ~ Mentally unhealthy Days (Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.096 | 0.043 | -54.576 | 0 | 0.088 | 0.104 |
| menthlth_days | 1.076 | 0.003 | 23.020 | 0 | 1.069 | 1.083 |
# Odds ratios and 95% CIs
or_simple <- exp(coef(mod_simple))
ci_simple <- exp(confint(mod_simple))
tibble(
Term = names(or_simple),
`Odds Ratio` = round(or_simple, 3),
`95% CI Lower` = round(ci_simple[, 1], 3),
`95% CI Upper` = round(ci_simple[, 2], 3)
) |>
kable(
caption = "Simple Logistic Regression: Odds Ratios for Mentally Unhealthy Days",
align = c("l", "r", "r", "r")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Odds Ratio | 95% CI Lower | 95% CI Upper |
|---|---|---|---|
| (Intercept) | 0.096 | 0.088 | 0.104 |
| menthlth_days | 1.076 | 1.069 | 1.083 |
#Interpretation: For every one‑unit increase in mentally unhealthy days, the odds of frequent physical distress are multiplied by 1.076, with a 95% CI [1.069, 1.083], holding all other variables constant. This means that each additional mentally unhealthy day is associated with about a 7.6% increase in the odds of experiencing ≥14 physically unhealthy days in the past month.
The association is statistically significant at α = 0.05, because the Wald test p‑value < 0.001 and the 95% confidence interval does not include 1.0.
#Step 3: Multiple Logistic Regression (10 points)
# Fit multiple logistic regression using same predictors as final linear model
mod_mlr <- glm(
frequent_distress ~ menthlth_days + bmi + sex + exercise + depression + age_group + income + education + smoking + gen_health + marital,
data = brfss_analytic,
family = binomial
)
summary(mod_mlr)##
## Call:
## glm(formula = frequent_distress ~ menthlth_days + bmi + sex +
## exercise + depression + age_group + income + education +
## smoking + gen_health + marital, family = binomial, data = brfss_analytic)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.463024 0.292454 -11.841 < 2e-16 ***
## menthlth_days 0.044787 0.004358 10.277 < 2e-16 ***
## bmi -0.004822 0.005556 -0.868 0.38546
## sexMale -0.176389 0.081382 -2.167 0.03020 *
## exerciseYes -0.586264 0.085768 -6.835 8.17e-12 ***
## depressionYes 0.267888 0.095954 2.792 0.00524 **
## age_group35-49 0.512672 0.162401 3.157 0.00160 **
## age_group50-64 0.767269 0.154816 4.956 7.20e-07 ***
## age_group65+ 0.964882 0.155220 6.216 5.09e-10 ***
## income$25K-$49K -0.215375 0.111712 -1.928 0.05386 .
## income$50K-$99K -0.501752 0.118036 -4.251 2.13e-05 ***
## income$100K+ -0.541456 0.227454 -2.381 0.01729 *
## educationHS/GED 0.049334 0.165069 0.299 0.76504
## educationSome college 0.331979 0.165058 2.011 0.04430 *
## educationCollege graduate 0.283640 0.171335 1.655 0.09783 .
## smokingFormer smoker 0.200388 0.089714 2.234 0.02551 *
## smokingCurrent smoker 0.208048 0.116374 1.788 0.07381 .
## gen_healthGood 0.895044 0.116270 7.698 1.38e-14 ***
## gen_healthFair/Poor 2.696530 0.115309 23.385 < 2e-16 ***
## maritalPreviously married -0.138611 0.096205 -1.441 0.14965
## maritalNever married -0.143542 0.127789 -1.123 0.26132
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6227.7 on 7999 degrees of freedom
## Residual deviance: 4420.7 on 7979 degrees of freedom
## AIC: 4462.7
##
## Number of Fisher Scoring iterations: 6
# Display results with odds ratios
tidy(mod_mlr, exponentiate = TRUE, conf.int = TRUE) %>%
kable(caption = "Multiple Logistic Regression (Adjusted Odds Ratios)",
digits = 3,
col.names = c("Term", "Odds Ratio", "Std. Error", "z-statistic", "p-value", "95% CI Lower", "95% CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Term | Odds Ratio | Std. Error | z-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 0.031 | 0.292 | -11.841 | 0.000 | 0.018 | 0.055 |
| menthlth_days | 1.046 | 0.004 | 10.277 | 0.000 | 1.037 | 1.055 |
| bmi | 0.995 | 0.006 | -0.868 | 0.385 | 0.984 | 1.006 |
| sexMale | 0.838 | 0.081 | -2.167 | 0.030 | 0.715 | 0.983 |
| exerciseYes | 0.556 | 0.086 | -6.835 | 0.000 | 0.470 | 0.658 |
| depressionYes | 1.307 | 0.096 | 2.792 | 0.005 | 1.082 | 1.576 |
| age_group35-49 | 1.670 | 0.162 | 3.157 | 0.002 | 1.217 | 2.302 |
| age_group50-64 | 2.154 | 0.155 | 4.956 | 0.000 | 1.596 | 2.929 |
| age_group65+ | 2.624 | 0.155 | 6.216 | 0.000 | 1.944 | 3.573 |
| income$25K-$49K | 0.806 | 0.112 | -1.928 | 0.054 | 0.648 | 1.004 |
| income$50K-$99K | 0.605 | 0.118 | -4.251 | 0.000 | 0.481 | 0.763 |
| income$100K+ | 0.582 | 0.227 | -2.381 | 0.017 | 0.368 | 0.899 |
| educationHS/GED | 1.051 | 0.165 | 0.299 | 0.765 | 0.762 | 1.456 |
| educationSome college | 1.394 | 0.165 | 2.011 | 0.044 | 1.011 | 1.932 |
| educationCollege graduate | 1.328 | 0.171 | 1.655 | 0.098 | 0.952 | 1.864 |
| smokingFormer smoker | 1.222 | 0.090 | 2.234 | 0.026 | 1.024 | 1.456 |
| smokingCurrent smoker | 1.231 | 0.116 | 1.788 | 0.074 | 0.979 | 1.545 |
| gen_healthGood | 2.447 | 0.116 | 7.698 | 0.000 | 1.952 | 3.081 |
| gen_healthFair/Poor | 14.828 | 0.115 | 23.385 | 0.000 | 11.863 | 18.648 |
| maritalPreviously married | 0.871 | 0.096 | -1.441 | 0.150 | 0.720 | 1.050 |
| maritalNever married | 0.866 | 0.128 | -1.123 | 0.261 | 0.673 | 1.111 |
#Interpretation: 1. Mentally unhealthy days (continuous predictor) Holding all other variables constant, each additional mentally unhealthy day is associated with 1.046 times higher odds of frequent physical distress, 95% CI [1.037, 1.055]. This association is statistically significant, as the p‑value is <0.001 and the confidence interval does not include 1.0.
Exercise (Yes vs. No) Compared to adults who do not exercise, those who do exercise have 0.556 times the odds of frequent physical distress, 95% CI [0.470, 0.658], holding all other variables constant. This indicates a strong protective association, and it is statistically significant based on the p‑value <0.001 and CI excluding 1.0.
General health (Fair/Poor vs. Excellent/Very Good) Holding all else constant, individuals reporting fair or poor general health have 14.83 times the odds of frequent physical distress compared to those in excellent or very good health, 95% CI [11.86, 18.65]. This is the strongest predictor in the model and is highly statistically significant (p < 0.001, CI excludes 1.0).
#Step 4: Likelihood Ratio Test
Hypotheses Null hypothesis (H₀): The education variables do not improve model fit; all education coefficients = 0.
Alternative hypothesis (H₁): At least one education coefficient ≠ 0; the education group improves model fit.
# Fit reduced model excluding education
mod_reduced <- glm(
frequent_distress ~ menthlth_days + bmi + sex + exercise + depression + age_group + income + smoking + gen_health + marital,
data = brfss_analytic,
family = binomial
)
# Likelihood ratio test
lrt_result <- anova(mod_reduced, mod_mlr, test = "Chisq")
print(lrt_result)## Analysis of Deviance Table
##
## Model 1: frequent_distress ~ menthlth_days + bmi + sex + exercise + depression +
## age_group + income + smoking + gen_health + marital
## Model 2: frequent_distress ~ menthlth_days + bmi + sex + exercise + depression +
## age_group + income + education + smoking + gen_health + marital
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 7982 4430.3
## 2 7979 4420.7 3 9.6201 0.02209 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Report LRT results
tibble(
Model = c("Reduced (no education)", "Full (with education)"),
Deviance = round(c(mod_reduced$deviance, mod_mlr$deviance), 2),
df = c(mod_reduced$df.residual, mod_mlr$df.residual)
) |>
kable(
caption = "Model Deviances for Likelihood Ratio Test",
align = c("l", "r", "r")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Deviance | df |
|---|---|---|
| Reduced (no education) | 4430.28 | 7982 |
| Full (with education) | 4420.66 | 7979 |
#Interpretation Comparing the reduced model (without education) to the full model (with education) gives:
Deviance difference (χ²): 9.62 Degrees of freedom: 3 p‑value: 0.022
At α = 0.05, we reject the null hypothesis, because the p‑value (0.022) is below 0.05. This means that, collectively, the education variables significantly improve the fit of the logistic regression model for predicting frequent physical distress. Even though some individual education levels were not significant on their own, the group as a whole contributes meaningful explanatory value.
#STEP 5: ROC Curve and AUC
# Generate ROC curve
roc_obj <- roc(
brfss_analytic$frequent_distress,
fitted(mod_mlr),
levels = c("No", "Yes"),
direction = "<"
)
# Plot ROC curve
plot(
roc_obj,
main = "ROC Curve: Frequent Physical Distress Model",
col = "steelblue",
lwd = 2,
print.auc = TRUE,
auc.polygon = TRUE,
auc.polygon.col = "lightgreen",
print.auc.y = 0.45
)
abline(a = 0, b = 1, lty = 2, col = "grey")# Report AUC with 95% CI
auc_val <- auc(roc_obj)
auc_ci <- ci.auc(roc_obj)
cat("AUC:", round(auc_val, 4), "\n")## AUC: 0.8555
## 95% CI: 0.8426 – 0.8683
#Interpretation: An AUC of 0.8555 indicates that the model has excellent discrimination according to the interpretation scale. This means that, when presented with one person who does have frequent physical distress and one person who does not, the model will correctly rank them (assigning a higher predicted probability to the distressed individual) about 86% of the time.
The 95% CI (0.8426–0.8683) is narrow and entirely within the excellent range (0.8–0.9).
#Step 6: Hosmer-Lemeshow Goodness-of-Fit Test (5 points)
#Hypotheses H₀ (null): The model fits the data well; observed and expected counts are similar across deciles of predicted risk.
H₁ (alternative): The model does not fit the data well; observed and expected counts differ in at least some deciles.
# Hosmer-Lemeshow goodness-of-fit test
hl_test <- hoslem.test(mod_mlr$y, fitted(mod_mlr), g = 10)
print(hl_test)##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: mod_mlr$y, fitted(mod_mlr)
## X-squared = 5.5773, df = 8, p-value = 0.6945
# Observed vs. expected
hl_obs_exp <- cbind(hl_test$observed, hl_test$expected)
colnames(hl_obs_exp) <- c("Obs. No", "Obs. Yes", "Exp. No", "Exp. Yes")
hl_obs_exp |>
as.data.frame() |>
rownames_to_column("Decile") |>
mutate(across(where(is.numeric), ~ round(., 1))) |>
kable(
caption = "Hosmer-Lemeshow Test: Observed vs. Expected by Predicted Probability Decile",
align = c("l", rep("r", 4))
) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Decile | Obs. No | Obs. Yes | Exp. No | Exp. Yes |
|---|---|---|---|---|
| [0.00675,0.0185] | 786 | 14 | 789.0 | 11.0 |
| (0.0185,0.025] | 784 | 16 | 782.6 | 17.4 |
| (0.025,0.0305] | 785 | 15 | 778.1 | 21.9 |
| (0.0305,0.0378] | 777 | 23 | 773.1 | 26.9 |
| (0.0378,0.0507] | 765 | 35 | 764.9 | 35.1 |
| (0.0507,0.0679] | 753 | 47 | 752.7 | 47.3 |
| (0.0679,0.0946] | 728 | 72 | 736.1 | 63.9 |
| (0.0946,0.187] | 692 | 108 | 695.8 | 104.2 |
| (0.187,0.425] | 546 | 254 | 552.1 | 247.9 |
| (0.425,0.893] | 332 | 468 | 323.6 | 476.4 |
Interpretation: From the analysis of deviance table:
Test statistic (χ²): 5.57 Degrees of freedom: 8 p‑value: 0.6945
At α = 0.05, we fail to reject the null hypothesis because the p‑value (0.6945) is much greater than 0.05. This indicates no evidence of poor model fit, hence the model’s predicted probabilities align well with the observed outcomes across risk deciles.
The Hosmer–Lemeshow test evaluates calibration (how close predicted probabilities are to actual outcomes), while the ROC/AUC evaluates discrimination (how well the model separates cases from non‑cases). Together, the model has excellent discrimination (AUC = 0.856) and good calibration and it ranks individuals correctly as well as produces aligned predicted probabilities.
#Part 3: Reflection
A. Statistical Insights
Whild conducting both the linear and logistic regression analyses, the results showed that physical health burden among U.S. adults is strongly shaped by overall health status, mental health, health behaviors, and socioeconomic position. General health emerged as the single strongest predictor in every model: individuals reporting fair or poor health had dramatically higher physically unhealthy days and nearly 15 fold increased odds of frequent physical distress with poor general health. Mentally unhealthy days was also strong predictor, showing a clear dose–response pattern in both modeling frameworks. Exercise, age group, depression, income, and smoking also showed meaningful associations, though their magnitudes were smaller. Some predictors, such as education, were statistically significant in the linear model but only partially significant in the logistic model, reflecting differences in how each framework captures variation in the outcome. Because the BRFSS is cross‑sectional, the analysis cannot establish temporal ordering or causality; poor physical health may influence mental health, income, or behaviors just as much as the reverse. Additionally, unmeasured confounders such as chronic disease burden, access to healthcare, occupational exposures, or neighborhood environment could bias associations if they influence both predictors and physical distress but were not included in the model.
B. Linear vs Logistic Regression
The linear regression model showed how many physically unhealthy days are associated with each predictor, offering an interpretation of changes in the mean number of days. On the other hand, the logistic regression model focus on the probability of crossing a clinically meaningful threshold (≥14 days) by CDC definitions and public‑health surveillance. Logistic regression also allows evaluation of discrimination through the AUC, which showed excellent ability (AUC = 0.856) to distinguish individuals with and without frequent distress which the linear model cannot provide. Linear regression is preferable when the outcome is approximately continuous and normally distributed, whereas logistic is more appropriate for binary outcomes. Model selection criteria also differ: linear models rely on R², adjusted R², and AIC/BIC for fit and parsimony, while logistic models emphasize deviance, likelihood‑ratio tests, calibration (Hosmer–Lemeshow), and discrimination metrics such as AUC. Using both of these approaches showed the important angle to look at the physical health burden in the US population.
C. AI Transparency
I consulted an AI assistant during Part 0 (Data Preparation) to help debug a data‑quality issue. After selecting 12 raw BRFSS variables and creating 12 recoded versions using mutate(), I expected my analytic dataset to contain 24 columns. However, after applying drop_na() and slice_sample(), the dataset only showed 23 variables. I asked the AI why this might happen, and it suggested that one of the variable names in my select() statement might not match the name in the raw BRFSS dataset. I double‑checked this by inspecting each dataset individually—for example, using unique(brfss_clean$marital) and verifying that “marital” %in% names(brfss_raw) returned TRUE for all four datasets. Even so, the final dataset continued to show only 23 variables. Aside from this issue, I followed lectures and it was fairly easy to follow through.
#end of homework 4