Submission: Knit this file to HTML, publish to RPubs with the title
epi553_hw04_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this.Rmdfile to Brightspace.AI Policy: AI tools are permitted on this assignment for debugging not coding. See the assignment description for full details.
Step 1.1: Load required packages: tidyverse, haven, broom, kableExtra, car, summary, leaps, pROC, ResourceSelection
# Load required packages
library(kableExtra)
library(broom)
library(haven)
library(janitor)
library(plotly)
library(broom)
library(ggeffects)
library(ggstats)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)
library(leaps)
library(pROC)
library(ResourceSelection)
library(tidyverse)Step 1.2: Import using read_xpt. Report the total number of rows and columns in the raw dataset
# Import the dataset — update the path if needed
brfss_raw <- read_xpt("LLCP2023.XPT") |> clean_names()
# Quick check
glimpse(brfss_raw)## Rows: 433,323
## Columns: 350
## $ state <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ fmonth <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ idate <chr> "03012023", "01062023", "03082023", "03062023", "01062023", "…
## $ imonth <chr> "03", "01", "03", "03", "01", "01", "03", "01", "03", "01", "…
## $ iday <chr> "01", "06", "08", "06", "06", "09", "21", "06", "15", "08", "…
## $ iyear <chr> "2023", "2023", "2023", "2023", "2023", "2023", "2023", "2023…
## $ dispcode <dbl> 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1…
## $ seqno <chr> "2023000001", "2023000002", "2023000003", "2023000004", "2023…
## $ psu <dbl> 2.023e+09, 2.023e+09, 2.023e+09, 2.023e+09, 2.023e+09, 2.023e…
## $ ctelenm1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pvtresd1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ colghous <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ statere1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ celphon1 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ ladult1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ numadult <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 1, 1, 2, 1, 1, 1…
## $ respslc1 <dbl> 1, NA, NA, 1, NA, 1, NA, NA, NA, 2, 1, 1, 1, 1, 1, NA, NA, 1,…
## $ landsex2 <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ lndsxbrt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ safetime <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ctelnum1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cellfon5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cadult1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cellsex2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ celsxbrt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pvtresd3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cclghous <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cstate1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ landline <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hhadult <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sexvar <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ genhlth <dbl> 2, 2, 4, 2, 4, 3, 4, 4, 3, 3, 3, 2, 2, 3, 3, 4, 5, 3, 3, 3, 3…
## $ physhlth <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, 30, 1…
## $ menthlth <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88,…
## $ poorhlth <dbl> NA, NA, 1, 88, NA, 88, 88, 1, 88, NA, NA, 88, NA, NA, NA, 1, …
## $ primins1 <dbl> 3, 3, 3, 3, 3, 7, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2…
## $ persdoc3 <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2…
## $ medcost1 <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ checkup1 <dbl> 2, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ exerany2 <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2, 1, 1…
## $ exract12 <dbl> NA, 1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA, 1, NA, NA, 1, NA, NA…
## $ exeroft1 <dbl> NA, 106, 205, 103, 102, 212, NA, NA, NA, 230, 105, NA, 105, N…
## $ exerhmm1 <dbl> NA, 30, 15, 30, 45, 45, NA, NA, NA, 30, 30, NA, 45, NA, NA, 1…
## $ exract22 <dbl> NA, 88, 88, 88, 8, 88, NA, NA, NA, 88, 77, NA, 3, NA, NA, 88,…
## $ exeroft2 <dbl> NA, NA, NA, NA, 107, NA, NA, NA, NA, NA, NA, NA, 105, NA, NA,…
## $ exerhmm2 <dbl> NA, NA, NA, NA, 100, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA, …
## $ strength <dbl> 888, 888, 205, 888, 888, 203, 777, 888, 888, 888, 105, 888, 8…
## $ bphigh6 <dbl> 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1…
## $ bpmeds1 <dbl> 1, 2, 1, NA, 1, 1, NA, 1, 1, 1, 1, 1, NA, NA, 1, 1, 1, 1, 1, …
## $ cholchk3 <dbl> 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2…
## $ toldhi3 <dbl> 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2…
## $ cholmed3 <dbl> 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2…
## $ cvdinfr4 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ cvdcrhd4 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ cvdstrk3 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
## $ asthma3 <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
## $ asthnow <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ chcscnc1 <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ chcocnc1 <dbl> 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ chccopd3 <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2…
## $ addepev3 <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ chckdny2 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 7, 1, 2, 2, 2…
## $ havarth4 <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1…
## $ diabete4 <dbl> 1, 3, 3, 3, 1, 3, 3, 1, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1…
## $ diabage4 <dbl> 57, NA, NA, NA, 68, NA, NA, 98, NA, 60, NA, NA, NA, NA, NA, N…
## $ marital <dbl> 1, 2, 3, 1, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 3, 3, 2…
## $ educa <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5, 4, 4…
## $ renthom1 <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ numhhol4 <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2…
## $ numphon4 <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ cpdemo1c <dbl> 1, 1, 8, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ veteran3 <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2…
## $ employ1 <dbl> 7, 7, 7, 7, 8, 7, 8, 7, 7, 7, 4, 1, 7, 7, 7, 5, 7, 7, 7, 7, 7…
## $ children <dbl> 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 8…
## $ income3 <dbl> 99, 99, 2, 99, 7, 7, 6, 99, 6, 7, 7, 99, 9, 9, 9, 3, 77, 99, …
## $ pregnant <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ weight2 <dbl> 172, 132, 130, 170, 170, 165, 180, 135, 195, 160, 300, 145, 1…
## $ height3 <dbl> 503, 409, 504, 506, 508, 502, 600, 411, 504, 510, 511, 505, 5…
## $ deaf <dbl> 2, 1, 7, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ blind <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ decide <dbl> 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 7, 2…
## $ diffwalk <dbl> 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1…
## $ diffdres <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 2, 2, 2, 2…
## $ diffalon <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1…
## $ fall12mn <dbl> 88, 88, 88, 88, 3, 88, 8, 88, 88, 88, 88, 88, 88, 1, 88, 1, 3…
## $ fallinj5 <dbl> NA, NA, NA, NA, 88, NA, 88, NA, NA, NA, NA, NA, NA, 1, NA, 88…
## $ smoke100 <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1…
## $ smokday2 <dbl> NA, NA, 3, NA, NA, NA, 3, NA, NA, 3, NA, NA, 3, 3, 3, NA, NA,…
## $ usenow3 <dbl> 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ ecignow2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1, 4…
## $ alcday4 <dbl> 888, 888, 888, 888, 202, 205, 888, 888, 888, 888, 888, 888, 1…
## $ avedrnk3 <dbl> NA, NA, NA, NA, 1, 2, NA, NA, NA, NA, NA, NA, 1, 2, NA, NA, N…
## $ drnk3ge5 <dbl> NA, NA, NA, NA, 88, 88, NA, NA, NA, NA, NA, NA, 88, 88, NA, N…
## $ maxdrnks <dbl> NA, NA, NA, NA, 1, 2, NA, NA, NA, NA, NA, NA, 1, 3, NA, NA, N…
## $ flushot7 <dbl> 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1…
## $ flshtmy3 <dbl> NA, 92023, 112022, 102022, NA, NA, 32023, 102023, 777777, 112…
## $ pneuvac4 <dbl> 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1…
## $ shingle2 <dbl> 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1…
## $ hivtst7 <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 2, 2, 2, 1…
## $ hivtstd3 <dbl> NA, NA, NA, 777777, NA, 777777, 92022, NA, NA, NA, NA, NA, NA…
## $ seatbelt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ drnkdri2 <dbl> NA, NA, NA, NA, 88, 88, NA, NA, NA, NA, NA, NA, 88, 88, NA, N…
## $ covidpo1 <dbl> 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2…
## $ covidsm1 <dbl> NA, NA, NA, NA, NA, 2, NA, 2, 2, NA, 1, 2, NA, 2, 2, NA, NA, …
## $ covidact <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA…
## $ pdiabts1 <dbl> NA, 2, 1, 1, NA, 1, 1, NA, 1, NA, 1, 1, 1, 1, 1, 1, 7, 1, 8, …
## $ prediab2 <dbl> NA, 3, 3, 3, NA, 3, 3, NA, 1, NA, 3, 3, 1, 3, 3, 3, 3, 3, 3, …
## $ diabtype <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ insulin1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ chkhemo3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ eyeexam1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ diabeye1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ diabedu1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ feetsore <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ arthexer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ arthedu <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lmtjoin3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ arthdis2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ joinpai2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsfirst <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcslast <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsnumcg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsctsc1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsscncr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsctwhn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hadmam <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ howlong <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cervscrn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crvclcnc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crvclpap <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crvclhpv <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hadhyst2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ psatest1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ psatime1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcpsars2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ psasugs1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcstalk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hadsigm4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ colnsigm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ colntes1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sigmtes1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lastsig4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ colncncr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vircolo1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vclntes2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ smalstol <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ stoltest <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ stooldn2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bldstfit <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sdnates1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cncrdiff <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cncrage <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cncrtyp2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvtrt3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvdoc1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvsum <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvrtrn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvinst <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvinsr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvdein <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvclin <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvpain <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvctl2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ indortan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ numburn3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sunprtct <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wkdayout <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wkendout <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cimemlo1 <dbl> 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, NA, 2, 2, 2, 1, 2, 2, …
## $ cdworry <dbl> NA, NA, 2, NA, 1, 1, NA, NA, NA, NA, NA, NA, 7, NA, NA, NA, N…
## $ cddiscu1 <dbl> NA, NA, 1, NA, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, N…
## $ cdhous1 <dbl> NA, NA, 2, NA, 2, 1, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, N…
## $ cdsocia1 <dbl> NA, NA, 2, NA, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, N…
## $ caregiv1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvrel4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvlng1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvhrs1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvprb3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvalzd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvper1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvhou1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvexpt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lastsmk2 <dbl> NA, NA, 7, NA, NA, NA, 7, NA, NA, 7, NA, NA, 7, NA, 7, NA, NA…
## $ stopsmk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mentcigs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mentecig <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ heattbco <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ firearm5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ gunload <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ loadulk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ firstaid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aspirin <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ birthsex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ somale <dbl> NA, NA, NA, NA, NA, NA, 2, NA, NA, 2, 2, NA, NA, 2, NA, NA, N…
## $ sofemale <dbl> 2, 2, 2, 2, 2, 2, NA, 2, 2, NA, NA, 2, 2, NA, 2, 2, 2, NA, 9,…
## $ trnsgndr <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ marijan1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjsmok <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjeat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjvape <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjdab <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjothr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ usemrjn4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedeprs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedrink <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedrugs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceprisn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedivrc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acepunch <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acehurt1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceswear <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acetouch <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acetthem <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acehvsex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceadsaf <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceadned <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ imfvpla4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hpvadvc4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hpvadsht <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tetanus1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ covidva1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ covacge1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ covidnu2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lsatisfy <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, NA, 1, 2, 3, 1, 4, 2, …
## $ emtsuprt <dbl> 1, 2, 4, 1, 2, 2, 1, 1, 1, 3, 1, 1, 2, NA, 1, 1, 2, 1, 4, 2, …
## $ sdlonely <dbl> 5, 3, 3, 3, 2, 3, 4, 5, 3, 3, 4, 5, 4, NA, 4, 3, 4, 5, 1, 2, …
## $ sdhemply <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ foodstmp <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhfood1 <dbl> 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, NA, 5, 5, 5, 5, 5, 5, …
## $ sdhbills <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhutils <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhtrnsp <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhstre1 <dbl> 5, 5, 3, 5, 2, 3, 5, 5, 4, 4, 4, 4, 4, NA, 4, 5, 4, 5, 4, 4, …
## $ rrclass3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrcognt2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrtreat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rratwrk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrhcare4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrphysm2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rcsgend1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rcsxbrth <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rcsrltn2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ casthdx2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ casthno2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ qstver <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ qstlang <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ metstat <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1…
## $ urbstat <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mscode <dbl> 1, 1, 3, 2, 5, 1, 2, 5, 3, 1, 1, 1, 1, 5, 1, 2, 5, 5, 5, 1, 1…
## $ ststr <dbl> 11011, 11012, 11011, 11011, 11011, 11011, 11011, 11012, 11011…
## $ strwt <dbl> 42.79158, 170.42939, 42.79158, 42.79158, 42.79158, 42.79158, …
## $ rawrake <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 1, 1, 2, 1, 1, 1…
## $ wt2rake <dbl> 85.58316, 170.42939, 42.79158, 85.58316, 42.79158, 85.58316, …
## $ imprace <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ chispnc <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ crace1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cageg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cllcpwt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ dualuse <dbl> 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dualcor <dbl> 0.5026388, 0.5026388, NA, 0.5026388, 0.5026388, 0.5026388, 0.…
## $ llcpwt2 <dbl> 941.1640, 1874.2239, 1151.6032, 941.1640, 470.5820, 941.1640,…
## $ llcpwt <dbl> 605.4279, 1121.9927, 600.9633, 605.4279, 281.7110, 1060.3352,…
## $ rfhlth <dbl> 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1…
## $ phys14d <dbl> 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 3, 3, 1, 1, 2…
## $ ment14d <dbl> 1, 1, 2, 1, 1, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1…
## $ hlthpl1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hcvu653 <dbl> 9, 9, 9, 9, 9, 1, 9, 9, 9, 9, 1, 9, 9, 9, 9, 9, 1, 9, 9, 9, 9…
## $ totinda <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 9, 2, 2, 1, 1…
## $ metvl12 <dbl> NA, 35, 35, 35, 35, 35, NA, NA, NA, 35, 35, NA, 35, NA, NA, 3…
## $ metvl22 <dbl> NA, NA, NA, NA, 33, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA, N…
## $ maxvo21 <dbl> 1840, 1803, 1322, 1914, 1988, 2506, 1545, 1914, 1655, 1875, 2…
## $ fc601 <dbl> 315, 309, 227, 328, 341, 430, 265, 328, 284, 321, 472, 328, 3…
## $ actin13 <dbl> NA, 2, 2, 2, 2, 1, NA, NA, NA, 2, 1, NA, 1, NA, NA, 2, NA, NA…
## $ actin23 <dbl> NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA,…
## $ padur1 <dbl> NA, 30, 15, 30, 45, 45, NA, NA, NA, 30, 30, NA, 45, NA, NA, 6…
## $ padur2 <dbl> NA, NA, NA, NA, 60, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA, N…
## $ pafreq1 <dbl> NA, 6000, 1167, 3000, 2000, 2800, NA, NA, NA, 7000, 5000, NA,…
## $ pafreq2 <dbl> NA, NA, NA, NA, 7000, NA, NA, NA, NA, NA, NA, NA, 5000, NA, N…
## $ minac12 <dbl> NA, 180, 18, 90, 90, 126, NA, NA, NA, 210, 150, NA, 225, NA, …
## $ minac22 <dbl> NA, 0, 0, 0, 420, 0, NA, NA, NA, 0, NA, NA, 225, NA, NA, 0, N…
## $ strfreq <dbl> 0, 0, 1167, 0, 0, 700, NA, 0, 0, 0, 5000, 0, 0, 2000, 2000, 0…
## $ pamiss3 <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 9, 0, 0, 1, 1…
## $ pamin13 <dbl> NA, 360, 36, 180, 180, 126, NA, NA, NA, 420, 150, NA, 225, NA…
## $ pamin23 <dbl> NA, NA, NA, NA, 420, NA, NA, NA, NA, NA, NA, NA, 450, NA, NA,…
## $ pa3min <dbl> NA, 360, 36, 180, 600, 126, NA, NA, NA, 420, 150, NA, 675, NA…
## $ pavig13 <dbl> NA, 180, 18, 90, 90, 0, NA, NA, NA, 210, 0, NA, 0, NA, NA, 12…
## $ pavig23 <dbl> NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, 225, NA, NA, N…
## $ pa3vigm <dbl> NA, 180, 18, 90, 90, 0, NA, NA, NA, 210, 0, NA, 225, NA, NA, …
## $ pacat3 <dbl> 4, 1, 9, 9, 1, 9, 4, 4, 4, 1, 9, 4, 1, 4, 4, 9, 9, 4, 4, 9, 9…
## $ paindx3 <dbl> 2, 1, 9, 1, 1, 9, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 9, 2, 2, 1, 1…
## $ pa150r4 <dbl> 3, 1, 9, 1, 1, 9, 3, 3, 3, 1, 1, 3, 1, 3, 3, 1, 9, 3, 3, 1, 1…
## $ pa300r4 <dbl> 3, 1, 9, 9, 1, 9, 3, 3, 3, 1, 9, 3, 1, 3, 3, 9, 9, 3, 3, 9, 9…
## $ pa30023 <dbl> 2, 1, 9, 9, 1, 9, 2, 2, 2, 1, 9, 2, 1, 2, 2, 9, 9, 2, 2, 9, 9…
## $ pastrng <dbl> 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1…
## $ parec3 <dbl> 4, 2, 9, 2, 2, 9, 9, 4, 4, 2, 1, 4, 2, 3, 3, 2, 9, 4, 4, 2, 1…
## $ pastae3 <dbl> 2, 2, 9, 2, 2, 9, 9, 2, 2, 2, 1, 2, 2, 2, 2, 2, 9, 2, 2, 2, 1…
## $ rfhype6 <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2…
## $ cholch3 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rfchol3 <dbl> 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1…
## $ michd <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ ltasth1 <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1…
## $ casthm1 <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1…
## $ asthms1 <dbl> 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 3, 3, 3, 3…
## $ drdxar2 <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1…
## $ mrace1 <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ hispanc <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ race <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ raceg21 <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ racegr3 <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ raceprv <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ sex <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ ageg5yr <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12, 13,…
## $ age65yr <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
## $ age80 <dbl> 80, 80, 80, 78, 76, 62, 80, 78, 80, 75, 59, 78, 68, 79, 79, 8…
## $ age_g <dbl> 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6…
## $ htin4 <dbl> 63, 57, 64, 66, 68, 62, 72, 59, 64, 70, 71, 65, 68, 73, 62, 6…
## $ htm4 <dbl> 160, 145, 163, 168, 173, 157, 183, 150, 163, 178, 180, 165, 1…
## $ wtkg3 <dbl> 7802, 5987, 5897, 7711, 7711, 7484, 8165, 6123, 8845, 7257, 1…
## $ bmi5 <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 2296, 4…
## $ bmi5cat <dbl> 4, 3, 2, 3, 3, 4, 2, 3, 4, 2, 4, 2, 3, 2, 3, 3, 4, 3, 3, 3, 4…
## $ rfbmi5 <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ chldcnt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ educag <dbl> 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 4, 3, 3, 2, 4, 4, 3, 2, 2…
## $ incomg1 <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3, 4, 3…
## $ smoker3 <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1, 3, 3…
## $ rfsmok3 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ cureci2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ drnkany6 <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2…
## $ drocdy4 <dbl> 0, 0, 0, 0, 7, 17, 0, 0, 0, 0, 0, 0, 29, 100, 0, 0, 0, 0, 33,…
## $ rfbing6 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ drnkwk2 <dbl> 0, 0, 0, 0, 47, 233, 0, 0, 0, 0, 0, 0, 200, 1400, 0, 0, 0, 0,…
## $ rfdrhv8 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ flshot7 <dbl> 2, 1, 1, 1, 2, NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, 2, NA, 1, 2, 1…
## $ pneumo3 <dbl> 2, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA, 2, 1, 1, 2, 1, NA, 2, 1, 1…
## $ aidtst4 <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 1…
## $ rfseat2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rfseat3 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ drnkdrv <dbl> 9, 9, 9, 9, 2, 2, 9, 9, 9, 9, 9, 9, 2, 2, 9, 9, 9, 9, 2, 2, 9…
## [1] "state" "fmonth" "idate" "imonth" "iday" "iyear"
## [7] "dispcode" "seqno" "psu" "ctelenm1" "pvtresd1" "colghous"
## [13] "statere1" "celphon1" "ladult1" "numadult" "respslc1" "landsex2"
## [19] "lndsxbrt" "safetime" "ctelnum1" "cellfon5" "cadult1" "cellsex2"
## [25] "celsxbrt" "pvtresd3" "cclghous" "cstate1" "landline" "hhadult"
## [31] "sexvar" "genhlth" "physhlth" "menthlth" "poorhlth" "primins1"
## [37] "persdoc3" "medcost1" "checkup1" "exerany2" "exract12" "exeroft1"
## [43] "exerhmm1" "exract22" "exeroft2" "exerhmm2" "strength" "bphigh6"
## [49] "bpmeds1" "cholchk3" "toldhi3" "cholmed3" "cvdinfr4" "cvdcrhd4"
## [55] "cvdstrk3" "asthma3" "asthnow" "chcscnc1" "chcocnc1" "chccopd3"
## [61] "addepev3" "chckdny2" "havarth4" "diabete4" "diabage4" "marital"
## [67] "educa" "renthom1" "numhhol4" "numphon4" "cpdemo1c" "veteran3"
## [73] "employ1" "children" "income3" "pregnant" "weight2" "height3"
## [79] "deaf" "blind" "decide" "diffwalk" "diffdres" "diffalon"
## [85] "fall12mn" "fallinj5" "smoke100" "smokday2" "usenow3" "ecignow2"
## [91] "alcday4" "avedrnk3" "drnk3ge5" "maxdrnks" "flushot7" "flshtmy3"
## [97] "pneuvac4" "shingle2" "hivtst7" "hivtstd3" "seatbelt" "drnkdri2"
## [103] "covidpo1" "covidsm1" "covidact" "pdiabts1" "prediab2" "diabtype"
## [109] "insulin1" "chkhemo3" "eyeexam1" "diabeye1" "diabedu1" "feetsore"
## [115] "arthexer" "arthedu" "lmtjoin3" "arthdis2" "joinpai2" "lcsfirst"
## [121] "lcslast" "lcsnumcg" "lcsctsc1" "lcsscncr" "lcsctwhn" "hadmam"
## [127] "howlong" "cervscrn" "crvclcnc" "crvclpap" "crvclhpv" "hadhyst2"
## [133] "psatest1" "psatime1" "pcpsars2" "psasugs1" "pcstalk2" "hadsigm4"
## [139] "colnsigm" "colntes1" "sigmtes1" "lastsig4" "colncncr" "vircolo1"
## [145] "vclntes2" "smalstol" "stoltest" "stooldn2" "bldstfit" "sdnates1"
## [151] "cncrdiff" "cncrage" "cncrtyp2" "csrvtrt3" "csrvdoc1" "csrvsum"
## [157] "csrvrtrn" "csrvinst" "csrvinsr" "csrvdein" "csrvclin" "csrvpain"
## [163] "csrvctl2" "indortan" "numburn3" "sunprtct" "wkdayout" "wkendout"
## [169] "cimemlo1" "cdworry" "cddiscu1" "cdhous1" "cdsocia1" "caregiv1"
## [175] "crgvrel4" "crgvlng1" "crgvhrs1" "crgvprb3" "crgvalzd" "crgvper1"
## [181] "crgvhou1" "crgvexpt" "lastsmk2" "stopsmk2" "mentcigs" "mentecig"
## [187] "heattbco" "firearm5" "gunload" "loadulk2" "hasymp1" "hasymp2"
## [193] "hasymp3" "hasymp4" "hasymp5" "hasymp6" "strsymp1" "strsymp2"
## [199] "strsymp3" "strsymp4" "strsymp5" "strsymp6" "firstaid" "aspirin"
## [205] "birthsex" "somale" "sofemale" "trnsgndr" "marijan1" "marjsmok"
## [211] "marjeat" "marjvape" "marjdab" "marjothr" "usemrjn4" "acedeprs"
## [217] "acedrink" "acedrugs" "aceprisn" "acedivrc" "acepunch" "acehurt1"
## [223] "aceswear" "acetouch" "acetthem" "acehvsex" "aceadsaf" "aceadned"
## [229] "imfvpla4" "hpvadvc4" "hpvadsht" "tetanus1" "covidva1" "covacge1"
## [235] "covidnu2" "lsatisfy" "emtsuprt" "sdlonely" "sdhemply" "foodstmp"
## [241] "sdhfood1" "sdhbills" "sdhutils" "sdhtrnsp" "sdhstre1" "rrclass3"
## [247] "rrcognt2" "rrtreat" "rratwrk2" "rrhcare4" "rrphysm2" "rcsgend1"
## [253] "rcsxbrth" "rcsrltn2" "casthdx2" "casthno2" "qstver" "qstlang"
## [259] "metstat" "urbstat" "mscode" "ststr" "strwt" "rawrake"
## [265] "wt2rake" "imprace" "chispnc" "crace1" "cageg" "cllcpwt"
## [271] "dualuse" "dualcor" "llcpwt2" "llcpwt" "rfhlth" "phys14d"
## [277] "ment14d" "hlthpl1" "hcvu653" "totinda" "metvl12" "metvl22"
## [283] "maxvo21" "fc601" "actin13" "actin23" "padur1" "padur2"
## [289] "pafreq1" "pafreq2" "minac12" "minac22" "strfreq" "pamiss3"
## [295] "pamin13" "pamin23" "pa3min" "pavig13" "pavig23" "pa3vigm"
## [301] "pacat3" "paindx3" "pa150r4" "pa300r4" "pa30023" "pastrng"
## [307] "parec3" "pastae3" "rfhype6" "cholch3" "rfchol3" "michd"
## [313] "ltasth1" "casthm1" "asthms1" "drdxar2" "mrace1" "hispanc"
## [319] "race" "raceg21" "racegr3" "raceprv" "sex" "ageg5yr"
## [325] "age65yr" "age80" "age_g" "htin4" "htm4" "wtkg3"
## [331] "bmi5" "bmi5cat" "rfbmi5" "chldcnt" "educag" "incomg1"
## [337] "smoker3" "rfsmok3" "cureci2" "drnkany6" "drocdy4" "rfbing6"
## [343] "drnkwk2" "rfdrhv8" "flshot7" "pneumo3" "aidtst4" "rfseat2"
## [349] "rfseat3" "drnkdrv"
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_raw), ncol(brfss_raw))) |>
kable(caption = "Full Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 433323 |
| Variables | 350 |
In the raw dataset, there are 433,323 (observations) rows and 350 (variables) columns.
Step 1.3: Select the variables listed in the table
# Select only the variables we want to work with
brfss_select <- brfss_raw |>
select(menthlth, physhlth, bmi5, sexvar, exerany2, ageg5yr, educa, genhlth, marital)
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_select), ncol(brfss_select))) |>
kable(caption = "Only Necessary Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 433323 |
| Variables | 9 |
What variables did you select and why? After selecting only the variables needed for the homework, there are 433,323 observations and 9 variables (menthlth, psyhlth, bmi5, sexvar, exerany2, ageg5yr, educa, genhlth, and marital). I chose menthlth and bmi5 as my continous variables, sexvar and exerany2 as my binary, and ageg5yr, educa, genhlth, and marital as my categorical predictors. Mental health, exercise, education, and general health have consistently been demonstrated to be predictors in all of the other BRFSS labs that we have completed. I am always interested in how women and men differ hence picking sex. BMI always intrigues me because I do not believe it is the best scale to measure an individual’s health but I want to understand more how it can influence physical distress especially in someone who may be considered overweight. As we get older, overall health appears to decline suggesting it could be a good predictor to analyze. Also we have discussed constantly that married individuals are the healthiest, so I was curious to see how that would impact this overall relationship.
Step 2: Recode all variables as described. Store all categorical variables as labeled factors and set appropriate reference levels
# Recode all the variables selected
brfss_recode <- brfss_select |>
mutate(
# Outcome: Mentally unhealthy days in past 30
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
menthlth %in% c(77, 99) ~ NA_real_,
TRUE ~ NA_real_
),
# Physical health days
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
physhlth %in% c(77, 99) ~ NA_real_,
TRUE ~ NA_real_
),
# BMI
bmi = case_when(
bmi5 == 9999 ~ NA_real_,
TRUE ~ bmi5 / 100
),
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Exercise in past 30 days
exercise = factor(
case_when(
exerany2 == 2 ~ "No",
exerany2 == 1 ~ "Yes",
exerany2 %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("No", "Yes")
),
# Age groups
age_group = factor(
case_when(
ageg5yr %in% c(1, 2, 3) ~ "18-34",
ageg5yr %in% c(4, 5, 6) ~ "35-49",
ageg5yr %in% c(7, 8, 9) ~ "50-64",
ageg5yr %in% c(10, 11, 12, 13) ~ "65+",
ageg5yr == 14 ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("18-34", "35-49", "50-64", "65+")
),
# Education
education = factor(
case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS/GED",
educa == 5 ~ "Some College",
educa == 6 ~ "College Graduate",
educa == 9 ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("Less than HS", "HS/GED", "Some College", "College Graduate")
),
# General Health Status
gen_health = factor(
case_when(
genhlth %in% c(1, 2) ~ "Excellent/Very Good",
genhlth == 3 ~ "Good",
genhlth %in% c(4, 5) ~ "Fair/Poor",
genhlth %in% c(7, 9) ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("Excellent/Very Good", "Good", "Fair/Poor")
),
# Marital Status
marital = factor(
case_when(
marital %in% c(1, 6) ~ "Married/Partnered",
marital %in% c(2, 3, 4) ~ "Previously married",
marital == 5 ~ "Never married",
marital == 9 ~ NA_character_,
TRUE ~ NA_character_
),
levels = c("Married/Partnered", "Previously married", "Never married")
)
)Step 3.1: Create the analytic dataset and report a missing descriptive summary
brfss_analytic <- brfss_recode |>
select (menthlth_days, physhlth_days, bmi, sex, exercise, age_group, education, gen_health, marital)
# Report missing values for the key variables and print out total sample size
missing_summary <- data.frame(
Variable = names(brfss_analytic),
Missing_Count = colSums(is.na(brfss_analytic)),
Missing_Percent = round(colSums(is.na(brfss_analytic)) / nrow(brfss_analytic) * 100, 2)
)
# Show variables with the most missing data
print(missing_summary[order(-missing_summary$Missing_Count), ][1:9, ]) |>
kable(
caption = "Variables by Missingness"
) |>
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "center"
)## Variable Missing_Count Missing_Percent
## bmi bmi 40535 9.35
## physhlth_days physhlth_days 10785 2.49
## menthlth_days menthlth_days 8108 1.87
## age_group age_group 7779 1.80
## marital marital 4289 0.99
## education education 2325 0.54
## gen_health gen_health 1262 0.29
## exercise exercise 1251 0.29
## sex sex 0 0.00
| Variable | Missing_Count | Missing_Percent | |
|---|---|---|---|
| bmi | bmi | 40535 | 9.35 |
| physhlth_days | physhlth_days | 10785 | 2.49 |
| menthlth_days | menthlth_days | 8108 | 1.87 |
| age_group | age_group | 7779 | 1.80 |
| marital | marital | 4289 | 0.99 |
| education | education | 2325 | 0.54 |
| gen_health | gen_health | 1262 | 0.29 |
| exercise | exercise | 1251 | 0.29 |
| sex | sex | 0 | 0.00 |
Physhlth_days had 10,785 observations missing resulting in 2.49% of missingness. BMI had 40,535 observations missing resulting in 9.35% of missingness. Menthlth_days had 8,108 observations missing resulting in 1.87% missingness.
Step 3.2: Create the analytic dataset using drop_na() and slice_sample() with set.seed(1220). Report the final analytic n.
# Create analytic dataset
set.seed(1220)
brfss_analytic <- brfss_analytic |>
drop_na() |>
slice_sample(n = 8000)
# Display subset dimensions
cat("Working subset dimensions:",
nrow(brfss_analytic), "observations,",
ncol(brfss_analytic), "variables\n")## Working subset dimensions: 8000 observations, 9 variables
# Save for later use
saveRDS(brfss_analytic,
"brfss_2023_analytic.rds")
# Print out total sample size
cat("Final Analytic N:", nrow(brfss_analytic), "\n")## Final Analytic N: 8000
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_analytic), ncol(brfss_analytic))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 8000 |
| Variables | 9 |
The final analytic n is 8,000 with 9 recoded variables (menthlth_days, physhlth_days, bmi, sex, exercise, age_group, education, gen_health, marital).
Section 3.3: Produce a descriptive statistics table using gtsummary::tbl_summary()
# Create Table 1 Descriptive Statistics
brfss_analytic |>
select(menthlth_days, physhlth_days, bmi, sex, exercise, age_group, education, gen_health, marital) |>
tbl_summary(
type = list(exercise ~ "categorical",
sex ~ "categorical"),
label = list(
menthlth_days ~ "Mentally Unhealthy Days (Last 30 Days)",
physhlth_days ~ "Physically Unhealthy Days (Last 30 Days)",
bmi ~ "Body Mass Index (kg/m2)",
sex ~ "Sex",
exercise ~ "Exercise Status",
age_group ~ "Age Category",
education ~ "Education Category",
gen_health ~ "General Health Status",
marital ~ "Marital Status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**") |>
modify_footnote(
all_stat_cols() ~
"Mean (SD); n(%); Analytic sample includes 8,000 complete-case observations. Missing values were excluded listwise for all variables included in this table."
) |>
as_flex_table()Characteristic | N | N = 8,0001 |
|---|---|---|
Mentally Unhealthy Days (Last 30 Days) | 8,000 | 4.3 (8.2) |
Physically Unhealthy Days (Last 30 Days) | 8,000 | 4.5 (8.8) |
Body Mass Index (kg/m2) | 8,000 | 28.5 (6.7) |
Sex | 8,000 | |
Male | 3,899 (49%) | |
Female | 4,101 (51%) | |
Exercise Status | 8,000 | |
No | 1,880 (24%) | |
Yes | 6,120 (77%) | |
Age Category | 8,000 | |
18-34 | 1,355 (17%) | |
35-49 | 1,522 (19%) | |
50-64 | 2,108 (26%) | |
65+ | 3,015 (38%) | |
Education Category | 8,000 | |
Less than HS | 435 (5.4%) | |
HS/GED | 1,988 (25%) | |
Some College | 2,097 (26%) | |
College Graduate | 3,480 (44%) | |
General Health Status | 8,000 | |
Excellent/Very Good | 3,954 (49%) | |
Good | 2,576 (32%) | |
Fair/Poor | 1,470 (18%) | |
Marital Status | 8,000 | |
Married/Partnered | 4,533 (57%) | |
Previously married | 1,995 (25%) | |
Never married | 1,472 (18%) | |
1Mean (SD); n(%); Analytic sample includes 8,000 complete-case observations. Missing values were excluded listwise for all variables included in this table. | ||
Research Question: What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?
# Fitting maximum linear regression
max_model <- lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health + marital, data = brfss_analytic)
tidy(max_model, conf.int = TRUE) |>
mutate(
across(where(is.numeric), ~ round(., 3))
) |>
kable(
caption = "Table 3. Maximum Model Linear Regression: Physically Unhealthy Days ~ All Predictors (BRFSS 2023, n = 8,000)",
col.names = c("Term", "Estimate(β)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate(β) | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.335 | 0.589 | 2.268 | 0.023 | 0.181 | 2.489 |
| menthlth_days | 0.205 | 0.011 | 19.274 | 0.000 | 0.184 | 0.226 |
| bmi | 0.022 | 0.013 | 1.712 | 0.087 | -0.003 | 0.046 |
| sexFemale | 0.070 | 0.165 | 0.424 | 0.672 | -0.253 | 0.393 |
| exerciseYes | -1.759 | 0.205 | -8.561 | 0.000 | -2.162 | -1.356 |
| age_group35-49 | 0.178 | 0.288 | 0.618 | 0.537 | -0.387 | 0.743 |
| age_group50-64 | 0.919 | 0.281 | 3.274 | 0.001 | 0.369 | 1.470 |
| age_group65+ | 1.505 | 0.278 | 5.419 | 0.000 | 0.960 | 2.049 |
| educationHS/GED | -0.317 | 0.386 | -0.822 | 0.411 | -1.073 | 0.439 |
| educationSome College | -0.189 | 0.386 | -0.488 | 0.626 | -0.946 | 0.569 |
| educationCollege Graduate | -0.190 | 0.379 | -0.501 | 0.616 | -0.934 | 0.553 |
| gen_healthGood | 1.285 | 0.190 | 6.758 | 0.000 | 0.912 | 1.658 |
| gen_healthFair/Poor | 10.349 | 0.251 | 41.219 | 0.000 | 9.857 | 10.842 |
| maritalPreviously married | 0.108 | 0.204 | 0.531 | 0.595 | -0.292 | 0.508 |
| maritalNever married | -0.244 | 0.242 | -1.006 | 0.314 | -0.719 | 0.231 |
## [1] 54378.57
## [1] 54490.37
# Report R-squared, adjusted r-squared, AIC, and BIC in one table
glance(max_model) |>
dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
mutate(across(everything(), \(x) round(x, 3))) |>
kable(caption = "Maximum Model: Fit Statistics") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| r.squared | adj.r.squared | sigma | AIC | BIC | df.residual |
|---|---|---|---|---|---|
| 0.323 | 0.322 | 7.233 | 54378.57 | 54490.36 | 7985 |
R-squared = 0.323 It appears that the full model with all predictors explains 32.3% of the variance in physically unhealthy days. R-squared is not the best to use because it always increases when we add a predictor to the mode, even if it random noise.
Adjusted R-squared = 0.322 It appears that the full model with all predictors explains 32.2% of the variance in physically unhealthy days when using adjusted r-squared which can penalize for the number of predictors and will only increase when a new predictor improves the model by more than chance alone.
AIC = According to AIC, the model produces a 54,378.57. The lower the AIC, the better for model fit.
BIC = According to BIC, the model produces a 54,490.36. Similar to AIC, the lower the BIC, the better for model fit. —
# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets <- regsubsets(
physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health + marital, data = brfss_analytic,
nvmax = 15, # maximum number of variables to consider
method = "exhaustive"
)
best_summary <- summary(best_subsets)
subset_metrics <- tibble(
p = 1:length(best_summary$adjr2),
`Adj. R²` = best_summary$adjr2,
BIC = best_summary$bic,
Cp = best_summary$cp
)
# Create a summary table showing Adjusted R-squared and BIC across model sizes (number of predictors)
models <- list(
"Mental health" = lm(physhlth_days ~ menthlth_days, data = brfss_analytic),
"+ bmi" = lm(physhlth_days ~ menthlth_days + bmi, data = brfss_analytic),
"+ sex" = lm(physhlth_days ~ menthlth_days + bmi + sex, data = brfss_analytic),
"+ exercise" = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise, data = brfss_analytic),
"+ age_group" = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group, data = brfss_analytic),
"+ education" = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education, data = brfss_analytic),
"+ general health" = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health, data = brfss_analytic),
"+ marital (full)" = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health + marital, data = brfss_analytic)
)
comp_table <- map_dfr(names(models), \(name) {
g <- glance(models[[name]])
tibble(
Model = name,
p = length(coef(models[[name]])) - 1,
`Adj. R²` = round(g$adj.r.squared, 4),
BIC = round(g$BIC, 1)
)
})
comp_table |>
kable(caption = "Model Comparison: Which Variables Add") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | p | Adj. R² | BIC |
|---|---|---|---|
| Mental health | 1 | 0.0982 | 56667.7 |
|
2 | 0.1084 | 56585.0 |
|
3 | 0.1085 | 56591.8 |
|
4 | 0.1455 | 56260.7 |
|
7 | 0.1637 | 56112.6 |
|
10 | 0.1674 | 56101.1 |
|
12 | 0.3220 | 54474.0 |
|
14 | 0.3220 | 54490.4 |
Identify: The model size that maximizes Adjusted R-squared
If we look at the model comparison 12 parameters is the model size that appears to maximize adjusted R-squared. We see that when adding general health (p = 12), the adjusted R-Squared is 0.3220, however, when we add marital (p = 14), the adjusted R-squared is also 0.3220, suggesting that both of these could be considered a model size that maximizes adjusted R-squared.
Identify: The model size that minimizes BIC
If we again look at the model comparisons, 12 parameters is the model size that appears to maximize BIC. This matches the first adjusted R-squared with 12 parameters and demonstrates the lowest BIC of 54,474.0. Adding marital status (p = 14) makes the BIC go up to 54,490.4.
In 2-3 sentences: Describe what you observe: Do the two criteria agree on the best model size? If not, which criterion favors a simpler model
The two criteria do agree to a point on the best model size. If we go by the first set of parameters that give us the lowest adjusted R-squared and lowest BIC, then the two criteria agree on the best model size. They both suggest that 12 parameters is the best model size to produce an adjusted R-squared of 0.3220 and a BIC of 54,474.0 which is the lowest in the model comparison. If we include 14 parameters, the adjsuted R-squared remains the same however our BIC increases to 54,490.4.
## Start: AIC=31673.55
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group +
## education + gen_health + marital
##
## Df Sum of Sq RSS AIC
## - education 3 44 417795 31668
## - marital 2 82 417833 31671
## - sex 1 9 417760 31672
## <none> 417751 31674
## - bmi 1 153 417905 31674
## - age_group 3 2301 420052 31711
## - exercise 1 3835 421586 31745
## - menthlth_days 1 19435 437186 32035
## - gen_health 2 94594 512345 33302
##
## Step: AIC=31668.4
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group +
## gen_health + marital
##
## Df Sum of Sq RSS AIC
## - marital 2 84 417879 31666
## - sex 1 10 417805 31667
## <none> 417795 31668
## - bmi 1 153 417948 31669
## - age_group 3 2301 420096 31706
## - exercise 1 3918 421713 31741
## - menthlth_days 1 19459 437254 32031
## - gen_health 2 96640 514435 33329
##
## Step: AIC=31666
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group +
## gen_health
##
## Df Sum of Sq RSS AIC
## - sex 1 15 417894 31664
## <none> 417879 31666
## - bmi 1 150 418029 31667
## - age_group 3 3198 421077 31721
## - exercise 1 3924 421803 31739
## - menthlth_days 1 19538 437417 32030
## - gen_health 2 97825 515704 33345
##
## Step: AIC=31664.28
## physhlth_days ~ menthlth_days + bmi + exercise + age_group +
## gen_health
##
## Df Sum of Sq RSS AIC
## <none> 417894 31664
## - bmi 1 149 418043 31665
## - age_group 3 3250 421144 31720
## - exercise 1 3971 421864 31738
## - menthlth_days 1 19882 437776 32034
## - gen_health 2 97843 515737 33343
# Create a pretty table
tidy(mod_backward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Backward Elimination Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.0211 | 0.4526 | 2.2562 | 0.0241 | 0.1339 | 1.9082 |
| menthlth_days | 0.2057 | 0.0106 | 19.4985 | 0.0000 | 0.1851 | 0.2264 |
| bmi | 0.0212 | 0.0126 | 1.6870 | 0.0916 | -0.0034 | 0.0459 |
| exerciseYes | -1.7623 | 0.2022 | -8.7137 | 0.0000 | -2.1588 | -1.3659 |
| age_group35-49 | 0.3061 | 0.2723 | 1.1241 | 0.2610 | -0.2277 | 0.8398 |
| age_group50-64 | 1.0723 | 0.2565 | 4.1804 | 0.0000 | 0.5695 | 1.5751 |
| age_group65+ | 1.6841 | 0.2447 | 6.8820 | 0.0000 | 1.2044 | 2.1638 |
| gen_healthGood | 1.2797 | 0.1888 | 6.7785 | 0.0000 | 0.9096 | 1.6498 |
| gen_healthFair/Poor | 10.3518 | 0.2463 | 42.0277 | 0.0000 | 9.8690 | 10.8347 |
Backward elimation contains menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor.
# Perform forward elimination
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)
mod_forward <- step(mod_null,
scope = list(lower = mod_null, upper = max_model),
direction = "forward", trace = 1)## Start: AIC=34767.94
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 2 172936 444258 32142
## + menthlth_days 1 60700 556494 33942
## + exercise 1 36694 580500 34280
## + education 3 10632 606562 34635
## + bmi 1 9513 607682 34646
## + marital 2 8185 609009 34665
## + age_group 3 6839 610355 34685
## + sex 1 1231 615963 34754
## <none> 617194 34768
##
## Step: AIC=32141.72
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 18156.4 426102 31810
## + exercise 1 5642.9 438615 32041
## + age_group 3 984.8 443273 32130
## + sex 1 646.6 443612 32132
## + marital 2 670.1 443588 32134
## + bmi 1 361.0 443897 32137
## <none> 444258 32142
## + education 3 175.3 444083 32145
##
## Step: AIC=31809.89
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 4865.1 421237 31720
## + age_group 3 3865.5 422236 31743
## + marital 2 1206.0 424896 31791
## + bmi 1 291.1 425811 31806
## + sex 1 163.4 425938 31809
## <none> 426102 31810
## + education 3 112.8 425989 31814
##
## Step: AIC=31720.03
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + age_group 3 3194.2 418043 31665
## + marital 2 1020.6 420216 31705
## <none> 421237 31720
## + bmi 1 93.0 421144 31720
## + sex 1 63.9 421173 31721
## + education 3 64.9 421172 31725
##
## Step: AIC=31665.13
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
##
## Df Sum of Sq RSS AIC
## + bmi 1 148.834 417894 31664
## <none> 418043 31665
## + sex 1 13.102 418029 31667
## + marital 2 86.081 417956 31667
## + education 3 45.539 417997 31670
##
## Step: AIC=31664.28
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## bmi
##
## Df Sum of Sq RSS AIC
## <none> 417894 31664
## + sex 1 14.648 417879 31666
## + marital 2 89.092 417805 31667
## + education 3 45.895 417848 31669
# Create a pretty table
tidy(mod_forward, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Forward Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.0211 | 0.4526 | 2.2562 | 0.0241 | 0.1339 | 1.9082 |
| gen_healthGood | 1.2797 | 0.1888 | 6.7785 | 0.0000 | 0.9096 | 1.6498 |
| gen_healthFair/Poor | 10.3518 | 0.2463 | 42.0277 | 0.0000 | 9.8690 | 10.8347 |
| menthlth_days | 0.2057 | 0.0106 | 19.4985 | 0.0000 | 0.1851 | 0.2264 |
| exerciseYes | -1.7623 | 0.2022 | -8.7137 | 0.0000 | -2.1588 | -1.3659 |
| age_group35-49 | 0.3061 | 0.2723 | 1.1241 | 0.2610 | -0.2277 | 0.8398 |
| age_group50-64 | 1.0723 | 0.2565 | 4.1804 | 0.0000 | 0.5695 | 1.5751 |
| age_group65+ | 1.6841 | 0.2447 | 6.8820 | 0.0000 | 1.2044 | 2.1638 |
| bmi | 0.0212 | 0.0126 | 1.6870 | 0.0916 | -0.0034 | 0.0459 |
Forward selection also contains menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor.
# Perform stepwise elimination
mod_stepwise <- step(mod_null,
scope = list(lower = mod_null, upper = max_model),
direction = "both", trace = 1)## Start: AIC=34767.94
## physhlth_days ~ 1
##
## Df Sum of Sq RSS AIC
## + gen_health 2 172936 444258 32142
## + menthlth_days 1 60700 556494 33942
## + exercise 1 36694 580500 34280
## + education 3 10632 606562 34635
## + bmi 1 9513 607682 34646
## + marital 2 8185 609009 34665
## + age_group 3 6839 610355 34685
## + sex 1 1231 615963 34754
## <none> 617194 34768
##
## Step: AIC=32141.72
## physhlth_days ~ gen_health
##
## Df Sum of Sq RSS AIC
## + menthlth_days 1 18156 426102 31810
## + exercise 1 5643 438615 32041
## + age_group 3 985 443273 32130
## + sex 1 647 443612 32132
## + marital 2 670 443588 32134
## + bmi 1 361 443897 32137
## <none> 444258 32142
## + education 3 175 444083 32145
## - gen_health 2 172936 617194 34768
##
## Step: AIC=31809.89
## physhlth_days ~ gen_health + menthlth_days
##
## Df Sum of Sq RSS AIC
## + exercise 1 4865 421237 31720
## + age_group 3 3866 422236 31743
## + marital 2 1206 424896 31791
## + bmi 1 291 425811 31806
## + sex 1 163 425938 31809
## <none> 426102 31810
## + education 3 113 425989 31814
## - menthlth_days 1 18156 444258 32142
## - gen_health 2 130393 556494 33942
##
## Step: AIC=31720.03
## physhlth_days ~ gen_health + menthlth_days + exercise
##
## Df Sum of Sq RSS AIC
## + age_group 3 3194 418043 31665
## + marital 2 1021 420216 31705
## <none> 421237 31720
## + bmi 1 93 421144 31720
## + sex 1 64 421173 31721
## + education 3 65 421172 31725
## - exercise 1 4865 426102 31810
## - menthlth_days 1 17379 438615 32041
## - gen_health 2 108844 530080 33555
##
## Step: AIC=31665.13
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
##
## Df Sum of Sq RSS AIC
## + bmi 1 149 417894 31664
## <none> 418043 31665
## + sex 1 13 418029 31667
## + marital 2 86 417956 31667
## + education 3 46 417997 31670
## - age_group 3 3194 421237 31720
## - exercise 1 4194 422236 31743
## - menthlth_days 1 19893 437935 32035
## - gen_health 2 100730 518772 33388
##
## Step: AIC=31664.28
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group +
## bmi
##
## Df Sum of Sq RSS AIC
## <none> 417894 31664
## - bmi 1 149 418043 31665
## + sex 1 15 417879 31666
## + marital 2 89 417805 31667
## + education 3 46 417848 31669
## - age_group 3 3250 421144 31720
## - exercise 1 3971 421864 31738
## - menthlth_days 1 19882 437776 32034
## - gen_health 2 97843 515737 33343
# Create a pretty table
tidy(mod_stepwise, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stepwise Selection Result (AIC-based)",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.0211 | 0.4526 | 2.2562 | 0.0241 | 0.1339 | 1.9082 |
| gen_healthGood | 1.2797 | 0.1888 | 6.7785 | 0.0000 | 0.9096 | 1.6498 |
| gen_healthFair/Poor | 10.3518 | 0.2463 | 42.0277 | 0.0000 | 9.8690 | 10.8347 |
| menthlth_days | 0.2057 | 0.0106 | 19.4985 | 0.0000 | 0.1851 | 0.2264 |
| exerciseYes | -1.7623 | 0.2022 | -8.7137 | 0.0000 | -2.1588 | -1.3659 |
| age_group35-49 | 0.3061 | 0.2723 | 1.1241 | 0.2610 | -0.2277 | 0.8398 |
| age_group50-64 | 1.0723 | 0.2565 | 4.1804 | 0.0000 | 0.5695 | 1.5751 |
| age_group65+ | 1.6841 | 0.2447 | 6.8820 | 0.0000 | 1.2044 | 2.1638 |
| bmi | 0.0212 | 0.0126 | 1.6870 | 0.0916 | -0.0034 | 0.0459 |
Stepwise selection also contains menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor.
In 3-5 sentences: Compare the results: Do the three methods agree on the same final model? Which variables (if any) were excluded by all three methods? Which variables were retained by all three methods?
Yes, the backward elimination, forward selection, and stepwise selection methods do agree on the same final model. Marital, education, and sex, were excluded by all three models. All three models retained menthlth_days, bmi, exercise, age_group, and gen_health suggesting this may be the best set of predictors to use for the final model selection.
# Create table to compare all the models for model selection
model_comparison <- tribble(
~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
"Maximum model",
length(coef(max_model)) - 1,
round(glance(max_model)$adj.r.squared, 4),
round(AIC(max_model), 1),
round(BIC(max_model), 1),
"Backward Elimination",
length(coef(mod_backward)) - 1,
round(glance(mod_backward)$adj.r.squared, 4),
round(AIC(mod_backward), 1),
round(BIC(mod_backward), 1),
"Forward Selection",
length(coef(mod_forward)) - 1,
round(glance(mod_forward)$adj.r.squared, 4),
round(AIC(mod_forward), 1),
round(BIC(mod_forward), 1),
"Stepwise Selection",
length(coef(mod_stepwise)) - 1,
round(glance(mod_stepwise)$adj.r.squared, 4),
round(AIC(mod_stepwise), 1),
round(BIC(mod_stepwise), 1)
)
model_comparison |>
kable(caption = "Model Comparison: R-Squared, Adjusted R-Squared AIC, BIC",
digits = 3,
align = "lrrr") |>
kable_styling(bootstrap_options = c("striped", "hover"),
full_width = FALSE)| Method | Variables selected | Adj. R² | AIC | BIC |
|---|---|---|---|---|
| Maximum model | 14 | 0.322 | 54378.6 | 54490.4 |
| Backward Elimination | 8 | 0.322 | 54369.3 | 54439.2 |
| Forward Selection | 8 | 0.322 | 54369.3 | 54439.2 |
| Stepwise Selection | 8 | 0.322 | 54369.3 | 54439.2 |
Why this Final Model: I chose this final model which contains menthlth_days, bmi, exercise, age_group, and gen_health after creating a maximum model and comparing it to backward elimination, forward selection, and stepwise selection. For the maximum model with 14 parameters, the adjusted R-squared is 0.322 which is exactly the same output as the backward elimination, forward selection, and stepwise selection. However, in these three models, there are only 8 parameters included which demonstrates an overall lowered AIC of 54,369.3 and BIC of 54,439.2 when compared to the maximum model which had an AIC of 54,378.6 and 54,490.4. I decided to choose the most parsimonious model which contains less parameters but maximizes adjusted R-squared, AIC, and BIC.
# Fitting final model based on all selection methods
final_model <- lm(physhlth_days ~ menthlth_days + bmi + exercise + age_group + gen_health, data = brfss_analytic)
tidy(final_model, conf.int = TRUE) |>
mutate(
across(where(is.numeric), ~ round(., 3))
) |>
kable(
caption = "Table 4. Final Model Linear Regression: Physically Unhealthy Days ~ menthlth_days, bmi, exercise, age_group, and gen_health (BRFSS 2023, n = 8,000)",
col.names = c("Term", "Estimate(β)", "Std. Error", "t-statistic",
"p-value", "95% CI Lower", "95% CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate(β) | Std. Error | t-statistic | p-value | 95% CI Lower | 95% CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.021 | 0.453 | 2.256 | 0.024 | 0.134 | 1.908 |
| menthlth_days | 0.206 | 0.011 | 19.499 | 0.000 | 0.185 | 0.226 |
| bmi | 0.021 | 0.013 | 1.687 | 0.092 | -0.003 | 0.046 |
| exerciseYes | -1.762 | 0.202 | -8.714 | 0.000 | -2.159 | -1.366 |
| age_group35-49 | 0.306 | 0.272 | 1.124 | 0.261 | -0.228 | 0.840 |
| age_group50-64 | 1.072 | 0.257 | 4.180 | 0.000 | 0.569 | 1.575 |
| age_group65+ | 1.684 | 0.245 | 6.882 | 0.000 | 1.204 | 2.164 |
| gen_healthGood | 1.280 | 0.189 | 6.778 | 0.000 | 0.910 | 1.650 |
| gen_healthFair/Poor | 10.352 | 0.246 | 42.028 | 0.000 | 9.869 | 10.835 |
Interpret the coefficients: for at least three predictors in plain language, including units and “holding all other variables constant” language. Include at least one continuous and one categorical predictor.
Menthlth_days (0.206): For every additional mentally unhealthy day, physically unhealthy days tend to increase by 0.206 days on average holding all other variables constant.
Exercise(Yes) (-1.762): Individuals who exercise report 1.76 fewer physically unhealthy days compared to those who do not exercise holding all other variables constant.
Age_group35-49 (0.306): Individuals who are aged 35-49 report 0.306 more physically unhealthy days compared to those who are 18 to 34 holding all other variables constant.
Residuals vs. Fitted: Is there evidence of non-linearity or non-constant variance?
There is slight evidence of non-linearity because the red line is not entirely flat around 0 with a slight dip in the middle. The residuals could slightly be demonstrating a funnel shape which would suggest a non-constant variance.
Normal Q-Q: Do the residuals approximately follow a normal distribution? Where do they deviate?
The residuals do not follow a normal distribution along the 45 degree reference line. The points deviate at both ends with heavy tails/s-curves.
Scale-Location: Is the spread of residuals roughly constant across fitted values?
The spread of residuals is not roughly constant across the fitted values. If it was, the red line would be flat, in our scale-location, a curved line is apparent. The residuals do form a funnel shape further emphasizing non-constant variance.
Residuals vs. Leverage: Are there any highly influential observations (large Cook’s distance)?
There do appear to be a couple of highly influential observations including one value at 2445 and one at 2831.
Write a brief conclusion (2-3 sentences): Are the LINE assumptions reasonable satisfied? What violations, if any, do you observe?
The LINE assumptions do not seem to be reasonable satisfied as every one of the graphs depicts violations. There appears to be non-linearity, non-normal distribution of residuals, non-constant spread of residuals across fitted values, and some highly influential observations.
• 1 if physhlth_days >= 14. • 0 if physhlth_days < 14 • Store as factor with levels “No” (reference) and “Yes” • Report prevalence of frequent physical distress (count and percentage in each category)
brfss_analytic <- brfss_analytic |>
mutate(
frequent_distress = factor(
case_when(
physhlth_days < 14 ~ "No",
physhlth_days >= 14 ~ "Yes",
TRUE ~ NA_character_
),
levels = c("No", "Yes")
))
# Prevalence of frequent physical distress (count and percentage)
brfss_analytic|>
select(frequent_distress) |>
tbl_summary(
type = list(frequent_distress ~ "categorical"),
label = list(
frequent_distress ~ "Frequent Physical Distress"
),
statistic = list(
all_categorical() ~ "{n} ({p}%)"
),
missing = "no"
) |>
add_n() |>
bold_labels() |>
modify_caption("**Frequent Physical Distress Counts and Percentages**")| Characteristic | N | N = 8,0001 |
|---|---|---|
| Frequent Physical Distress | 8,000 | |
| No | 6,928 (87%) | |
| Yes | 1,072 (13%) | |
| 1 n (%) | ||
# Create Table to show across categories
brfss_analytic |>
select(frequent_distress, sex, exercise, age_group, education, gen_health, marital) |>
tbl_summary(
by = frequent_distress,
type = list(exercise ~ "categorical",
sex ~ "categorical"),
label = list(
sex ~ "Sex",
exercise ~ "Exercise Status",
age_group ~ "Age Category",
education ~ "Education Category",
gen_health ~ "General Health Status",
marital ~ "Marital Status"
),
statistic = list(
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("** Categorical Breakdowns of Physical Distress Counts**") |>
modify_footnote(
all_stat_cols() ~
"n(%)"
) |>
as_flex_table()Characteristic | N | No | Yes |
|---|---|---|---|
Sex | 8,000 | ||
Male | 3,431 (50%) | 468 (44%) | |
Female | 3,497 (50%) | 604 (56%) | |
Exercise Status | 8,000 | ||
No | 1,375 (20%) | 505 (47%) | |
Yes | 5,553 (80%) | 567 (53%) | |
Age Category | 8,000 | ||
18-34 | 1,249 (18%) | 106 (9.9%) | |
35-49 | 1,377 (20%) | 145 (14%) | |
50-64 | 1,784 (26%) | 324 (30%) | |
65+ | 2,518 (36%) | 497 (46%) | |
Education Category | 8,000 | ||
Less than HS | 318 (4.6%) | 117 (11%) | |
HS/GED | 1,665 (24%) | 323 (30%) | |
Some College | 1,792 (26%) | 305 (28%) | |
College Graduate | 3,153 (46%) | 327 (31%) | |
General Health Status | 8,000 | ||
Excellent/Very Good | 3,821 (55%) | 133 (12%) | |
Good | 2,335 (34%) | 241 (22%) | |
Fair/Poor | 772 (11%) | 698 (65%) | |
Marital Status | 8,000 | ||
Married/Partnered | 4,027 (58%) | 506 (47%) | |
Previously married | 1,590 (23%) | 405 (38%) | |
Never married | 1,311 (19%) | 161 (15%) | |
1n(%) | |||
In 2-3 sentences: Comment on the balance of the outcome
It appears that 87% of individuals are categorized into the no frequent physical distress category whereas only 13% are in the yes frequent physical distress category. This suggests that there are more than five times as many individuals in one category than the other meaning the outcome is not very balanced. I also created a table for the categorical predictors to understand its distribution and noticed that for general health and frequent physical distress, those who did demonstrate frequent physical distress were 65% of those in the fair/poor health status category.
Why Choose Gen_Health: I chose general health because it appeared to be one of the strongest predictors. In the maximum linear regression model, fair/poor health was associated with about 10 more unhealthy days compared to the excellent/very good category holding all else constant. For individuals who reported good health status, they were associated with 1.29 more unhealthy days compared to the excellent/very good category holding all else constant. Both good and fair/poor demonstrated a p-value of less than 0.001 suggesting this would be a great starting point. We have also had countless discussions in class surrounding how important of a predictor general health can be for an individual. In addition, when looking at the model comparison, it appears that general health status contributed a large amount to the adjusted R-squared which helps edxplain the variance in physically unhealthy days.
# Fit a simple logistic regression
mod_simple <- glm(frequent_distress ~ gen_health,
data = brfss_analytic, family = binomial)
# Display the model summary
tidy(mod_simple, conf.int = TRUE, exponentiate = FALSE) |>
mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
\(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(caption = "Simple Logistic Regression: Frequent Distress ~ General Health Status (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -3.358 | 0.088 | -38.069 | <2e-16 | -3.536 | -3.190 |
| gen_healthGood | 1.087 | 0.111 | 9.778 | <2e-16 | 0.871 | 1.307 |
| gen_healthFair/Poor | 3.257 | 0.103 | 31.774 | <2e-16 | 3.060 | 3.462 |
# Calculate and report the Odds ratio and 95% confidence interval
tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
\(x) round(x, 3)),
p.value = format.pval(p.value, digits = 3)) |>
kable(caption = "Wald Tests for General Health (Exponentiated)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.035 | 0.088 | -38.069 | <2e-16 | 0.029 | 0.041 |
| gen_healthGood | 2.965 | 0.111 | 9.778 | <2e-16 | 2.389 | 3.695 |
| gen_healthFair/Poor | 25.975 | 0.103 | 31.774 | <2e-16 | 21.317 | 31.869 |
# Publication ready table
mod_simple |>
tbl_regression(
exponentiate = TRUE,
label = list(gen_health ~ "General Health Status"),
) |>
bold_labels() |>
bold_p()| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| General Health Status | |||
| Excellent/Very Good | — | — | |
| Good | 2.97 | 2.39, 3.70 | <0.001 |
| Fair/Poor | 26.0 | 21.3, 31.9 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
Interpret the odds ratio in plain language: State whether the association is statistically significant at alpha = 0.05 and how you know (Wald test p-value or CI excluding 1.0)
Gen_healthGood (OR = 2.97): Compared to excellent/very good health status, individuals who reported good health status have almost 3 times the odds of frequent physical distress, 95% CI [2.39, 3.70].
Gen_healthFair/Poor (OR = 26.0): Compared to excellent/very good health status, individuals who reported fair/poor health status have 26 times the odds of frequent physical distress, 95% CI [21.3, 31.9].
The association is statistically significant at alpha = 0.05 because the 95% confidence interval does not include 1, and by default the Wald test is computed and outputted a p-value of <0.001 suggesting that we should consider the evidence that the association is statistically significant.
# Fit a multiple logistic regression
mod_logistic <- glm(frequent_distress ~ menthlth_days + bmi + exercise + age_group + gen_health,
data = brfss_analytic, family = binomial)
# Display the model summary with pretty table
tidy(mod_logistic, conf.int = TRUE, exponentiate = FALSE) |>
kable(digits = 3, caption = "Multiple Logistic Regression: ~ menthlth_days, bmi, exercise, age_group, and gen_health (Log-Odds Scale)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -3.745 | 0.219 | -17.130 | 0.000 | -4.177 | -3.320 |
| menthlth_days | 0.055 | 0.004 | 13.999 | 0.000 | 0.047 | 0.062 |
| bmi | 0.004 | 0.005 | 0.807 | 0.420 | -0.006 | 0.015 |
| exerciseYes | -0.510 | 0.082 | -6.208 | 0.000 | -0.671 | -0.349 |
| age_group35-49 | 0.178 | 0.154 | 1.156 | 0.248 | -0.123 | 0.481 |
| age_group50-64 | 0.556 | 0.139 | 4.009 | 0.000 | 0.287 | 0.832 |
| age_group65+ | 0.847 | 0.134 | 6.317 | 0.000 | 0.588 | 1.114 |
| gen_healthGood | 0.843 | 0.114 | 7.367 | 0.000 | 0.620 | 1.069 |
| gen_healthFair/Poor | 2.719 | 0.109 | 24.891 | 0.000 | 2.508 | 2.937 |
# Calculate and report the Adjusted Odds ratio and 95% confidence interval
tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3, caption = "Multiple Logistic Regression: Frequent Distress ~ menthlth_days, bmi, exercise, age_group, and gen_health (Odds Ratio)") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.024 | 0.219 | -17.130 | 0.000 | 0.015 | 0.036 |
| menthlth_days | 1.056 | 0.004 | 13.999 | 0.000 | 1.048 | 1.064 |
| bmi | 1.004 | 0.005 | 0.807 | 0.420 | 0.994 | 1.015 |
| exerciseYes | 0.600 | 0.082 | -6.208 | 0.000 | 0.511 | 0.705 |
| age_group35-49 | 1.195 | 0.154 | 1.156 | 0.248 | 0.885 | 1.618 |
| age_group50-64 | 1.744 | 0.139 | 4.009 | 0.000 | 1.333 | 2.297 |
| age_group65+ | 2.333 | 0.134 | 6.317 | 0.000 | 1.800 | 3.047 |
| gen_healthGood | 2.324 | 0.114 | 7.367 | 0.000 | 1.860 | 2.914 |
| gen_healthFair/Poor | 15.171 | 0.109 | 24.891 | 0.000 | 12.281 | 18.852 |
# Publication ready table
mod_logistic |>
tbl_regression(
exponentiate = TRUE,
label = list(menthlth_days ~ "Mentally Unhealthy Days",
bmi ~ "Body Mass Index",
exercise ~ "Exercise in past 30 days",
age_group ~ "Age Category",
gen_health ~ "General Health Status")
) |>
bold_labels() |>
bold_p()| Characteristic | OR | 95% CI | p-value |
|---|---|---|---|
| Mentally Unhealthy Days | 1.06 | 1.05, 1.06 | <0.001 |
| Body Mass Index | 1.00 | 0.99, 1.01 | 0.4 |
| Exercise in past 30 days | |||
| No | — | — | |
| Yes | 0.60 | 0.51, 0.71 | <0.001 |
| Age Category | |||
| 18-34 | — | — | |
| 35-49 | 1.19 | 0.88, 1.62 | 0.2 |
| 50-64 | 1.74 | 1.33, 2.30 | <0.001 |
| 65+ | 2.33 | 1.80, 3.05 | <0.001 |
| General Health Status | |||
| Excellent/Very Good | — | — | |
| Good | 2.32 | 1.86, 2.91 | <0.001 |
| Fair/Poor | 15.2 | 12.3, 18.9 | <0.001 |
| Abbreviations: CI = Confidence Interval, OR = Odds Ratio | |||
Interpret the odds ratios for at least three predictors in plain language: Ensure using “holding all other variables constant” language (Include at least one continuous and one categorical predictor)
Mentally Unhealthy Days (1.06): With each increased day of mentally unhealthy days, individuals have 6% higher odds of experiencing frequent distress holding all other variables constant.
Exercise(Yes) (0.60): Individuals who exercise have 40% decreased odds of experiencing frequent physical distress when compared to those who do not exercise holding all other variables constant.
General Health Status Good (2.32): Individuals who classify themselves in the good general health status have 2.3 times higher odds of experiencing frequent physical distress compared to individuals in the excellent/very good health status category holding all other variables constant.
# Create a reduced model without age_group
mod_reduced <- glm(frequent_distress ~ menthlth_days + bmi + exercise + gen_health,
data = brfss_analytic,
family = binomial
)
# Conduct a likelihood ratio test to determine if age as a whole has an impact
anova(mod_reduced, mod_logistic, test = "Chisq") |>
kable(digits = 3,
caption = "LR Test: Does adding age_group improve the model?") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Resid. Df | Resid. Dev | Df | Deviance | Pr(>Chi) |
|---|---|---|---|---|
| 7994 | 4586.561 | NA | NA | NA |
| 7991 | 4528.178 | 3 | 58.383 | 0 |
Report: State the null and alternative hypotheses, report the test statistic (deviance), degrees of freedom, and p-value. Conclude at alpha = 0.05: Does this group of predictors significantly improve the model?
H0: The dropped group of predictors does not improve the model (age_group).
H1: The dropped group predictors does significantly improve the model (age_group).
The test statistic (deviance) was 58.383. Degrees of freedom was 3, and the p-value was < .001.
This p-value is less than the alpha level of 0.05 suggesting that we should reject the null hypothesis and consider the evidence that age group significantly improves the model.
# Fit the ROC object
roc_obj <- roc(
response = brfss_analytic$frequent_distress,
predictor = fitted(mod_logistic),
levels = c("No", "Yes"),
direction = "<"
)
auc_value <- auc(roc_obj)
# Provides both AUC and 95% confidence intervals
ci_auc <- ci.auc(roc_obj)
ci_auc## 95% CI: 0.8362-0.8627 (DeLong)
cat("AUC:", round(auc_value, 3),
"\n95% CI:", paste(round(ci_auc[1], 3),
round(ci_auc[2], 3),
round(ci_auc[3], 3), sep = " - "))## AUC: 0.849
## 95% CI: 0.836 - 0.849 - 0.863
# Graph that demonstrates the ROC curve
ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
labs(title = "ROC Curve: Frequent Physical Distress Model",
subtitle = paste0("AUC = ", round(auc_value, 3)),
x = "Specificity", y = "Sensitivity") +
theme_minimal()Report: the AUC with 95% confidence interval
The AUC with 95% CIs is 0.849 [0.8362, 0.8627].
Interpret the AUC: What does this value tell you about the model’s ability to discriminate between individuals with and without frequent physical distress?
This value tells me that the model has an excellent ability (0.8-0.9) to discriminate between individuals with and without physical distress.
# Perform Hosmer-Lemeshow Test
hl_test <- hoslem.test(
x = as.numeric(brfss_analytic$frequent_distress) - 1,
y = fitted(mod_logistic),
g = 10
)
hl_test##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: as.numeric(brfss_analytic$frequent_distress) - 1, fitted(mod_logistic)
## X-squared = 5.0826, df = 8, p-value = 0.7487
# Always pair with a calibration plot
brfss_analytic |>
mutate(pred_prob = fitted(mod_logistic),
obs_outcome = as.numeric(frequent_distress) - 1,
decile = ntile(pred_prob, 10)) |>
group_by(decile) |>
summarise(
mean_pred = mean(pred_prob),
mean_obs = mean(obs_outcome),
.groups = "drop"
) |>
ggplot(aes(x = mean_pred, y = mean_obs)) +
geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
geom_point(size = 3, color = "steelblue") +
geom_line(color = "steelblue") +
labs(title = "Calibration Plot: Observed vs. Predicted Probability of Frequent Physical Distress",
subtitle = "Points should fall on the dashed line for perfect calibration",
x = "Mean Predicted Probability (within decile)",
y = "Observed Proportion (within decile)") +
theme_minimal()Report: State the null and alternative hypotheses, report the test statistic, degrees of freedom, and p-value. Interpret: At alpha = 0.05 is there evidence of poor model fit?
H0: The model fits well in the regions of predicted probability.
H1: The model does not fit well in some regions of predicted probability.
The test-statistic, X-squared, was 5.0826. The degrees of freedom was 8, and the p-value was 0.7487. At alpha 0.05, there is not evidence of poor model fit, so we accept the null hypothesis that the model fits well in the regions of the predicted probability. This is further demonstrated with the corresponding calibration plot.
In 1-2 sentences: Discuss how the Hosmer-Lemeshow result complements the ROC/AUC finding.
They compliment each other because Hosmer-Lemeshow is a calibration assessment that determines how well predicted probabilities match observed rates while the ROC/AUC finding is a discrimination assessment that determines how well the model separates events from non-events. This gives both a perspective from predicted values as well as outcomes to gain a well-rounded understanding of the data.
Write 250–400 words in continuous prose (not bullet points) addressing all three topics below.
A. Statistical Insights (5 points)
[What do your results suggest about the factors associated with physical health burden among U.S. adults? Which predictors were the strongest in both the linear and logistic models? Were any predictors significant in one model but not the other? What are the key limitations of using cross-sectional survey data for this type of analysis? Name at least two specific potential confounders not measured in your model.]
When fitting the maximum linear model, I used all eight variables including menthlth_days, bmi, sex, exercise, age_group, education, gen_health, and marital. After performing different selection methods, I determined that menthlth_days, bmi, exercise, age_group, and gen_health were the best predictors. In this final linear regression, menthlth_days, exercise_Yes, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor were found to be significant with p-values less than 0.001. Whereas, bmi and age_group35-49 were not significant with respective p-values of 0.092 and 0.261. This linear model was able to explain 32.2% of the varaince in physically unhealthy days. The same results were discovered for the multiple logistic regression with only bmi and age_group35-49 not being signiciant (p-values of 0.420 and 0.248). These results suggest that unhealthy mental health days, exercise, age(50-64, 65+), and general health are all significant factors associated with the physical health burden among U.S. adults. The most significant predictors appeared to be general health status and being in the age group of 65+ with large estimates in both the linear and logistic regressions. They key limitations of using cross-sectional survey data is that observations from the same household or geographic may not be fully independent. Furthermore, cross-sectional survey data can not establish temporality and causality; however, it can reflect associations/relationships. Two confounders that were not measures in this model were rural versus urban status because those living in rural areas may demonstrate more frequent physical distress days as well as income which was a variable that I did not choose as predictor however those who with lower income or socioeconomic status may have increased levels of frequent physical distress.
B. Linear versus Logistic Regression (5 points)
[Compare what the linear regression model tells you versus what the logistic regression model tells you about the same research question. What information does each approach provide that the other does not? In what situations would you prefer one approach over the other? How do the model selection criteria differ between the two frameworks (for example, R-squared versus AUC)]
Simple linear regression models allows us to quantify the linear relationship between a continuous outcome and a single predictor, tells us how to predict values of an outcome given a predictor value, test hypotheses about whether a predictor is associated with an outcome, and lay the groundwork for multiple linear regression which can allow us to control for confounding. For linear regression, we employ R-squared and adjusted R-squared to understand the proportion of total variability in our outcome explained by the linear regression. Typically when selecting a linear regression, we want to pick the one with the highest R-squared or adjusted R-Squared. Logistic regressions helps us when the outcome variable is binary instead of continuous, we want to estimate the probability of an event occurring, we need adjusted odds ratios that control for confounders, and we want to identify risk factors for a dichotomous health outcome. We prefer to use logistic regression in cross-sectional studies where we have the prevalence of a condition, in case-control studies with odds of exposure given disease status and cohort studies for incidence of disease. Typically to select a model in logistic regression, we want to employ AIC/BIC and choose the model that demonstrates the lowest value.
C. AI Transparency (5 points)
[Describe any parts of this assignment where you sought AI assistance. Include the specific prompts you used and how you verified the correctness of the output. If you did not use AI, state this explicitly.]
I did not use any AI for this assignment. I already had all of the code to perform the tasks that were assigned between all of the labs that we have completed, the lecture notes, and my own final project codes because I employed logistic regression. If I did have any errors arise, I looked them up on Google which often directed me to R blogs where others had found answers. This included trying to determine how to get confidence intervals for the AUC curve. I found an R blog that explained to use ci.auc. Having an error appear and looking up its meaning seems to be one of the best ways for me to learn how to code. —
End of Homework 4