EPI 553 — Homework 4: Model Selection and Introduction to Logistic Regression

Submission: Knit this file to HTML, publish to RPubs with the title epi553_hw04_lastname_firstname, and paste the RPubs URL in the Brightspace submission comments. Also upload this .Rmd file to Brightspace.

AI Policy: AI tools are permitted on this assignment for debugging not coding. See the assignment description for full details.

Part 0: Data Preparation (10 points)

Step 1.1: Load required packages: tidyverse, haven, broom, kableExtra, car, summary, leaps, pROC, ResourceSelection

# Load required packages
library(kableExtra)
library(broom)
library(haven)
library(janitor)
library(plotly)
library(broom)
library(ggeffects)
library(ggstats)
library(gtsummary)
library(GGally)
library(car)
library(lmtest)
library(corrplot)
library(leaps)
library(pROC)
library(ResourceSelection)
library(tidyverse)

Step 1.2: Import using read_xpt. Report the total number of rows and columns in the raw dataset

# Import the dataset — update the path if needed
brfss_raw <- read_xpt("LLCP2023.XPT") |> clean_names()

# Quick check
glimpse(brfss_raw)

## Rows: 433,323
## Columns: 350
## $ state    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ fmonth   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ idate    <chr> "03012023", "01062023", "03082023", "03062023", "01062023", "…
## $ imonth   <chr> "03", "01", "03", "03", "01", "01", "03", "01", "03", "01", "…
## $ iday     <chr> "01", "06", "08", "06", "06", "09", "21", "06", "15", "08", "…
## $ iyear    <chr> "2023", "2023", "2023", "2023", "2023", "2023", "2023", "2023…
## $ dispcode <dbl> 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1…
## $ seqno    <chr> "2023000001", "2023000002", "2023000003", "2023000004", "2023…
## $ psu      <dbl> 2.023e+09, 2.023e+09, 2.023e+09, 2.023e+09, 2.023e+09, 2.023e…
## $ ctelenm1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ pvtresd1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ colghous <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ statere1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ celphon1 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ ladult1  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ numadult <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 1, 1, 2, 1, 1, 1…
## $ respslc1 <dbl> 1, NA, NA, 1, NA, 1, NA, NA, NA, 2, 1, 1, 1, 1, 1, NA, NA, 1,…
## $ landsex2 <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ lndsxbrt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ safetime <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ ctelnum1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cellfon5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cadult1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cellsex2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ celsxbrt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pvtresd3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cclghous <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cstate1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ landline <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hhadult  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sexvar   <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ genhlth  <dbl> 2, 2, 4, 2, 4, 3, 4, 4, 3, 3, 3, 2, 2, 3, 3, 4, 5, 3, 3, 3, 3…
## $ physhlth <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, 30, 1…
## $ menthlth <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88, 88,…
## $ poorhlth <dbl> NA, NA, 1, 88, NA, 88, 88, 1, 88, NA, NA, 88, NA, NA, NA, 1, …
## $ primins1 <dbl> 3, 3, 3, 3, 3, 7, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2…
## $ persdoc3 <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2…
## $ medcost1 <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ checkup1 <dbl> 2, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ exerany2 <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2, 1, 1…
## $ exract12 <dbl> NA, 1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA, 1, NA, NA, 1, NA, NA…
## $ exeroft1 <dbl> NA, 106, 205, 103, 102, 212, NA, NA, NA, 230, 105, NA, 105, N…
## $ exerhmm1 <dbl> NA, 30, 15, 30, 45, 45, NA, NA, NA, 30, 30, NA, 45, NA, NA, 1…
## $ exract22 <dbl> NA, 88, 88, 88, 8, 88, NA, NA, NA, 88, 77, NA, 3, NA, NA, 88,…
## $ exeroft2 <dbl> NA, NA, NA, NA, 107, NA, NA, NA, NA, NA, NA, NA, 105, NA, NA,…
## $ exerhmm2 <dbl> NA, NA, NA, NA, 100, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA, …
## $ strength <dbl> 888, 888, 205, 888, 888, 203, 777, 888, 888, 888, 105, 888, 8…
## $ bphigh6  <dbl> 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1, 1…
## $ bpmeds1  <dbl> 1, 2, 1, NA, 1, 1, NA, 1, 1, 1, 1, 1, NA, NA, 1, 1, 1, 1, 1, …
## $ cholchk3 <dbl> 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2…
## $ toldhi3  <dbl> 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2…
## $ cholmed3 <dbl> 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2…
## $ cvdinfr4 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ cvdcrhd4 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ cvdstrk3 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2…
## $ asthma3  <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
## $ asthnow  <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ chcscnc1 <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ chcocnc1 <dbl> 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1…
## $ chccopd3 <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2…
## $ addepev3 <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ chckdny2 <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 7, 1, 2, 2, 2…
## $ havarth4 <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1…
## $ diabete4 <dbl> 1, 3, 3, 3, 1, 3, 3, 1, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1…
## $ diabage4 <dbl> 57, NA, NA, NA, 68, NA, NA, 98, NA, 60, NA, NA, NA, NA, NA, N…
## $ marital  <dbl> 1, 2, 3, 1, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 3, 3, 2…
## $ educa    <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5, 4, 4…
## $ renthom1 <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ numhhol4 <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2…
## $ numphon4 <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ cpdemo1c <dbl> 1, 1, 8, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ veteran3 <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2…
## $ employ1  <dbl> 7, 7, 7, 7, 8, 7, 8, 7, 7, 7, 4, 1, 7, 7, 7, 5, 7, 7, 7, 7, 7…
## $ children <dbl> 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 8…
## $ income3  <dbl> 99, 99, 2, 99, 7, 7, 6, 99, 6, 7, 7, 99, 9, 9, 9, 3, 77, 99, …
## $ pregnant <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ weight2  <dbl> 172, 132, 130, 170, 170, 165, 180, 135, 195, 160, 300, 145, 1…
## $ height3  <dbl> 503, 409, 504, 506, 508, 502, 600, 411, 504, 510, 511, 505, 5…
## $ deaf     <dbl> 2, 1, 7, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ blind    <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ decide   <dbl> 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 7, 2…
## $ diffwalk <dbl> 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1…
## $ diffdres <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 2, 2, 2, 2…
## $ diffalon <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1…
## $ fall12mn <dbl> 88, 88, 88, 88, 3, 88, 8, 88, 88, 88, 88, 88, 88, 1, 88, 1, 3…
## $ fallinj5 <dbl> NA, NA, NA, NA, 88, NA, 88, NA, NA, NA, NA, NA, NA, 1, NA, 88…
## $ smoke100 <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1, 1…
## $ smokday2 <dbl> NA, NA, 3, NA, NA, NA, 3, NA, NA, 3, NA, NA, 3, 3, 3, NA, NA,…
## $ usenow3  <dbl> 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ ecignow2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1, 4…
## $ alcday4  <dbl> 888, 888, 888, 888, 202, 205, 888, 888, 888, 888, 888, 888, 1…
## $ avedrnk3 <dbl> NA, NA, NA, NA, 1, 2, NA, NA, NA, NA, NA, NA, 1, 2, NA, NA, N…
## $ drnk3ge5 <dbl> NA, NA, NA, NA, 88, 88, NA, NA, NA, NA, NA, NA, 88, 88, NA, N…
## $ maxdrnks <dbl> NA, NA, NA, NA, 1, 2, NA, NA, NA, NA, NA, NA, 1, 3, NA, NA, N…
## $ flushot7 <dbl> 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1…
## $ flshtmy3 <dbl> NA, 92023, 112022, 102022, NA, NA, 32023, 102023, 777777, 112…
## $ pneuvac4 <dbl> 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1, 1…
## $ shingle2 <dbl> 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1…
## $ hivtst7  <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 2, 2, 2, 1…
## $ hivtstd3 <dbl> NA, NA, NA, 777777, NA, 777777, 92022, NA, NA, NA, NA, NA, NA…
## $ seatbelt <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ drnkdri2 <dbl> NA, NA, NA, NA, 88, 88, NA, NA, NA, NA, NA, NA, 88, 88, NA, N…
## $ covidpo1 <dbl> 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2…
## $ covidsm1 <dbl> NA, NA, NA, NA, NA, 2, NA, 2, 2, NA, 1, 2, NA, 2, 2, NA, NA, …
## $ covidact <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, NA…
## $ pdiabts1 <dbl> NA, 2, 1, 1, NA, 1, 1, NA, 1, NA, 1, 1, 1, 1, 1, 1, 7, 1, 8, …
## $ prediab2 <dbl> NA, 3, 3, 3, NA, 3, 3, NA, 1, NA, 3, 3, 1, 3, 3, 3, 3, 3, 3, …
## $ diabtype <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ insulin1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ chkhemo3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ eyeexam1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ diabeye1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ diabedu1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ feetsore <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ arthexer <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ arthedu  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lmtjoin3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ arthdis2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ joinpai2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsfirst <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcslast  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsnumcg <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsctsc1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsscncr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lcsctwhn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hadmam   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ howlong  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cervscrn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crvclcnc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crvclpap <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crvclhpv <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hadhyst2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ psatest1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ psatime1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcpsars2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ psasugs1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ pcstalk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hadsigm4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ colnsigm <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ colntes1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sigmtes1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lastsig4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ colncncr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vircolo1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ vclntes2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ smalstol <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ stoltest <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ stooldn2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ bldstfit <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sdnates1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cncrdiff <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cncrage  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cncrtyp2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvtrt3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvdoc1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvsum  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvrtrn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvinst <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvinsr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvdein <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvclin <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvpain <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ csrvctl2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ indortan <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ numburn3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ sunprtct <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wkdayout <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ wkendout <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cimemlo1 <dbl> 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, NA, 2, 2, 2, 1, 2, 2, …
## $ cdworry  <dbl> NA, NA, 2, NA, 1, 1, NA, NA, NA, NA, NA, NA, 7, NA, NA, NA, N…
## $ cddiscu1 <dbl> NA, NA, 1, NA, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, N…
## $ cdhous1  <dbl> NA, NA, 2, NA, 2, 1, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, N…
## $ cdsocia1 <dbl> NA, NA, 2, NA, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA, N…
## $ caregiv1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvrel4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvlng1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvhrs1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvprb3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvalzd <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvper1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvhou1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ crgvexpt <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lastsmk2 <dbl> NA, NA, 7, NA, NA, NA, 7, NA, NA, 7, NA, NA, 7, NA, 7, NA, NA…
## $ stopsmk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mentcigs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ mentecig <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ heattbco <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ firearm5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ gunload  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ loadulk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp2  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp3  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp4  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp5  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hasymp6  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ strsymp6 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ firstaid <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aspirin  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ birthsex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ somale   <dbl> NA, NA, NA, NA, NA, NA, 2, NA, NA, 2, 2, NA, NA, 2, NA, NA, N…
## $ sofemale <dbl> 2, 2, 2, 2, 2, 2, NA, 2, 2, NA, NA, 2, 2, NA, 2, 2, 2, NA, 9,…
## $ trnsgndr <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ marijan1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjsmok <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjeat  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjvape <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjdab  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ marjothr <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ usemrjn4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedeprs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedrink <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedrugs <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceprisn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acedivrc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acepunch <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acehurt1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceswear <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acetouch <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acetthem <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ acehvsex <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceadsaf <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ aceadned <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ imfvpla4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hpvadvc4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ hpvadsht <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ tetanus1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ covidva1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ covacge1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ covidnu2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ lsatisfy <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, NA, 1, 2, 3, 1, 4, 2, …
## $ emtsuprt <dbl> 1, 2, 4, 1, 2, 2, 1, 1, 1, 3, 1, 1, 2, NA, 1, 1, 2, 1, 4, 2, …
## $ sdlonely <dbl> 5, 3, 3, 3, 2, 3, 4, 5, 3, 3, 4, 5, 4, NA, 4, 3, 4, 5, 1, 2, …
## $ sdhemply <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ foodstmp <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhfood1 <dbl> 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, NA, 5, 5, 5, 5, 5, 5, …
## $ sdhbills <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhutils <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhtrnsp <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2, …
## $ sdhstre1 <dbl> 5, 5, 3, 5, 2, 3, 5, 5, 4, 4, 4, 4, 4, NA, 4, 5, 4, 5, 4, 4, …
## $ rrclass3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrcognt2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrtreat  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rratwrk2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrhcare4 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rrphysm2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rcsgend1 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rcsxbrth <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ rcsrltn2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ casthdx2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ casthno2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ qstver   <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 1…
## $ qstlang  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ metstat  <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2, 1…
## $ urbstat  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ mscode   <dbl> 1, 1, 3, 2, 5, 1, 2, 5, 3, 1, 1, 1, 1, 5, 1, 2, 5, 5, 5, 1, 1…
## $ ststr    <dbl> 11011, 11012, 11011, 11011, 11011, 11011, 11011, 11012, 11011…
## $ strwt    <dbl> 42.79158, 170.42939, 42.79158, 42.79158, 42.79158, 42.79158, …
## $ rawrake  <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 1, 1, 2, 1, 1, 1…
## $ wt2rake  <dbl> 85.58316, 170.42939, 42.79158, 85.58316, 42.79158, 85.58316, …
## $ imprace  <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ chispnc  <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9…
## $ crace1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cageg    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ cllcpwt  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ dualuse  <dbl> 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dualcor  <dbl> 0.5026388, 0.5026388, NA, 0.5026388, 0.5026388, 0.5026388, 0.…
## $ llcpwt2  <dbl> 941.1640, 1874.2239, 1151.6032, 941.1640, 470.5820, 941.1640,…
## $ llcpwt   <dbl> 605.4279, 1121.9927, 600.9633, 605.4279, 281.7110, 1060.3352,…
## $ rfhlth   <dbl> 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1…
## $ phys14d  <dbl> 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 3, 3, 1, 1, 2…
## $ ment14d  <dbl> 1, 1, 2, 1, 1, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1, 1…
## $ hlthpl1  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hcvu653  <dbl> 9, 9, 9, 9, 9, 1, 9, 9, 9, 9, 1, 9, 9, 9, 9, 9, 1, 9, 9, 9, 9…
## $ totinda  <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 9, 2, 2, 1, 1…
## $ metvl12  <dbl> NA, 35, 35, 35, 35, 35, NA, NA, NA, 35, 35, NA, 35, NA, NA, 3…
## $ metvl22  <dbl> NA, NA, NA, NA, 33, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA, N…
## $ maxvo21  <dbl> 1840, 1803, 1322, 1914, 1988, 2506, 1545, 1914, 1655, 1875, 2…
## $ fc601    <dbl> 315, 309, 227, 328, 341, 430, 265, 328, 284, 321, 472, 328, 3…
## $ actin13  <dbl> NA, 2, 2, 2, 2, 1, NA, NA, NA, 2, 1, NA, 1, NA, NA, 2, NA, NA…
## $ actin23  <dbl> NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA,…
## $ padur1   <dbl> NA, 30, 15, 30, 45, 45, NA, NA, NA, 30, 30, NA, 45, NA, NA, 6…
## $ padur2   <dbl> NA, NA, NA, NA, 60, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA, N…
## $ pafreq1  <dbl> NA, 6000, 1167, 3000, 2000, 2800, NA, NA, NA, 7000, 5000, NA,…
## $ pafreq2  <dbl> NA, NA, NA, NA, 7000, NA, NA, NA, NA, NA, NA, NA, 5000, NA, N…
## $ minac12  <dbl> NA, 180, 18, 90, 90, 126, NA, NA, NA, 210, 150, NA, 225, NA, …
## $ minac22  <dbl> NA, 0, 0, 0, 420, 0, NA, NA, NA, 0, NA, NA, 225, NA, NA, 0, N…
## $ strfreq  <dbl> 0, 0, 1167, 0, 0, 700, NA, 0, 0, 0, 5000, 0, 0, 2000, 2000, 0…
## $ pamiss3  <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 9, 0, 0, 1, 1…
## $ pamin13  <dbl> NA, 360, 36, 180, 180, 126, NA, NA, NA, 420, 150, NA, 225, NA…
## $ pamin23  <dbl> NA, NA, NA, NA, 420, NA, NA, NA, NA, NA, NA, NA, 450, NA, NA,…
## $ pa3min   <dbl> NA, 360, 36, 180, 600, 126, NA, NA, NA, 420, 150, NA, 675, NA…
## $ pavig13  <dbl> NA, 180, 18, 90, 90, 0, NA, NA, NA, 210, 0, NA, 0, NA, NA, 12…
## $ pavig23  <dbl> NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, 225, NA, NA, N…
## $ pa3vigm  <dbl> NA, 180, 18, 90, 90, 0, NA, NA, NA, 210, 0, NA, 225, NA, NA, …
## $ pacat3   <dbl> 4, 1, 9, 9, 1, 9, 4, 4, 4, 1, 9, 4, 1, 4, 4, 9, 9, 4, 4, 9, 9…
## $ paindx3  <dbl> 2, 1, 9, 1, 1, 9, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 9, 2, 2, 1, 1…
## $ pa150r4  <dbl> 3, 1, 9, 1, 1, 9, 3, 3, 3, 1, 1, 3, 1, 3, 3, 1, 9, 3, 3, 1, 1…
## $ pa300r4  <dbl> 3, 1, 9, 9, 1, 9, 3, 3, 3, 1, 9, 3, 1, 3, 3, 9, 9, 3, 3, 9, 9…
## $ pa30023  <dbl> 2, 1, 9, 9, 1, 9, 2, 2, 2, 1, 9, 2, 1, 2, 2, 9, 9, 2, 2, 9, 9…
## $ pastrng  <dbl> 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1…
## $ parec3   <dbl> 4, 2, 9, 2, 2, 9, 9, 4, 4, 2, 1, 4, 2, 3, 3, 2, 9, 4, 4, 2, 1…
## $ pastae3  <dbl> 2, 2, 9, 2, 2, 9, 9, 2, 2, 2, 1, 2, 2, 2, 2, 2, 9, 2, 2, 2, 1…
## $ rfhype6  <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2…
## $ cholch3  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rfchol3  <dbl> 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1…
## $ michd    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ ltasth1  <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1…
## $ casthm1  <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1…
## $ asthms1  <dbl> 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 3, 3, 3, 3…
## $ drdxar2  <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1…
## $ mrace1   <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ hispanc  <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ race     <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ raceg21  <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ racegr3  <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ raceprv  <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2…
## $ sex      <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2…
## $ ageg5yr  <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12, 13,…
## $ age65yr  <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2…
## $ age80    <dbl> 80, 80, 80, 78, 76, 62, 80, 78, 80, 75, 59, 78, 68, 79, 79, 8…
## $ age_g    <dbl> 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 5, 6, 6, 6, 6…
## $ htin4    <dbl> 63, 57, 64, 66, 68, 62, 72, 59, 64, 70, 71, 65, 68, 73, 62, 6…
## $ htm4     <dbl> 160, 145, 163, 168, 173, 157, 183, 150, 163, 178, 180, 165, 1…
## $ wtkg3    <dbl> 7802, 5987, 5897, 7711, 7711, 7484, 8165, 6123, 8845, 7257, 1…
## $ bmi5     <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 2296, 4…
## $ bmi5cat  <dbl> 4, 3, 2, 3, 3, 4, 2, 3, 4, 2, 4, 2, 3, 2, 3, 3, 4, 3, 3, 3, 4…
## $ rfbmi5   <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2…
## $ chldcnt  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ educag   <dbl> 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 4, 3, 3, 2, 4, 4, 3, 2, 2…
## $ incomg1  <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3, 4, 3…
## $ smoker3  <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1, 3, 3…
## $ rfsmok3  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1…
## $ cureci2  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ drnkany6 <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1, 2…
## $ drocdy4  <dbl> 0, 0, 0, 0, 7, 17, 0, 0, 0, 0, 0, 0, 29, 100, 0, 0, 0, 0, 33,…
## $ rfbing6  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ drnkwk2  <dbl> 0, 0, 0, 0, 47, 233, 0, 0, 0, 0, 0, 0, 200, 1400, 0, 0, 0, 0,…
## $ rfdrhv8  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ flshot7  <dbl> 2, 1, 1, 1, 2, NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, 2, NA, 1, 2, 1…
## $ pneumo3  <dbl> 2, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA, 2, 1, 1, 2, 1, NA, 2, 1, 1…
## $ aidtst4  <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 1…
## $ rfseat2  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ rfseat3  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2…
## $ drnkdrv  <dbl> 9, 9, 9, 9, 2, 2, 9, 9, 9, 9, 9, 9, 2, 2, 9, 9, 9, 9, 2, 2, 9…

names(brfss_raw)

##   [1] "state"    "fmonth"   "idate"    "imonth"   "iday"     "iyear"   
##   [7] "dispcode" "seqno"    "psu"      "ctelenm1" "pvtresd1" "colghous"
##  [13] "statere1" "celphon1" "ladult1"  "numadult" "respslc1" "landsex2"
##  [19] "lndsxbrt" "safetime" "ctelnum1" "cellfon5" "cadult1"  "cellsex2"
##  [25] "celsxbrt" "pvtresd3" "cclghous" "cstate1"  "landline" "hhadult" 
##  [31] "sexvar"   "genhlth"  "physhlth" "menthlth" "poorhlth" "primins1"
##  [37] "persdoc3" "medcost1" "checkup1" "exerany2" "exract12" "exeroft1"
##  [43] "exerhmm1" "exract22" "exeroft2" "exerhmm2" "strength" "bphigh6" 
##  [49] "bpmeds1"  "cholchk3" "toldhi3"  "cholmed3" "cvdinfr4" "cvdcrhd4"
##  [55] "cvdstrk3" "asthma3"  "asthnow"  "chcscnc1" "chcocnc1" "chccopd3"
##  [61] "addepev3" "chckdny2" "havarth4" "diabete4" "diabage4" "marital" 
##  [67] "educa"    "renthom1" "numhhol4" "numphon4" "cpdemo1c" "veteran3"
##  [73] "employ1"  "children" "income3"  "pregnant" "weight2"  "height3" 
##  [79] "deaf"     "blind"    "decide"   "diffwalk" "diffdres" "diffalon"
##  [85] "fall12mn" "fallinj5" "smoke100" "smokday2" "usenow3"  "ecignow2"
##  [91] "alcday4"  "avedrnk3" "drnk3ge5" "maxdrnks" "flushot7" "flshtmy3"
##  [97] "pneuvac4" "shingle2" "hivtst7"  "hivtstd3" "seatbelt" "drnkdri2"
## [103] "covidpo1" "covidsm1" "covidact" "pdiabts1" "prediab2" "diabtype"
## [109] "insulin1" "chkhemo3" "eyeexam1" "diabeye1" "diabedu1" "feetsore"
## [115] "arthexer" "arthedu"  "lmtjoin3" "arthdis2" "joinpai2" "lcsfirst"
## [121] "lcslast"  "lcsnumcg" "lcsctsc1" "lcsscncr" "lcsctwhn" "hadmam"  
## [127] "howlong"  "cervscrn" "crvclcnc" "crvclpap" "crvclhpv" "hadhyst2"
## [133] "psatest1" "psatime1" "pcpsars2" "psasugs1" "pcstalk2" "hadsigm4"
## [139] "colnsigm" "colntes1" "sigmtes1" "lastsig4" "colncncr" "vircolo1"
## [145] "vclntes2" "smalstol" "stoltest" "stooldn2" "bldstfit" "sdnates1"
## [151] "cncrdiff" "cncrage"  "cncrtyp2" "csrvtrt3" "csrvdoc1" "csrvsum" 
## [157] "csrvrtrn" "csrvinst" "csrvinsr" "csrvdein" "csrvclin" "csrvpain"
## [163] "csrvctl2" "indortan" "numburn3" "sunprtct" "wkdayout" "wkendout"
## [169] "cimemlo1" "cdworry"  "cddiscu1" "cdhous1"  "cdsocia1" "caregiv1"
## [175] "crgvrel4" "crgvlng1" "crgvhrs1" "crgvprb3" "crgvalzd" "crgvper1"
## [181] "crgvhou1" "crgvexpt" "lastsmk2" "stopsmk2" "mentcigs" "mentecig"
## [187] "heattbco" "firearm5" "gunload"  "loadulk2" "hasymp1"  "hasymp2" 
## [193] "hasymp3"  "hasymp4"  "hasymp5"  "hasymp6"  "strsymp1" "strsymp2"
## [199] "strsymp3" "strsymp4" "strsymp5" "strsymp6" "firstaid" "aspirin" 
## [205] "birthsex" "somale"   "sofemale" "trnsgndr" "marijan1" "marjsmok"
## [211] "marjeat"  "marjvape" "marjdab"  "marjothr" "usemrjn4" "acedeprs"
## [217] "acedrink" "acedrugs" "aceprisn" "acedivrc" "acepunch" "acehurt1"
## [223] "aceswear" "acetouch" "acetthem" "acehvsex" "aceadsaf" "aceadned"
## [229] "imfvpla4" "hpvadvc4" "hpvadsht" "tetanus1" "covidva1" "covacge1"
## [235] "covidnu2" "lsatisfy" "emtsuprt" "sdlonely" "sdhemply" "foodstmp"
## [241] "sdhfood1" "sdhbills" "sdhutils" "sdhtrnsp" "sdhstre1" "rrclass3"
## [247] "rrcognt2" "rrtreat"  "rratwrk2" "rrhcare4" "rrphysm2" "rcsgend1"
## [253] "rcsxbrth" "rcsrltn2" "casthdx2" "casthno2" "qstver"   "qstlang" 
## [259] "metstat"  "urbstat"  "mscode"   "ststr"    "strwt"    "rawrake" 
## [265] "wt2rake"  "imprace"  "chispnc"  "crace1"   "cageg"    "cllcpwt" 
## [271] "dualuse"  "dualcor"  "llcpwt2"  "llcpwt"   "rfhlth"   "phys14d" 
## [277] "ment14d"  "hlthpl1"  "hcvu653"  "totinda"  "metvl12"  "metvl22" 
## [283] "maxvo21"  "fc601"    "actin13"  "actin23"  "padur1"   "padur2"  
## [289] "pafreq1"  "pafreq2"  "minac12"  "minac22"  "strfreq"  "pamiss3" 
## [295] "pamin13"  "pamin23"  "pa3min"   "pavig13"  "pavig23"  "pa3vigm" 
## [301] "pacat3"   "paindx3"  "pa150r4"  "pa300r4"  "pa30023"  "pastrng" 
## [307] "parec3"   "pastae3"  "rfhype6"  "cholch3"  "rfchol3"  "michd"   
## [313] "ltasth1"  "casthm1"  "asthms1"  "drdxar2"  "mrace1"   "hispanc" 
## [319] "race"     "raceg21"  "racegr3"  "raceprv"  "sex"      "ageg5yr" 
## [325] "age65yr"  "age80"    "age_g"    "htin4"    "htm4"     "wtkg3"   
## [331] "bmi5"     "bmi5cat"  "rfbmi5"   "chldcnt"  "educag"   "incomg1" 
## [337] "smoker3"  "rfsmok3"  "cureci2"  "drnkany6" "drocdy4"  "rfbing6" 
## [343] "drnkwk2"  "rfdrhv8"  "flshot7"  "pneumo3"  "aidtst4"  "rfseat2" 
## [349] "rfseat3"  "drnkdrv"

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_raw), ncol(brfss_raw))) |> 
  kable(caption = "Full Dataset Dimensions") |> 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Full Dataset Dimensions
Metric	Value
Observations	433323
Variables	350

In the raw dataset, there are 433,323 (observations) rows and 350 (variables) columns.

Step 1.3: Select the variables listed in the table

# Select only the variables we want to work with
brfss_select <- brfss_raw |> 
  select(menthlth, physhlth, bmi5, sexvar, exerany2, ageg5yr, educa, genhlth, marital) 

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_select), ncol(brfss_select))) |> 
  kable(caption = "Only Necessary Dataset Dimensions") |> 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Only Necessary Dataset Dimensions
Metric	Value
Observations	433323
Variables	9

# Save necessary data and variables as an RDS file
saveRDS(brfss_select,
  "brfss_2023_select.rds")

What variables did you select and why? After selecting only the variables needed for the homework, there are 433,323 observations and 9 variables (menthlth, psyhlth, bmi5, sexvar, exerany2, ageg5yr, educa, genhlth, and marital). I chose menthlth and bmi5 as my continous variables, sexvar and exerany2 as my binary, and ageg5yr, educa, genhlth, and marital as my categorical predictors. Mental health, exercise, education, and general health have consistently been demonstrated to be predictors in all of the other BRFSS labs that we have completed. I am always interested in how women and men differ hence picking sex. BMI always intrigues me because I do not believe it is the best scale to measure an individual’s health but I want to understand more how it can influence physical distress especially in someone who may be considered overweight. As we get older, overall health appears to decline suggesting it could be a good predictor to analyze. Also we have discussed constantly that married individuals are the healthiest, so I was curious to see how that would impact this overall relationship.

Step 2: Recode all variables as described. Store all categorical variables as labeled factors and set appropriate reference levels

# Recode all the variables selected
brfss_recode <- brfss_select |> 
  mutate(
    # Outcome: Mentally unhealthy days in past 30 
    menthlth_days = case_when(
      menthlth == 88 ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      menthlth %in% c(77, 99) ~ NA_real_,
      TRUE ~ NA_real_
    ),
    # Physical health days
    physhlth_days = case_when(
      physhlth == 88 ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      physhlth %in% c(77, 99) ~ NA_real_,
      TRUE ~ NA_real_
    ),
    # BMI
    bmi = case_when(
      bmi5 == 9999 ~ NA_real_,
      TRUE ~ bmi5 / 100
    ),
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Exercise in past 30 days
    exercise = factor(
      case_when(
        exerany2 == 2 ~ "No",
        exerany2 == 1 ~ "Yes",
        exerany2 %in% c(7, 9) ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("No", "Yes")
    ),
    # Age groups
    age_group = factor(
      case_when(
        ageg5yr %in% c(1, 2, 3) ~ "18-34",
        ageg5yr %in% c(4, 5, 6) ~ "35-49",
        ageg5yr %in% c(7, 8, 9) ~ "50-64",
        ageg5yr %in% c(10, 11, 12, 13) ~ "65+",
        ageg5yr == 14 ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("18-34", "35-49", "50-64", "65+")
    ),
    # Education 
    education = factor(
      case_when(
        educa %in% c(1, 2, 3) ~ "Less than HS",
        educa == 4 ~ "HS/GED",
        educa == 5 ~ "Some College",
        educa == 6 ~ "College Graduate",
        educa == 9 ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("Less than HS", "HS/GED", "Some College", "College Graduate")
    ),
    # General Health Status 
    gen_health = factor(
      case_when(
        genhlth %in% c(1, 2) ~ "Excellent/Very Good",
        genhlth == 3 ~ "Good",
        genhlth %in% c(4, 5) ~ "Fair/Poor",
        genhlth %in% c(7, 9) ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("Excellent/Very Good", "Good", "Fair/Poor")
    ),
    # Marital Status
    marital = factor(
      case_when(
        marital %in% c(1, 6) ~ "Married/Partnered",
        marital %in% c(2, 3, 4) ~ "Previously married",
        marital == 5 ~ "Never married",
        marital == 9 ~ NA_character_,
        TRUE ~ NA_character_
      ),
      levels = c("Married/Partnered", "Previously married", "Never married")
    )
  )

Step 3.1: Create the analytic dataset and report a missing descriptive summary

brfss_analytic <- brfss_recode |> 
  select (menthlth_days, physhlth_days, bmi, sex, exercise, age_group, education, gen_health, marital)

# Report missing values for the key variables and print out total sample size
missing_summary <- data.frame(
  Variable = names(brfss_analytic),
  Missing_Count = colSums(is.na(brfss_analytic)),
  Missing_Percent = round(colSums(is.na(brfss_analytic)) / nrow(brfss_analytic) * 100, 2)
)

# Show variables with the most missing data
print(missing_summary[order(-missing_summary$Missing_Count), ][1:9, ]) |>  
  kable(
    caption = "Variables by Missingness"
  ) |> 
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  )

##                    Variable Missing_Count Missing_Percent
## bmi                     bmi         40535            9.35
## physhlth_days physhlth_days         10785            2.49
## menthlth_days menthlth_days          8108            1.87
## age_group         age_group          7779            1.80
## marital             marital          4289            0.99
## education         education          2325            0.54
## gen_health       gen_health          1262            0.29
## exercise           exercise          1251            0.29
## sex                     sex             0            0.00

Variables by Missingness
	Variable	Missing_Count	Missing_Percent
bmi	bmi	40535	9.35
physhlth_days	physhlth_days	10785	2.49
menthlth_days	menthlth_days	8108	1.87
age_group	age_group	7779	1.80
marital	marital	4289	0.99
education	education	2325	0.54
gen_health	gen_health	1262	0.29
exercise	exercise	1251	0.29
sex	sex	0	0.00

Physhlth_days had 10,785 observations missing resulting in 2.49% of missingness. BMI had 40,535 observations missing resulting in 9.35% of missingness. Menthlth_days had 8,108 observations missing resulting in 1.87% missingness.

Step 3.2: Create the analytic dataset using drop_na() and slice_sample() with set.seed(1220). Report the final analytic n.

# Create analytic dataset
set.seed(1220)
brfss_analytic <- brfss_analytic |> 
  drop_na() |> 
  slice_sample(n = 8000)

# Display subset dimensions
cat("Working subset dimensions:",
    nrow(brfss_analytic), "observations,",
    ncol(brfss_analytic), "variables\n")

## Working subset dimensions: 8000 observations, 9 variables

# Save for later use
saveRDS(brfss_analytic,
  "brfss_2023_analytic.rds")

# Print out total sample size
cat("Final Analytic N:", nrow(brfss_analytic), "\n")

## Final Analytic N: 8000

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_analytic), ncol(brfss_analytic))) |> 
  kable(caption = "Analytic Dataset Dimensions") |> 
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	8000
Variables	9

The final analytic n is 8,000 with 9 recoded variables (menthlth_days, physhlth_days, bmi, sex, exercise, age_group, education, gen_health, marital).

Section 3.3: Produce a descriptive statistics table using gtsummary::tbl_summary()

# Create Table 1 Descriptive Statistics
brfss_analytic |>  
  select(menthlth_days, physhlth_days, bmi, sex, exercise, age_group, education, gen_health, marital) |>  
  tbl_summary(
    type = list(exercise ~ "categorical",
                sex ~ "categorical"),
    label = list(
      menthlth_days ~ "Mentally Unhealthy Days (Last 30 Days)",
      physhlth_days ~ "Physically Unhealthy Days (Last 30 Days)",
      bmi ~ "Body Mass Index (kg/m2)",
      sex ~ "Sex",
      exercise ~ "Exercise Status",
      age_group ~ "Age Category",
      education ~ "Education Category",
      gen_health ~ "General Health Status",
      marital ~ "Marital Status"
    ),
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |> 
  add_n() |> 
  bold_labels() |> 
  italicize_levels() |> 
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**") |>  
  modify_footnote(
    all_stat_cols() ~ 
      "Mean (SD); n(%); Analytic sample includes 8,000 complete-case observations. Missing values were excluded listwise for all variables included in this table."
  ) |> 
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**
Characteristic	N	N = 8,0001
Mentally Unhealthy Days (Last 30 Days)	8,000	4.3 (8.2)
Physically Unhealthy Days (Last 30 Days)	8,000	4.5 (8.8)
Body Mass Index (kg/m2)	8,000	28.5 (6.7)
Sex	8,000
Male		3,899 (49%)
Female		4,101 (51%)
Exercise Status	8,000
No		1,880 (24%)
Yes		6,120 (77%)
Age Category	8,000
18-34		1,355 (17%)
35-49		1,522 (19%)
50-64		2,108 (26%)
65+		3,015 (38%)
Education Category	8,000
Less than HS		435 (5.4%)
HS/GED		1,988 (25%)
Some College		2,097 (26%)
College Graduate		3,480 (44%)
General Health Status	8,000
Excellent/Very Good		3,954 (49%)
Good		2,576 (32%)
Fair/Poor		1,470 (18%)
Marital Status	8,000
Married/Partnered		4,533 (57%)
Previously married		1,995 (25%)
Never married		1,472 (18%)
1Mean (SD); n(%); Analytic sample includes 8,000 complete-case observations. Missing values were excluded listwise for all variables included in this table.

Part 1: Model Selection for Linear Regression (35 points)

Research Question: What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?

Step 1: Fit the Maximum Model (5 Points)

# Fitting maximum linear regression 
max_model <- lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health + marital, data = brfss_analytic)

tidy(max_model, conf.int = TRUE) |> 
    mutate(
    across(where(is.numeric), ~ round(., 3))
  ) |> 
  kable(
    caption  = "Table 3. Maximum Model Linear Regression: Physically Unhealthy Days ~ All Predictors (BRFSS 2023, n = 8,000)",
    col.names = c("Term", "Estimate(β)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) |> 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 3. Maximum Model Linear Regression: Physically Unhealthy Days ~ All Predictors (BRFSS 2023, n = 8,000)
Term	Estimate(β)	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	1.335	0.589	2.268	0.023	0.181	2.489
menthlth_days	0.205	0.011	19.274	0.000	0.184	0.226
bmi	0.022	0.013	1.712	0.087	-0.003	0.046
sexFemale	0.070	0.165	0.424	0.672	-0.253	0.393
exerciseYes	-1.759	0.205	-8.561	0.000	-2.162	-1.356
age_group35-49	0.178	0.288	0.618	0.537	-0.387	0.743
age_group50-64	0.919	0.281	3.274	0.001	0.369	1.470
age_group65+	1.505	0.278	5.419	0.000	0.960	2.049
educationHS/GED	-0.317	0.386	-0.822	0.411	-1.073	0.439
educationSome College	-0.189	0.386	-0.488	0.626	-0.946	0.569
educationCollege Graduate	-0.190	0.379	-0.501	0.616	-0.934	0.553
gen_healthGood	1.285	0.190	6.758	0.000	0.912	1.658
gen_healthFair/Poor	10.349	0.251	41.219	0.000	9.857	10.842
maritalPreviously married	0.108	0.204	0.531	0.595	-0.292	0.508
maritalNever married	-0.244	0.242	-1.006	0.314	-0.719	0.231

# See AIC and BIC of the full model
AIC(max_model)

## [1] 54378.57

BIC(max_model)

## [1] 54490.37

# Report R-squared, adjusted r-squared, AIC, and BIC in one table
glance(max_model) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.323	0.322	7.233	54378.57	54490.36	7985

R-squared = 0.323 It appears that the full model with all predictors explains 32.3% of the variance in physically unhealthy days. R-squared is not the best to use because it always increases when we add a predictor to the mode, even if it random noise.

Adjusted R-squared = 0.322 It appears that the full model with all predictors explains 32.2% of the variance in physically unhealthy days when using adjusted r-squared which can penalize for the number of predictors and will only increase when a new predictor improves the model by more than chance alone.

AIC = According to AIC, the model produces a 54,378.57. The lower the AIC, the better for model fit.

BIC = According to BIC, the model produces a 54,490.36. Similar to AIC, the lower the BIC, the better for model fit. —

Step 2: Best Subsets Regression (10 points)

# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health + marital, data = brfss_analytic,
  nvmax = 15,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

# Create a summary table showing Adjusted R-squared and BIC across model sizes (number of predictors)
models <- list(
  "Mental health"     = lm(physhlth_days ~ menthlth_days, data = brfss_analytic),
  "+ bmi"             = lm(physhlth_days ~ menthlth_days + bmi, data = brfss_analytic),
  "+ sex"             = lm(physhlth_days ~ menthlth_days + bmi + sex, data = brfss_analytic),
  "+ exercise"        = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise, data = brfss_analytic),
  "+ age_group"       = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group, data = brfss_analytic),
  "+ education"       = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education, data = brfss_analytic),
  "+ general health"  = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health, data = brfss_analytic),
  "+ marital (full)"  = lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + education + gen_health + marital, data = brfss_analytic)
)

comp_table <- map_dfr(names(models), \(name) {
  g <- glance(models[[name]])
  tibble(
    Model = name,
    p = length(coef(models[[name]])) - 1,
    `Adj. R²` = round(g$adj.r.squared, 4),
    BIC = round(g$BIC, 1)
  )
})

comp_table |>
  kable(caption = "Model Comparison: Which Variables Add") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Comparison: Which Variables Add
Model	p	Adj. R²	BIC
Mental health	1	0.0982	56667.7
bmi	2	0.1084	56585.0
sex	3	0.1085	56591.8
exercise	4	0.1455	56260.7
age_group	7	0.1637	56112.6
education	10	0.1674	56101.1
general health	12	0.3220	54474.0
marital (full)	14	0.3220	54490.4

Identify: The model size that maximizes Adjusted R-squared

If we look at the model comparison 12 parameters is the model size that appears to maximize adjusted R-squared. We see that when adding general health (p = 12), the adjusted R-Squared is 0.3220, however, when we add marital (p = 14), the adjusted R-squared is also 0.3220, suggesting that both of these could be considered a model size that maximizes adjusted R-squared.

Identify: The model size that minimizes BIC

If we again look at the model comparisons, 12 parameters is the model size that appears to maximize BIC. This matches the first adjusted R-squared with 12 parameters and demonstrates the lowest BIC of 54,474.0. Adding marital status (p = 14) makes the BIC go up to 54,490.4.

In 2-3 sentences: Describe what you observe: Do the two criteria agree on the best model size? If not, which criterion favors a simpler model

The two criteria do agree to a point on the best model size. If we go by the first set of parameters that give us the lowest adjusted R-squared and lowest BIC, then the two criteria agree on the best model size. They both suggest that 12 parameters is the best model size to produce an adjusted R-squared of 0.3220 and a BIC of 54,474.0 which is the lowest in the model comparison. If we include 14 parameters, the adjsuted R-squared remains the same however our BIC increases to 54,490.4.

Step 3: Stepwise Selection Methods (10 points)

Section 1: Backward Elimination

# Perform backward elimination 
mod_backward <- step(max_model, direction = "backward", trace = 1)

## Start:  AIC=31673.55
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + 
##     education + gen_health + marital
## 
##                 Df Sum of Sq    RSS   AIC
## - education      3        44 417795 31668
## - marital        2        82 417833 31671
## - sex            1         9 417760 31672
## <none>                       417751 31674
## - bmi            1       153 417905 31674
## - age_group      3      2301 420052 31711
## - exercise       1      3835 421586 31745
## - menthlth_days  1     19435 437186 32035
## - gen_health     2     94594 512345 33302
## 
## Step:  AIC=31668.4
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + 
##     gen_health + marital
## 
##                 Df Sum of Sq    RSS   AIC
## - marital        2        84 417879 31666
## - sex            1        10 417805 31667
## <none>                       417795 31668
## - bmi            1       153 417948 31669
## - age_group      3      2301 420096 31706
## - exercise       1      3918 421713 31741
## - menthlth_days  1     19459 437254 32031
## - gen_health     2     96640 514435 33329
## 
## Step:  AIC=31666
## physhlth_days ~ menthlth_days + bmi + sex + exercise + age_group + 
##     gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        15 417894 31664
## <none>                       417879 31666
## - bmi            1       150 418029 31667
## - age_group      3      3198 421077 31721
## - exercise       1      3924 421803 31739
## - menthlth_days  1     19538 437417 32030
## - gen_health     2     97825 515704 33345
## 
## Step:  AIC=31664.28
## physhlth_days ~ menthlth_days + bmi + exercise + age_group + 
##     gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       417894 31664
## - bmi            1       149 418043 31665
## - age_group      3      3250 421144 31720
## - exercise       1      3971 421864 31738
## - menthlth_days  1     19882 437776 32034
## - gen_health     2     97843 515737 33343

# Create a pretty table 
tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.0211	0.4526	2.2562	0.0241	0.1339	1.9082
menthlth_days	0.2057	0.0106	19.4985	0.0000	0.1851	0.2264
bmi	0.0212	0.0126	1.6870	0.0916	-0.0034	0.0459
exerciseYes	-1.7623	0.2022	-8.7137	0.0000	-2.1588	-1.3659
age_group35-49	0.3061	0.2723	1.1241	0.2610	-0.2277	0.8398
age_group50-64	1.0723	0.2565	4.1804	0.0000	0.5695	1.5751
age_group65+	1.6841	0.2447	6.8820	0.0000	1.2044	2.1638
gen_healthGood	1.2797	0.1888	6.7785	0.0000	0.9096	1.6498
gen_healthFair/Poor	10.3518	0.2463	42.0277	0.0000	9.8690	10.8347

Backward elimation contains menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor.

Section 2: Forward Selection

# Perform forward elimination 
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = max_model),
                    direction = "forward", trace = 1)

## Start:  AIC=34767.94
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     2    172936 444258 32142
## + menthlth_days  1     60700 556494 33942
## + exercise       1     36694 580500 34280
## + education      3     10632 606562 34635
## + bmi            1      9513 607682 34646
## + marital        2      8185 609009 34665
## + age_group      3      6839 610355 34685
## + sex            1      1231 615963 34754
## <none>                       617194 34768
## 
## Step:  AIC=32141.72
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1   18156.4 426102 31810
## + exercise       1    5642.9 438615 32041
## + age_group      3     984.8 443273 32130
## + sex            1     646.6 443612 32132
## + marital        2     670.1 443588 32134
## + bmi            1     361.0 443897 32137
## <none>                       444258 32142
## + education      3     175.3 444083 32145
## 
## Step:  AIC=31809.89
## physhlth_days ~ gen_health + menthlth_days
## 
##             Df Sum of Sq    RSS   AIC
## + exercise   1    4865.1 421237 31720
## + age_group  3    3865.5 422236 31743
## + marital    2    1206.0 424896 31791
## + bmi        1     291.1 425811 31806
## + sex        1     163.4 425938 31809
## <none>                   426102 31810
## + education  3     112.8 425989 31814
## 
## Step:  AIC=31720.03
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##             Df Sum of Sq    RSS   AIC
## + age_group  3    3194.2 418043 31665
## + marital    2    1020.6 420216 31705
## <none>                   421237 31720
## + bmi        1      93.0 421144 31720
## + sex        1      63.9 421173 31721
## + education  3      64.9 421172 31725
## 
## Step:  AIC=31665.13
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
## 
##             Df Sum of Sq    RSS   AIC
## + bmi        1   148.834 417894 31664
## <none>                   418043 31665
## + sex        1    13.102 418029 31667
## + marital    2    86.081 417956 31667
## + education  3    45.539 417997 31670
## 
## Step:  AIC=31664.28
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     bmi
## 
##             Df Sum of Sq    RSS   AIC
## <none>                   417894 31664
## + sex        1    14.648 417879 31666
## + marital    2    89.092 417805 31667
## + education  3    45.895 417848 31669

# Create a pretty table
tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.0211	0.4526	2.2562	0.0241	0.1339	1.9082
gen_healthGood	1.2797	0.1888	6.7785	0.0000	0.9096	1.6498
gen_healthFair/Poor	10.3518	0.2463	42.0277	0.0000	9.8690	10.8347
menthlth_days	0.2057	0.0106	19.4985	0.0000	0.1851	0.2264
exerciseYes	-1.7623	0.2022	-8.7137	0.0000	-2.1588	-1.3659
age_group35-49	0.3061	0.2723	1.1241	0.2610	-0.2277	0.8398
age_group50-64	1.0723	0.2565	4.1804	0.0000	0.5695	1.5751
age_group65+	1.6841	0.2447	6.8820	0.0000	1.2044	2.1638
bmi	0.0212	0.0126	1.6870	0.0916	-0.0034	0.0459

Forward selection also contains menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor.

Section 3: Stepwise Selection

# Perform stepwise elimination 
mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = max_model),
                     direction = "both", trace = 1)

## Start:  AIC=34767.94
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     2    172936 444258 32142
## + menthlth_days  1     60700 556494 33942
## + exercise       1     36694 580500 34280
## + education      3     10632 606562 34635
## + bmi            1      9513 607682 34646
## + marital        2      8185 609009 34665
## + age_group      3      6839 610355 34685
## + sex            1      1231 615963 34754
## <none>                       617194 34768
## 
## Step:  AIC=32141.72
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     18156 426102 31810
## + exercise       1      5643 438615 32041
## + age_group      3       985 443273 32130
## + sex            1       647 443612 32132
## + marital        2       670 443588 32134
## + bmi            1       361 443897 32137
## <none>                       444258 32142
## + education      3       175 444083 32145
## - gen_health     2    172936 617194 34768
## 
## Step:  AIC=31809.89
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      4865 421237 31720
## + age_group      3      3866 422236 31743
## + marital        2      1206 424896 31791
## + bmi            1       291 425811 31806
## + sex            1       163 425938 31809
## <none>                       426102 31810
## + education      3       113 425989 31814
## - menthlth_days  1     18156 444258 32142
## - gen_health     2    130393 556494 33942
## 
## Step:  AIC=31720.03
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + age_group      3      3194 418043 31665
## + marital        2      1021 420216 31705
## <none>                       421237 31720
## + bmi            1        93 421144 31720
## + sex            1        64 421173 31721
## + education      3        65 421172 31725
## - exercise       1      4865 426102 31810
## - menthlth_days  1     17379 438615 32041
## - gen_health     2    108844 530080 33555
## 
## Step:  AIC=31665.13
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
## 
##                 Df Sum of Sq    RSS   AIC
## + bmi            1       149 417894 31664
## <none>                       418043 31665
## + sex            1        13 418029 31667
## + marital        2        86 417956 31667
## + education      3        46 417997 31670
## - age_group      3      3194 421237 31720
## - exercise       1      4194 422236 31743
## - menthlth_days  1     19893 437935 32035
## - gen_health     2    100730 518772 33388
## 
## Step:  AIC=31664.28
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     bmi
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       417894 31664
## - bmi            1       149 418043 31665
## + sex            1        15 417879 31666
## + marital        2        89 417805 31667
## + education      3        46 417848 31669
## - age_group      3      3250 421144 31720
## - exercise       1      3971 421864 31738
## - menthlth_days  1     19882 437776 32034
## - gen_health     2     97843 515737 33343

# Create a pretty table
tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stepwise Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.0211	0.4526	2.2562	0.0241	0.1339	1.9082
gen_healthGood	1.2797	0.1888	6.7785	0.0000	0.9096	1.6498
gen_healthFair/Poor	10.3518	0.2463	42.0277	0.0000	9.8690	10.8347
menthlth_days	0.2057	0.0106	19.4985	0.0000	0.1851	0.2264
exerciseYes	-1.7623	0.2022	-8.7137	0.0000	-2.1588	-1.3659
age_group35-49	0.3061	0.2723	1.1241	0.2610	-0.2277	0.8398
age_group50-64	1.0723	0.2565	4.1804	0.0000	0.5695	1.5751
age_group65+	1.6841	0.2447	6.8820	0.0000	1.2044	2.1638
bmi	0.0212	0.0126	1.6870	0.0916	-0.0034	0.0459

Stepwise selection also contains menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor.

In 3-5 sentences: Compare the results: Do the three methods agree on the same final model? Which variables (if any) were excluded by all three methods? Which variables were retained by all three methods?

Yes, the backward elimination, forward selection, and stepwise selection methods do agree on the same final model. Marital, education, and sex, were excluded by all three models. All three models retained menthlth_days, bmi, exercise, age_group, and gen_health suggesting this may be the best set of predictors to use for the final model selection.

Step 4: Final Model Selection and Interpretation (5 points)

# Create table to compare all the models for model selection 
model_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(max_model)) - 1,
    round(glance(max_model)$adj.r.squared, 4),
    round(AIC(max_model), 1),
    round(BIC(max_model), 1),
  "Backward Elimination",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward Selection",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise Selection",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

model_comparison |> 
  kable(caption = "Model Comparison: R-Squared, Adjusted R-Squared AIC, BIC",
        digits = 3,
        align = "lrrr") |> 
  kable_styling(bootstrap_options = c("striped", "hover"),
                full_width = FALSE)

Model Comparison: R-Squared, Adjusted R-Squared AIC, BIC
Method	Variables selected	Adj. R²	AIC	BIC
Maximum model	14	0.322	54378.6	54490.4
Backward Elimination	8	0.322	54369.3	54439.2
Forward Selection	8	0.322	54369.3	54439.2
Stepwise Selection	8	0.322	54369.3	54439.2

Why this Final Model: I chose this final model which contains menthlth_days, bmi, exercise, age_group, and gen_health after creating a maximum model and comparing it to backward elimination, forward selection, and stepwise selection. For the maximum model with 14 parameters, the adjusted R-squared is 0.322 which is exactly the same output as the backward elimination, forward selection, and stepwise selection. However, in these three models, there are only 8 parameters included which demonstrates an overall lowered AIC of 54,369.3 and BIC of 54,439.2 when compared to the maximum model which had an AIC of 54,378.6 and 54,490.4. I decided to choose the most parsimonious model which contains less parameters but maximizes adjusted R-squared, AIC, and BIC.

# Fitting final model based on all selection methods
final_model <- lm(physhlth_days ~ menthlth_days + bmi + exercise + age_group + gen_health, data = brfss_analytic)

tidy(final_model, conf.int = TRUE) |> 
  mutate(
    across(where(is.numeric), ~ round(., 3))
  ) |> 
  kable(
    caption  = "Table 4. Final Model Linear Regression: Physically Unhealthy Days ~ menthlth_days, bmi, exercise, age_group, and gen_health (BRFSS 2023, n = 8,000)",
    col.names = c("Term", "Estimate(β)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) |> 
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 4. Final Model Linear Regression: Physically Unhealthy Days ~ menthlth_days, bmi, exercise, age_group, and gen_health (BRFSS 2023, n = 8,000)
Term	Estimate(β)	Std. Error	t-statistic	p-value	95% CI Lower	95% CI Upper
(Intercept)	1.021	0.453	2.256	0.024	0.134	1.908
menthlth_days	0.206	0.011	19.499	0.000	0.185	0.226
bmi	0.021	0.013	1.687	0.092	-0.003	0.046
exerciseYes	-1.762	0.202	-8.714	0.000	-2.159	-1.366
age_group35-49	0.306	0.272	1.124	0.261	-0.228	0.840
age_group50-64	1.072	0.257	4.180	0.000	0.569	1.575
age_group65+	1.684	0.245	6.882	0.000	1.204	2.164
gen_healthGood	1.280	0.189	6.778	0.000	0.910	1.650
gen_healthFair/Poor	10.352	0.246	42.028	0.000	9.869	10.835

Interpret the coefficients: for at least three predictors in plain language, including units and “holding all other variables constant” language. Include at least one continuous and one categorical predictor.

Menthlth_days (0.206): For every additional mentally unhealthy day, physically unhealthy days tend to increase by 0.206 days on average holding all other variables constant.

Exercise(Yes) (-1.762): Individuals who exercise report 1.76 fewer physically unhealthy days compared to those who do not exercise holding all other variables constant.

Age_group35-49 (0.306): Individuals who are aged 35-49 report 0.306 more physically unhealthy days compared to those who are 18 to 34 holding all other variables constant.

Step 5: LINE Assumption Check (5 points)

# Create the four base R diagnostic plots
par(mfrow = c(2, 2))
plot(final_model)

par(mfrow = c(1, 1))

Residuals vs. Fitted: Is there evidence of non-linearity or non-constant variance?

There is slight evidence of non-linearity because the red line is not entirely flat around 0 with a slight dip in the middle. The residuals could slightly be demonstrating a funnel shape which would suggest a non-constant variance.

Normal Q-Q: Do the residuals approximately follow a normal distribution? Where do they deviate?

The residuals do not follow a normal distribution along the 45 degree reference line. The points deviate at both ends with heavy tails/s-curves.

Scale-Location: Is the spread of residuals roughly constant across fitted values?

The spread of residuals is not roughly constant across the fitted values. If it was, the red line would be flat, in our scale-location, a curved line is apparent. The residuals do form a funnel shape further emphasizing non-constant variance.

Residuals vs. Leverage: Are there any highly influential observations (large Cook’s distance)?

There do appear to be a couple of highly influential observations including one value at 2445 and one at 2831.

Write a brief conclusion (2-3 sentences): Are the LINE assumptions reasonable satisfied? What violations, if any, do you observe?

The LINE assumptions do not seem to be reasonable satisfied as every one of the graphs depicts violations. There appears to be non-linearity, non-normal distribution of residuals, non-constant spread of residuals across fitted values, and some highly influential observations.

Part 2: Logistic Regression (40 points)

Step 1: Create the Binary Outcome (5 points):

• 1 if physhlth_days >= 14. • 0 if physhlth_days < 14 • Store as factor with levels “No” (reference) and “Yes” • Report prevalence of frequent physical distress (count and percentage in each category)

brfss_analytic <- brfss_analytic |> 
  mutate(
      frequent_distress = factor(
      case_when(
        physhlth_days < 14 ~ "No",
        physhlth_days >= 14 ~ "Yes",
        TRUE ~ NA_character_
      ),
      levels = c("No", "Yes")
    ))

# Prevalence of frequent physical distress (count and percentage) 
 brfss_analytic|>  
  select(frequent_distress) |>  
  tbl_summary(
    type = list(frequent_distress ~ "categorical"),
    label = list(
      frequent_distress ~ "Frequent Physical Distress"
    ),
    statistic = list(
      all_categorical() ~ "{n} ({p}%)"
    ),
    missing = "no"
  ) |> 
  add_n() |> 
  bold_labels() |> 
  modify_caption("**Frequent Physical Distress Counts and Percentages**")

**Frequent Physical Distress Counts and Percentages**
Characteristic	N	N = 8,000¹
Frequent Physical Distress	8,000
No		6,928 (87%)
Yes		1,072 (13%)
¹ n (%)

# Create Table to show across categories
brfss_analytic |>  
  select(frequent_distress, sex, exercise, age_group, education, gen_health, marital) |>  
  tbl_summary(
    by = frequent_distress,
    type = list(exercise ~ "categorical",
                sex ~ "categorical"),
    label = list(
      sex ~ "Sex",
      exercise ~ "Exercise Status",
      age_group ~ "Age Category",
      education ~ "Education Category",
      gen_health ~ "General Health Status",
      marital ~ "Marital Status"
    ),
    statistic = list(
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |> 
  add_n() |> 
  bold_labels() |> 
  italicize_levels() |> 
  modify_caption("** Categorical Breakdowns of Physical Distress Counts**") |>  
  modify_footnote(
    all_stat_cols() ~ 
      "n(%)"
  ) |> 
  as_flex_table()

** Categorical Breakdowns of Physical Distress Counts**
Characteristic	N	No N = 6,9281	Yes N = 1,0721
Sex	8,000
Male		3,431 (50%)	468 (44%)
Female		3,497 (50%)	604 (56%)
Exercise Status	8,000
No		1,375 (20%)	505 (47%)
Yes		5,553 (80%)	567 (53%)
Age Category	8,000
18-34		1,249 (18%)	106 (9.9%)
35-49		1,377 (20%)	145 (14%)
50-64		1,784 (26%)	324 (30%)
65+		2,518 (36%)	497 (46%)
Education Category	8,000
Less than HS		318 (4.6%)	117 (11%)
HS/GED		1,665 (24%)	323 (30%)
Some College		1,792 (26%)	305 (28%)
College Graduate		3,153 (46%)	327 (31%)
General Health Status	8,000
Excellent/Very Good		3,821 (55%)	133 (12%)
Good		2,335 (34%)	241 (22%)
Fair/Poor		772 (11%)	698 (65%)
Marital Status	8,000
Married/Partnered		4,027 (58%)	506 (47%)
Previously married		1,590 (23%)	405 (38%)
Never married		1,311 (19%)	161 (15%)
1n(%)

In 2-3 sentences: Comment on the balance of the outcome

It appears that 87% of individuals are categorized into the no frequent physical distress category whereas only 13% are in the yes frequent physical distress category. This suggests that there are more than five times as many individuals in one category than the other meaning the outcome is not very balanced. I also created a table for the categorical predictors to understand its distribution and noticed that for general health and frequent physical distress, those who did demonstrate frequent physical distress were 65% of those in the fair/poor health status category.

Step 2: Simple (Unadjusted) Logistic Regression (10 points)

Why Choose Gen_Health: I chose general health because it appeared to be one of the strongest predictors. In the maximum linear regression model, fair/poor health was associated with about 10 more unhealthy days compared to the excellent/very good category holding all else constant. For individuals who reported good health status, they were associated with 1.29 more unhealthy days compared to the excellent/very good category holding all else constant. Both good and fair/poor demonstrated a p-value of less than 0.001 suggesting this would be a great starting point. We have also had countless discussions in class surrounding how important of a predictor general health can be for an individual. In addition, when looking at the model comparison, it appears that general health status contributed a large amount to the adjusted R-squared which helps edxplain the variance in physically unhealthy days.

# Fit a simple logistic regression
mod_simple <- glm(frequent_distress ~ gen_health,
data = brfss_analytic, family = binomial)

# Display the model summary
tidy(mod_simple, conf.int = TRUE, exponentiate = FALSE) |>
  mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
                \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(caption = "Simple Logistic Regression: Frequent Distress ~ General Health Status (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: Frequent Distress ~ General Health Status (Log-Odds Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-3.358	0.088	-38.069	<2e-16	-3.536	-3.190
gen_healthGood	1.087	0.111	9.778	<2e-16	0.871	1.307
gen_healthFair/Poor	3.257	0.103	31.774	<2e-16	3.060	3.462

# Calculate and report the Odds ratio and 95% confidence interval
tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(c(estimate, std.error, statistic, conf.low, conf.high),
                \(x) round(x, 3)),
         p.value = format.pval(p.value, digits = 3)) |>
  kable(caption = "Wald Tests for General Health (Exponentiated)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Wald Tests for General Health (Exponentiated)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.035	0.088	-38.069	<2e-16	0.029	0.041
gen_healthGood	2.965	0.111	9.778	<2e-16	2.389	3.695
gen_healthFair/Poor	25.975	0.103	31.774	<2e-16	21.317	31.869

# Publication ready table
mod_simple |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(gen_health ~ "General Health Status"),
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
General Health Status
Excellent/Very Good	—	—
Good	2.97	2.39, 3.70	<0.001
Fair/Poor	26.0	21.3, 31.9	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpret the odds ratio in plain language: State whether the association is statistically significant at alpha = 0.05 and how you know (Wald test p-value or CI excluding 1.0)

Gen_healthGood (OR = 2.97): Compared to excellent/very good health status, individuals who reported good health status have almost 3 times the odds of frequent physical distress, 95% CI [2.39, 3.70].

Gen_healthFair/Poor (OR = 26.0): Compared to excellent/very good health status, individuals who reported fair/poor health status have 26 times the odds of frequent physical distress, 95% CI [21.3, 31.9].

The association is statistically significant at alpha = 0.05 because the 95% confidence interval does not include 1, and by default the Wald test is computed and outputted a p-value of <0.001 suggesting that we should consider the evidence that the association is statistically significant.

Step 3: Multiple Logistic Regression (10 points)

# Fit a multiple logistic regression 
mod_logistic <- glm(frequent_distress ~ menthlth_days + bmi + exercise + age_group + gen_health,
data = brfss_analytic, family = binomial)

# Display the model summary with pretty table 
tidy(mod_logistic, conf.int = TRUE, exponentiate = FALSE) |>
  kable(digits = 3, caption = "Multiple Logistic Regression:  ~ menthlth_days, bmi, exercise, age_group, and gen_health (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Multiple Logistic Regression: ~ menthlth_days, bmi, exercise, age_group, and gen_health (Log-Odds Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-3.745	0.219	-17.130	0.000	-4.177	-3.320
menthlth_days	0.055	0.004	13.999	0.000	0.047	0.062
bmi	0.004	0.005	0.807	0.420	-0.006	0.015
exerciseYes	-0.510	0.082	-6.208	0.000	-0.671	-0.349
age_group35-49	0.178	0.154	1.156	0.248	-0.123	0.481
age_group50-64	0.556	0.139	4.009	0.000	0.287	0.832
age_group65+	0.847	0.134	6.317	0.000	0.588	1.114
gen_healthGood	0.843	0.114	7.367	0.000	0.620	1.069
gen_healthFair/Poor	2.719	0.109	24.891	0.000	2.508	2.937

# Calculate and report the Adjusted Odds ratio and 95% confidence interval
tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3, caption = "Multiple Logistic Regression: Frequent Distress ~ menthlth_days, bmi, exercise, age_group, and gen_health (Odds Ratio)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Multiple Logistic Regression: Frequent Distress ~ menthlth_days, bmi, exercise, age_group, and gen_health (Odds Ratio)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.024	0.219	-17.130	0.000	0.015	0.036
menthlth_days	1.056	0.004	13.999	0.000	1.048	1.064
bmi	1.004	0.005	0.807	0.420	0.994	1.015
exerciseYes	0.600	0.082	-6.208	0.000	0.511	0.705
age_group35-49	1.195	0.154	1.156	0.248	0.885	1.618
age_group50-64	1.744	0.139	4.009	0.000	1.333	2.297
age_group65+	2.333	0.134	6.317	0.000	1.800	3.047
gen_healthGood	2.324	0.114	7.367	0.000	1.860	2.914
gen_healthFair/Poor	15.171	0.109	24.891	0.000	12.281	18.852

# Publication ready table
mod_logistic |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(menthlth_days ~ "Mentally Unhealthy Days",
      bmi ~ "Body Mass Index",
      exercise ~ "Exercise in past 30 days",
      age_group ~ "Age Category",
      gen_health ~ "General Health Status")
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Mentally Unhealthy Days	1.06	1.05, 1.06	<0.001
Body Mass Index	1.00	0.99, 1.01	0.4
Exercise in past 30 days
No	—	—
Yes	0.60	0.51, 0.71	<0.001
Age Category
18-34	—	—
35-49	1.19	0.88, 1.62	0.2
50-64	1.74	1.33, 2.30	<0.001
65+	2.33	1.80, 3.05	<0.001
General Health Status
Excellent/Very Good	—	—
Good	2.32	1.86, 2.91	<0.001
Fair/Poor	15.2	12.3, 18.9	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpret the odds ratios for at least three predictors in plain language: Ensure using “holding all other variables constant” language (Include at least one continuous and one categorical predictor)

Mentally Unhealthy Days (1.06): With each increased day of mentally unhealthy days, individuals have 6% higher odds of experiencing frequent distress holding all other variables constant.

Exercise(Yes) (0.60): Individuals who exercise have 40% decreased odds of experiencing frequent physical distress when compared to those who do not exercise holding all other variables constant.

General Health Status Good (2.32): Individuals who classify themselves in the good general health status have 2.3 times higher odds of experiencing frequent physical distress compared to individuals in the excellent/very good health status category holding all other variables constant.

Step 4: Likelihood Ratio Test

# Create a reduced model without age_group
mod_reduced <- glm(frequent_distress ~ menthlth_days + bmi + exercise + gen_health,
  data = brfss_analytic,
  family = binomial
)

# Conduct a likelihood ratio test to determine if age as a whole has an impact
anova(mod_reduced, mod_logistic, test = "Chisq") |>
  kable(digits = 3,
        caption = "LR Test: Does adding age_group improve the model?") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test: Does adding age_group improve the model?
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
7994	4586.561	NA	NA	NA
7991	4528.178	3	58.383	0

Report: State the null and alternative hypotheses, report the test statistic (deviance), degrees of freedom, and p-value. Conclude at alpha = 0.05: Does this group of predictors significantly improve the model?

H0: The dropped group of predictors does not improve the model (age_group).

H1: The dropped group predictors does significantly improve the model (age_group).

The test statistic (deviance) was 58.383. Degrees of freedom was 3, and the p-value was < .001.

This p-value is less than the alpha level of 0.05 suggesting that we should reject the null hypothesis and consider the evidence that age group significantly improves the model.

Step 5: ROC Curve and AUC (5 points).

# Fit the ROC object
roc_obj <- roc(
  response = brfss_analytic$frequent_distress,
  predictor = fitted(mod_logistic),
  levels = c("No", "Yes"),
  direction = "<"
)

auc_value <- auc(roc_obj)

# Provides both AUC and 95% confidence intervals
ci_auc <- ci.auc(roc_obj)
ci_auc

## 95% CI: 0.8362-0.8627 (DeLong)

cat("AUC:", round(auc_value, 3),
    "\n95% CI:", paste(round(ci_auc[1], 3),
                       round(ci_auc[2], 3),
                       round(ci_auc[3], 3), sep = " - "))

## AUC: 0.849 
## 95% CI: 0.836 - 0.849 - 0.863

# Graph that demonstrates the ROC curve 
ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve: Frequent Physical Distress Model",
       subtitle = paste0("AUC = ", round(auc_value, 3)),
       x = "Specificity", y = "Sensitivity") +
  theme_minimal()

Report: the AUC with 95% confidence interval

The AUC with 95% CIs is 0.849 [0.8362, 0.8627].

Interpret the AUC: What does this value tell you about the model’s ability to discriminate between individuals with and without frequent physical distress?

This value tells me that the model has an excellent ability (0.8-0.9) to discriminate between individuals with and without physical distress.

Step 6: Hosmer-Lemeshow Goodness-of-Fit Test (5 points)

# Perform Hosmer-Lemeshow Test
hl_test <- hoslem.test(
  x = as.numeric(brfss_analytic$frequent_distress) - 1,
  y = fitted(mod_logistic),
  g = 10
)

hl_test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(brfss_analytic$frequent_distress) - 1, fitted(mod_logistic)
## X-squared = 5.0826, df = 8, p-value = 0.7487

# Always pair with a calibration plot 
brfss_analytic |>
  mutate(pred_prob = fitted(mod_logistic),
         obs_outcome = as.numeric(frequent_distress) - 1,
         decile = ntile(pred_prob, 10)) |>
  group_by(decile) |>
  summarise(
    mean_pred = mean(pred_prob),
    mean_obs  = mean(obs_outcome),
    .groups = "drop"
  ) |>
  ggplot(aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  geom_point(size = 3, color = "steelblue") +
  geom_line(color = "steelblue") +
  labs(title = "Calibration Plot: Observed vs. Predicted Probability of Frequent Physical Distress",
       subtitle = "Points should fall on the dashed line for perfect calibration",
       x = "Mean Predicted Probability (within decile)",
       y = "Observed Proportion (within decile)") +
  theme_minimal()

Report: State the null and alternative hypotheses, report the test statistic, degrees of freedom, and p-value. Interpret: At alpha = 0.05 is there evidence of poor model fit?

H0: The model fits well in the regions of predicted probability.

H1: The model does not fit well in some regions of predicted probability.

The test-statistic, X-squared, was 5.0826. The degrees of freedom was 8, and the p-value was 0.7487. At alpha 0.05, there is not evidence of poor model fit, so we accept the null hypothesis that the model fits well in the regions of the predicted probability. This is further demonstrated with the corresponding calibration plot.

In 1-2 sentences: Discuss how the Hosmer-Lemeshow result complements the ROC/AUC finding.

They compliment each other because Hosmer-Lemeshow is a calibration assessment that determines how well predicted probabilities match observed rates while the ROC/AUC finding is a discrimination assessment that determines how well the model separates events from non-events. This gives both a perspective from predicted values as well as outcomes to gain a well-rounded understanding of the data.

Part 3: Reflections (15 points)

Write 250–400 words in continuous prose (not bullet points) addressing all three topics below.

A. Statistical Insights (5 points)

[What do your results suggest about the factors associated with physical health burden among U.S. adults? Which predictors were the strongest in both the linear and logistic models? Were any predictors significant in one model but not the other? What are the key limitations of using cross-sectional survey data for this type of analysis? Name at least two specific potential confounders not measured in your model.]

When fitting the maximum linear model, I used all eight variables including menthlth_days, bmi, sex, exercise, age_group, education, gen_health, and marital. After performing different selection methods, I determined that menthlth_days, bmi, exercise, age_group, and gen_health were the best predictors. In this final linear regression, menthlth_days, exercise_Yes, age_group50-64, age_group65+, gen_healthGood, and gen_healthFair/Poor were found to be significant with p-values less than 0.001. Whereas, bmi and age_group35-49 were not significant with respective p-values of 0.092 and 0.261. This linear model was able to explain 32.2% of the varaince in physically unhealthy days. The same results were discovered for the multiple logistic regression with only bmi and age_group35-49 not being signiciant (p-values of 0.420 and 0.248). These results suggest that unhealthy mental health days, exercise, age(50-64, 65+), and general health are all significant factors associated with the physical health burden among U.S. adults. The most significant predictors appeared to be general health status and being in the age group of 65+ with large estimates in both the linear and logistic regressions. They key limitations of using cross-sectional survey data is that observations from the same household or geographic may not be fully independent. Furthermore, cross-sectional survey data can not establish temporality and causality; however, it can reflect associations/relationships. Two confounders that were not measures in this model were rural versus urban status because those living in rural areas may demonstrate more frequent physical distress days as well as income which was a variable that I did not choose as predictor however those who with lower income or socioeconomic status may have increased levels of frequent physical distress.

B. Linear versus Logistic Regression (5 points)

[Compare what the linear regression model tells you versus what the logistic regression model tells you about the same research question. What information does each approach provide that the other does not? In what situations would you prefer one approach over the other? How do the model selection criteria differ between the two frameworks (for example, R-squared versus AUC)]

Simple linear regression models allows us to quantify the linear relationship between a continuous outcome and a single predictor, tells us how to predict values of an outcome given a predictor value, test hypotheses about whether a predictor is associated with an outcome, and lay the groundwork for multiple linear regression which can allow us to control for confounding. For linear regression, we employ R-squared and adjusted R-squared to understand the proportion of total variability in our outcome explained by the linear regression. Typically when selecting a linear regression, we want to pick the one with the highest R-squared or adjusted R-Squared. Logistic regressions helps us when the outcome variable is binary instead of continuous, we want to estimate the probability of an event occurring, we need adjusted odds ratios that control for confounders, and we want to identify risk factors for a dichotomous health outcome. We prefer to use logistic regression in cross-sectional studies where we have the prevalence of a condition, in case-control studies with odds of exposure given disease status and cohort studies for incidence of disease. Typically to select a model in logistic regression, we want to employ AIC/BIC and choose the model that demonstrates the lowest value.

C. AI Transparency (5 points)

[Describe any parts of this assignment where you sought AI assistance. Include the specific prompts you used and how you verified the correctness of the output. If you did not use AI, state this explicitly.]

I did not use any AI for this assignment. I already had all of the code to perform the tasks that were assigned between all of the labs that we have completed, the lecture notes, and my own final project codes because I employed logistic regression. If I did have any errors arise, I looked them up on Google which often directed me to R blogs where others had found answers. This included trying to determine how to get confidence intervals for the AUC curve. I found an R blog that explained to use ci.auc. Having an error appear and looking up its meaning seems to be one of the best ways for me to learn how to code. —

End of Homework 4