Computer Homework #4: Model Selection and Introduction to Logistic Regression

Transparent Assignment Description (TAD) and Grading Rubric

Due: Friday, April 24, 2026 (11:59 pm ET)
Points: 100 Topics: Model selection (best subsets, backward, forward, stepwise), model comparison criteria (Adjusted R-squared, AIC, BIC), LINE assumption checking, binary logistic regression, odds ratios, likelihood ratio tests, ROC/AUC, Hosmer-Lemeshow goodness-of-fit

Purpose

In this assignment, you will bring together the model-building tools you have learned this semester. You will start with a pool of candidate predictors and use formal model selection procedures to identify the best linear regression model for a continuous health outcome. After checking model assumptions, you will then reframe the same research question as a binary classification problem and apply logistic regression. By the end, you will have practiced the full analytic pipeline from variable selection through model evaluation for both linear and logistic regression

Learning Objectives

After completing this assignment, you will be able to: 1. Apply multiple model selection methods (best subsets, backward elimination, forward selection, stepwise) and explain how they differ 2. Compare candidate models using Adjusted R-squared, AIC, and BIC, and justify a final model choice 3. Assess LINE assumptions using diagnostic plots and interpret violations 4. Fit and interpret simple and multiple logistic regression models 5. Calculate and interpret odds ratios with 95% confidence intervals 6. Conduct a likelihood ratio test comparing nested logistic regression models 7. Evaluate logistic model performance using ROC/AUC and Hosmer-Lemeshow tests 8. Compare and contrast linear and logistic regression approaches for the same research question

Dataset

You will use the 2023 Behavioral Risk Factor Surveillance System (BRFSS), the same data source you downloaded for HW3. If you still have the original LLCP2023.XPT file, you can reuse it. Otherwise, follow the download instructions from HW3 to obtain it from the CDC website. Important: Because HW4 requires variables not used in HW3, you must go back to the original XPT file and select the new set of variables described below

Research Question:

What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of 2 physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?

Part 0: Data Preparation (10 points)

Step 1: Load Required packages (3 points)

# Load required packages
library(tidyverse) 
library(haven) # Reading XPT files
library(broom) # Tidying model output
library(kableExtra) # Formatting tables
library(car) # Diagnostic tools
library(gtsummary) # Descriptive tables
library(leaps) # Best subsets regression
library(pROC) # ROC curves and AUC
library(ResourceSelection) # Hosmer-Lemeshow test

Step 1 continued: Import the Dataset, Report the number of rows and columns, Select variables, and Save as working data frame.

#Import the dataset — update the path if needed
 brfss_full <- read_xpt("C:/Users/Alyss/OneDrive/Desktop/epi553/data/LLCP2023.XPT") 

glimpse(brfss_full)

## Rows: 433,323
## Columns: 350
## $ `_STATE`   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ FMONTH     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ IDATE      <chr> "03012023", "01062023", "03082023", "03062023", "01062023",…
## $ IMONTH     <chr> "03", "01", "03", "03", "01", "01", "03", "01", "03", "01",…
## $ IDAY       <chr> "01", "06", "08", "06", "06", "09", "21", "06", "15", "08",…
## $ IYEAR      <chr> "2023", "2023", "2023", "2023", "2023", "2023", "2023", "20…
## $ DISPCODE   <dbl> 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100, 1100,…
## $ SEQNO      <chr> "2023000001", "2023000002", "2023000003", "2023000004", "20…
## $ `_PSU`     <dbl> 2.023e+09, 2.023e+09, 2.023e+09, 2.023e+09, 2.023e+09, 2.02…
## $ CTELENM1   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ PVTRESD1   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ COLGHOUS   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STATERE1   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ CELPHON1   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ LADULT1    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ NUMADULT   <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 1, 1, 2, 1, 1,…
## $ RESPSLC1   <dbl> 1, NA, NA, 1, NA, 1, NA, NA, NA, 2, 1, 1, 1, 1, 1, NA, NA, …
## $ LANDSEX2   <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1,…
## $ LNDSXBRT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SAFETIME   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CTELNUM1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CELLFON5   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CADULT1    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CELLSEX2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CELSXBRT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PVTRESD3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CCLGHOUS   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSTATE1    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LANDLINE   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HHADULT    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SEXVAR     <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1,…
## $ GENHLTH    <dbl> 2, 2, 4, 2, 4, 3, 4, 4, 3, 3, 3, 2, 2, 3, 3, 4, 5, 3, 3, 3,…
## $ PHYSHLTH   <dbl> 88, 88, 6, 2, 88, 2, 8, 1, 5, 88, 88, 2, 88, 88, 88, 4, 30,…
## $ MENTHLTH   <dbl> 88, 88, 2, 88, 88, 3, 77, 88, 88, 88, 88, 88, 88, 88, 88, 8…
## $ POORHLTH   <dbl> NA, NA, 1, 88, NA, 88, 88, 1, 88, NA, NA, 88, NA, NA, NA, 1…
## $ PRIMINS1   <dbl> 3, 3, 3, 3, 3, 7, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ PERSDOC3   <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ MEDCOST1   <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ CHECKUP1   <dbl> 2, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ EXERANY2   <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 7, 2, 2, 1,…
## $ EXRACT12   <dbl> NA, 1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA, 1, NA, NA, 1, NA, …
## $ EXEROFT1   <dbl> NA, 106, 205, 103, 102, 212, NA, NA, NA, 230, 105, NA, 105,…
## $ EXERHMM1   <dbl> NA, 30, 15, 30, 45, 45, NA, NA, NA, 30, 30, NA, 45, NA, NA,…
## $ EXRACT22   <dbl> NA, 88, 88, 88, 8, 88, NA, NA, NA, 88, 77, NA, 3, NA, NA, 8…
## $ EXEROFT2   <dbl> NA, NA, NA, NA, 107, NA, NA, NA, NA, NA, NA, NA, 105, NA, N…
## $ EXERHMM2   <dbl> NA, NA, NA, NA, 100, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA…
## $ STRENGTH   <dbl> 888, 888, 205, 888, 888, 203, 777, 888, 888, 888, 105, 888,…
## $ BPHIGH6    <dbl> 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 1, 3, 3, 1, 1, 1, 1, 1, 1,…
## $ BPMEDS1    <dbl> 1, 2, 1, NA, 1, 1, NA, 1, 1, 1, 1, 1, NA, NA, 1, 1, 1, 1, 1…
## $ CHOLCHK3   <dbl> 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,…
## $ TOLDHI3    <dbl> 2, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ CHOLMED3   <dbl> 2, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 1,…
## $ CVDINFR4   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ CVDCRHD4   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ CVDSTRK3   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2,…
## $ ASTHMA3    <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,…
## $ ASTHNOW    <dbl> NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ CHCSCNC1   <dbl> 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ CHCOCNC1   <dbl> 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ CHCCOPD3   <dbl> 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2,…
## $ ADDEPEV3   <dbl> 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ CHCKDNY2   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 7, 1, 2, 2,…
## $ HAVARTH4   <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1,…
## $ DIABETE4   <dbl> 1, 3, 3, 3, 1, 3, 3, 1, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1,…
## $ DIABAGE4   <dbl> 57, NA, NA, NA, 68, NA, NA, 98, NA, 60, NA, NA, NA, NA, NA,…
## $ MARITAL    <dbl> 1, 2, 3, 1, 3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 3, 3, 1, 3, 3,…
## $ EDUCA      <dbl> 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 4, 6, 5, 5, 4, 6, 6, 5, 4,…
## $ RENTHOM1   <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ NUMHHOL4   <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ NUMPHON4   <dbl> NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ CPDEMO1C   <dbl> 1, 1, 8, 1, 3, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ VETERAN3   <dbl> 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2,…
## $ EMPLOY1    <dbl> 7, 7, 7, 7, 8, 7, 8, 7, 7, 7, 4, 1, 7, 7, 7, 5, 7, 7, 7, 7,…
## $ CHILDREN   <dbl> 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88,…
## $ INCOME3    <dbl> 99, 99, 2, 99, 7, 7, 6, 99, 6, 7, 7, 99, 9, 9, 9, 3, 77, 99…
## $ PREGNANT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ WEIGHT2    <dbl> 172, 132, 130, 170, 170, 165, 180, 135, 195, 160, 300, 145,…
## $ HEIGHT3    <dbl> 503, 409, 504, 506, 508, 502, 600, 411, 504, 510, 511, 505,…
## $ DEAF       <dbl> 2, 1, 7, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2,…
## $ BLIND      <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ DECIDE     <dbl> 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 7,…
## $ DIFFWALK   <dbl> 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 1, 1, 2, 1,…
## $ DIFFDRES   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 2, 2, 2,…
## $ DIFFALON   <dbl> 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,…
## $ FALL12MN   <dbl> 88, 88, 88, 88, 3, 88, 8, 88, 88, 88, 88, 88, 88, 1, 88, 1,…
## $ FALLINJ5   <dbl> NA, NA, NA, NA, 88, NA, 88, NA, NA, NA, NA, NA, NA, 1, NA, …
## $ SMOKE100   <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 2, 2, 2, 1, 1,…
## $ SMOKDAY2   <dbl> NA, NA, 3, NA, NA, NA, 3, NA, NA, 3, NA, NA, 3, 3, 3, NA, N…
## $ USENOW3    <dbl> 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
## $ ECIGNOW2   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 1, 1, 1, 1, 4, 1,…
## $ ALCDAY4    <dbl> 888, 888, 888, 888, 202, 205, 888, 888, 888, 888, 888, 888,…
## $ AVEDRNK3   <dbl> NA, NA, NA, NA, 1, 2, NA, NA, NA, NA, NA, NA, 1, 2, NA, NA,…
## $ DRNK3GE5   <dbl> NA, NA, NA, NA, 88, 88, NA, NA, NA, NA, NA, NA, 88, 88, NA,…
## $ MAXDRNKS   <dbl> NA, NA, NA, NA, 1, 2, NA, NA, NA, NA, NA, NA, 1, 3, NA, NA,…
## $ FLUSHOT7   <dbl> 2, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1,…
## $ FLSHTMY3   <dbl> NA, 92023, 112022, 102022, NA, NA, 32023, 102023, 777777, 1…
## $ PNEUVAC4   <dbl> 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 1, 1,…
## $ SHINGLE2   <dbl> 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2,…
## $ HIVTST7    <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 7, 2, 2, 2,…
## $ HIVTSTD3   <dbl> NA, NA, NA, 777777, NA, 777777, 92022, NA, NA, NA, NA, NA, …
## $ SEATBELT   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DRNKDRI2   <dbl> NA, NA, NA, NA, 88, 88, NA, NA, NA, NA, NA, NA, 88, 88, NA,…
## $ COVIDPO1   <dbl> 2, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1,…
## $ COVIDSM1   <dbl> NA, NA, NA, NA, NA, 2, NA, 2, 2, NA, 1, 2, NA, 2, 2, NA, NA…
## $ COVIDACT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 3, NA, NA, NA, NA, …
## $ PDIABTS1   <dbl> NA, 2, 1, 1, NA, 1, 1, NA, 1, NA, 1, 1, 1, 1, 1, 1, 7, 1, 8…
## $ PREDIAB2   <dbl> NA, 3, 3, 3, NA, 3, 3, NA, 1, NA, 3, 3, 1, 3, 3, 3, 3, 3, 3…
## $ DIABTYPE   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ INSULIN1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CHKHEMO3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ EYEEXAM1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ DIABEYE1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ DIABEDU1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FEETSORE   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ARTHEXER   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ARTHEDU    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LMTJOIN3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ARTHDIS2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ JOINPAI2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LCSFIRST   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LCSLAST    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LCSNUMCG   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LCSCTSC1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LCSSCNCR   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LCSCTWHN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HADMAM     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HOWLONG    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CERVSCRN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRVCLCNC   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRVCLPAP   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRVCLHPV   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HADHYST2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PSATEST1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PSATIME1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PCPSARS2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PSASUGS1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ PCSTALK2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HADSIGM4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ COLNSIGM   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ COLNTES1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SIGMTES1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LASTSIG4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ COLNCNCR   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ VIRCOLO1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ VCLNTES2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SMALSTOL   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STOLTEST   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STOOLDN2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BLDSTFIT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SDNATES1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CNCRDIFF   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CNCRAGE    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CNCRTYP2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVTRT3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVDOC1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVSUM    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVRTRN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVINST   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVINSR   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVDEIN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVCLIN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVPAIN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CSRVCTL2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ INDORTAN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ NUMBURN3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SUNPRTCT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ WKDAYOUT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ WKENDOUT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CIMEMLO1   <dbl> 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, NA, 2, 2, 2, 1, 2, 2…
## $ CDWORRY    <dbl> NA, NA, 2, NA, 1, 1, NA, NA, NA, NA, NA, NA, 7, NA, NA, NA,…
## $ CDDISCU1   <dbl> NA, NA, 1, NA, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA,…
## $ CDHOUS1    <dbl> NA, NA, 2, NA, 2, 1, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA,…
## $ CDSOCIA1   <dbl> NA, NA, 2, NA, 2, 2, NA, NA, NA, NA, NA, NA, 2, NA, NA, NA,…
## $ CAREGIV1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVREL4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVLNG1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVHRS1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVPRB3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVALZD   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVPER1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVHOU1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CRGVEXPT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LASTSMK2   <dbl> NA, NA, 7, NA, NA, NA, 7, NA, NA, 7, NA, NA, 7, NA, 7, NA, …
## $ STOPSMK2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MENTCIGS   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MENTECIG   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HEATTBCO   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FIREARM5   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ GUNLOAD    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LOADULK2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HASYMP1    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HASYMP2    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HASYMP3    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HASYMP4    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HASYMP5    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HASYMP6    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STRSYMP1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STRSYMP2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STRSYMP3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STRSYMP4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STRSYMP5   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ STRSYMP6   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ FIRSTAID   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ASPIRIN    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ BIRTHSEX   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ SOMALE     <dbl> NA, NA, NA, NA, NA, NA, 2, NA, NA, 2, 2, NA, NA, 2, NA, NA,…
## $ SOFEMALE   <dbl> 2, 2, 2, 2, 2, 2, NA, 2, 2, NA, NA, 2, 2, NA, 2, 2, 2, NA, …
## $ TRNSGNDR   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ MARIJAN1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MARJSMOK   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MARJEAT    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MARJVAPE   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MARJDAB    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ MARJOTHR   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ USEMRJN4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEDEPRS   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEDRINK   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEDRUGS   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEPRISN   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEDIVRC   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEPUNCH   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEHURT1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACESWEAR   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACETOUCH   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACETTHEM   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEHVSEX   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEADSAF   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ ACEADNED   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ IMFVPLA4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HPVADVC4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ HPVADSHT   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ TETANUS1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ COVIDVA1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ COVACGE1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ COVIDNU2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ LSATISFY   <dbl> 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 1, 1, NA, 1, 2, 3, 1, 4, 2…
## $ EMTSUPRT   <dbl> 1, 2, 4, 1, 2, 2, 1, 1, 1, 3, 1, 1, 2, NA, 1, 1, 2, 1, 4, 2…
## $ SDLONELY   <dbl> 5, 3, 3, 3, 2, 3, 4, 5, 3, 3, 4, 5, 4, NA, 4, 3, 4, 5, 1, 2…
## $ SDHEMPLY   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, NA, 2, 2, 2, 2, 2, 2…
## $ FOODSTMP   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2…
## $ SDHFOOD1   <dbl> 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5, NA, 5, 5, 5, 5, 5, 5…
## $ SDHBILLS   <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2…
## $ SDHUTILS   <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2…
## $ SDHTRNSP   <dbl> 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, NA, 2, 2, 2, 2, 2, 2…
## $ SDHSTRE1   <dbl> 5, 5, 3, 5, 2, 3, 5, 5, 4, 4, 4, 4, 4, NA, 4, 5, 4, 5, 4, 4…
## $ RRCLASS3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RRCOGNT2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RRTREAT    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RRATWRK2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RRHCARE4   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RRPHYSM2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RCSGEND1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RCSXBRTH   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ RCSRLTN2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CASTHDX2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CASTHNO2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ QSTVER     <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
## $ QSTLANG    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_METSTAT` <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 2,…
## $ `_URBSTAT` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MSCODE     <dbl> 1, 1, 3, 2, 5, 1, 2, 5, 3, 1, 1, 1, 1, 5, 1, 2, 5, 5, 5, 1,…
## $ `_STSTR`   <dbl> 11011, 11012, 11011, 11011, 11011, 11011, 11011, 11012, 110…
## $ `_STRWT`   <dbl> 42.79158, 170.42939, 42.79158, 42.79158, 42.79158, 42.79158…
## $ `_RAWRAKE` <dbl> 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 4, 2, 2, 2, 1, 1, 2, 1, 1,…
## $ `_WT2RAKE` <dbl> 85.58316, 170.42939, 42.79158, 85.58316, 42.79158, 85.58316…
## $ `_IMPRACE` <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
## $ `_CHISPNC` <dbl> 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,…
## $ `_CRACE1`  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ CAGEG      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `_CLLCPWT` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ `_DUALUSE` <dbl> 1, 1, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_DUALCOR` <dbl> 0.5026388, 0.5026388, NA, 0.5026388, 0.5026388, 0.5026388, …
## $ `_LLCPWT2` <dbl> 941.1640, 1874.2239, 1151.6032, 941.1640, 470.5820, 941.164…
## $ `_LLCPWT`  <dbl> 605.4279, 1121.9927, 600.9633, 605.4279, 281.7110, 1060.335…
## $ `_RFHLTH`  <dbl> 1, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1,…
## $ `_PHYS14D` <dbl> 1, 1, 2, 2, 1, 2, 2, 2, 2, 1, 1, 2, 1, 1, 1, 2, 3, 3, 1, 1,…
## $ `_MENT14D` <dbl> 1, 1, 2, 1, 1, 2, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 3, 1,…
## $ `_HLTHPL1` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_HCVU653` <dbl> 9, 9, 9, 9, 9, 1, 9, 9, 9, 9, 1, 9, 9, 9, 9, 9, 1, 9, 9, 9,…
## $ `_TOTINDA` <dbl> 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 9, 2, 2, 1,…
## $ METVL12_   <dbl> NA, 35, 35, 35, 35, 35, NA, NA, NA, 35, 35, NA, 35, NA, NA,…
## $ METVL22_   <dbl> NA, NA, NA, NA, 33, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA,…
## $ MAXVO21_   <dbl> 1840, 1803, 1322, 1914, 1988, 2506, 1545, 1914, 1655, 1875,…
## $ FC601_     <dbl> 315, 309, 227, 328, 341, 430, 265, 328, 284, 321, 472, 328,…
## $ ACTIN13_   <dbl> NA, 2, 2, 2, 2, 1, NA, NA, NA, 2, 1, NA, 1, NA, NA, 2, NA, …
## $ ACTIN23_   <dbl> NA, NA, NA, NA, 1, NA, NA, NA, NA, NA, NA, NA, 2, NA, NA, N…
## $ PADUR1_    <dbl> NA, 30, 15, 30, 45, 45, NA, NA, NA, 30, 30, NA, 45, NA, NA,…
## $ PADUR2_    <dbl> NA, NA, NA, NA, 60, NA, NA, NA, NA, NA, NA, NA, 45, NA, NA,…
## $ PAFREQ1_   <dbl> NA, 6000, 1167, 3000, 2000, 2800, NA, NA, NA, 7000, 5000, N…
## $ PAFREQ2_   <dbl> NA, NA, NA, NA, 7000, NA, NA, NA, NA, NA, NA, NA, 5000, NA,…
## $ `_MINAC12` <dbl> NA, 180, 18, 90, 90, 126, NA, NA, NA, 210, 150, NA, 225, NA…
## $ `_MINAC22` <dbl> NA, 0, 0, 0, 420, 0, NA, NA, NA, 0, NA, NA, 225, NA, NA, 0,…
## $ STRFREQ_   <dbl> 0, 0, 1167, 0, 0, 700, NA, 0, 0, 0, 5000, 0, 0, 2000, 2000,…
## $ PAMISS3_   <dbl> 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 9, 0, 0, 1,…
## $ PAMIN13_   <dbl> NA, 360, 36, 180, 180, 126, NA, NA, NA, 420, 150, NA, 225, …
## $ PAMIN23_   <dbl> NA, NA, NA, NA, 420, NA, NA, NA, NA, NA, NA, NA, 450, NA, N…
## $ PA3MIN_    <dbl> NA, 360, 36, 180, 600, 126, NA, NA, NA, 420, 150, NA, 675, …
## $ PAVIG13_   <dbl> NA, 180, 18, 90, 90, 0, NA, NA, NA, 210, 0, NA, 0, NA, NA, …
## $ PAVIG23_   <dbl> NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, 225, NA, NA,…
## $ PA3VIGM_   <dbl> NA, 180, 18, 90, 90, 0, NA, NA, NA, 210, 0, NA, 225, NA, NA…
## $ `_PACAT3`  <dbl> 4, 1, 9, 9, 1, 9, 4, 4, 4, 1, 9, 4, 1, 4, 4, 9, 9, 4, 4, 9,…
## $ `_PAINDX3` <dbl> 2, 1, 9, 1, 1, 9, 2, 2, 2, 1, 1, 2, 1, 2, 2, 1, 9, 2, 2, 1,…
## $ `_PA150R4` <dbl> 3, 1, 9, 1, 1, 9, 3, 3, 3, 1, 1, 3, 1, 3, 3, 1, 9, 3, 3, 1,…
## $ `_PA300R4` <dbl> 3, 1, 9, 9, 1, 9, 3, 3, 3, 1, 9, 3, 1, 3, 3, 9, 9, 3, 3, 9,…
## $ `_PA30023` <dbl> 2, 1, 9, 9, 1, 9, 2, 2, 2, 1, 9, 2, 1, 2, 2, 9, 9, 2, 2, 9,…
## $ `_PASTRNG` <dbl> 2, 2, 2, 2, 2, 2, 9, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2,…
## $ `_PAREC3`  <dbl> 4, 2, 9, 2, 2, 9, 9, 4, 4, 2, 1, 4, 2, 3, 3, 2, 9, 4, 4, 2,…
## $ `_PASTAE3` <dbl> 2, 2, 9, 2, 2, 9, 9, 2, 2, 2, 1, 2, 2, 2, 2, 2, 9, 2, 2, 2,…
## $ `_RFHYPE6` <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2,…
## $ `_CHOLCH3` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_RFCHOL3` <dbl> 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ `_MICHD`   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ `_LTASTH1` <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1,…
## $ `_CASTHM1` <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1,…
## $ `_ASTHMS1` <dbl> 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 9, 3, 3, 3,…
## $ `_DRDXAR2` <dbl> 2, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1,…
## $ `_MRACE1`  <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
## $ `_HISPANC` <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ `_RACE`    <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
## $ `_RACEG21` <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
## $ `_RACEGR3` <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
## $ `_RACEPRV` <dbl> 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1,…
## $ `_SEX`     <dbl> 2, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 1, 2, 1,…
## $ `_AGEG5YR` <dbl> 13, 13, 13, 12, 12, 9, 13, 12, 13, 12, 8, 12, 10, 12, 12, 1…
## $ `_AGE65YR` <dbl> 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2,…
## $ `_AGE80`   <dbl> 80, 80, 80, 78, 76, 62, 80, 78, 80, 75, 59, 78, 68, 79, 79,…
## $ `_AGE_G`   <dbl> 6, 6, 6, 6, 6, 5, 6, 6, 6, 6, 5, 6, 6, 6, 6, 6, 5, 6, 6, 6,…
## $ HTIN4      <dbl> 63, 57, 64, 66, 68, 62, 72, 59, 64, 70, 71, 65, 68, 73, 62,…
## $ HTM4       <dbl> 160, 145, 163, 168, 173, 157, 183, 150, 163, 178, 180, 165,…
## $ WTKG3      <dbl> 7802, 5987, 5897, 7711, 7711, 7484, 8165, 6123, 8845, 7257,…
## $ `_BMI5`    <dbl> 3047, 2856, 2231, 2744, 2585, 3018, 2441, 2727, 3347, 2296,…
## $ `_BMI5CAT` <dbl> 4, 3, 2, 3, 3, 4, 2, 3, 4, 2, 4, 2, 3, 2, 3, 3, 4, 3, 3, 3,…
## $ `_RFBMI5`  <dbl> 2, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2,…
## $ `_CHLDCNT` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_EDUCAG`  <dbl> 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 4, 3, 3, 2, 4, 4, 3, 2,…
## $ `_INCOMG1` <dbl> 9, 9, 1, 9, 5, 5, 4, 9, 4, 5, 5, 9, 6, 6, 6, 2, 9, 9, 3, 4,…
## $ `_SMOKER3` <dbl> 4, 4, 3, 4, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 1, 3,…
## $ `_RFSMOK3` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ `_CURECI2` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ DRNKANY6   <dbl> 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 1,…
## $ DROCDY4_   <dbl> 0, 0, 0, 0, 7, 17, 0, 0, 0, 0, 0, 0, 29, 100, 0, 0, 0, 0, 3…
## $ `_RFBING6` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_DRNKWK2` <dbl> 0, 0, 0, 0, 47, 233, 0, 0, 0, 0, 0, 0, 200, 1400, 0, 0, 0, …
## $ `_RFDRHV8` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_FLSHOT7` <dbl> 2, 1, 1, 1, 2, NA, 1, 1, 1, 1, NA, 1, 1, 1, 1, 2, NA, 1, 2,…
## $ `_PNEUMO3` <dbl> 2, 1, 1, 1, 1, NA, 1, 1, 1, 1, NA, 2, 1, 1, 2, 1, NA, 2, 1,…
## $ `_AIDTST4` <dbl> 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 2, 2, 2,…
## $ `_RFSEAT2` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_RFSEAT3` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ `_DRNKDRV` <dbl> 9, 9, 9, 9, 2, 2, 9, 9, 9, 9, 9, 9, 2, 2, 9, 9, 9, 9, 2, 2,…

cat("Number of Rows:", nrow(brfss_full), "\n")

## Number of Rows: 433323

cat("Number of Columns:", ncol(brfss_full), "\n")

## Number of Columns: 350

# Select outcome + chosen predictors
brfss_ms <- brfss_full %>%
  select(PHYSHLTH, MENTHLTH, `_BMI5`, SEXVAR, EXERANY2, ADDEPEV3, `_AGEG5YR`, `_INCOMG1`, EDUCA, `_SMOKER3`, GENHLTH,  MARITAL)

# Save as a working data frame
saveRDS(brfss_ms,
  "C:/Users/Alyss/OneDrive/Desktop/epi553/data/brfss_ms_2023"
)

Step 1 Question(s):

Justify your chosen predictors for the maximum model.

I chose to include all of the predictor variables because leaving some of the variables out may undermine the maximum model. However, having all the variables in the model may also reduce the expected association between the outcome and chosen predictors.

Step 2 (4 points): Recode variables

# ----- Recode and Clean --------------------
brfss_ms <- brfss_full %>%
  mutate(
    # Outcome: physically unhealthy days in past 30 (88 = none; 77/99 = DK/refused = NA)
    physhlth_days = case_when(
      PHYSHLTH == 88                 ~ 0, 
      PHYSHLTH >= 1 & PHYSHLTH <= 30 ~ as.numeric(PHYSHLTH), 
      TRUE                           ~ NA_real_
    ), 
    
    # Candidate Predictors:
    
    # Mentally unhealthy days in past 30 (88 = none; 77/99 = DK/refused = NA)
    menthlth_days = case_when(
      MENTHLTH == 88                ~ 0, 
      MENTHLTH >= 1 & MENTHLTH <=30 ~ as.numeric(MENTHLTH), 
      TRUE                          ~ NA_real_
    ), 
    
    # BMI: Divide _bmi5 by 100, 9999 to NA
    bmi = ifelse(`_BMI5` > 0, `_BMI5` / 100, NA_real_
    ), 
    
    # Sex: 1 = "Yes", 2 = "No", 7, 9 to NA
    sex = factor(SEXVAR, levels = c(1, 2), labels = c("Male", "Female")), # Reference layer = "Male"
    
    # Exercise: 1 = "Yes", 2 = "No", 7, 9 to NA
    exercise = factor(case_when(
      EXERANY2 == 1 ~ "Yes", 
      EXERANY2 == 2 ~ "No", 
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")), # Reference layer = "No"
    
    # Depression: 1 = "Yes", 2 = "No", 7, 9 to NA
    depression = factor(case_when(
      ADDEPEV3 == 1 ~ "Yes", 
      ADDEPEV3 == 2 ~ "No", 
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")), # Reference layer = "No"
    
    # Age Group: Collapse: 1-3 = "18-34", 4-6 = "35-49", 7-9 = 50-64,
    #  7-9 = "50-64", 10-13 = "65+", 14 to NA
    age_group = factor(case_when(
      `_AGEG5YR` %in% c(1, 2, 3)        ~ "18-34", 
      `_AGEG5YR` %in% c(4, 5, 6)        ~ "35-49", 
      `_AGEG5YR` %in% c(7, 8, 9)        ~ "50-64", 
      `_AGEG5YR` %in% c(10, 11, 12, 13) ~ "65+", 
      TRUE                              ~ NA_character_
    ), levels = c("18-34", "35-49", "50-64", "65+")), # Reference layer = "18-34"
    
    # Income: Collapse: 1-2 = "Less than 25k", 3-4 = "25-49k", 5-6 =     # "50-99k", 7 = "100k+", 9 to NA 
    income = factor(case_when(
      INCOME3 %in% c(1, 2) ~ "Less than $25K", 
      INCOME3 %in% c(3, 4) ~ "$25K-$49K",
      INCOME3 %in% c(5, 6) ~ "$50K-$99K", 
      INCOME3 == 7         ~ "$100K+", 
      TRUE                 ~ NA_character_
    ), levels = c("Less than $25K", "$25K-$49K", "$50K-$99K", "$100K+")), # Reference layer = "Less that $25K"
    
    # Education: Collapse: 1-3 = "Less than HS", 4 = "HS/GED", 5 = "Some college"    
    # 6= "college graduate",  9 to NA
    education = factor(case_when(
      EDUCA %in% c(1, 2, 3) ~ "Less than HS", 
      EDUCA == 4            ~ "HS/GED",
      EDUCA == 5            ~ "Some college",
      EDUCA == 6            ~ "College Graduate",
      TRUE                  ~ NA_character_
    ), levels = c("Less than HS", "HS/GED", "Some college","College Graduate")), # Reference layer = "Less than HS"
    
    # Smoking: Collapse: 1-2 = "Current", 3 = "Former", 4 = "Never",     # 9 to NA
    smoking = factor(case_when(
      `_SMOKER3` %in% c(1, 2) ~ "Current", 
      `_SMOKER3` == 3         ~ "Former", 
      `_SMOKER3` == 4         ~ "Never", 
      TRUE                    ~ NA_character_
    ), levels = c("Never", "Former", "Current")), # Reference layer = "Never"
    
    # Gen Health: Collapse: 1-2 = "Excellent/Very Good", 3 = "Good", 4-5 = "Fair/Poor",     # 7 & 9 to NA
    gen_health = factor(case_when(
     GENHLTH %in% c(1, 2) ~ "Excellent/Very Good", 
     GENHLTH == 3         ~ "Good", 
     GENHLTH %in% c(4, 5) ~ "Fair/Poor", 
     TRUE                 ~ NA_character_
    ), levels = c("Excellent/Very Good", "good", "Fair/Poor")), # Reference layer = "Excellent/Very Good
    
    # marital: Collapse: 1 and 6 = "Married/Partnered, 2-4 = "Previously married" 5 = "Never married", 9 to NA
    marital = factor(case_when(
      MARITAL %in% c(1, 6)    ~ "Married/Partnered", 
      MARITAL %in% c(2, 3, 4) ~ "Previously married", 
      MARITAL == 5            ~ "Never married", 
      TRUE                    ~ NA_character_
    ), levels = c("Married/Partnered", "Previously married", "Never married")) # Reference layer = "Married/Partnered"
    
  )

brfss_cleaned <- brfss_ms |>
  select(physhlth_days, menthlth_days, sex, exercise, depression, age_group, income, education, smoking, gen_health, marital, bmi)

Step 3 (3 points): Create the analytic dataset and report a descriptive summary

# Report the number and percentage of missing observations with physhlth_days, menthlth_days, and marital.

# Physhlth_days:
cat("Number of missing physically unhealthy days:", sum(is.na(brfss_ms$physhlth_days)), "\n")

## Number of missing physically unhealthy days: 10785

cat("Percentage of missing physically unhealthy days:", round(mean(is.na(brfss_ms$physhlth_days) * 100), 2), "%", "\n")

## Percentage of missing physically unhealthy days: 2.49 %

# Menthlth_days:
cat("Number of missing mentally unhealthy days:", sum(is.na(brfss_ms$menthlth_days)), "\n")

## Number of missing mentally unhealthy days: 8108

cat("Percentage of missing mentally unhealthy days:", round(mean(is.na(brfss_ms$menthlth_days) * 100), 2), "%", "\n")

## Percentage of missing mentally unhealthy days: 1.87 %

# marital:
cat("Number of missing marital status:", sum(is.na(brfss_ms$marital)), "\n")

## Number of missing marital status: 4289

cat("Number of missing marital status:", round(mean(is.na(brfss_ms$marital) * 100), 2), "%", "\n")

## Number of missing marital status: 0.99 %

# Remove missing rows and draw a random sample 
  set.seed(1220)
  brfss_analytic <- brfss_cleaned |>
    drop_na() |>
    slice_sample(n = 8000)
 
# Report final analytic sample:
  tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_analytic), ncol(brfss_analytic))) |>
  kable(caption = " Final Analytic Sample") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Final Analytic Sample
Metric	Value
Observations	8000
Variables	12

# Descriptive summary table:
  brfss_analytic|>
  select(physhlth_days, menthlth_days, bmi, sex, exercise, depression, age_group, income, education, smoking, gen_health, marital) |>
  tbl_summary(
    label = list(
      physhlth_days ~ "Physically unhealthy days (past 30)", 
      menthlth_days ~ "Mentally unhealthy days (past 30)", 
      bmi           ~ "Body mass index (x100)", 
      sex           ~ "Sex", 
      exercise      ~ "Exercise (past 30)", 
      depression    ~ "Ever told depressive disorder", 
      age_group     ~ "Age group", 
      income        ~ "Annual household income", 
      education     ~ "Education level", 
      smoking       ~ "Smoking status",
      gen_health    ~ "General health status", 
      marital       ~ "Marital status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**
Characteristic	N	N = 8,0001
Physically unhealthy days (past 30)	8,000	6.7 (10.8)
Mentally unhealthy days (past 30)	8,000	5.7 (9.6)
Body mass index (x100)	8,000	28.8 (7.2)
Sex	8,000
Male		3,568 (45%)
Female		4,432 (55%)
Exercise (past 30)	8,000	5,457 (68%)
Ever told depressive disorder	8,000	1,980 (25%)
Age group	8,000
18-34		1,382 (17%)
35-49		1,242 (16%)
50-64		1,825 (23%)
65+		3,551 (44%)
Annual household income	8,000
Less than $25K		845 (11%)
$25K-$49K		1,208 (15%)
$50K-$99K		3,472 (43%)
$100K+		2,475 (31%)
Education level	8,000
Less than HS		657 (8.2%)
HS/GED		2,491 (31%)
Some college		2,387 (30%)
College Graduate		2,465 (31%)
Smoking status	8,000
Never		4,412 (55%)
Former		2,354 (29%)
Current		1,234 (15%)
General health status	8,000
Excellent/Very Good		4,832 (60%)
good		0 (0%)
Fair/Poor		3,168 (40%)
Marital status	8,000
Married/Partnered		3,390 (42%)
Previously married		2,907 (36%)
Never married		1,703 (21%)
1Mean (SD); n (%)

Part 1: Model Selection for Linear Regression (35 points)

Your goal is to identify the best linear regression model for predicting physhlth_days (continuous) from your chosen set of at least 8 predictors

Step 1: Fit the Maximum Model (5 points)

# Fit a linear regression model containing all of your selected predictors
max_mod <- lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + age_group + income + education + smoking + gen_health + marital, data = brfss_analytic)

# Display the model summary
summary(max_mod)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise + 
##     depression + age_group + income + education + smoking + gen_health + 
##     marital, data = brfss_analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.678  -3.536  -0.897   1.969  33.057 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.692847   0.663701   2.551  0.01077 *  
## menthlth_days              0.237382   0.011468  20.700  < 2e-16 ***
## bmi                       -0.030980   0.013460  -2.302  0.02138 *  
## sexFemale                  0.005791   0.191493   0.030  0.97588    
## exerciseYes               -2.411171   0.214279 -11.252  < 2e-16 ***
## depressionYes              0.435939   0.246559   1.768  0.07708 .  
## age_group35-49             0.340697   0.337654   1.009  0.31300    
## age_group50-64             2.180699   0.325420   6.701 2.21e-11 ***
## age_group65+               2.005078   0.311673   6.433 1.32e-10 ***
## income$25K-$49K           -0.519132   0.373446  -1.390  0.16453    
## income$50K-$99K           -1.037815   0.332208  -3.124  0.00179 ** 
## income$100K+              -0.957936   0.360613  -2.656  0.00791 ** 
## educationHS/GED            0.695658   0.369776   1.881  0.05997 .  
## educationSome college      1.200302   0.375900   3.193  0.00141 ** 
## educationCollege Graduate  1.639963   0.387662   4.230 2.36e-05 ***
## smokingFormer              0.931882   0.219014   4.255 2.12e-05 ***
## smokingCurrent             0.616435   0.280716   2.196  0.02812 *  
## gen_healthFair/Poor       10.342537   0.225570  45.851  < 2e-16 ***
## maritalPreviously married  0.146707   0.221509   0.662  0.50779    
## maritalNever married      -0.383313   0.270588  -1.417  0.15664    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.239 on 7980 degrees of freedom
## Multiple R-squared:  0.4142, Adjusted R-squared:  0.4128 
## F-statistic: 296.9 on 19 and 7980 DF,  p-value: < 2.2e-16

# Report and interpret: R-squared, Adjusted R-squared, AIC (using AIC()), and BIC(using BIC())

# R-squared
cat("R squared:", summary(max_mod)$r.squared)

## R squared: 0.4141759

# Adjusted r-squared
cat("Adjusted R squared:", summary(max_mod)$adj.r.squared)

## Adjusted R squared: 0.4127811

# AIC
cat("AIC:", AIC(max_mod))

## AIC: 56467.91

# BIC
cat("BIC:", BIC(max_mod))

## BIC: 56614.64

Step 2: Best subsets regression (10 points)

# Use leaps::regsubsets() to perform best subsets regression on your maximum model. 
best_subsets <- regsubsets(physhlth_days ~ menthlth_days + sex + exercise + depression + age_group + income + education + smoking + gen_health + marital, data = brfss_analytic,
                           
# Set nvmax equal to the total number of parameters in your maximum model
nvmax = 10, 
method = "exhaustive"
)

## Reordering variables and trying again:

best_summary <- summary(best_subsets)

# Create a summary table or plot showing Adjusted R-squared and BIC across model sizes (number of predictors)
subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1, p2, ncol = 2)

# Identify the model size that maximizes Adjusted R-squared
cat("Best model by Adj. R²:", which.max(best_summary$adjr2), "variables\n")

## Best model by Adj. R²: 11 variables

# Identify the model size that minimizes BIC
cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")

## Best model by BIC: 7 variables

Step 2 Question(s): In 2-3 sentences, describe what you observe: Do the two criteria agree on the best model size? If not, which criterion favors a simpler model?

The two criteria do not agree on the same best model size. BIC favors a simpler model of 7 variables

Step 3: Stepwise Selection Methods (10 points)

Using the step() function, perform the following three selection procedures. For each, report which variables are in the final model.

# Backward elimination: Start from the maximum model (direction = "backward")
cat("=== BACKWARD ELIMINATION ===\n\n")

## === BACKWARD ELIMINATION ===

# Step 1: Maximum model
mod_back <- max_mod
cat("Step 1: Maximum model\n")

## Step 1: Maximum model

cat("Variables:", paste(names(coef(mod_back))[-1], collapse = ", "), "\n")

## Variables: menthlth_days, bmi, sexFemale, exerciseYes, depressionYes, age_group35-49, age_group50-64, age_group65+, income$25K-$49K, income$50K-$99K, income$100K+, educationHS/GED, educationSome college, educationCollege Graduate, smokingFormer, smokingCurrent, gen_healthFair/Poor, maritalPreviously married, maritalNever married

# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
  filter(term != "(Intercept)") |>
  arrange(desc(p.value)) |>
  dplyr::select(term, estimate, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

pvals |>
  head(5) |>
  kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: Variables Sorted by p-value (Highest First)
term	estimate	p.value
sexFemale	0.0058	0.9759
maritalPreviously married	0.1467	0.5078
age_group35-49	0.3407	0.3130
income$25K-$49K	-0.5191	0.1645
maritalNever married	-0.3833	0.1566

# Automated backward elimination using AIC
mod_backward <- step(max_mod, direction = "backward", trace = 1)

## Start:  AIC=33762.89
## physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + 
##     age_group + income + education + smoking + gen_health + marital
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1         0 541749 33761
## - marital        2       229 541979 33762
## <none>                       541749 33763
## - depression     1       212 541962 33764
## - bmi            1       360 542109 33766
## - income         3       753 542502 33768
## - smoking        2      1292 543042 33778
## - education      3      1652 543401 33781
## - age_group      3      4862 546611 33828
## - exercise       1      8596 550345 33887
## - menthlth_days  1     29090 570840 34179
## - gen_health     1    142720 684470 35632
## 
## Step:  AIC=33760.9
## physhlth_days ~ menthlth_days + bmi + exercise + depression + 
##     age_group + income + education + smoking + gen_health + marital
## 
##                 Df Sum of Sq    RSS   AIC
## - marital        2       234 541983 33760
## <none>                       541749 33761
## - depression     1       216 541965 33762
## - bmi            1       360 542109 33764
## - income         3       754 542503 33766
## - smoking        2      1306 543055 33776
## - education      3      1657 543406 33779
## - age_group      3      4863 546612 33826
## - exercise       1      8598 550348 33885
## - menthlth_days  1     29096 570846 34177
## - gen_health     1    143361 685110 35637
## 
## Step:  AIC=33760.34
## physhlth_days ~ menthlth_days + bmi + exercise + depression + 
##     age_group + income + education + smoking + gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       541983 33760
## - depression     1       235 542218 33762
## - bmi            1       381 542364 33764
## - income         3       750 542733 33765
## - smoking        2      1360 543343 33776
## - education      3      1627 543609 33778
## - age_group      3      6919 548902 33856
## - exercise       1      8674 550657 33885
## - menthlth_days  1     29210 571193 34178
## - gen_health     1    143522 685505 35638

tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.5011	0.6331	2.3709	0.0178	0.2600	2.7421
menthlth_days	0.2377	0.0115	20.7423	0.0000	0.2152	0.2601
bmi	-0.0319	0.0134	-2.3693	0.0178	-0.0582	-0.0055
exerciseYes	-2.4211	0.2142	-11.3032	0.0000	-2.8410	-2.0012
depressionYes	0.4550	0.2445	1.8607	0.0628	-0.0243	0.9344
age_group35-49	0.4816	0.3286	1.4655	0.1428	-0.1626	1.1258
age_group50-64	2.3893	0.3052	7.8295	0.0000	1.7911	2.9875
age_group65+	2.2623	0.2790	8.1085	0.0000	1.7154	2.8093
income$25K-$49K	-0.4910	0.3726	-1.3178	0.1876	-1.2215	0.2394
income$50K-$99K	-1.0089	0.3276	-3.0792	0.0021	-1.6511	-0.3666
income$100K+	-0.9260	0.3506	-2.6412	0.0083	-1.6133	-0.2387
educationHS/GED	0.6674	0.3684	1.8114	0.0701	-0.0549	1.3896
educationSome college	1.1806	0.3745	3.1526	0.0016	0.4465	1.9147
educationCollege Graduate	1.6086	0.3858	4.1696	0.0000	0.8523	2.3648
smokingFormer	0.9503	0.2176	4.3664	0.0000	0.5237	1.3770
smokingCurrent	0.6207	0.2795	2.2209	0.0264	0.0729	1.1685
gen_healthFair/Poor	10.3429	0.2250	45.9779	0.0000	9.9019	10.7839

# Forward selection: Start from an intercept-only model with the maximum model as the upper scope (direction = "forward")
# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = max_mod),
                    direction = "forward", trace = 1)

## Start:  AIC=38002.78
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     1    323958 600806 34555
## + menthlth_days  1    133222 791542 36760
## + exercise       1     81869 842895 37263
## + depression     1     53920 870844 37524
## + income         3     40748 884016 37648
## + smoking        2     27766 896999 37763
## + age_group      3     26625 898139 37775
## + bmi            1     16692 908073 37859
## + marital        2     12591 912173 37897
## + education      3      9140 915625 37929
## <none>                       924764 38003
## + sex            1        19 924746 38005
## 
## Step:  AIC=34554.65
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     36406 564400 34057
## + exercise       1     11781 589026 34398
## + depression     1      8886 591921 34437
## + smoking        2      3722 597084 34509
## + age_group      3      3675 597131 34512
## + marital        2      2621 598185 34524
## + income         3      1749 599057 34537
## + education      3      1198 599608 34545
## + sex            1       442 600364 34551
## <none>                       600806 34555
## + bmi            1        65 600742 34556
## 
## Step:  AIC=34056.57
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + age_group   3   10334.3 554066 33915
## + exercise    1    9615.9 554784 33921
## + marital     2    3816.9 560583 34006
## + smoking     2    2786.2 561614 34021
## + education   3     826.4 563573 34051
## + income      3     659.0 563741 34053
## + depression  1     264.9 564135 34055
## + bmi         1     213.3 564187 34056
## <none>                    564400 34057
## + sex         1     117.2 564283 34057
## 
## Step:  AIC=33914.73
## physhlth_days ~ gen_health + menthlth_days + age_group
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1    7933.6 546132 33801
## + smoking     2    1391.2 552674 33899
## + income      3     733.2 553332 33910
## + depression  1     378.1 553687 33911
## + marital     2     421.1 553644 33913
## + education   3     547.2 553518 33913
## <none>                    554066 33915
## + bmi         1     125.8 553940 33915
## + sex         1      18.6 554047 33916
## 
## Step:  AIC=33801.35
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + smoking     2   1267.24 544865 33787
## + education   3   1244.40 544888 33789
## + bmi         1    449.59 545682 33797
## + depression  1    406.26 545726 33797
## + income      3    530.56 545601 33800
## + marital     2    330.88 545801 33801
## <none>                    546132 33801
## + sex         1      8.31 546124 33803
## 
## Step:  AIC=33786.77
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking
## 
##              Df Sum of Sq    RSS   AIC
## + education   3   1493.33 543371 33771
## + bmi         1    391.64 544473 33783
## + depression  1    337.27 544527 33784
## + income      3    519.54 544345 33785
## <none>                    544865 33787
## + marital     2    257.88 544607 33787
## + sex         1     39.84 544825 33788
## 
## Step:  AIC=33770.81
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education
## 
##              Df Sum of Sq    RSS   AIC
## + income      3    803.73 542568 33765
## + bmi         1    352.21 543019 33768
## + depression  1    251.55 543120 33769
## + marital     2    281.95 543089 33771
## <none>                    543371 33771
## + sex         1     22.39 543349 33772
## 
## Step:  AIC=33764.97
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education + income
## 
##              Df Sum of Sq    RSS   AIC
## + bmi         1    349.65 542218 33762
## + depression  1    203.60 542364 33764
## + marital     2    273.20 542294 33765
## <none>                    542568 33765
## + sex         1     11.80 542556 33767
## 
## Step:  AIC=33761.81
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education + income + bmi
## 
##              Df Sum of Sq    RSS   AIC
## + depression  1   235.067 541983 33760
## <none>                    542218 33762
## + marital     2   252.803 541965 33762
## + sex         1    15.235 542203 33764
## 
## Step:  AIC=33760.34
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education + income + bmi + depression
## 
##           Df Sum of Sq    RSS   AIC
## <none>                 541983 33760
## + marital  2   233.592 541749 33761
## + sex      1     4.262 541979 33762

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.5011	0.6331	2.3709	0.0178	0.2600	2.7421
gen_healthFair/Poor	10.3429	0.2250	45.9779	0.0000	9.9019	10.7839
menthlth_days	0.2377	0.0115	20.7423	0.0000	0.2152	0.2601
age_group35-49	0.4816	0.3286	1.4655	0.1428	-0.1626	1.1258
age_group50-64	2.3893	0.3052	7.8295	0.0000	1.7911	2.9875
age_group65+	2.2623	0.2790	8.1085	0.0000	1.7154	2.8093
exerciseYes	-2.4211	0.2142	-11.3032	0.0000	-2.8410	-2.0012
smokingFormer	0.9503	0.2176	4.3664	0.0000	0.5237	1.3770
smokingCurrent	0.6207	0.2795	2.2209	0.0264	0.0729	1.1685
educationHS/GED	0.6674	0.3684	1.8114	0.0701	-0.0549	1.3896
educationSome college	1.1806	0.3745	3.1526	0.0016	0.4465	1.9147
educationCollege Graduate	1.6086	0.3858	4.1696	0.0000	0.8523	2.3648
income$25K-$49K	-0.4910	0.3726	-1.3178	0.1876	-1.2215	0.2394
income$50K-$99K	-1.0089	0.3276	-3.0792	0.0021	-1.6511	-0.3666
income$100K+	-0.9260	0.3506	-2.6412	0.0083	-1.6133	-0.2387
bmi	-0.0319	0.0134	-2.3693	0.0178	-0.0582	-0.0055
depressionYes	0.4550	0.2445	1.8607	0.0628	-0.0243	0.9344

# Stepwise selection: Start from the intercept-only model with both directions allowed (direction = "both")
mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = max_mod),
                     direction = "both", trace = 1)

## Start:  AIC=38002.78
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     1    323958 600806 34555
## + menthlth_days  1    133222 791542 36760
## + exercise       1     81869 842895 37263
## + depression     1     53920 870844 37524
## + income         3     40748 884016 37648
## + smoking        2     27766 896999 37763
## + age_group      3     26625 898139 37775
## + bmi            1     16692 908073 37859
## + marital        2     12591 912173 37897
## + education      3      9140 915625 37929
## <none>                       924764 38003
## + sex            1        19 924746 38005
## 
## Step:  AIC=34554.65
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     36406 564400 34057
## + exercise       1     11781 589026 34398
## + depression     1      8886 591921 34437
## + smoking        2      3722 597084 34509
## + age_group      3      3675 597131 34512
## + marital        2      2621 598185 34524
## + income         3      1749 599057 34537
## + education      3      1198 599608 34545
## + sex            1       442 600364 34551
## <none>                       600806 34555
## + bmi            1        65 600742 34556
## - gen_health     1    323958 924764 38003
## 
## Step:  AIC=34056.57
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + age_group      3     10334 554066 33915
## + exercise       1      9616 554784 33921
## + marital        2      3817 560583 34006
## + smoking        2      2786 561614 34021
## + education      3       826 563573 34051
## + income         3       659 563741 34053
## + depression     1       265 564135 34055
## + bmi            1       213 564187 34056
## <none>                       564400 34057
## + sex            1       117 564283 34057
## - menthlth_days  1     36406 600806 34555
## - gen_health     1    227142 791542 36760
## 
## Step:  AIC=33914.73
## physhlth_days ~ gen_health + menthlth_days + age_group
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      7934 546132 33801
## + smoking        2      1391 552674 33899
## + income         3       733 553332 33910
## + depression     1       378 553687 33911
## + marital        2       421 553644 33913
## + education      3       547 553518 33913
## <none>                       554066 33915
## + bmi            1       126 553940 33915
## + sex            1        19 554047 33916
## - age_group      3     10334 564400 34057
## - menthlth_days  1     43066 597131 34512
## - gen_health     1    193638 747704 36311
## 
## Step:  AIC=33801.35
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + smoking        2      1267 544865 33787
## + education      3      1244 544888 33789
## + bmi            1       450 545682 33797
## + depression     1       406 545726 33797
## + income         3       531 545601 33800
## + marital        2       331 545801 33801
## <none>                       546132 33801
## + sex            1         8 546124 33803
## - exercise       1      7934 554066 33915
## - age_group      3      8652 554784 33921
## - menthlth_days  1     39979 586111 34365
## - gen_health     1    160531 706663 35861
## 
## Step:  AIC=33786.77
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking
## 
##                 Df Sum of Sq    RSS   AIC
## + education      3      1493 543371 33771
## + bmi            1       392 544473 33783
## + depression     1       337 544527 33784
## + income         3       520 544345 33785
## <none>                       544865 33787
## + marital        2       258 544607 33787
## + sex            1        40 544825 33788
## - smoking        2      1267 546132 33801
## - age_group      3      7456 552320 33889
## - exercise       1      7810 552674 33899
## - menthlth_days  1     38578 583442 34332
## - gen_health     1    155892 700757 35798
## 
## Step:  AIC=33770.81
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education
## 
##                 Df Sum of Sq    RSS   AIC
## + income         3       804 542568 33765
## + bmi            1       352 543019 33768
## + depression     1       252 543120 33769
## + marital        2       282 543089 33771
## <none>                       543371 33771
## + sex            1        22 543349 33772
## - education      3      1493 544865 33787
## - smoking        2      1516 544888 33789
## - age_group      3      6980 550351 33867
## - exercise       1      8583 551954 33894
## - menthlth_days  1     37572 580943 34304
## - gen_health     1    156409 699781 35793
## 
## Step:  AIC=33764.97
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education + income
## 
##                 Df Sum of Sq    RSS   AIC
## + bmi            1       350 542218 33762
## + depression     1       204 542364 33764
## + marital        2       273 542294 33765
## <none>                       542568 33765
## + sex            1        12 542556 33767
## - income         3       804 543371 33771
## - smoking        2      1506 544074 33783
## - education      3      1778 544345 33785
## - age_group      3      6971 549539 33861
## - exercise       1      8367 550935 33885
## - menthlth_days  1     36579 579147 34285
## - gen_health     1    148236 690804 35695
## 
## Step:  AIC=33761.81
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education + income + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## + depression     1       235 541983 33760
## <none>                       542218 33762
## + marital        2       253 541965 33762
## + sex            1        15 542203 33764
## - bmi            1       350 542568 33765
## - income         3       801 543019 33768
## - smoking        2      1432 543650 33779
## - education      3      1736 543954 33781
## - age_group      3      6838 549056 33856
## - exercise       1      8645 550863 33886
## - menthlth_days  1     36695 578913 34284
## - gen_health     1    145932 688150 35667
## 
## Step:  AIC=33760.34
## physhlth_days ~ gen_health + menthlth_days + age_group + exercise + 
##     smoking + education + income + bmi + depression
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       541983 33760
## + marital        2       234 541749 33761
## - depression     1       235 542218 33762
## + sex            1         4 541979 33762
## - bmi            1       381 542364 33764
## - income         3       750 542733 33765
## - smoking        2      1360 543343 33776
## - education      3      1627 543609 33778
## - age_group      3      6919 548902 33856
## - exercise       1      8674 550657 33885
## - menthlth_days  1     29210 571193 34178
## - gen_health     1    143522 685505 35638

tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stepwise Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.5011	0.6331	2.3709	0.0178	0.2600	2.7421
gen_healthFair/Poor	10.3429	0.2250	45.9779	0.0000	9.9019	10.7839
menthlth_days	0.2377	0.0115	20.7423	0.0000	0.2152	0.2601
age_group35-49	0.4816	0.3286	1.4655	0.1428	-0.1626	1.1258
age_group50-64	2.3893	0.3052	7.8295	0.0000	1.7911	2.9875
age_group65+	2.2623	0.2790	8.1085	0.0000	1.7154	2.8093
exerciseYes	-2.4211	0.2142	-11.3032	0.0000	-2.8410	-2.0012
smokingFormer	0.9503	0.2176	4.3664	0.0000	0.5237	1.3770
smokingCurrent	0.6207	0.2795	2.2209	0.0264	0.0729	1.1685
educationHS/GED	0.6674	0.3684	1.8114	0.0701	-0.0549	1.3896
educationSome college	1.1806	0.3745	3.1526	0.0016	0.4465	1.9147
educationCollege Graduate	1.6086	0.3858	4.1696	0.0000	0.8523	2.3648
income$25K-$49K	-0.4910	0.3726	-1.3178	0.1876	-1.2215	0.2394
income$50K-$99K	-1.0089	0.3276	-3.0792	0.0021	-1.6511	-0.3666
income$100K+	-0.9260	0.3506	-2.6412	0.0083	-1.6133	-0.2387
bmi	-0.0319	0.0134	-2.3693	0.0178	-0.0582	-0.0055
depressionYes	0.4550	0.2445	1.8607	0.0628	-0.0243	0.9344

method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(max_mod)) - 1,
    round(glance(max_mod)$adj.r.squared, 4),
    round(AIC(max_mod), 1),
    round(BIC(max_mod), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparison of Variable Selection Methods
Method	Variables selected	Adj. R²	AIC	BIC
Maximum model	19	0.4128	56467.9	56614.6
Backward (AIC)	16	0.4127	56465.4	56591.1
Forward (AIC)	16	0.4127	56465.4	56591.1
Stepwise (AIC)	16	0.4127	56465.4	56591.1

Step 3 Question(s): In 3-5 sentences, compare the results:

• Do the three methods agree on the same final model?

• Which variables (if any) were excluded by all three methods?

• Which variables were retained by all three methods

The three models do agree on the same final model. The three models all excluded genhealth (Excellent/VeryGood, and Good), sexMale, exerciseNo, depressionNo, age_group18-34, incomeLess than 25K, educationLess than HS, smokingNever, and marital. The variables that were retained were gen_healthFair/Poor, menthlth_days, age_group35-49, age_group50-64, age_group65+, exerciseYes, smokingFormer, smokingCurrent, educationHS/GED, educationSome college, educationCollege Graduate, income25K− 49K, income50K−99K, income$100K+, bmi, and depressionYes.

Step 4: Final Model Selection and Interaction (5 points)

# Choose your final model. State which method(s) guided your choice and why.

fin_mod <- lm(physhlth_days ~ menthlth_days + bmi + exercise + depression + age_group + income + education + smoking + gen_health, data = brfss_analytic)

# Display the coefficient table for your final model using tidy() with confidence intervals
tidy(fin_mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Coefficient Table",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Coefficient Table
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.5011	0.6331	2.3709	0.0178	0.2600	2.7421
menthlth_days	0.2377	0.0115	20.7423	0.0000	0.2152	0.2601
bmi	-0.0319	0.0134	-2.3693	0.0178	-0.0582	-0.0055
exerciseYes	-2.4211	0.2142	-11.3032	0.0000	-2.8410	-2.0012
depressionYes	0.4550	0.2445	1.8607	0.0628	-0.0243	0.9344
age_group35-49	0.4816	0.3286	1.4655	0.1428	-0.1626	1.1258
age_group50-64	2.3893	0.3052	7.8295	0.0000	1.7911	2.9875
age_group65+	2.2623	0.2790	8.1085	0.0000	1.7154	2.8093
income$25K-$49K	-0.4910	0.3726	-1.3178	0.1876	-1.2215	0.2394
income$50K-$99K	-1.0089	0.3276	-3.0792	0.0021	-1.6511	-0.3666
income$100K+	-0.9260	0.3506	-2.6412	0.0083	-1.6133	-0.2387
educationHS/GED	0.6674	0.3684	1.8114	0.0701	-0.0549	1.3896
educationSome college	1.1806	0.3745	3.1526	0.0016	0.4465	1.9147
educationCollege Graduate	1.6086	0.3858	4.1696	0.0000	0.8523	2.3648
smokingFormer	0.9503	0.2176	4.3664	0.0000	0.5237	1.3770
smokingCurrent	0.6207	0.2795	2.2209	0.0264	0.0729	1.1685
gen_healthFair/Poor	10.3429	0.2250	45.9779	0.0000	9.9019	10.7839

Step 4 Question(s): Choose your final model. State which method(s) guided your choice and why. I chose to use the model from the stepwise, backward, and forward elimination. All three came up with the same variables.

Interpret the coefficients for at least three predictors in plain language, including units and “holding all other variables constant” language. Include at least one continuous and one categorical predictor

1. Exercise: Compared to those who do not exercise, those who exercised reported 2.4211 less physically unhealthy days when holding all other variables constant.

2. BMI: Each additional kg/m^2 decrease in BMI, the number of days being physically unhealthy decreases by 0.0319 days, holding all other variables constant.

3. Income: Each additional one-unit increase in the income category is associated with 0.4910 less physically unhealthy days when all other variables are held constant.

Step 5: LINE Assumption Check (5 points)

# Using your final linear model, produce the four base R diagnostic plots:
par(mfrow = c(2, 2))
plot(fin_mod, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

Step 2 Question(s): Interpret each panel in 1-2 sentences: 1. Residuals vs. Fitted: Is there evidence of non-linearity or non-constant variance? The points show scatter and no pattern; there is evidence of non-linearity.

Normal Q-Q: Do the residuals approximately follow a normal distribution? Where do they deviate? Residuals follow normal distribution near the middle but has very large tails that deviate.
Scale-Location: Is the spread of residuals roughly constant across fitted values? The spread of residuals is roughly constant across fitted values but is showing increasing variance.
Residuals vs. Leverage: Are there any highly influential observations (large Cook’s distance)? There are three influential outliers.

Write a brief conclusion (2-3 sentences): Are the LINE assumptions reasonably satisfied? What violations, if any, do you observe The LINE assumptions are reasonably satisfied. There is the increasing variance in the Scale - Location test and the deviations in the tails in the Q-Q plot but the sample size is large so it is robust to minor violations.

Part 2: Logistic Regression (40 points)

You will now reframe the research question using a binary outcome. The CDC defines frequent physical distress as reporting 14 or more physically unhealthy days in the past 30 days.

Step 1: Create the binary outcome (5 points)

# Create a new variable frequent_distress

brfss_logistic <- brfss_analytic |>
  mutate(
    fpd = factor(
      ifelse(physhlth_days >= 14, 1, 0), 
      levels = c(0, 1), 
      labels = c("No", "Yes") # Reference level = No
    )
  )

# Report the prevalence of fpd (count and percentage)
fpd_table_count <- table(brfss_logistic$fpd)
fpd_table_percent <- prop.table(table(brfss_logistic$fpd))

print(fpd_table_count)

## 
##   No  Yes 
## 6225 1775

print(fpd_table_percent)

## 
##       No      Yes 
## 0.778125 0.221875

Step 1 Question(s): In 1-2 sentences, comment on the balance of the outcome.

The balance of the outcome is unbalanced. More people report having no frequent physical distress than having frequent physical distress.

Step 2: Simple (Unadjusted) Logistic Regression (10 points)

Choose one predictor from your final linear model that you expect to have a strong association with physical distress. State why you chose it.

The predictor that I think will have a strong association with physical distress is gen_health becuase if an individual has poor general health, they will be more likely to report frequent physical distress.

# Fit a simple logistic regression.
mod_simple <- glm(fpd ~ gen_health, data = brfss_logistic, family = binomial)

# Display the model summary.
summary(mod_simple)

## 
## Call:
## glm(formula = fpd ~ gen_health, family = binomial, data = brfss_logistic)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.24062    0.07556  -42.89   <2e-16 ***
## gen_healthFair/Poor  3.25198    0.08350   38.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8468.3  on 7999  degrees of freedom
## Residual deviance: 5942.3  on 7998  degrees of freedom
## AIC: 5946.3
## 
## Number of Fisher Scoring iterations: 6

mod_simple |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(gen_health ~ "General Health Status")
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
General Health Status
Excellent/Very Good	—	—
Fair/Poor	25.8	22.0, 30.5	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

# Calculate and report the odds ratio and 95% confidence interval:
exp(coef(mod_simple)) # Odds Ratio

##         (Intercept) gen_healthFair/Poor 
##          0.03913978         25.84144427

exp(confint(mod_simple)) # 95% CI for OR

##                           2.5 %      97.5 %
## (Intercept)          0.03363503  0.04523803
## gen_healthFair/Poor 21.99933051 30.52350547

Step 2 Question(s) Interpret the odds ratio in plain language.

Compared to those with Excellent/Very Good health, those with fair/poor general health had 25.84 times the odds of frequent physical distress, 95% CI (21.99 - 30.52).

State whether the association is statistically significant at alpha = 0.05 and how you know (Wald test p-value or CI excluding 1.0)

The association is statistically significant. The p-value is less than the significance level alpha 0.05. The CI does not contain 1, or null.

Step 3: Multiple Logistic Regression (10 points)

# Fit a multiple logistic regression using the same predictors as your final linear model from Part 1
mod_logistic <- glm(fpd ~ menthlth_days + bmi + exercise + depression + age_group + income + education + smoking + gen_health,
 data = brfss_logistic, family = binomial)


# Display the model summary
summary(mod_logistic)

## 
## Call:
## glm(formula = fpd ~ menthlth_days + bmi + exercise + depression + 
##     age_group + income + education + smoking + gen_health, family = binomial, 
##     data = brfss_logistic)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.670512   0.231633 -15.846  < 2e-16 ***
## menthlth_days              0.052463   0.003652  14.366  < 2e-16 ***
## bmi                       -0.007115   0.004372  -1.627  0.10367    
## exerciseYes               -0.620071   0.070653  -8.776  < 2e-16 ***
## depressionYes              0.185182   0.082534   2.244  0.02485 *  
## age_group35-49             0.445399   0.144505   3.082  0.00205 ** 
## age_group50-64             0.992891   0.132630   7.486 7.09e-14 ***
## age_group65+               1.032203   0.129293   7.983 1.42e-15 ***
## income$25K-$49K           -0.135541   0.116829  -1.160  0.24598    
## income$50K-$99K           -0.253128   0.105518  -2.399  0.01644 *  
## income$100K+              -0.327357   0.120282  -2.722  0.00650 ** 
## educationHS/GED            0.035708   0.116430   0.307  0.75908    
## educationSome college      0.176716   0.119875   1.474  0.14044    
## educationCollege Graduate  0.306018   0.127536   2.399  0.01642 *  
## smokingFormer              0.321935   0.078293   4.112 3.92e-05 ***
## smokingCurrent             0.248723   0.096242   2.584  0.00976 ** 
## gen_healthFair/Poor        2.722715   0.090305  30.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8468.3  on 7999  degrees of freedom
## Residual deviance: 5419.3  on 7983  degrees of freedom
## AIC: 5453.3
## 
## Number of Fisher Scoring iterations: 6

mod_logistic |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(
      menthlth_days ~ "Mentally 
unhealthy days (past 30)",
      bmi           ~ "Body Mass Index", 
      exercise      ~ "Exercise",
      depression    ~ "Ever told depressive disorder", 
      age_group     ~ "Age Group",
      income        ~ "Household Income",
      education     ~ "Education Level",
      smoking       ~ "Smoking status",
      gen_health    ~ "general health status"
    )
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Mentally unhealthy days (past 30)	1.05	1.05, 1.06	<0.001
Body Mass Index	0.99	0.98, 1.00	0.10
Exercise
No	—	—
Yes	0.54	0.47, 0.62	<0.001
Ever told depressive disorder
No	—	—
Yes	1.20	1.02, 1.41	0.025
Age Group
18-34	—	—
35-49	1.56	1.18, 2.08	0.002
50-64	2.70	2.09, 3.51	<0.001
65+	2.81	2.18, 3.63	<0.001
Household Income
Less than $25K	—	—
$25K-$49K	0.87	0.69, 1.10	0.2
$50K-$99K	0.78	0.63, 0.95	0.016
$100K+	0.72	0.57, 0.91	0.006
Education Level
Less than HS	—	—
HS/GED	1.04	0.83, 1.30	0.8
Some college	1.19	0.94, 1.51	0.14
College Graduate	1.36	1.06, 1.74	0.016
Smoking status
Never	—	—
Former	1.38	1.18, 1.61	<0.001
Current	1.28	1.06, 1.55	0.010
general health status
Excellent/Very Good	—	—
Fair/Poor	15.2	12.8, 18.2	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

# Calculate and display a table of adjusted odds ratios with 95% confidence intervals for all predictors:
tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.025	0.232	-15.846	0.000	0.016	0.040
menthlth_days	1.054	0.004	14.366	0.000	1.046	1.061
bmi	0.993	0.004	-1.627	0.104	0.984	1.001
exerciseYes	0.538	0.071	-8.776	0.000	0.468	0.618
depressionYes	1.203	0.083	2.244	0.025	1.023	1.414
age_group35-49	1.561	0.145	3.082	0.002	1.178	2.075
age_group50-64	2.699	0.133	7.486	0.000	2.086	3.509
age_group65+	2.807	0.129	7.983	0.000	2.185	3.628
income$25K-$49K	0.873	0.117	-1.160	0.246	0.694	1.098
income$50K-$99K	0.776	0.106	-2.399	0.016	0.631	0.955
income$100K+	0.721	0.120	-2.722	0.006	0.569	0.913
educationHS/GED	1.036	0.116	0.307	0.759	0.825	1.303
educationSome college	1.193	0.120	1.474	0.140	0.944	1.510
educationCollege Graduate	1.358	0.128	2.399	0.016	1.058	1.745
smokingFormer	1.380	0.078	4.112	0.000	1.183	1.609
smokingCurrent	1.282	0.096	2.584	0.010	1.062	1.548
gen_healthFair/Poor	15.222	0.090	30.150	0.000	12.782	18.214

Step 3 Question(s)

Interpret the adjusted odds ratios for at least three predictors in plain language, using “holding all other variables constant” language. Include at least one continuous and one categorical predictor.

Mentally Unhealthy Days: For every one day increase in mentally unhealthy days, the odds of frequent physical distress are multiplied by 1.05, holding all other variables constant.

Exercise: Compared to those who were aged 18-34, those who were aged 65+ had 2.81 times the odds of frequent physical distress, holding all other variables constant.

Age Group: Compared to those who did not exercise, those who did had 0.54 times the odds of frequent physical distress, holding all other variables constant.

Step 4: Likelihood Ratio Test (5 points)

Choose a group of related predictors to test collectively (for example, all income levels, all education levels, or all smoking levels) All income levels.

# Fit a reduced model that excludes income.
reduced_mod <- glm(
  fpd ~ menthlth_days + bmi + exercise + depression + age_group + education + smoking + gen_health,
 data = brfss_logistic, family = binomial
)

# Conduct a likelihood ratio test comparing the reduced and full models:
test_results <- anova(reduced_mod, mod_logistic, test = "Chisq")

print(test_results)

## Analysis of Deviance Table
## 
## Model 1: fpd ~ menthlth_days + bmi + exercise + depression + age_group + 
##     education + smoking + gen_health
## Model 2: fpd ~ menthlth_days + bmi + exercise + depression + age_group + 
##     income + education + smoking + gen_health
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1      7986     5428.1                       
## 2      7983     5419.3  3   8.7572   0.0327 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step 4 Question(s) State the hypotheses: Null Hypothesis (H0): The income levels does not significantly improve the model.

Alternative Hypothesis (H1): The income levels do significantly improve the model.

Report test statistic, df, and p-value. Test Statistic: 5419.3

Degree of Freedom: 3

P-value: 0.0327

Conclude at alpha = 0.05: Does this group of predictors significantly improve the model?

The p-value is less than 0.05 so we can reject the null hypothesis. The income level does significantly improve the model.

Step 5 ROC Curve and AUC (5 points)

# Use the pROC package to generate a ROC curve for your multiple logistic regression model:
roc_obj <-roc(brfss_logistic$fpd, fitted(mod_logistic))
plot(roc_obj, main = "ROC Curve: Frequent Physical Distress Model",
 print.auc = TRUE)

# Report the AUC with 95% confidence interval:
  auc(roc_obj)

## Area under the curve: 0.8827

ci.auc(roc_obj)

## 95% CI: 0.8741-0.8913 (DeLong)

Step 5 Question(s) Interpret the AUC: What does this value tell you about the model’s ability to discriminate between individuals with and without frequent physical distress? Use the following guide: o 0.5 = no discrimination (coin flip) o 0.5-0.7 = poor discrimination o 0.7-0.8 = acceptable discrimination o 0.8-0.9 = excellent discrimination o 0.9+ = outstanding discrimination

The AUC equals 0.883 so the model is excellent at being able to discriminate between individuals with and without frequent physical distress.

Step 6: Hosmer-Lerneshow Goodness-of-fit Test (5 points)

# Conduct the Hosmer-Lemeshow test using the ResourceSelection package:
hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mod_logistic$y, fitted(mod_logistic)
## X-squared = 9.6182, df = 8, p-value = 0.2929

Step 6 Question(s) State the null hypothesis (the model fits the data well) and alternative hypothesis (the model does not fit the data well) Null Hypothesis (H0): The model fits the data well.

Alternative Hypothesis (H1): The model does not fit the data well.

Report the test statistic, degrees of freedom, and p-value Test Estimate: 9.6182 df: 8 p-value: 0.2929

Interpret: At alpha = 0.05, is there evidence of poor model fit? The p-value is greater than 0.05 so we fail to reject the null hypothesis. The model does fit the data.

In 1-2 sentences, discuss how the Hosmer-Lemeshow result complements the ROC/AUC finding.

A non-significant Hosmer-Lemeshow test tells us the model fits the table well and the ROC/AUC test tells us that our model is good at separating between classes. A model that fits the data well should also be good at separating between classes.

Part 3: Reflection (15 points)

Write 250-400 words in continuous prose:

The results from this lab suggest that some of the factors were significantly associated with physical health burden among U.S adults. The predictors with the strongest relationship with physical health burden in both the linear and logistic models was general health status. Predictors that were significant in the linear regression model but not the logistic regression model were bmi, those aged 35-49. Predictors that were significant in the logistic regression model but not in the linear regression model were those who were ever told they have a depressive disorder, and those who education status was labeled as “some college”. Due to the data being survey data from a cross-sectional study, there are some key limitations. Both the exposure and outcome data was collected at the same time so we can not determine causality, or incidence. Incidence relies on potential changes over time but this study’s data was collected at one point in time so we can only measure the prevalence. Cross-sectional studies are at risk for selection and/or recall bias. Potential confounders not measured in this model are disability and marital status. Marital status was removed from the final model but those who were widowed, or previously married may have poorer physical health. Disability status may be a potential confounder because having a disability may result in physical health decline.

The linear regression model tells us how much the outcome changes by one-unit increases in a predictor while the logistic regression model tells us the odds of the outcome increasing by a one-unit increase in a predictor. Linear regression should be used when your outcome is continuous and logistic regression should be used when your outcome is binary or categorical. The model selection criteria for linear regression is adjusted R² while logistic regression uses AUC. Adjusted R^2 only increases in a model when the new predictor improves the model by more than chance alone. AUC is used to tell how well the model can separate between individuals with or without the outcome.

I did not use AI for this assignment.

EPI 553 — Homework 4: Model Selection, Logistic Regression

Alyssa Ingle

Due: Friday, April 24, 2026 at 11:59 PM ET