EPI553_HW04_Azamatova

#Part 0: Data Acquisition and Preparation

##Setup and Data

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(haven)
library(broom)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(broom)
library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(gtsummary)
library(leaps)
library(pROC)

## Warning: package 'pROC' was built under R version 4.5.3

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(ResourceSelection)

## Warning: package 'ResourceSelection' was built under R version 4.5.3

## ResourceSelection 0.3-6   2023-06-27

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:gtsummary':
## 
##     select
## 
## The following object is masked from 'package:dplyr':
## 
##     select

1 Loading Survey BRFSS 2023 Data

# Load raw HCW_database data
BRFSS_2023 <- read_xpt(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/hw04/LLCP2023.XPT")

sleep_data <-read_xpt("C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/hw04/SLQ_L.XPT")
  
# Display dataset dimensions
head(BRFSS_2023)

## # A tibble: 6 × 350
##   `_STATE` FMONTH IDATE    IMONTH IDAY  IYEAR DISPCODE SEQNO     `_PSU` CTELENM1
##      <dbl>  <dbl> <chr>    <chr>  <chr> <chr>    <dbl> <chr>      <dbl>    <dbl>
## 1        1      1 03012023 03     01    2023      1100 20230000… 2.02e9        1
## 2        1      1 01062023 01     06    2023      1100 20230000… 2.02e9        1
## 3        1      1 03082023 03     08    2023      1100 20230000… 2.02e9        1
## 4        1      1 03062023 03     06    2023      1100 20230000… 2.02e9        1
## 5        1      1 01062023 01     06    2023      1100 20230000… 2.02e9        1
## 6        1      1 01092023 01     09    2023      1100 20230000… 2.02e9        1
## # ℹ 340 more variables: PVTRESD1 <dbl>, COLGHOUS <dbl>, STATERE1 <dbl>,
## #   CELPHON1 <dbl>, LADULT1 <dbl>, NUMADULT <dbl>, RESPSLC1 <dbl>,
## #   LANDSEX2 <dbl>, LNDSXBRT <dbl>, SAFETIME <dbl>, CTELNUM1 <dbl>,
## #   CELLFON5 <dbl>, CADULT1 <dbl>, CELLSEX2 <dbl>, CELSXBRT <dbl>,
## #   PVTRESD3 <dbl>, CCLGHOUS <dbl>, CSTATE1 <dbl>, LANDLINE <dbl>,
## #   HHADULT <dbl>, SEXVAR <dbl>, GENHLTH <dbl>, PHYSHLTH <dbl>, MENTHLTH <dbl>,
## #   POORHLTH <dbl>, PRIMINS1 <dbl>, PERSDOC3 <dbl>, MEDCOST1 <dbl>, …

head(sleep_data)

## # A tibble: 6 × 7
##     SEQN SLQ300 SLQ310 SLD012 SLQ320 SLQ330 SLD013
##    <dbl> <chr>  <chr>   <dbl> <chr>  <chr>   <dbl>
## 1 130378 21:30  07:00     9.5 00:00  09:00       9
## 2 130379 21:00  06:00     9   21:00  06:00       9
## 3 130380 00:00  08:00     8   00:00  09:00       9
## 4 130384 21:30  05:00     7.5 23:00  07:00       8
## 5 130385 22:05  06:15     8   22:05  06:15       8
## 6 130386 00:00  07:30     7.5 00:00  08:00       8

2 Create subset with 13 variables

BRFSS_raw0 <- BRFSS_2023 %>%
  dplyr::select(SEQNO, PHYSHLTH, MENTHLTH,`_BMI5`, SEXVAR,
         EXERANY2,ADDEPEV3,`_AGEG5YR`,`_INCOMG1`,EDUCA,`_SMOKER3`, GENHLTH, MARITAL)


BRFSS_raw <- BRFSS_raw0 %>%
  mutate (SEQN = SEQNO)

head(BRFSS_raw)

## # A tibble: 6 × 14
##   SEQNO PHYSHLTH MENTHLTH `_BMI5` SEXVAR EXERANY2 ADDEPEV3 `_AGEG5YR` `_INCOMG1`
##   <chr>    <dbl>    <dbl>   <dbl>  <dbl>    <dbl>    <dbl>      <dbl>      <dbl>
## 1 2023…       88       88    3047      2        2        2         13          9
## 2 2023…       88       88    2856      2        1        1         13          9
## 3 2023…        6        2    2231      2        1        2         13          1
## 4 2023…        2       88    2744      2        1        1         12          9
## 5 2023…       88       88    2585      2        1        1         12          5
## 6 2023…        2        3    3018      2        1        2          9          5
## # ℹ 5 more variables: EDUCA <dbl>, `_SMOKER3` <dbl>, GENHLTH <dbl>,
## #   MARITAL <dbl>, SEQN <chr>

head(sleep_data, 30)

## # A tibble: 30 × 7
##      SEQN SLQ300 SLQ310 SLD012 SLQ320 SLQ330 SLD013
##     <dbl> <chr>  <chr>   <dbl> <chr>  <chr>   <dbl>
##  1 130378 21:30  07:00     9.5 00:00  09:00       9
##  2 130379 21:00  06:00     9   21:00  06:00       9
##  3 130380 00:00  08:00     8   00:00  09:00       9
##  4 130384 21:30  05:00     7.5 23:00  07:00       8
##  5 130385 22:05  06:15     8   22:05  06:15       8
##  6 130386 00:00  07:30     7.5 00:00  08:00       8
##  7 130387 04:00  07:00     3   03:00  08:00       5
##  8 130388 21:00  06:30     9.5 23:00  09:00      10
##  9 130389 23:00  07:00     8   00:00  08:00       8
## 10 130390 22:00  06:00     8   00:00  08:00       8
## # ℹ 20 more rows

##Merge datasets to create SLEPTIM1

###Select and Recode all variables. Store all categorical variables as labeled factors

brfss_clean <- BRFSS_raw %>%
  mutate(
    # Outcome:
    physhlth_days = case_when(
      PHYSHLTH == 88                    ~ 0,
      PHYSHLTH == 77                    ~ NA_real_,
      PHYSHLTH == 99                    ~ NA_real_,
      PHYSHLTH >= 1 & PHYSHLTH <= 30   ~ as.numeric(PHYSHLTH),
      TRUE                             ~ NA_real_),
  
      #_________PREDICTORS_________________________________
   
     #MENTHLTH 
    menthlth_days = case_when(
      MENTHLTH == 88                    ~ 0,
      MENTHLTH == 77                    ~ NA_real_,
      MENTHLTH == 99                   ~ NA_real_,
      MENTHLTH >= 1 & MENTHLTH <= 30    ~ as.numeric(MENTHLTH),
      TRUE                             ~ NA_real_),
    
    #BMI
    bmi = case_when(
  `_BMI5` == 9999 ~ NA_real_, TRUE ~ `_BMI5` / 100),
    
     sex = factor(case_when(
      SEXVAR == 1 ~ "Male",
      SEXVAR == 2 ~ "Female",
      TRUE ~ NA_character_), 
      levels = c("Male", "Female")),
  

#EXERANY2: 1 = "Yes", 2 = "No"; values 7 and 9 should be recoded to NA.  
    exercise = factor(case_when(
      EXERANY2 == 1 ~ "Yes",
      EXERANY2 == 2 ~ "No",
      EXERANY2 == 7 ~ NA_character_,
      EXERANY2 == 9 ~ NA_character_, 
      TRUE          ~ NA_character_ ), 
      levels = c("No", "Yes")),
    
#ADDEPEV3
depression= factor(case_when(
  ADDEPEV3 == 1 ~ "Yes",
  ADDEPEV3 == 2 ~ "No",
  ADDEPEV3 %in% c(7,9)~ NA_character_, TRUE ~ NA_character_),
  levels = c("No", "Yes")),

# AGEG5YR 
#Collapse: 1-3 = "18-34", 4-6 = "35-49",  7-9 = "50-64", 10-13 ="65+"; 14 to NA
    age_group = factor(case_when(
      `_AGEG5YR` %in% c(1, 2,3)  ~ "18-34", 
      `_AGEG5YR` %in% c(4, 5,6) ~  "35-49", 
      `_AGEG5YR` %in% c(7,8,9)  ~ "50-64",
      `_AGEG5YR` %in% c(10, 11,12, 13) ~ "65+",
      `_AGEG5YR` == 14 ~ NA_character_,  TRUE ~NA_character_),
      levels = c("18-34", "35-49", "50-64","65+")),
    
  #_INCOMG1:
#Collapse: 1-2 = "Less than $25K", 3-4  = "$25K-$49K", 5-6 = "$50K-$99K", 7  = "$100K+"; 9 to NA

      income = factor(case_when(
        `_INCOMG1`%in% c(1,2) ~ "Less than $25,000",
        `_INCOMG1` %in% c(3, 4) ~ "$25,000 to $49,000",
        `_INCOMG1`%in% c(5,6) ~ "$50,000 to $99,000",
        `_INCOMG1`== 7 ~  "$100,000 +",       
        `_INCOMG1`== 9 ~ NA_character_, TRUE ~NA_character_), 
        levels = c("Less than $25,000", #ref
                  "$25,000 to $49,000","$50,000 to $99,000",
                "$100,000 +")),

#EDUCA:  
    education = factor(case_when(
      EDUCA %in% c(1, 2, 3)     ~ "Less than HS",
      EDUCA == 4             ~ "HS diploma or GED",
      EDUCA == 5             ~ "Some college",
      EDUCA == 6             ~ "College graduate",
      EDUCA == 9             ~ NA_character_,
      TRUE                   ~ NA_character_), 
      levels = c("Less than HS", #ref
                 "HS diploma or GED", 
                 "Some college","College graduate")),
      
#_SMOKER3: 1 = "Current daily smoker", 2 = "Current some-day smoker", 3 = "Former smoker", 4 = "Never smoker"; value 9 should be recoded NA
   
   smoking = factor(case_when(
     `_SMOKER3` %in% c(1, 2) ~ "Current",
     `_SMOKER3` == 3 ~ "Former",
     `_SMOKER3` == 4 ~ "Never",
     `_SMOKER3` == 9 ~ NA_character_, TRUE ~ NA_character_),
     levels = c("Never", "Former","Current")),

#GENHLTH
  gen_health=factor(case_when(
    GENHLTH %in% c(1,2) ~ "Excellent/Very Good",
    GENHLTH == 3 ~ "Good",
    GENHLTH %in% c(4,5) ~ "Fair/Poor",
    GENHLTH %in% c(7,9) ~ NA_character_, TRUE ~ NA_character_),
  levels = c("Excellent/Very Good", "Good", "Fair/Poor")),
  
  #MARITAL
  marital = factor(case_when(
    MARITAL %in% c(1,6) ~ "Married/Partnered",
    MARITAL %in% c(2,3,4) ~ "Previously married",
    MARITAL == 5 ~ "Never married",
    MARITAL == 9 ~ NA_character_, 
    TRUE ~ NA_character_), 
    levels = c("Married/Partnered", #ref
               "Previously married", "Never married"))
      )
  
  brfss_clean %>%
  dplyr::select(physhlth_days, menthlth_days, bmi, sex, exercise, depression, age_group, income, education, smoking, gen_health, marital) %>%
  summarise(across(everything(), list(
    n_missing  = ~ sum(is.na(.)),
    pct_missing = ~ round(mean(is.na(.)) * 100, 1)
  ))) %>%
  pivot_longer(
    everything(),
    names_to      = c("Variable", ".value"),
    names_pattern = "(.+)_(n_missing|pct_missing)"
  ) %>%
  arrange(desc(pct_missing)) %>%
  kable(
    caption   = "Missing Observations BRFSS 2023",
    col.names = c("Variable", "N Missing", "% Missing")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Missing Observations BRFSS 2023
Variable	N Missing	% Missing
income	86623	20.0
bmi	40535	9.4
smoking	23062	5.3
physhlth_days	10785	2.5
menthlth_days	8108	1.9
age_group	7779	1.8
marital	4289	1.0
depression	2587	0.6
education	2325	0.5
exercise	1251	0.3
gen_health	1262	0.3
sex	0	0.0

##Create the analytic dataset & Report the final analytic n.

brfss_clean |>
 filter(
    !is.na(physhlth_days),
    !is.na(menthlth_days),
    !is.na(bmi),
    !is.na(sex),
    !is.na(exercise),
    !is.na(depression),
    !is.na(age_group),
    !is.na(income),
    !is.na(education),
    !is.na(smoking), 
    !is.na(gen_health),
    !is.na(marital))

## # A tibble: 308,228 × 26
##    SEQNO      PHYSHLTH MENTHLTH `_BMI5` SEXVAR EXERANY2 ADDEPEV3 `_AGEG5YR`
##    <chr>         <dbl>    <dbl>   <dbl>  <dbl>    <dbl>    <dbl>      <dbl>
##  1 2023000003        6        2    2231      2        1        2         13
##  2 2023000005       88       88    2585      2        1        1         12
##  3 2023000006        2        3    3018      2        1        2          9
##  4 2023000009        5       88    3347      2        2        2         13
##  5 2023000010       88       88    2296      1        1        2         12
##  6 2023000011       88       88    4184      1        1        2          8
##  7 2023000013       88       88    2965      2        1        2         10
##  8 2023000014       88       88    2375      1        2        2         12
##  9 2023000015       88       88    2597      2        2        2         12
## 10 2023000016        4       88    2819      2        1        2         13
## # ℹ 308,218 more rows
## # ℹ 18 more variables: `_INCOMG1` <dbl>, EDUCA <dbl>, `_SMOKER3` <dbl>,
## #   GENHLTH <dbl>, MARITAL <dbl>, SEQN <chr>, physhlth_days <dbl>,
## #   menthlth_days <dbl>, bmi <dbl>, sex <fct>, exercise <fct>,
## #   depression <fct>, age_group <fct>, income <fct>, education <fct>,
## #   smoking <fct>, gen_health <fct>, marital <fct>

# Reproducible random sample
set.seed(1220)
brfss_analytic <- brfss_clean |>
  drop_na() |>
 slice_sample(n = 8000)

# Save for lab activity
saveRDS(brfss_analytic,
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/HW04/brfss_clean.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_analytic), ncol(brfss_analytic))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	8000
Variables	26

##Produce a descriptive statistics table using gtsummary::tbl_summary()

brfss_analytic |>
  dplyr::select(menthlth_days, physhlth_days, bmi, sex, exercise,depression, 
 age_group, income, education, smoking, gen_health, marital) |>
  tbl_summary(
    label = list(
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      physhlth_days ~ "Physically unhealthy days (past 30)",
      bmi           ~ "Body Mass Index",
      sex           ~ "Sex",
      exercise      ~ "Any physical activity (past 30 days)",
      depression    ~"Ever told depressive disorder",
      age_group     ~ "Age Group",
      income        ~ "Annual household income level",
      education     ~ "Education level",
      smoking    ~ "Smoking status",
      gen_health    ~  "General health status",
      marital       ~   "Marital status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics -- BRFSS 2020 (n = 5,000)**") |>
  as_flex_table()

**Table 1. Descriptive Statistics -- BRFSS 2020 (n = 5,000)**
Characteristic	N	N = 8,0001
Mentally unhealthy days (past 30)	8,000	4.3 (8.2)
Physically unhealthy days (past 30)	8,000	4.3 (8.7)
Body Mass Index	8,000	28.7 (6.5)
Sex	8,000
Male		3,971 (50%)
Female		4,029 (50%)
Any physical activity (past 30 days)	8,000	6,157 (77%)
Ever told depressive disorder	8,000	1,776 (22%)
Age Group	8,000
18-34		1,294 (16%)
35-49		1,657 (21%)
50-64		2,109 (26%)
65+		2,940 (37%)
Annual household income level	8,000
Less than $25,000		1,090 (14%)
$25,000 to $49,000		1,931 (24%)
$50,000 to $99,000		4,297 (54%)
$100,000 +		682 (8.5%)
Education level	8,000
Less than HS		391 (4.9%)
HS diploma or GED		1,887 (24%)
Some college		2,115 (26%)
College graduate		3,607 (45%)
Smoking status	8,000
Never		4,808 (60%)
Former		2,230 (28%)
Current		962 (12%)
General health status	8,000
Excellent/Very Good		3,926 (49%)
Good		2,648 (33%)
Fair/Poor		1,426 (18%)
Marital status	8,000
Married/Partnered		4,582 (57%)
Previously married		2,050 (26%)
Never married		1,368 (17%)
1Mean (SD); n (%)

#Part 1: Model Selection for Linear Regression (35 points) * >Research question**: What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)

##Step 1: Step 1: Fit the Maximum Model (5 points)

2.1 Fit a multiple linear regression model with all of your selected predictor. Display the full summary() output.

model_max <- lm(physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + 
 age_group+ income+ education+ smoking+ gen_health+ marital, 
                data=brfss_analytic)

summary(model_max)

## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + sex + exercise + 
##     depression + age_group + income + education + smoking + gen_health + 
##     marital, data = brfss_analytic)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.795  -2.839  -1.145   0.690  31.047 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 2.20132    0.62746   3.508 0.000453 ***
## menthlth_days               0.18362    0.01135  16.172  < 2e-16 ***
## bmi                        -0.01877    0.01287  -1.459 0.144638    
## sexFemale                   0.23449    0.16389   1.431 0.152538    
## exerciseYes                -1.93289    0.20408  -9.471  < 2e-16 ***
## depressionYes               0.77536    0.21528   3.602 0.000318 ***
## age_group35-49              0.76935    0.28365   2.712 0.006696 ** 
## age_group50-64              1.40108    0.27874   5.026 5.11e-07 ***
## age_group65+                1.72725    0.27852   6.202 5.87e-10 ***
## income$25,000 to $49,000   -1.08440    0.27627  -3.925 8.74e-05 ***
## income$50,000 to $99,000   -1.52385    0.27607  -5.520 3.50e-08 ***
## income$100,000 +           -1.31580    0.39821  -3.304 0.000956 ***
## educationHS diploma or GED  0.29371    0.40089   0.733 0.463806    
## educationSome college       1.00198    0.40273   2.488 0.012867 *  
## educationCollege graduate   1.05262    0.40445   2.603 0.009269 ** 
## smokingFormer               0.43777    0.18776   2.332 0.019749 *  
## smokingCurrent              0.46414    0.26617   1.744 0.081241 .  
## gen_healthGood              1.40982    0.18635   7.565 4.30e-14 ***
## gen_healthFair/Poor        10.06509    0.25239  39.879  < 2e-16 ***
## maritalPreviously married  -0.26701    0.20679  -1.291 0.196680    
## maritalNever married       -0.41244    0.24899  -1.656 0.097666 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.112 on 7979 degrees of freedom
## Multiple R-squared:  0.3258, Adjusted R-squared:  0.3242 
## F-statistic: 192.8 on 20 and 7979 DF,  p-value: < 2.2e-16

2.2 Report and interpret: R-squared, Adjusted R-squared, AIC (using AIC()), and BIC (using BIC()

glance(model_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.326	0.324	7.112	54115.05	54268.77	7979

Interpretation: The maximum model explains approximately 32.4% of the variability in poor physical health days after accounting for all predictors in the model (Adj. R² = 0.324). This means that around 67% of the variation in the outcome is explained by other predictors, which are not in this model. The residual standard error (sigma = 7.112), this means that on average model’s predicted outcome and the observed real outcome differ by 7.1 days of poor physical health. The AIC and BIC on their own are not very informative and should be used for comparison with the other models. In general, lower AIC indicated a better between models fit and complexity,while BIC is even more strict to the noise, and tends to favor simpler model with less unnecessary predictors.

##Step 2: Best Subsets Regression (10 points)

###Use leaps::regsubsets() to perform best subsets regression on your maximum model

best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + age_group + income + education + smoking + gen_health + marital,
  data=brfss_analytic,
  #Set nvmax equal to the total number of parameters in your maximum model
  nvmax = 20,      
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

#Identify the model size that minimizes BIC
cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")

## Best model by BIC: 9 variables

best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")

## 
## Variables in BIC-best model:

cat(paste(" ", best_vars), sep = "\n")

##   menthlth_days
##   exerciseYes
##   depressionYes
##   age_group35-49
##   age_group50-64
##   age_group65+
##   income$50,000 to $99,000
##   gen_healthGood
##   gen_healthFair/Poor

#Identify the model size that maximizes Adjusted R-squared.
cat("Best model by Adj. R^2:", which.max(best_summary$adjr2), "variables\n")

## Best model by Adj. R^2: 19 variables

best_adjr2_idx <- which.max (best_summary$adjr2)
best_vars1 <-names(which(best_summary$which[best_adjr2_idx, -1]))
cat("\nVariables in Adj. R^2-best model:\n")

## 
## Variables in Adj. R^2-best model:

cat(paste(" ", best_vars1), sep = "\n")

##   menthlth_days
##   bmi
##   sexFemale
##   exerciseYes
##   depressionYes
##   age_group35-49
##   age_group50-64
##   age_group65+
##   income$25,000 to $49,000
##   income$50,000 to $99,000
##   income$100,000 +
##   educationSome college
##   educationCollege graduate
##   smokingFormer
##   smokingCurrent
##   gen_healthGood
##   gen_healthFair/Poor
##   maritalPreviously married
##   maritalNever married

Note: regsubsets() treats each dummy variable from a factor as a separate predictor. This means some levels of a categorical variable could be included while others are excluded. This is a known limitation. You will compare these results with stepwise methods (Step 3), which handle factors as complete units.

###Create a summary table or plot showing Adjusted R-squared and BIC across model sizes (number of predictors).

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R^2` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R^2`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R^2 by Model Size", x = "Number of Variables", y = "Adjusted R^2") +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1, p2, ncol = 2)

>Interpretation: The two criteria do not agree on the best model size. While Adjusted R² suggests that 19 variables are best for the model, BIC suggests that only 9 parameters are needed: menthlth_days, exerciseYes, depressionYes, age_group35-49, age_group50-64, age_group65+, income$50,000 to $99,000, gen_healthGood, gen_healthFair/Poor

##Step 3: Stepwise Selection Methods (10 points) >Using the step() function, perform the following three selection procedures. For each, report which variables are in the final model.

###Backward elimination: Start from the maximum model (direction = “backward”)

# Automated backward elimination using AIC
mod_backward <- step(model_max, direction = "backward", trace = 1)

## Start:  AIC=31410.04
## physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + 
##     age_group + income + education + smoking + gen_health + marital
## 
##                 Df Sum of Sq    RSS   AIC
## - marital        2       180 403789 31410
## <none>                       403609 31410
## - sex            1       104 403712 31410
## - bmi            1       108 403716 31410
## - smoking        2       347 403956 31413
## - depression     1       656 404265 31421
## - education      3       876 404485 31421
## - income         3      1550 405158 31435
## - age_group      3      2235 405844 31448
## - exercise       1      4537 408146 31497
## - menthlth_days  1     13230 416839 31666
## - gen_health     2     84670 488279 32930
## 
## Step:  AIC=31409.61
## physhlth_days ~ menthlth_days + bmi + sex + exercise + depression + 
##     age_group + income + education + smoking + gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## - bmi            1        98 403887 31410
## <none>                       403789 31410
## - sex            1       102 403891 31410
## - smoking        2       345 404134 31412
## - depression     1       637 404427 31420
## - education      3       875 404665 31421
## - income         3      1388 405177 31431
## - age_group      3      3048 406837 31464
## - exercise       1      4538 408327 31497
## - menthlth_days  1     13195 416984 31665
## - gen_health     2     84707 488496 32929
## 
## Step:  AIC=31409.55
## physhlth_days ~ menthlth_days + sex + exercise + depression + 
##     age_group + income + education + smoking + gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       403887 31410
## - sex            1       103 403990 31410
## - smoking        2       357 404244 31413
## - depression     1       612 404499 31420
## - education      3       875 404762 31421
## - income         3      1394 405281 31431
## - age_group      3      3049 406936 31464
## - exercise       1      4451 408338 31495
## - menthlth_days  1     13238 417125 31666
## - gen_health     2     85645 489532 32944

tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.3853	0.4874	2.8425	0.0045	0.4300	2.3407
menthlth_days	0.1836	0.0114	16.1748	0.0000	0.1614	0.2059
sexFemale	0.2329	0.1633	1.4262	0.1539	-0.0872	0.5530
exerciseYes	-1.9048	0.2031	-9.3793	0.0000	-2.3029	-1.5067
depressionYes	0.7471	0.2149	3.4769	0.0005	0.3259	1.1683
age_group35-49	0.8374	0.2703	3.0980	0.0020	0.3075	1.3672
age_group50-64	1.4778	0.2583	5.7206	0.0000	0.9714	1.9841
age_group65+	1.8366	0.2513	7.3099	0.0000	1.3441	2.3292
income$25,000 to $49,000	-1.0324	0.2749	-3.7557	0.0002	-1.5713	-0.4936
income$50,000 to $99,000	-1.3884	0.2658	-5.2239	0.0000	-1.9094	-0.8674
income$100,000 +	-1.1122	0.3849	-2.8897	0.0039	-1.8666	-0.3577
educationHS diploma or GED	0.2576	0.4006	0.6431	0.5202	-0.5276	1.0428
educationSome college	0.9636	0.4025	2.3943	0.0167	0.1747	1.7525
educationCollege graduate	1.0310	0.4043	2.5502	0.0108	0.2385	1.8236
smokingFormer	0.4399	0.1876	2.3442	0.0191	0.0720	0.8077
smokingCurrent	0.4776	0.2647	1.8038	0.0713	-0.0414	0.9965
gen_healthGood	1.3636	0.1839	7.4161	0.0000	1.0032	1.7241
gen_healthFair/Poor	10.0009	0.2486	40.2245	0.0000	9.5136	10.4883

###Forward selection: Start from an intercept-only model with the maximum model as the upper scope (direction = “forward”)

# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = model_max),
                    direction = "forward", trace = 1)

## Start:  AIC=34524.38
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     2    164022 434665 31967
## + menthlth_days  1     60303 538384 33677
## + exercise       1     34542 564145 34051
## + income         3     28842 569845 34135
## + depression     1     20750 577937 34244
## + smoking        2     12790 585897 34356
## + education      3      8593 590094 34415
## + marital        2      7321 591366 34430
## + bmi            1      6253 592434 34442
## + age_group      3      6527 592160 34443
## + sex            1      1649 597038 34504
## <none>                       598687 34524
## 
## Step:  AIC=31967.07
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1   17940.1 416725 31632
## + exercise       1    6466.7 428198 31849
## + depression     1    6070.1 428595 31857
## + income         3    3231.0 431434 31913
## + smoking        2    1776.3 432889 31938
## + age_group      3    1535.7 433129 31945
## + sex            1    1254.2 433411 31946
## + marital        2     830.3 433835 31956
## + education      3     415.5 434250 31965
## <none>                       434665 31967
## + bmi            1       0.2 434665 31969
## 
## Step:  AIC=31631.88
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1    5820.8 410904 31521
## + age_group   3    4661.5 412064 31548
## + income      3    1987.0 414738 31600
## + marital     2    1402.5 415322 31609
## + smoking     2    1034.6 415690 31616
## + depression  1     738.1 415987 31620
## + sex         1     605.2 416120 31622
## + education   3     318.9 416406 31632
## <none>                    416725 31632
## + bmi         1       3.6 416721 31634
## 
## Step:  AIC=31521.35
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + age_group   3    3720.5 407184 31455
## + income      3    1198.4 409706 31504
## + marital     2    1064.8 409839 31505
## + depression  1     739.1 410165 31509
## + smoking     2     706.2 410198 31512
## + education   3     686.9 410217 31514
## + sex         1     449.3 410455 31515
## <none>                    410904 31521
## + bmi         1      82.9 410821 31522
## 
## Step:  AIC=31454.58
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
## 
##              Df Sum of Sq    RSS   AIC
## + income      3   1162.69 406021 31438
## + depression  1    932.70 406251 31438
## + education   3    501.69 406682 31451
## + sex         1    261.03 406923 31452
## + smoking     2    360.51 406823 31452
## <none>                    407184 31455
## + bmi         1     83.59 407100 31455
## + marital     2     40.11 407144 31458
## 
## Step:  AIC=31437.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income
## 
##              Df Sum of Sq    RSS   AIC
## + depression  1    861.57 405159 31423
## + education   3    953.62 405067 31425
## + smoking     2    305.02 405716 31436
## + sex         1    203.01 405818 31436
## <none>                    406021 31438
## + bmi         1     74.91 405946 31438
## + marital     2    153.70 405867 31439
## 
## Step:  AIC=31422.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression
## 
##             Df Sum of Sq    RSS   AIC
## + education  3    837.21 404322 31412
## + smoking    2    259.11 404900 31422
## + sex        1    110.16 405049 31423
## + bmi        1    105.12 405054 31423
## <none>                   405159 31423
## + marital    2    169.72 404990 31423
## 
## Step:  AIC=31412.16
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression + education
## 
##           Df Sum of Sq    RSS   AIC
## + smoking  2    332.18 403990 31410
## + bmi      1    110.19 404212 31412
## <none>                 404322 31412
## + sex      1     77.87 404244 31413
## + marital  2    168.26 404154 31413
## 
## Step:  AIC=31409.59
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression + education + smoking
## 
##           Df Sum of Sq    RSS   AIC
## + sex      1    102.92 403887 31410
## <none>                 403990 31410
## + bmi      1     98.80 403891 31410
## + marital  2    169.54 403820 31410
## 
## Step:  AIC=31409.55
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression + education + smoking + sex
## 
##           Df Sum of Sq    RSS   AIC
## <none>                 403887 31410
## + bmi      1    97.861 403789 31410
## + marital  2   170.601 403716 31410

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.3853	0.4874	2.8425	0.0045	0.4300	2.3407
gen_healthGood	1.3636	0.1839	7.4161	0.0000	1.0032	1.7241
gen_healthFair/Poor	10.0009	0.2486	40.2245	0.0000	9.5136	10.4883
menthlth_days	0.1836	0.0114	16.1748	0.0000	0.1614	0.2059
exerciseYes	-1.9048	0.2031	-9.3793	0.0000	-2.3029	-1.5067
age_group35-49	0.8374	0.2703	3.0980	0.0020	0.3075	1.3672
age_group50-64	1.4778	0.2583	5.7206	0.0000	0.9714	1.9841
age_group65+	1.8366	0.2513	7.3099	0.0000	1.3441	2.3292
income$25,000 to $49,000	-1.0324	0.2749	-3.7557	0.0002	-1.5713	-0.4936
income$50,000 to $99,000	-1.3884	0.2658	-5.2239	0.0000	-1.9094	-0.8674
income$100,000 +	-1.1122	0.3849	-2.8897	0.0039	-1.8666	-0.3577
depressionYes	0.7471	0.2149	3.4769	0.0005	0.3259	1.1683
educationHS diploma or GED	0.2576	0.4006	0.6431	0.5202	-0.5276	1.0428
educationSome college	0.9636	0.4025	2.3943	0.0167	0.1747	1.7525
educationCollege graduate	1.0310	0.4043	2.5502	0.0108	0.2385	1.8236
smokingFormer	0.4399	0.1876	2.3442	0.0191	0.0720	0.8077
smokingCurrent	0.4776	0.2647	1.8038	0.0713	-0.0414	0.9965
sexFemale	0.2329	0.1633	1.4262	0.1539	-0.0872	0.5530

###Stepwise selection: Start from the intercept-only model with both directions allowed (direction = “both”)

mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = model_max),
                     direction = "both", trace = 1)

## Start:  AIC=34524.38
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     2    164022 434665 31967
## + menthlth_days  1     60303 538384 33677
## + exercise       1     34542 564145 34051
## + income         3     28842 569845 34135
## + depression     1     20750 577937 34244
## + smoking        2     12790 585897 34356
## + education      3      8593 590094 34415
## + marital        2      7321 591366 34430
## + bmi            1      6253 592434 34442
## + age_group      3      6527 592160 34443
## + sex            1      1649 597038 34504
## <none>                       598687 34524
## 
## Step:  AIC=31967.07
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     17940 416725 31632
## + exercise       1      6467 428198 31849
## + depression     1      6070 428595 31857
## + income         3      3231 431434 31913
## + smoking        2      1776 432889 31938
## + age_group      3      1536 433129 31945
## + sex            1      1254 433411 31946
## + marital        2       830 433835 31956
## + education      3       416 434250 31965
## <none>                       434665 31967
## + bmi            1         0 434665 31969
## - gen_health     2    164022 598687 34524
## 
## Step:  AIC=31631.88
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1      5821 410904 31521
## + age_group      3      4661 412064 31548
## + income         3      1987 414738 31600
## + marital        2      1402 415322 31609
## + smoking        2      1035 415690 31616
## + depression     1       738 415987 31620
## + sex            1       605 416120 31622
## + education      3       319 416406 31632
## <none>                       416725 31632
## + bmi            1         4 416721 31634
## - menthlth_days  1     17940 434665 31967
## - gen_health     2    121660 538384 33677
## 
## Step:  AIC=31521.35
## physhlth_days ~ gen_health + menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + age_group      3      3720 407184 31455
## + income         3      1198 409706 31504
## + marital        2      1065 409839 31505
## + depression     1       739 410165 31509
## + smoking        2       706 410198 31512
## + education      3       687 410217 31514
## + sex            1       449 410455 31515
## <none>                       410904 31521
## + bmi            1        83 410821 31522
## - exercise       1      5821 416725 31632
## - menthlth_days  1     17294 428198 31849
## - gen_health     2    101805 512709 33288
## 
## Step:  AIC=31454.58
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group
## 
##                 Df Sum of Sq    RSS   AIC
## + income         3      1163 406021 31438
## + depression     1       933 406251 31438
## + education      3       502 406682 31451
## + sex            1       261 406923 31451
## + smoking        2       361 406823 31451
## <none>                       407184 31455
## + bmi            1        84 407100 31455
## + marital        2        40 407144 31458
## - age_group      3      3720 410904 31521
## - exercise       1      4880 412064 31548
## - menthlth_days  1     19957 427141 31835
## - gen_health     2     94875 502059 33126
## 
## Step:  AIC=31437.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income
## 
##                 Df Sum of Sq    RSS   AIC
## + depression     1       862 405159 31423
## + education      3       954 405067 31425
## + smoking        2       305 405716 31436
## + sex            1       203 405818 31436
## <none>                       406021 31438
## + bmi            1        75 405946 31438
## + marital        2       154 405867 31439
## - income         3      1163 407184 31455
## - age_group      3      3685 409706 31504
## - exercise       1      4214 410235 31518
## - menthlth_days  1     18945 424966 31801
## - gen_health     2     87509 493530 32995
## 
## Step:  AIC=31422.71
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression
## 
##                 Df Sum of Sq    RSS   AIC
## + education      3       837 404322 31412
## + smoking        2       259 404900 31422
## + sex            1       110 405049 31423
## + bmi            1       105 405054 31423
## <none>                       405159 31423
## + marital        2       170 404990 31423
## - depression     1       862 406021 31438
## - income         3      1092 406251 31438
## - age_group      3      3871 409030 31493
## - exercise       1      4219 409378 31504
## - menthlth_days  1     13674 418834 31686
## - gen_health     2     86610 491770 32969
## 
## Step:  AIC=31412.16
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression + education
## 
##                 Df Sum of Sq    RSS   AIC
## + smoking        2       332 403990 31410
## + bmi            1       110 404212 31412
## <none>                       404322 31412
## + sex            1        78 404244 31413
## + marital        2       168 404154 31413
## - education      3       837 405159 31423
## - depression     1       745 405067 31425
## - income         3      1495 405818 31436
## - age_group      3      3558 407880 31476
## - exercise       1      4604 408926 31501
## - menthlth_days  1     13607 417929 31675
## - gen_health     2     86813 491135 32964
## 
## Step:  AIC=31409.59
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression + education + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## + sex            1       103 403887 31410
## <none>                       403990 31410
## + bmi            1        99 403891 31410
## + marital        2       170 403820 31410
## - smoking        2       332 404322 31412
## - depression     1       691 404681 31421
## - education      3       910 404900 31422
## - income         3      1457 405447 31432
## - age_group      3      3178 407168 31466
## - exercise       1      4502 408492 31496
## - menthlth_days  1     13347 417337 31668
## - gen_health     2     85552 489542 32942
## 
## Step:  AIC=31409.55
## physhlth_days ~ gen_health + menthlth_days + exercise + age_group + 
##     income + depression + education + smoking + sex
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       403887 31410
## - sex            1       103 403990 31410
## + bmi            1        98 403789 31410
## + marital        2       171 403716 31410
## - smoking        2       357 404244 31413
## - depression     1       612 404499 31420
## - education      3       875 404762 31421
## - income         3      1394 405281 31431
## - age_group      3      3049 406936 31464
## - exercise       1      4451 408338 31495
## - menthlth_days  1     13238 417125 31666
## - gen_health     2     85645 489532 32944

tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stepwise Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	1.3853	0.4874	2.8425	0.0045	0.4300	2.3407
gen_healthGood	1.3636	0.1839	7.4161	0.0000	1.0032	1.7241
gen_healthFair/Poor	10.0009	0.2486	40.2245	0.0000	9.5136	10.4883
menthlth_days	0.1836	0.0114	16.1748	0.0000	0.1614	0.2059
exerciseYes	-1.9048	0.2031	-9.3793	0.0000	-2.3029	-1.5067
age_group35-49	0.8374	0.2703	3.0980	0.0020	0.3075	1.3672
age_group50-64	1.4778	0.2583	5.7206	0.0000	0.9714	1.9841
age_group65+	1.8366	0.2513	7.3099	0.0000	1.3441	2.3292
income$25,000 to $49,000	-1.0324	0.2749	-3.7557	0.0002	-1.5713	-0.4936
income$50,000 to $99,000	-1.3884	0.2658	-5.2239	0.0000	-1.9094	-0.8674
income$100,000 +	-1.1122	0.3849	-2.8897	0.0039	-1.8666	-0.3577
depressionYes	0.7471	0.2149	3.4769	0.0005	0.3259	1.1683
educationHS diploma or GED	0.2576	0.4006	0.6431	0.5202	-0.5276	1.0428
educationSome college	0.9636	0.4025	2.3943	0.0167	0.1747	1.7525
educationCollege graduate	1.0310	0.4043	2.5502	0.0108	0.2385	1.8236
smokingFormer	0.4399	0.1876	2.3442	0.0191	0.0720	0.8077
smokingCurrent	0.4776	0.2647	1.8038	0.0713	-0.0414	0.9965
sexFemale	0.2329	0.1633	1.4262	0.1539	-0.0872	0.5530

###In 3-5 sentences, compare the results:

• Do the three methods agree on the same final model? >Yes, basically all three methods agree on the same final model.

• Which variables (if any) were excluded by all three methods? > Bmi, sleep_hrs, sex,education,and smoking were excluded by all three methods.

• Which variables were retained by all three methods? > menthlth_days, sex, exercise, depression, all age groups, all income groups, all education groups, smoking status, and gen_health status were retained by all three methods.

##Step 4: Final Model Selection and Interpretation (5 points)

###Choose your final model. State which method(s) guided your choice and why. Conclusion:Although the AIC model includes conceptually important predictors of physically unhealthy days (like income, marital status, and education), I decided to go with the BIC model to avoid overfitting. BIC penalizes model complexity more strongly, so it gives a simpler, more stable model that is more likely to generalize well, even if it leaves out some variables that seem important in theory.

###Display the coefficient table for your final model using tidy() with confidence intervals

mod_fin <- lm(physhlth_days ~ menthlth_days + exercise + depression + age_group+ income+ gen_health+ marital, data=brfss_analytic)

tidy(mod_fin, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "AIC-based model: Physical health days(30days) and 7 predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

AIC-based model: Physical health days(30days) and 7 predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.3868	0.3976	6.0033	0.0000	1.6074	3.1661
menthlth_days	0.1863	0.0113	16.4431	0.0000	0.1641	0.2085
exerciseYes	-1.8268	0.2006	-9.1086	0.0000	-2.2200	-1.4337
depressionYes	0.8845	0.2126	4.1600	0.0000	0.4677	1.3012
age_group35-49	0.8898	0.2805	3.1722	0.0015	0.3399	1.4396
age_group50-64	1.5153	0.2759	5.4931	0.0000	0.9746	2.0561
age_group65+	1.9082	0.2747	6.9461	0.0000	1.3697	2.4467
income$25,000 to $49,000	-0.9724	0.2747	-3.5397	0.0004	-1.5110	-0.4339
income$50,000 to $99,000	-1.2940	0.2651	-4.8821	0.0000	-1.8136	-0.7744
income$100,000 +	-1.0141	0.3815	-2.6580	0.0079	-1.7620	-0.2662
gen_healthGood	1.3250	0.1834	7.2252	0.0000	0.9655	1.6844
gen_healthFair/Poor	9.9619	0.2467	40.3886	0.0000	9.4784	10.4454
maritalPreviously married	-0.2057	0.2059	-0.9990	0.3178	-0.6093	0.1979
maritalNever married	-0.4314	0.2487	-1.7343	0.0829	-0.9189	0.0562

###Interpret the coefficients for at least three predictors in plain language, including units and “holding all other variables constant” language. Include at least one continuous and one categorical predictor >Interpretation: For every day increase of poor mental health, there are 0.2 more days of poor physical health on average, hooding all other variables constant. Compared to individuals aged 18–34, those in the 50–64 age group have, on average, 1.52 more days of poor physical health, holding all other variables constant. Compared to individuals who report excellent or very good general health, those reporting fair or poor health have, on average, 9.96 more days of poor physical health, holding all other variables constant.

##Step 5: LINE Assumption Check (5 points)

###Using your final linear model, produce the four base R diagnostic plots:

par(mfrow = c(2, 2))
plot(mod_fin,  col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

par(mfrow = c(1, 1))

###Interpret each panel in 1-2 sentences: 9. Residuals vs. Fitted: The plot shows a clear pattern rather than a random scatter around zero, suggesting potential non-linearity and non-constant variance. This is not unexpected, as the outcome (poor physical health days) is bounded between 0 and 30 and is right-skewed; however, the large sample size (n = 5,000) provides some robustness to these violations.

10. Normal Q-Q: The residuals do not closely follow the diagonal line and display an S-shaped pattern, indicating deviations from normality, particularly in the tails.

11. Scale-Location: The red line is not horizontal and shows an increasing trend, suggesting that the variance of residuals increases with fitted values (heteroscedasticity).

12. Residuals vs. Leverage: There are a few observations with relatively higher leverage and larger residuals (like the labeled points), so they could be somewhat influential. However, most points are clustered with low leverage and residuals around zero, with no clear pattern indicating major model misspecification. Cook’s distance lines are not exceeded by most observations, so none of these points are strongly influential. Overall, the model does not seem to be driven by a few extreme observations.

###Write a brief conclusion (2-3 sentences): Are the LINE assumptions reasonably satisfied? What violations, if any, do you observe? Conclusion: The LINE assumptions are only partially satisfied. There is evidence of non-linearity and heteroscedasticity, as well as deviations from normality in the residuals. While a few observations show higher leverage, no single point appears to be strongly influential, and the large sample size provides some robustness to these violations.

#Part 2: Logistic Regression (40 points)

You will now reframe the research question using a binary outcome. The CDC defines frequent physical distress as reporting 14 or more physically unhealthy days in the past 30 days.

##Step 1: Create the Binary Outcome (5 points) ###Create a new variable frequent_distress: • 1 if physhlth_days >= 14 • 0 if physhlth_days < 14 • Store as a factor with levels “No” (reference) and “Yes”

brfss_log <- brfss_analytic |>
  mutate(
    frequent_distress= factor(case_when(
      physhlth_days >= 14 ~ 1,
      physhlth_days < 14 ~ 0, 
      TRUE ~ NA_real_), levels = c(0,1), labels = c("No", "Yes"))
  )

brfss_log %>%
  count(frequent_distress) %>%
  mutate(percent = round(n / sum(n) * 100, 2)) %>%
  kable(
    col.names = c("Frequent physical distress", "Count", "Percent (%)"),
    caption = "Distribution of frequent physical distress (Binary)"
  ) %>%
  kable_styling(full_width = FALSE)

Distribution of frequent physical distress (Binary)
Frequent physical distress	Count	Percent (%)
No	6948	86.85
Yes	1052	13.15

###Report the prevalence of frequent physical distress (count and percentage in each category) > There is an imbalance in proportion of people who’s considered having physically unhealthy days and who don’t with more than 1:6 ratio

##Step 2: Simple (Unadjusted) Logistic Regression (10 points) • Choose one predictor from your final linear model that you expect to have a strong association with physical distress. State why you chose it. >General health status was identified as the strongest predictor of poor physical health days, with individuals reporting poor or fair health experiencing more days of poor physical health, as demonstrated in the Part 1 linear regression models.

• Fit a simple logistic regression: mod_simple <- glm(frequent_distress ~ your_predictor, data = brfss_analytic, family = binomial)

mod_simple <- glm(frequent_distress ~ gen_health, data = brfss_log, family = binomial)

#Display the model summary
summary(mod_simple)

## 
## Call:
## glm(formula = frequent_distress ~ gen_health, family = binomial, 
##     data = brfss_log)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.39831    0.09021  -37.67   <2e-16 ***
## gen_healthGood       1.14179    0.11198   10.20   <2e-16 ***
## gen_healthFair/Poor  3.28880    0.10465   31.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6227.7  on 7999  degrees of freedom
## Residual deviance: 4754.1  on 7997  degrees of freedom
## AIC: 4760.1
## 
## Number of Fisher Scoring iterations: 6

• Calculate and report the odds ratio and 95% confidence interval:

exp(coef(mod_simple)) # Odds ratio

##         (Intercept)      gen_healthGood gen_healthFair/Poor 
##          0.03342985          3.13235705         26.81066762

exp(confint(mod_simple)) # 95% CI for OR

## Waiting for profiling to be done...

##                           2.5 %      97.5 %
## (Intercept)          0.02787183  0.03970661
## gen_healthGood       2.52057996  3.91109619
## gen_healthFair/Poor 21.91471664 33.03776143

• Interpret the odds ratio in plain language. For a continuous predictor: Interpretation:Compared to individuals who reported excellent/very good status of general health, individuals in Good general health status group have 3.13 times the odds of frequent physical distress, 95% CI [2.52-3.91]. Compared to reference group (excellent/very good), individuals in fair/poor general health status group have 26.81 times the odds of frequent physical distress, 95% CI [21.91-33.04].

• State whether the association is statistically significant at alpha = 0.05 and how you know (Wald test p-value or CI excluding 1.0) >The association for general health status and frequent physical distress across all categories is statistically significant at at alpha = 0.05, as CI 95% across groups do no include null value (aOR=1);

##Step 3: Multiple Logistic Regression (10 points) • Fit a multiple logistic regression using the same predictors as your final linear model from Part 1:

mod_logistic <- glm(frequent_distress ~ menthlth_days + exercise + depression + age_group+ income+ gen_health+ marital,  data = brfss_log, family = binomial)

#Display the model summary
summary(mod_logistic)

## 
## Call:
## glm(formula = frequent_distress ~ menthlth_days + exercise + 
##     depression + age_group + income + gen_health + marital, family = binomial, 
##     data = brfss_log)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -3.490873   0.208953 -16.706  < 2e-16 ***
## menthlth_days              0.045827   0.004326  10.594  < 2e-16 ***
## exerciseYes               -0.558746   0.083322  -6.706 2.00e-11 ***
## depressionYes              0.316261   0.094413   3.350 0.000809 ***
## age_group35-49             0.560087   0.160650   3.486 0.000490 ***
## age_group50-64             0.808137   0.153370   5.269 1.37e-07 ***
## age_group65+               1.022920   0.153914   6.646 3.01e-11 ***
## income$25,000 to $49,000  -0.176534   0.110401  -1.599 0.109815    
## income$50,000 to $99,000  -0.441539   0.111672  -3.954 7.69e-05 ***
## income$100,000 +          -0.477479   0.219215  -2.178 0.029396 *  
## gen_healthGood             0.874252   0.115220   7.588 3.26e-14 ***
## gen_healthFair/Poor        2.663291   0.112226  23.731  < 2e-16 ***
## maritalPreviously married -0.093772   0.094969  -0.987 0.323446    
## maritalNever married      -0.149673   0.127202  -1.177 0.239336    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6227.7  on 7999  degrees of freedom
## Residual deviance: 4441.0  on 7986  degrees of freedom
## AIC: 4469
## 
## Number of Fisher Scoring iterations: 6

• Calculate and display a table of adjusted odds ratios with 95% confidence intervals for all predictors:

tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
 kable(digits = 3)

term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.030	0.209	-16.706	0.000	0.020	0.046
menthlth_days	1.047	0.004	10.594	0.000	1.038	1.056
exerciseYes	0.572	0.083	-6.706	0.000	0.486	0.674
depressionYes	1.372	0.094	3.350	0.001	1.139	1.650
age_group35-49	1.751	0.161	3.486	0.000	1.281	2.406
age_group50-64	2.244	0.153	5.269	0.000	1.667	3.042
age_group65+	2.781	0.154	6.646	0.000	2.065	3.777
income$25,000 to $49,000	0.838	0.110	-1.599	0.110	0.675	1.041
income$50,000 to $99,000	0.643	0.112	-3.954	0.000	0.517	0.801
income$100,000 +	0.620	0.219	-2.178	0.029	0.398	0.943
gen_healthGood	2.397	0.115	7.588	0.000	1.916	3.011
gen_healthFair/Poor	14.343	0.112	23.731	0.000	11.545	17.930
maritalPreviously married	0.910	0.095	-0.987	0.323	0.755	1.096
maritalNever married	0.861	0.127	-1.177	0.239	0.670	1.103

• Interpret the adjusted odds ratios for at least three predictors in plain language, using “holding all other variables constant” language. Include at least one continuous and one categorical predictor. Intereptation: For every 1 day increase of poor mental health, the odds of frequent physical distress are multiplied by 1.05, 95% CI [1.04-1.06], holding all other variables constant. Compared to individuals of 18-34 years of age, individuals of 50-64 years of age have 2.24 times the odds of frequent physical distress, 95% CI [1.67-3.04], holding all other variables constant. Compared to individuals in excellent/very good general health status, individuals in fair/good general health status group have 14.34 times the odds of having frequent physical distress, CI 95%[11.55-17.93], holding all other variables constant.

##Step 4: Likelihood Ratio Test (5 points) • Choose a group of related predictors to test collectively (for example, all income levels, all education levels, or all smoking levels) • Fit a reduced model that excludes this group

mod_reduced <- glm(frequent_distress ~ menthlth_days + exercise  + depression + age_group+  gen_health+ marital,  data = brfss_log, family = binomial)

#Conduct a likelihood ratio test comparing the reduced and full models:
anova(mod_reduced, mod_logistic, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: frequent_distress ~ menthlth_days + exercise + depression + age_group + 
##     gen_health + marital
## Model 2: frequent_distress ~ menthlth_days + exercise + depression + age_group + 
##     income + gen_health + marital
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      7989     4458.1                          
## 2      7986     4441.0  3   17.149 0.0006587 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

• State the null and alternative hypotheses

H0: The income variable doesn’t significantly improve the model’s fit; H1: The income variable significantly improves the model’s fit;

• Report the test statistic (deviance), degrees of freedom, and p-value Conclusion: The change in deviance = 17.15 with 3 degrees of freedom (for each income category) significantly improve the model’s fit at alpha = 0.05 (p-value < 0.001), we reject the null hypothesis .

##Step 5: ROC Curve and AUC (5 points) • Use the pROC package to generate a ROC curve for your multiple logistic regression model:

roc_obj <- roc(brfss_log$frequent_distress, fitted(mod_logistic))

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

plot(roc_obj, main = "ROC Curve: Frequent Physical Distress Model",
 print.auc = TRUE)

• Report the AUC with 95% confidence interval:

auc(roc_obj)

## Area under the curve: 0.853

ci.auc(roc_obj)

## 95% CI: 0.8399-0.866 (DeLong)

###Interpret the AUC: What does this value tell you about the model’s ability to discriminate between individuals with and without frequent physical distress? Use the following guide: • 0.5 = no discrimination (coin flip) • 0.5-0.7 = poor discrimination • 0.7-0.8 = acceptable discrimination • 0.8-0.9 = excellent discrimination • 0.9+ = outstanding discrimination

Interpretation: The area under the curve (AUC) is 0.853 (95% CI: 0.839–0.866), indicating excellent discrimination. This means that there is 85.3% probability that the model is able to distinguish between individuals with and without frequent physical distress.The 95% Confidence interval suggests that the model has high precision.

##Step 6: Hosmer-Lemeshow Goodness-of-Fit Test (5 points)

#Conduct the Hosmer-Lemeshow test using the ResourceSelection package:
#Report the test statistic, degrees of freedom, and p-value
hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10)

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  mod_logistic$y, fitted(mod_logistic)
## X-squared = 7.0796, df = 8, p-value = 0.5281

###State the null hypothesis (the model fits the data well) and alternative hypothesis (the model does not fit the data well) H0: The model fits the data well H1: The model does not fit the data well

###Interpret: At alpha = 0.05, is there evidence of poor model fit?. In 1-2 sentences, discuss how the Hosmer-Lemeshow result complements the ROC/AUC Interpretation: The Hosmer–Lemeshow test (Chi-squared = 7.99, df = 8, p = 0.434) is not significant at alpha = 0.05, indicating that the model fits the data well. > The non-significant Hosmer–Lemeshow test (p = 0.434) suggests that the model fits the data well and predicted probabilities align with observed outcomes. Together with the high AUC (0.853), this means the model not only discriminates well between individuals with and without frequent physical distress, but also produces reliable predictions.

#Part 3: Reflection (15 points)

Write 250-400 words in continuous prose (no bullet points). ###A. Statistical Insights (5 points):

From the multivariable linear regression of maximum model we conclude that mental health days, exercise, depression, age_group50-64, age group 65+, all categories of income, education some college and education college graduate, general health statuses and marital statuses have statistically significant associations with different directions and strength with poor physical health days. In our final multivariate linear model,in comparison with individuals in excellent/ very good general health status group, individuals with fair/poor general health status have on average almost 10 more days of poor physical health, holding all other variables in constant (p<0.0001). Surely, in comarison with younger individuals (18-34 y.o.), participants in age group 65+ have on average 1.7 more days of poor physical health, holding all other variables in constant (p<0.0001). Whereas, all income categories have protective factor. For example, in comparison with individuals with less than 25k annual income, individuals with 100k+ annual income have on average 2.3 less days of poor physical health (p<0.0001), holding all other variables in constant. For the binary outcome that was coded off of physical health days (>14 days), in multivariable logistic regression, those strongest predictors all remained statistically significant at alpha = 0.05. In comparison with excellent/very good general health status group participants, participants in fair/poor general health status have 13.7 times the odds of frequent mental distress. In comparison with younger individuals in reference group, individuals 65 years and older have 2 times the odds of frequent mental distress. Individuals with 100k + annual income have 56% less the odds of frequent mental distress compared to those with income less than 25K (aOR=0.44). There were some changes in statistical significance across models. In the linear regression, depression and marital status categories were significantly associated with the number of physically unhealthy days. However, in the logistic regression, frequent distress (physhlth_days>14 days), these variables were no longer significant (p>0.05), likely because dichotomizing the outcome reduces information and power, which can make it harder to detect associations. It is also important to note that the data come from a cross-sectional survey, which has some limitations. Since the outcome and predictors were measured at the same point of time, temporality cannot be established, and therefore causal inferences cannot be made. Additionally, the final linear regression model explained only about 30% of the variability in physically unhealthy days, meaning that 70% remains unexplained. This suggests that other important factors not included in the model may contribute to the outcome. For example, chronic conditions and multimorbidity could play a significant role, as prior literature suggests their impact on self-rated health and physically unhealthy days. Factors such as recent injuries or trauma (e.g., within the past 30 days) may also act as important contributors, as well as the acute infections that have their seasonality (influenza, CoViD-19 etc.)

###B. Linear versus Logistic Regression (5 points):

The linear regression model estimates the average change in the outcome for a one-unit increase of a continuous predictor, or the difference in mean outcomes between categories, holding other variables constant. Whereas, the logistic regression model estimates the odds of the outcome occurring across different predictors or categories, calculated as adjusted odds ratios. So, linear regression focuses on changes in the outcome itself, while logistic regression focuses on the likelihood (odds) of developing the outcome. The linear regression model tells you the estimated difference in the outcome in its original units (days, kg/m², mmHg, years, etc.), meaning you can directly interpret how much the outcome changes across predictors or categories. In contrast, the logistic regression model tells you the odds of the outcome occurring across different predictors or groups, expressed as odds ratios, which reflect the strength of association but not the actual change in outcome units. The choice of the regression analysis type depends on the outcome variable. If the outcome is continuous, it is preferable to use linear regression and keep it as continuous, because this keeps the full information in the data and allows interpretation in the original units. Dichotomizing a continuous variable reduces the information and statistical power, and as shown in our model, some predictors may become statistically insignificant. Logistic regression is more appropriate when the outcome is truly binary or when there is a clinically meaningful cut-off. For model selection in linear regression, we typically use Adjusted R² rather than R², because R² will always stay the same or increase when adding variables, even if they do not improve the model. Adjusted R² penalizes unnecessary complexity, so it can decrease if added predictors introduce noise and do not meaningfully improve model fit. Another useful criterion is BIC, which also penalizes model complexity more strongly and helps avoid overfitting.

In logistic regression, model performance is often evaluated using AUC (Area Under the ROC Curve), which measures the model’s ability to correctly distinguish between the two outcome classes (event vs non-event). The interpretation is: 0.5 = no discrimination (random guess) 0.5–0.7 = poor discrimination 0.7–0.8 = acceptable discrimination 0.8–0.9 = excellent discrimination 0.9 = outstanding discrimination

So, linear regression focuses on explaining variance and penalizing complexity, while logistic regression focuses on discrimination between outcome groups.

###C. AI Transparency (5 points): Describe any parts of this assignment where you sought AI assistance. I used Chatgpt for paraphrasing purposes in part3. “can i say that dichotomizing outcome reduces the sensitivity of changes in estimates for predictors and outcome”.

The end of the assignment4

EPI553_HW04_Azamatova_Aigerim

Azamatova Aigerim

2026-04-24

1 Loading Survey BRFSS 2023 Data

2 Create subset with 13 variables

2.1 Fit a multiple linear regression model with all of your selected predictor. Display the full summary() output.

2.2 Report and interpret: R-squared, Adjusted R-squared, AIC (using AIC()), and BIC (using BIC()