#Assignment 4

Part 0: Preparation

Step 1

In this initial chunk I am loading the packages that I will need for this assignment. The full data set has 433323 observations and 9 selected variables. I selected physical health days (outcome variable) as well as the following predictors mental health days, bmi, depression, sex, age by age group, smoking status, general health and marital status. These variables were selected in particular to see if they have any impact on physical unhealthy days.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(haven)
## Warning: package 'haven' was built under R version 4.5.2
library(janitor)
## Warning: package 'janitor' was built under R version 4.5.2
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(knitr)
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.2
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(broom)
library(gtsummary)
## Warning: package 'gtsummary' was built under R version 4.5.2
library(GGally)
## Warning: package 'GGally' was built under R version 4.5.2
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
library(ggeffects)
## Warning: package 'ggeffects' was built under R version 4.5.2
library(plotly)
## Warning: package 'plotly' was built under R version 4.5.2
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(leaps)
## Warning: package 'leaps' was built under R version 4.5.2
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:plotly':
## 
##     select
## 
## The following object is masked from 'package:gtsummary':
## 
##     select
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(ResourceSelection)  # for Hosmer-Lemeshow
## Warning: package 'ResourceSelection' was built under R version 4.5.3
## ResourceSelection 0.3-6   2023-06-27
library(pROC)               # for ROC/AUC
## Warning: package 'pROC' was built under R version 4.5.3
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
library(performance)        # for model performance metrics
## Warning: package 'performance' was built under R version 4.5.3
library(sjPlot)
## Warning: package 'sjPlot' was built under R version 4.5.3
## 
## Attaching package: 'sjPlot'
## 
## The following object is masked from 'package:ggplot2':
## 
##     set_theme
library(modelsummary)
## Warning: package 'modelsummary' was built under R version 4.5.3
library(dplyr)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))
## Setting theme "Compact"
## Setting theme "Compact"
brfss_23 <- read_xpt(
  "C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/LLCP2023.XPT") %>% 
  clean_names()

brfss_23 <- brfss_23 %>% 
  dplyr::select(physhlth, menthlth, bmi5, addepev3, sexvar, ageg5yr, smoker3, genhlth, marital)

Step 2

In this section I recoded the categorical and binary variables to factor variables and ensured that they have meaningful labes to ensure easy readiablity for the reader.I also made sure that the reference macthed the folloing list for the categorical variables.

• sex: “Male” as reference • depression: “No” as reference • age_group: “18-34” as reference • smoking: “Never” as reference • gen_health: “Excellent/Very Good” as reference • marital: “Married/Partnered” as reference

brfss_23_clean <- brfss_23 |>
  mutate(
    
  #Outcome: physically unhealthy days in past 30
    physhlth_days = case_when(
      physhlth == 88 ~ 0,
      physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
      TRUE  ~ NA_real_ ),
 
  ## Selected Predictors for model
  
  #Mentally unhealthy days in past 30
    menthlth_days = case_when(
      menthlth == 88 ~ 0,
      menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
      TRUE  ~ NA_real_),   

  #BMI
    bmi = ifelse(bmi5 > 0, bmi5 / 100, NA_real_),
    
  #Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),

  #Depression
    depression = factor(addepev3, levels = c(2, 1), labels = c("No", "Yes")),

  #Age groups (13-level raw BRFSS variable Age Group)
    # 1 +2 + 3 = 18 to 34
    # 4 +5 +6= 35 - 39
    # 7 +8+9= 50 - 64
    # 10 +11 +12 +13 = 65+
    # 14 = Refused
    age_group = factor(case_when(
      ageg5yr %in% c(1, 2, 3) ~ "18-34",
      ageg5yr %in% c(4, 5, 6) ~ "35-49",
      ageg5yr %in% c(7, 8, 9) ~ "50-64",
      ageg5yr %in% c(10, 11, 12, 13) ~ "65+",
      TRUE ~ NA_character_), 
    levels = c("18-34", "35-49","50-64", "65+")),

  #Smoker (4-cat raw BRFSS variable)
    # 1 + 2 = Current 
    # 3 = Former 
    # 4 = Never 
    # 9 = Refused 
    smoking = factor(case_when(
      smoker3 %in% c(1, 2) ~ "Current",
      smoker3 == 3 ~ "Former",
      smoker3 == 4 ~ "Never",
       TRUE                 ~ NA_character_), 
      levels = c("Current","Former","Never")), 
      
  #General Health
    gen_health = case_when(
      genhlth %in% c(1, 2) ~ "Excellent/VG",
      genhlth == 3         ~ "Good",
      genhlth %in% c(4, 5) ~ "Fair/Poor",
      TRUE                 ~ NA_character_),

  #Marital Status
    marital = case_when(
      marital %in% c(1, 6) ~ "Married/Partnered",
      marital %in% c(2, 3, 4) ~ "Previously Married",
      marital == 5 ~ "Never married",
      TRUE ~ NA_character_ ))

Step 3

In this next chunk, I will be looking at the number and percent of missing data for the outcome and predictors. I will also present the final analytic samples.

# Percent Missing 
brfss_23_clean %>%
  dplyr::select(physhlth_days, menthlth_days, bmi, sex, depression,age_group,smoking,gen_health, marital) %>%
  summarise(across(everything(), ~ sum(is.na(.) | . %in% c(77, 88, 99)) / n() * 100)) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Pct_Missing_or_DK") %>%
  mutate(Pct_Missing_or_DK = round(Pct_Missing_or_DK, 1)) %>%
  arrange(desc(Pct_Missing_or_DK)) %>%
  kable(
    caption = "Table 1. Missing or Don't Know/Refused (%) — BRFSS 2023 Full Sample",
    col.names = c("Variable", "% Missing / DK / Refused")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 1. Missing or Don’t Know/Refused (%) — BRFSS 2023 Full Sample
Variable % Missing / DK / Refused
bmi 9.4
smoking 5.3
physhlth_days 2.5
menthlth_days 1.9
age_group 1.8
marital 1.0
depression 0.6
gen_health 0.3
sex 0.0

The number of missing for physical unhealthy days was XXX (2.5%) in full dataset. BMI had the most amount of missing in the dataset with a count of XXX (9.4%) missing. Sex had no missing values in the dataset.

The next step I will be dropping all the missing data from the final analytic dataset for all the variables in the dataset and creating a sample dataset of 8.000. I saved this dataset in a working directory and produced a table of the final analytic sample and number of predictors in the dataset.

# Reproducible random sample
set.seed(1220)
brfss_23_hw4 <- brfss_23_clean |>
  dplyr::select(physhlth_days, menthlth_days, bmi, sex, depression, age_group, smoking , gen_health, marital) |>
  tidyr::drop_na(physhlth_days,physhlth_days, menthlth_days, bmi, sex, depression, age_group, smoking, gen_health, marital) |> 
  dplyr::slice_sample(n = 8000)

# Save for review
saveRDS(brfss_23_hw4,
  "C:/Users/ariel/OneDrive/DrPH - Albany/4. Spring Semester 2026/HEPI 553/HEPI_553/data/brfss_hw4_2023.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_23_hw4), ncol(brfss_23_hw4))) |>
  kable(caption = "Table 2: Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 2: Analytic Dataset Dimensions
Metric Value
Observations 8000
Variables 9

The next chunk will produce Table 3, this table contains all the variables in dataset. The count and standard deviation is reported for continuos variables and count and percentages are reported for binary and categorical variables.

#Descriptive Summary Table 

brfss_23_hw4 %>%
  dplyr::select(physhlth_days, menthlth_days, bmi, sex, depression, age_group, smoking , gen_health, marital) %>%
  tbl_summary(
    label = list(
      physhlth_days ~ "Physically unhealthy days (past 30)",
      menthlth_days ~ "Mentally unhealthy days (past 30)",
      bmi           ~ "BMI (kg/m^2)",  
      sex           ~ "Sex",      
      depression    ~ "Depression (Yes)",
      age_group     ~ "Age (Age Groups)",
      smoking       ~ "Smoking Status",      
      gen_health    ~ "General Health Status",
      marital       ~ "Marital Status" 
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) %>%
  add_n() %>%
  bold_labels() %>%
  modify_caption("**Table 3: Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)**")
Table 3: Descriptive Statistics — BRFSS 2023 Analytic Sample (n = 8,000)
Characteristic N N = 8,0001
Physically unhealthy days (past 30) 8,000 4.4 (8.7)
Mentally unhealthy days (past 30) 8,000 4.3 (8.2)
BMI (kg/m^2) 8,000 28.5 (6.5)
Sex 8,000
    Male
3,881 (49%)
    Female
4,119 (51%)
Depression (Yes) 8,000 1,607 (20%)
Age (Age Groups) 8,000
    18-34
1,284 (16%)
    35-49
1,570 (20%)
    50-64
2,048 (26%)
    65+
3,098 (39%)
Smoking Status 8,000
    Current
884 (11%)
    Former
2,183 (27%)
    Never
4,933 (62%)
General Health Status 8,000
    Excellent/VG
3,927 (49%)
    Fair/Poor
1,387 (17%)
    Good
2,686 (34%)
Marital Status 8,000
    Married/Partnered
4,620 (58%)
    Never married
1,355 (17%)
    Previously Married
2,025 (25%)
1 Mean (SD); n (%)

Part 1: Model Selection for Linear Regression (35 points)

In this next part I will be conducted a full multi-linear regression with physical unhealthy days and the predictors listed above.

Step 1: Fit the Maximum Model

The chunk will run the full multi-variable linear regression and report the significant varirables in Table 4. This chuck will also report out the R squared, adjusted R squared, AIC and BIC for the full model in Table 5.

# Full Linear Model: Full multivariable model
mlr_m1 <- lm(physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
         data = brfss_23_hw4)

summary(mlr_m1)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + depression + 
##     sex + age_group + smoking + gen_health + marital, data = brfss_23_hw4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.9985  -2.7171  -1.3056   0.5627  30.0197 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.009531   0.483882  -0.020  0.98429    
## menthlth_days              0.191592   0.011450  16.732  < 2e-16 ***
## bmi                        0.033258   0.012774   2.604  0.00924 ** 
## depressionYes              1.145953   0.225315   5.086 3.74e-07 ***
## sexFemale                  0.078565   0.165641   0.474  0.63530    
## age_group35-49             0.580859   0.286844   2.025  0.04290 *  
## age_group50-64             1.679050   0.281832   5.958 2.67e-09 ***
## age_group65+               1.852247   0.276606   6.696 2.28e-11 ***
## smokingFormer             -1.433778   0.289619  -4.951 7.55e-07 ***
## smokingNever              -1.412889   0.268650  -5.259 1.48e-07 ***
## gen_healthFair/Poor       10.751555   0.246041  43.698  < 2e-16 ***
## gen_healthGood             1.299468   0.184367   7.048 1.96e-12 ***
## maritalNever married      -0.164972   0.243975  -0.676  0.49894    
## maritalPreviously Married  0.215368   0.199590   1.079  0.28060    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.126 on 7986 degrees of freedom
## Multiple R-squared:  0.3313, Adjusted R-squared:  0.3302 
## F-statistic: 304.4 on 13 and 7986 DF,  p-value: < 2.2e-16
tidy(mlr_m1, conf.int = TRUE) %>%
  mutate(
    term     = dplyr::recode(term,
      "(Intercept)"   = "Intercept",
      "menthlth_days" = "Mental health days",
      "bmi"           = "BMI (kg/m^2)",
      "depression"    = "Depression: Yes (ref = No)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      
      "age_group35-49"    = "Age Group: 35-49 (ref = <25)",
      "age_group50-64"    = "Age Group: 50-64 (ref = <25)",
      "age_group65+"      = "Age Group: 65+ (ref = <25)",
      
      "smokingFormer"     = "Smoking: Former (ref = Current)",
      "smokingNever"      = "Smoking: Never (ref = Current)",
      
      "gen_healthFair/Poor"    = "General Health Status: Fair/Poor (ref = Excellent/VG)",
      "gen_healthGood"         = "General Health Status: Good (ref = Excellent/VG)",
      
      "maritalNever married "    = "Marital Status: Never Married (ref = Married/Partnered)",
      "maritalPreviously Married"    = "Marital Status: Good (ref = Married/Partnered)",
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption  = "Table 4. Multiple Linear Regression: Phyically Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)",
    col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3, 4,6,7,8,10,11,12,14), background = "#EBF5FB")  # highlight key predictors
Table 4. Multiple Linear Regression: Phyically Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)
Term Estimate (β̂
Std. Erro
Intercept -0.0095 0.4839 -0.0197 0.9843 -0.9581 0.9390
Mental health days 0.1916 0.0115 16.7324 0.0000 0.1691 0.2140
BMI (kg/m^2) 0.0333 0.0128 2.6036 0.0092 0.0082 0.0583
depressionYes 1.1460 0.2253 5.0860 0.0000 0.7043 1.5876
Sex: Female (ref = Male) 0.0786 0.1656 0.4743 0.6353 -0.2461 0.4033
Age Group: 35-49 (ref = <25) 0.5809 0.2868 2.0250 0.0429 0.0186 1.1431
Age Group: 50-64 (ref = <25) 1.6790 0.2818 5.9576 0.0000 1.1266 2.2315
Age Group: 65+ (ref = <25) 1.8522 0.2766 6.6963 0.0000 1.3100 2.3945
Smoking: Former (ref = Current) -1.4338 0.2896 -4.9506 0.0000 -2.0015 -0.8660
Smoking: Never (ref = Current) -1.4129 0.2687 -5.2592 0.0000 -1.9395 -0.8863
General Health Status: Fair/Poor (ref = Excellent/VG) 10.7516 0.2460 43.6982 0.0000 10.2693 11.2339
General Health Status: Good (ref = Excellent/VG) 1.2995 0.1844 7.0483 0.0000 0.9381 1.6609
maritalNever married -0.1650 0.2440 -0.6762 0.4989 -0.6432 0.3133
Marital Status: Good (ref = Married/Partnered) 0.2154 0.1996 1.0791 0.2806 -0.1759 0.6066
glance(mlr_m1) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = " Table 5: Maximum Model - Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 5: Maximum Model - Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.331 0.33 7.126 54138.96 54243.77 7986

Step 2 : Best Subsets Regression

# Prepare a model matrix (need numeric predictors for leaps)
# Use the formula interface approach
best_subsets_hw4 <- regsubsets(
physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
  data = brfss_23_hw4,
  nvmax = 14,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary_hw4 <- summary(best_subsets_hw4)

subset_metrics_hw4 <- tibble(
  p = 1:length(best_summary_hw4$adjr2),
  `Adj. R^2` = best_summary_hw4$adjr2,
  BIC = best_summary_hw4$bic,
  Cp = best_summary_hw4$cp
)

p1_hw <- ggplot(subset_metrics_hw4, aes(x = p, y = `Adj. R^2`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary_hw4$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R^2 by Model Size", x = "Number of Variables", y = "Adjusted R^2") +
  theme_minimal(base_size = 12)

p2_hw <- ggplot(subset_metrics_hw4, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary_hw4$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1_hw, p2_hw, ncol = 2)

Interpretation of Adjusted R^2 and BIC:

The Adjusted R^2 suggest that 11 variables is optimal for this model whereas the BIC suggest that 9 variables optimize the model. The two criteria in this case do not agree and the BIC has a simpler model with less predictors.

Step 3: Selection Methods - Backward, Forwardand Stepwise

In the following chunk I will be performing a backward elimination.

# Automated backward elimination using AIC
mod_backward_hw <- step(mlr_m1, direction = "backward", trace = 1)
## Start:  AIC=31433.95
## physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + 
##     smoking + gen_health + marital
## 
##                 Df Sum of Sq    RSS   AIC
## - marital        2        99 405625 31432
## - sex            1        11 405537 31432
## <none>                       405526 31434
## - bmi            1       344 405870 31439
## - depression     1      1314 406839 31458
## - smoking        2      1502 407028 31460
## - age_group      3      3200 408726 31491
## - menthlth_days  1     14217 419743 31708
## - gen_health     2    102526 508052 33233
## 
## Step:  AIC=31431.9
## physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + 
##     smoking + gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        24 405648 31430
## <none>                       405625 31432
## - bmi            1       335 405959 31436
## - depression     1      1333 406958 31456
## - smoking        2      1544 407169 31458
## - age_group      3      4352 409977 31511
## - menthlth_days  1     14281 419906 31707
## - gen_health     2    103643 509268 33248
## 
## Step:  AIC=31430.36
## physhlth_days ~ menthlth_days + bmi + depression + age_group + 
##     smoking + gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       405648 31430
## - bmi            1       329 405977 31435
## - depression     1      1426 407074 31456
## - smoking        2      1532 407180 31457
## - age_group      3      4485 410133 31512
## - menthlth_days  1     14349 419998 31706
## - gen_health     2    103620 509269 33246
tidy(mod_backward_hw, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Backward Elimination Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) -0.0420 0.4584 -0.0917 0.9270 -0.9406 0.8566
menthlth_days 0.1919 0.0114 16.8108 0.0000 0.1696 0.2143
bmi 0.0325 0.0128 2.5441 0.0110 0.0074 0.0575
depressionYes 1.1784 0.2224 5.2991 0.0000 0.7425 1.6142
age_group35-49 0.6765 0.2718 2.4894 0.0128 0.1438 1.2093
age_group50-64 1.8089 0.2592 6.9799 0.0000 1.3009 2.3170
age_group65+ 2.0193 0.2468 8.1808 0.0000 1.5354 2.5031
smokingFormer -1.4382 0.2889 -4.9776 0.0000 -2.0045 -0.8718
smokingNever -1.4223 0.2670 -5.3278 0.0000 -1.9456 -0.8990
gen_healthFair/Poor 10.7668 0.2449 43.9633 0.0000 10.2867 11.2468
gen_healthGood 1.3053 0.1841 7.0897 0.0000 0.9444 1.6662

The backward elimination removed sex from the final model.

In the following chunk I will be performing a forward selection.

# Automated forward selection using AIC
mod_null_hw <- lm(physhlth_days ~ 1, data = brfss_23_hw4)

mod_forward_hw <- step(mod_null_hw,
                    scope = list(lower = mod_null_hw, upper = mlr_m1),
                    direction = "forward", trace = 1)
## Start:  AIC=34627.33
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     2    173555 432886 31934
## + menthlth_days  1     67344 539097 33688
## + depression     1     28283 578158 34247
## + smoking        2     13891 592550 34446
## + bmi            1     11255 595186 34479
## + marital        2      7369 599072 34534
## + age_group      3      5377 601064 34562
## + sex            1      1093 605348 34615
## <none>                       606441 34627
## 
## Step:  AIC=31934.26
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1   19322.2 413564 31571
## + depression     1    8323.4 424562 31781
## + smoking        2    2603.2 430283 31890
## + age_group      3    1623.0 431263 31910
## + marital        2     927.7 431958 31921
## + sex            1     705.5 432180 31923
## + bmi            1     588.8 432297 31925
## <none>                       432886 31934
## 
## Step:  AIC=31570.96
## physhlth_days ~ gen_health + menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + age_group   3    4600.7 408963 31488
## + smoking     2    1692.5 411871 31542
## + depression  1    1484.6 412079 31544
## + marital     2    1572.8 411991 31545
## + bmi         1     346.5 413217 31566
## + sex         1     235.3 413328 31568
## <none>                    413564 31571
## 
## Step:  AIC=31487.47
## physhlth_days ~ gen_health + menthlth_days + age_group
## 
##              Df Sum of Sq    RSS   AIC
## + depression  1   1545.56 407417 31459
## + smoking     2   1484.64 407478 31462
## + bmi         1    302.80 408660 31484
## <none>                    408963 31488
## + marital     2    186.28 408777 31488
## + sex         1     75.57 408887 31488
## 
## Step:  AIC=31459.17
## physhlth_days ~ gen_health + menthlth_days + age_group + depression
## 
##           Df Sum of Sq    RSS   AIC
## + smoking  2   1440.33 405977 31435
## + bmi      1    237.36 407180 31457
## <none>                 407417 31459
## + marital  2    135.15 407282 31461
## + sex      1      7.63 407410 31461
## 
## Step:  AIC=31434.84
## physhlth_days ~ gen_health + menthlth_days + age_group + depression + 
##     smoking
## 
##           Df Sum of Sq    RSS   AIC
## + bmi      1    328.65 405648 31430
## <none>                 405977 31435
## + sex      1     17.66 405959 31437
## + marital  2     99.04 405878 31437
## 
## Step:  AIC=31430.36
## physhlth_days ~ gen_health + menthlth_days + age_group + depression + 
##     smoking + bmi
## 
##           Df Sum of Sq    RSS   AIC
## <none>                 405648 31430
## + sex      1    23.519 405625 31432
## + marital  2   111.120 405537 31432
tidy(mod_forward_hw, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Forward Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) -0.0420 0.4584 -0.0917 0.9270 -0.9406 0.8566
gen_healthFair/Poor 10.7668 0.2449 43.9633 0.0000 10.2867 11.2468
gen_healthGood 1.3053 0.1841 7.0897 0.0000 0.9444 1.6662
menthlth_days 0.1919 0.0114 16.8108 0.0000 0.1696 0.2143
age_group35-49 0.6765 0.2718 2.4894 0.0128 0.1438 1.2093
age_group50-64 1.8089 0.2592 6.9799 0.0000 1.3009 2.3170
age_group65+ 2.0193 0.2468 8.1808 0.0000 1.5354 2.5031
depressionYes 1.1784 0.2224 5.2991 0.0000 0.7425 1.6142
smokingFormer -1.4382 0.2889 -4.9776 0.0000 -2.0045 -0.8718
smokingNever -1.4223 0.2670 -5.3278 0.0000 -1.9456 -0.8990
bmi 0.0325 0.0128 2.5441 0.0110 0.0074 0.0575

The forward selection process also removed sex from the model.

Finally, this chunk will perform a stepwise selection.

#Stepwise Selection
mod_stepwise_hw <- step(mod_null_hw,
                     scope = list(lower = mod_null_hw, upper = mlr_m1),
                     direction = "both", trace = 1)
## Start:  AIC=34627.33
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + gen_health     2    173555 432886 31934
## + menthlth_days  1     67344 539097 33688
## + depression     1     28283 578158 34247
## + smoking        2     13891 592550 34446
## + bmi            1     11255 595186 34479
## + marital        2      7369 599072 34534
## + age_group      3      5377 601064 34562
## + sex            1      1093 605348 34615
## <none>                       606441 34627
## 
## Step:  AIC=31934.26
## physhlth_days ~ gen_health
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     19322 413564 31571
## + depression     1      8323 424562 31781
## + smoking        2      2603 430283 31890
## + age_group      3      1623 431263 31910
## + marital        2       928 431958 31921
## + sex            1       706 432180 31923
## + bmi            1       589 432297 31925
## <none>                       432886 31934
## - gen_health     2    173555 606441 34627
## 
## Step:  AIC=31570.96
## physhlth_days ~ gen_health + menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + age_group      3      4601 408963 31487
## + smoking        2      1692 411871 31542
## + depression     1      1485 412079 31544
## + marital        2      1573 411991 31544
## + bmi            1       347 413217 31566
## + sex            1       235 413328 31568
## <none>                       413564 31571
## - menthlth_days  1     19322 432886 31934
## - gen_health     2    125533 539097 33688
## 
## Step:  AIC=31487.47
## physhlth_days ~ gen_health + menthlth_days + age_group
## 
##                 Df Sum of Sq    RSS   AIC
## + depression     1      1546 407417 31459
## + smoking        2      1485 407478 31462
## + bmi            1       303 408660 31484
## <none>                       408963 31487
## + marital        2       186 408777 31488
## + sex            1        76 408887 31488
## - age_group      3      4601 413564 31571
## - menthlth_days  1     22300 431263 31910
## - gen_health     2    115603 524566 33475
## 
## Step:  AIC=31459.17
## physhlth_days ~ gen_health + menthlth_days + age_group + depression
## 
##                 Df Sum of Sq    RSS   AIC
## + smoking        2      1440 405977 31435
## + bmi            1       237 407180 31457
## <none>                       407417 31459
## + marital        2       135 407282 31461
## + sex            1         8 407410 31461
## - depression     1      1546 408963 31487
## - age_group      3      4662 412079 31544
## - menthlth_days  1     14996 422413 31746
## - gen_health     2    113276 520694 33418
## 
## Step:  AIC=31434.84
## physhlth_days ~ gen_health + menthlth_days + age_group + depression + 
##     smoking
## 
##                 Df Sum of Sq    RSS   AIC
## + bmi            1       329 405648 31430
## <none>                       405977 31435
## + sex            1        18 405959 31436
## + marital        2        99 405878 31437
## - smoking        2      1440 407417 31459
## - depression     1      1501 407478 31462
## - age_group      3      4497 410474 31517
## - menthlth_days  1     14416 420393 31712
## - gen_health     2    108818 514795 33331
## 
## Step:  AIC=31430.36
## physhlth_days ~ gen_health + menthlth_days + age_group + depression + 
##     smoking + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       405648 31430
## + sex            1        24 405625 31432
## + marital        2       111 405537 31432
## - bmi            1       329 405977 31435
## - depression     1      1426 407074 31456
## - smoking        2      1532 407180 31457
## - age_group      3      4485 410133 31512
## - menthlth_days  1     14349 419998 31706
## - gen_health     2    103620 509269 33246
tidy(mod_stepwise_hw, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stepwise Selection Result (AIC-based)
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) -0.0420 0.4584 -0.0917 0.9270 -0.9406 0.8566
gen_healthFair/Poor 10.7668 0.2449 43.9633 0.0000 10.2867 11.2468
gen_healthGood 1.3053 0.1841 7.0897 0.0000 0.9444 1.6662
menthlth_days 0.1919 0.0114 16.8108 0.0000 0.1696 0.2143
age_group35-49 0.6765 0.2718 2.4894 0.0128 0.1438 1.2093
age_group50-64 1.8089 0.2592 6.9799 0.0000 1.3009 2.3170
age_group65+ 2.0193 0.2468 8.1808 0.0000 1.5354 2.5031
depressionYes 1.1784 0.2224 5.2991 0.0000 0.7425 1.6142
smokingFormer -1.4382 0.2889 -4.9776 0.0000 -2.0045 -0.8718
smokingNever -1.4223 0.2670 -5.3278 0.0000 -1.9456 -0.8990
bmi 0.0325 0.0128 2.5441 0.0110 0.0074 0.0575

The stepwise removed sex from the model as well.

The following chunk will compare the three model selection is best.

method_comparison_hw <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R^2`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mlr_m1)) - 1,
    round(glance(mlr_m1)$adj.r.squared, 4),
    round(AIC(mlr_m1), 1),
    round(BIC(mlr_m1), 1),
  "Backward (AIC)",
    length(coef(mod_backward_hw)) - 1,
    round(glance(mod_backward_hw)$adj.r.squared, 4),
    round(AIC(mod_backward_hw), 1),
    round(BIC(mod_backward_hw), 1),
  "Forward (AIC)",
    length(coef(mod_forward_hw)) - 1,
    round(glance(mod_forward_hw)$adj.r.squared, 4),
    round(AIC(mod_forward_hw), 1),
    round(BIC(mod_forward_hw), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise_hw)) - 1,
    round(glance(mod_stepwise_hw)$adj.r.squared, 4),
    round(AIC(mod_stepwise_hw), 1),
    round(BIC(mod_stepwise_hw), 1)
)

method_comparison_hw |>
  kable(caption = "Table 6: Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Table 6: Comparison of Variable Selection Methods
Method Variables selected Adj. R^2 AIC BIC
Maximum model 13 0.3302 54139.0 54243.8
Backward (AIC) 10 0.3303 54135.4 54219.2
Forward (AIC) 10 0.3303 54135.4 54219.2
Stepwise (AIC) 10 0.3303 54135.4 54219.2

Interpretation: All three automated methods selected the same model with 9 predictor terms (Adjusted R² = 0.346, AIC = 56139.8, BIC = 56237.7). This model has a lower AIC and BIC than the maximum model (AIC = 56141.4, BIC = 56246.2), confirming that removing sex improved parsimony without sacrificing fit. General Health Status, unhealthy mental health days, age group, marital status, depression, bmi and smoking status all held in all three models. The modest improvement in BIC (8.5 points lower) is more notable than the AIC improvement (1.6 points lower), consistent with BIC’s stronger preference for simpler models.In practice, the maximum model and the selected model would produce very similar predictions, but the selected model is preferred for its efficiency.

Step 4: Final Model Selection and Interpretation

# Full Linear Model: Full multivariable model
mlr_m1 <- lm(physhlth_days ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
         data = brfss_23_hw4)

summary(mlr_m1)
## 
## Call:
## lm(formula = physhlth_days ~ menthlth_days + bmi + depression + 
##     sex + age_group + smoking + gen_health + marital, data = brfss_23_hw4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.9985  -2.7171  -1.3056   0.5627  30.0197 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -0.009531   0.483882  -0.020  0.98429    
## menthlth_days              0.191592   0.011450  16.732  < 2e-16 ***
## bmi                        0.033258   0.012774   2.604  0.00924 ** 
## depressionYes              1.145953   0.225315   5.086 3.74e-07 ***
## sexFemale                  0.078565   0.165641   0.474  0.63530    
## age_group35-49             0.580859   0.286844   2.025  0.04290 *  
## age_group50-64             1.679050   0.281832   5.958 2.67e-09 ***
## age_group65+               1.852247   0.276606   6.696 2.28e-11 ***
## smokingFormer             -1.433778   0.289619  -4.951 7.55e-07 ***
## smokingNever              -1.412889   0.268650  -5.259 1.48e-07 ***
## gen_healthFair/Poor       10.751555   0.246041  43.698  < 2e-16 ***
## gen_healthGood             1.299468   0.184367   7.048 1.96e-12 ***
## maritalNever married      -0.164972   0.243975  -0.676  0.49894    
## maritalPreviously Married  0.215368   0.199590   1.079  0.28060    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.126 on 7986 degrees of freedom
## Multiple R-squared:  0.3313, Adjusted R-squared:  0.3302 
## F-statistic: 304.4 on 13 and 7986 DF,  p-value: < 2.2e-16
tidy(mlr_m1, conf.int = TRUE) %>%
  mutate(
    term     = dplyr::recode(term,
      "(Intercept)"   = "Intercept",
      "menthlth_days" = "Mental health days",
      "bmi"           = "BMI (kg/m^2)",
      "depression"    = "Depression: Yes (ref = No)",
      "sexFemale"     = "Sex: Female (ref = Male)",
      
      "age_group35-49"    = "Age Group: 35-49 (ref = <25)",
      "age_group50-64"    = "Age Group: 50-64 (ref = <25)",
      "age_group65+"      = "Age Group: 65+ (ref = <25)",
      
      "smokingFormer"     = "Smoking: Former (ref = Current)",
      "smokingNever"      = "Smoking: Never (ref = Current)",
      
      "gen_healthFair/Poor"    = "General Health Status: Fair/Poor (ref = Excellent/VG)",
      "gen_healthGood"         = "General Health Status: Good (ref = Excellent/VG)",
      
      "maritalNever married "    = "Marital Status: Never Married (ref = Married/Partnered)",
      "maritalPreviously Married"    = "Marital Status: Good (ref = Married/Partnered)",
    ),
    across(where(is.numeric), ~ round(., 4))
  ) %>%
  kable(
    caption  = "Table 7. Multiple Linear Regression: Phyically Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)",
    col.names = c("Term", "Estimate (β̂)", "Std. Error", "t-statistic",
                  "p-value", "95% CI Lower", "95% CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) %>%
  row_spec(c(2, 3, 4,6,7,8,10,11,12,14), background = "#EBF5FB")  # highlight key predictors
Table 7. Multiple Linear Regression: Phyically Unhealthy Days ~ Multiple Predictors (BRFSS 2023, n = 8,000)
Term Estimate (β̂
Std. Erro
Intercept -0.0095 0.4839 -0.0197 0.9843 -0.9581 0.9390
Mental health days 0.1916 0.0115 16.7324 0.0000 0.1691 0.2140
BMI (kg/m^2) 0.0333 0.0128 2.6036 0.0092 0.0082 0.0583
depressionYes 1.1460 0.2253 5.0860 0.0000 0.7043 1.5876
Sex: Female (ref = Male) 0.0786 0.1656 0.4743 0.6353 -0.2461 0.4033
Age Group: 35-49 (ref = <25) 0.5809 0.2868 2.0250 0.0429 0.0186 1.1431
Age Group: 50-64 (ref = <25) 1.6790 0.2818 5.9576 0.0000 1.1266 2.2315
Age Group: 65+ (ref = <25) 1.8522 0.2766 6.6963 0.0000 1.3100 2.3945
Smoking: Former (ref = Current) -1.4338 0.2896 -4.9506 0.0000 -2.0015 -0.8660
Smoking: Never (ref = Current) -1.4129 0.2687 -5.2592 0.0000 -1.9395 -0.8863
General Health Status: Fair/Poor (ref = Excellent/VG) 10.7516 0.2460 43.6982 0.0000 10.2693 11.2339
General Health Status: Good (ref = Excellent/VG) 1.2995 0.1844 7.0483 0.0000 0.9381 1.6609
maritalNever married -0.1650 0.2440 -0.6762 0.4989 -0.6432 0.3133
Marital Status: Good (ref = Married/Partnered) 0.2154 0.1996 1.0791 0.2806 -0.1759 0.6066
glance(mlr_m1) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Maximum Model: Fit Statistics
r.squared adj.r.squared sigma AIC BIC df.residual
0.331 0.33 7.126 54138.96 54243.77 7986

Interpreting Each Term

Mental Unhealthy Days: Each additional unhealthy mental health day is associated with an estimated .2519 increase in physically unhealthy days on average, adjusting for all other covariates (95% CI: .2292 to .2746).

Depression: Yes: Compared to those without depression (the reference group), those with depression report an estimated .9862 more physically unhealthy days on average, holding all other variables constant.

General Health Status: Fair/Poor Compared to those with Excellent/VG (the reference group), those with fair/poor general health status report an estimated 9.8970 more physically unhealthy days on average, holding all other variables constant.

Step 5: LINE Assumption Check

par(mfrow = c(2, 2))
plot(mlr_m1, which = 1:4, col = adjustcolor("steelblue", alpha.f = 0.3), pch = 16)

par(mfrow = c(1, 1))

Residuals vs. Fitted: Is there evidence of non-linearity or non-constant variance? Normal Q-Q: Do the residuals approximately follow a normal distribution? Where do they deviate? Scale-Location: Is the spread of residuals roughly constant across fitted values? Residuals vs. Leverage: Are there any highly influential observations (large Cook’s distance)?

Part 2: Logistic Regression

Step 1: Create the Binary Outcome

brfss_23_hw4 <- brfss_23_hw4 |>
  mutate(
fpd = factor(
      ifelse(physhlth_days >= 14, 1, 0),
      levels = c(0, 1),
      labels = c("No", "Yes")
    ))
brfss_23_hw4 %>%
  count(fpd) %>%
  mutate(percent = n / sum(n) * 100)
## # A tibble: 2 × 3
##   fpd       n percent
##   <fct> <int>   <dbl>
## 1 No     6924    86.6
## 2 Yes    1076    13.4
brfss_23_hw4 %>%
  dplyr::select(fpd) %>%
  tbl_summary(
    label = list(fpd ~ "Frequent Physical Distress"), 
    statistic = list(all_categorical() ~ "{n} ({p}%)")
  )
Characteristic N = 8,0001
Frequent Physical Distress 1,076 (13%)
1 n (%)

1,504 respondents (19%) reported having physically unhealthy days in the last 30 days.

Step 2: Simple (Unadjusted) Logistic Regression

I chose to view general health status as one predictor to look at more closely because the association was rather high in the multi linear model compared to the other predictors.

mod_simple_hw <- glm(fpd ~ gen_health, data = brfss_23_hw4,
                    family = binomial(link = "logit"))

tidy(mod_simple_hw, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3, caption = "Table 8: Simple Logistic Regression: FPD ~ Exercise (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 8: Simple Logistic Regression: FPD ~ Exercise (Log-Odds Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.035 0.089 -37.934 0 0.029 0.041
gen_healthFair/Poor 28.135 0.104 32.225 0 23.042 34.589
gen_healthGood 3.055 0.110 10.141 0 2.467 3.800

Interpretation Compared to those with Excellent/very good general health status, those with Fair/poor general health status has the odds of frequent physical distress are multiplied by 23.064, 95% CI [19.013, 28.202], this is statistically significant at a 95% confidence interval and the CI does not cross 1.

Compared to those with Excellent/very good general health status, those with Good general health status has the the odds of frequent physical distress are multiplied by 2.627, 95% CI [2.120, 3.272], this is statistically significant at a 95% confidence interval and the CI does not cross 1.

Step 3: Multiple Logistic Regression

mod_logistic_hw <- glm(fpd ~ menthlth_days + bmi + depression + sex + age_group + smoking + gen_health + marital,
         data = brfss_23_hw4,
                    family = binomial(link = "logit"))

tidy(mod_logistic_hw, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3, caption = "Table 9: Simple Logistic Regression:FMD ~ all Predictors (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 9: Simple Logistic Regression:FMD ~ all Predictors (Log-Odds Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.016 0.233 -17.684 0.000 0.010 0.026
menthlth_days 1.050 0.004 11.364 0.000 1.042 1.059
bmi 1.010 0.005 1.807 0.071 0.999 1.020
depressionYes 1.395 0.096 3.458 0.001 1.154 1.683
sexFemale 1.063 0.081 0.751 0.453 0.907 1.247
age_group35-49 1.448 0.160 2.308 0.021 1.059 1.986
age_group50-64 2.357 0.152 5.630 0.000 1.754 3.188
age_group65+ 2.426 0.152 5.847 0.000 1.809 3.278
smokingFormer 0.639 0.121 -3.709 0.000 0.504 0.810
smokingNever 0.638 0.112 -3.999 0.000 0.513 0.796
gen_healthFair/Poor 17.350 0.110 25.844 0.000 14.014 21.610
gen_healthGood 2.441 0.114 7.858 0.000 1.958 3.057
maritalNever married 0.931 0.123 -0.583 0.560 0.730 1.182
maritalPreviously Married 1.068 0.090 0.722 0.470 0.894 1.274

Mental Unhealthy Days: For every one-unit increase in mentally unhealthy days, the odds of frequent physical distress are multiplied by 1.060, 95% CI [1.052, 1.068] holding all other variables constant..

Depression: Yes: Compared to those without depression (the reference group), the odds of frequent physical distress are multiplied by 1.194, 95% CI [1.007, 1.413] holding all other variables constant.

General Health Status: Fair/Poor Compared to those with Excellent/VG (the reference group), those with fair/poor general health the odds of frequent physical distress are multiplied by 13.311, 95% CI [10.856, 16.438] holding all other variables constant.

Step 4: Likelihood Ratio Test

#SImple Logistic Regression - income
mod_nosmoking <- glm(fpd ~ menthlth_days + bmi + depression + sex + age_group  + gen_health   + marital, data = brfss_23_hw4,
                    family = binomial(link = "logit"))

tidy(mod_nosmoking, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3, caption = " Table 10: Simple Logistic Regression: FPD ~ all but smoking (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 10: Simple Logistic Regression: FPD ~ all but smoking (Log-Odds Scale)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.011 0.214 -21.052 0.000 0.007 0.017
menthlth_days 1.051 0.004 11.594 0.000 1.042 1.060
bmi 1.007 0.005 1.350 0.177 0.997 1.018
depressionYes 1.415 0.096 3.621 0.000 1.171 1.706
sexFemale 1.046 0.080 0.555 0.579 0.893 1.224
age_group35-49 1.539 0.159 2.704 0.007 1.128 2.108
age_group50-64 2.518 0.151 6.104 0.000 1.878 3.398
age_group65+ 2.472 0.151 5.993 0.000 1.845 3.337
gen_healthFair/Poor 18.210 0.110 26.476 0.000 14.732 22.645
gen_healthGood 2.532 0.113 8.221 0.000 2.033 3.167
maritalNever married 0.960 0.122 -0.333 0.739 0.754 1.218
maritalPreviously Married 1.099 0.090 1.047 0.295 0.921 1.310
# Likelihood ratio test
lr_test_hw <- anova(mod_nosmoking, mod_logistic_hw, test = "Chisq")

lr_test_hw |>
  kable(digits = 3, caption = " Table 11: Likelihood Ratio Test: Full Model vs. Null Smoking") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Table 11: Likelihood Ratio Test: Full Model vs. Null Smoking
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
7988 4513.487 NA NA NA
7986 4496.368 2 17.119 0
glance(mod_logistic_hw) |>
  dplyr::select(null.deviance, deviance, AIC, BIC, nobs) |>
  kable(digits = 1, caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Fit Statistics
null.deviance deviance AIC BIC nobs
6317.6 4496.4 4524.4 4622.2 8000

Step 5: ROC Curve and AUC

roc_obj_hw <- roc(
  response = brfss_23_hw4$fpd,
  predictor = fitted(mod_logistic_hw),
  levels = c("No", "Yes"),
  direction = "<"
)

auc_value <- auc(roc_obj_hw)

ggroc(roc_obj_hw, color = "steelblue", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
  labs(title = "Table 12: ROC Curve for Frequent Physical Distress Model",
       subtitle = paste0("AUC = ", round(auc_value, 3)),
       x = "Specificity", y = "Sensitivity") +
  theme_minimal()

This model is able to distinguish between individuals with and without physical distress days by having an AUC of 0.851 with excellent discrimination.

Step 6: Hosmer-Lemeshow Goodness-of-Fit Test

hl_test_hw <- hoslem.test(
  x = as.numeric(brfss_23_hw4$fpd) - 1,
  y = fitted(mod_logistic_hw),
  g = 10
)

hl_test_hw
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(brfss_23_hw4$fpd) - 1, fitted(mod_logistic_hw)
## X-squared = 5.8446, df = 8, p-value = 0.6646
brfss_23_hw4 |>
  mutate(pred_prob_hw = fitted(mod_logistic_hw),
         obs_outcome_hw = as.numeric(fpd) - 1,
         decile_hw = ntile(pred_prob_hw, 10)) |>
  group_by(decile_hw) |>
  summarise(
    mean_pred = mean(pred_prob_hw),
    mean_obs  = mean(obs_outcome_hw),
    .groups = "drop"
  ) |>
  ggplot(aes(x = mean_pred, y = mean_obs)) +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  geom_point(size = 3, color = "steelblue") +
  geom_line(color = "steelblue") +
  labs(title = " Table 13: Calibration Plot: Observed vs. Predicted Probability of FPD",
       subtitle = "Points should fall on the dashed line for perfect calibration",
       x = "Mean Predicted Probability (within decile)",
       y = "Observed Proportion (within decile)") +
  theme_minimal()

Interpretation: A small p-value (< 0.05) suggests that the model does not fit well in some regions of predicted probability. With large samples (like ours), the Hosmer-Lemeshow test can be over-powered and detect trivial misfit. Always pair it with a calibration plot. The calibration plot is well calibrated as the pointed tend to fall along the dashed line.

Part 3: Reflection

A. Statistical Insights:

The results from this analysis suggest that menthlth_days, bmi, sex, depression, age_group, smoking status , general health status, and marital status have a significant statistcal burden among U.S adults. General health status had the strongest association in both the linear and logistic models.The key limitations for cross-sectional survey data is that temporality can not be determined. Exercise and income are two potential confounders.

B. Linear versus Logistic Regression:

The linear regression model tells us what variables are correlated with physical distress days and identify who is likely to report more physical distress days. The logistic regression is used for binary ouctomes such as those who experiecne more than 14 physical disress days in 30 days, yes or no. The logistic regression tell us the probability of the event occuring. The model selection criteria differ between the two frameworks R squared measures the variances whereas AUC measures the level of discrimination of how well the model is able to distinguish between those with and without the outcome.

C. AI Transparency I did not use AI for this assignment.