#Research Question

What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?

library(tidyverse)
library(haven)
library(broom)
library(kableExtra)
library(car)
library(gtsummary)
library(pROC)
library(ResourceSelection)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

Part 0. Data Preparation

Preparing the Dataset

brfss_raw<- read_xpt("/Users/jingjunyang/Desktop/EPI553 Project/LLCP2023.XPT ")

Select only the 9 Variables

brfss_full<-brfss_raw[,c("PHYSHLTH","MENTHLTH","_BMI5","SEXVAR","EXERANY2","_AGEG5YR","_INCOMG1","EDUCA","_SMOKER3")]

Justify your choices briefly (2-3 sentences) in Part 0.

To examine individual, behavioral, and health factors associated with physical health burden in U.S. adults, I selected mentally unhealthy days, BMI, sex assigned at birth, physical activity, income, education, and smoking status as candidate predictors. These variables represent key biological, behavioral, and social determinants of health that are commonly associated with physical health outcomes in epidemiologic research. Including this range of factors allows for a more comprehensive assessment of potential contributors to physical health burden.

Save as RDS

saveRDS(brfss_full,"brfss_full.rds")

Rename and Recode Variables

brfss_recode <-brfss_full|>
  rename(
    menthlth_days = MENTHLTH,
    physhlth_days = PHYSHLTH,
    bmi =`_BMI5`,
    sex = SEXVAR,
    exercise = EXERANY2,
    age_group =  `_AGEG5YR`,
    income = `_INCOMG1`,
    education =  EDUCA,
    smoking = `_SMOKER3`
  ) |>
  mutate(
    # outcome: mentally unhealthy days in past 30
  menthlth_days = case_when(
    menthlth_days == 88 ~ 0,
    menthlth_days %in% c(77,99) ~ NA_real_,
    menthlth_days >=1 & menthlth_days<= 30 ~as.numeric(menthlth_days),
    TRUE  ~NA_real_
  ),
   # Physically unhealthy days in past 30 
   physhlth_days = case_when(
    physhlth_days == 88 ~ 0,
    physhlth_days %in% c(77,99) ~ NA_real_,
    physhlth_days >=1 & physhlth_days<=30 ~as.numeric(physhlth_days),
    TRUE  ~NA_real_
  ),
  
  # BMI
  bmi = case_when(
   bmi == 9999 ~NA_real_,
   TRUE     ~as.numeric(bmi)/100
  ),
  # Sex
  sex = factor(sex,
               levels = c(1,2),
               labels = c("Male","Female")),
 
  # Any exercise in the past 30 days
   exercise = case_when(
     exercise  %in% c(7,9) ~NA_real_, 
     TRUE                  ~ as.numeric(exercise)
     ),
     exercise = factor(exercise ,
           levels = c(2,1),
           labels = c("No", "Yes")),

  # Age group in 5-year categories (13 levels)
  age_group = case_when(
    age_group %in% 1:3 ~ "18-34",
    age_group %in% 4:6 ~ "35-49",
    age_group %in% 7:9 ~ "50-64",
    age_group %in% 10:13 ~"65+",
    TRUE ~ NA_character_
  ),
  age_group = factor (age_group,
                      levels = c("18-34","35-49", "50-64", "65+")),
    
#Annual household Income (7 levels)
  income = case_when(
    income %in% c(1,2) ~ "Less than $25,000",
    income %in% c(3,4) ~  "$25,000 ~ $49,000",
    income %in% c(5,6) ~ "$50,000 ~ $99,000",
    income == 7      ~ "$100,000+",
    income == 9      ~ NA_character_,
    TRUE ~ NA_character_
  ),
income = factor(income,
                levels = c("Less than $25,000","$25,000 ~ $49,000", "$50,000 ~ $99,000", "$100,000+" )
),


# Highest level of education completed
  education = case_when(
    education %in% 1:3 ~ "Less than HS",
    education == 4    ~ "HS/GED",
    education == 5    ~ "Some College",
    education == 6    ~ "College graduate",
    education == 9    ~ NA_character_,
    TRUE              ~ NA_character_),
    education = factor(education,
                       levels = c("Less than HS",
                                  "HS/GED",
                                  "Some College",
                                  "College graduate")),
    
# Smoking status 4 levels
  smoking = case_when(
    smoking %in% c(1,2) ~ "Current",
    smoking == 3     ~ "Former",
    smoking == 4     ~ "Never",
    smoking == 9      ~ NA_character_,
    TRUE              ~ NA_character_
    ),
    smoking = factor(smoking, 
           levels = c("Never",
                      "Former",
                      "Current")
  ))

Missingness report

vars <- c("physhlth_days","menthlth_days", "bmi", "sex", "exercise", "age_group", "income","education","smoking")

missing_summary <- data.frame(
  Variable = vars,
  Missing_Count = colSums(is.na(brfss_recode[vars])),
  Missing_Percent = round(colSums(is.na(brfss_recode[vars])) / nrow(brfss_recode) * 100, 2)
)
print(missing_summary)

##                    Variable Missing_Count Missing_Percent
## physhlth_days physhlth_days         10785            2.49
## menthlth_days menthlth_days          8108            1.87
## bmi                     bmi         40535            9.35
## sex                     sex             0            0.00
## exercise           exercise          1251            0.29
## age_group         age_group          7779            1.80
## income               income         86623           19.99
## education         education          2325            0.54
## smoking             smoking         23062            5.32

Analytic Sample: remove missing values and draw a random sample of 8,000 observations:

set.seed(1220)
brfss_analytic <-brfss_recode |>
  drop_na(menthlth_days,physhlth_days,bmi,sex,exercise, age_group,income,education,smoking) |>
  slice_sample(n=8000)

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_analytic), ncol(brfss_analytic))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Analytic Dataset Dimensions
Metric	Value
Observations	8000
Variables	9

Sample Size: 8000

Descriptive table

brfss_analytic |>
  select( physhlth_days, menthlth_days, bmi, sex , exercise, age_group, income, education, smoking) |>
  tbl_summary(
    label= list(
      physhlth_days ~  "Physically unhealthy days in past 30",
      menthlth_days ~ "Mentally unhealthy days in past 30",
      bmi           ~   "Body mass index",
      exercise      ~   "Any exercise in past 30 days",
      age_group     ~   "Age group in 5-year categories (13 levels)",
      income        ~   "Annual household income (7 levels)",
      education     ~   "Highest level of education completed",
      smoking       ~   "Smoking status (4 levels)"
      ),

    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |> 
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics  - BRFSS 2023 (n=8,000)**" ) |>
  as_flex_table()

**Table 1. Descriptive Statistics - BRFSS 2023 (n=8,000)**
Characteristic	N	N = 8,0001
Physically unhealthy days in past 30	8,000	4.4 (8.7)
Mentally unhealthy days in past 30	8,000	4.3 (8.1)
Body mass index	8,000	28.7 (6.5)
sex	8,000
Male		3,936 (49%)
Female		4,064 (51%)
Any exercise in past 30 days	8,000	6,240 (78%)
Age group in 5-year categories (13 levels)	8,000
18-34		1,277 (16%)
35-49		1,665 (21%)
50-64		2,055 (26%)
65+		3,003 (38%)
Annual household income (7 levels)	8,000
Less than $25,000		1,048 (13%)
$25,000 ~ $49,000		1,938 (24%)
$50,000 ~ $99,000		4,376 (55%)
$100,000+		638 (8.0%)
Highest level of education completed	8,000
Less than HS		376 (4.7%)
HS/GED		1,827 (23%)
Some College		2,138 (27%)
College graduate		3,659 (46%)
Smoking status (4 levels)	8,000
Never		4,812 (60%)
Former		2,262 (28%)
Current		926 (12%)
1Mean (SD); n (%)

Part 1: Model Selection for Linear Regression

Step 1: Fit the Maximum Model (5 points) • Fit a linear regression model containing all of your selected predictors 5 • Display the model summary • Report and interpret: R-squared, Adjusted R-squared, AIC (using AIC()), and BIC (using BIC())

mod_max <- lm(physhlth_days ~ menthlth_days + bmi + exercise + age_group +
                income + education + smoking,
              data = brfss_analytic)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: All Candidate Predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	5.8000	0.6530	8.8821	0.0000	4.5200	7.0800
menthlth_days	0.3042	0.0114	26.7352	0.0000	0.2819	0.3265
bmi	0.0480	0.0140	3.4339	0.0006	0.0206	0.0754
exerciseYes	-3.4505	0.2253	-15.3127	0.0000	-3.8922	-3.0088
age_group35-49	1.3477	0.2981	4.5215	0.0000	0.7634	1.9320
age_group50-64	2.6452	0.2870	9.2171	0.0000	2.0826	3.2077
age_group65+	2.8828	0.2740	10.5221	0.0000	2.3457	3.4198
income$25,000 ~ $49,000	-2.6241	0.3073	-8.5398	0.0000	-3.2264	-2.0217
income$50,000 ~ $99,000	-3.2385	0.2952	-10.9701	0.0000	-3.8172	-2.6598
income$100,000+	-3.8271	0.4303	-8.8949	0.0000	-4.6705	-2.9837
educationHS/GED	-1.1978	0.4508	-2.6567	0.0079	-2.0815	-0.3140
educationSome College	-0.7178	0.4525	-1.5863	0.1127	-1.6048	0.1692
educationCollege graduate	-1.2319	0.4531	-2.7188	0.0066	-2.1201	-0.3437
smokingFormer	0.6768	0.2052	3.2977	0.0010	0.2745	1.0791
smokingCurrent	1.1503	0.2964	3.8807	0.0001	0.5692	1.7313

glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.184	0.183	7.858	55705.45	55817.24	7985

Interpretation: The maximum model explains approximately 18.4% of the variance in physically unhealthy days ((R²= 0.184, Adjusted R² = 0.183). The strongest predictors are exercise (exerciser report 3.45 fewer physically unhealthy days ) and income (people who have income 100k and above report 3.82 fewer physically unhealthy days than people who have income less than $25,000). Age, education and smoking status have association with physicaly unhealthy days.

AIC=55705.45 AIC balances model fit against complexity using the penalty. BIC=55817.24 BIC applies a stricter complexity penalty than AIC.

Step 2: Best Subsets Regression (10 points) • Use leaps::regsubsets() to perform best subsets regression on your maximum model

• Set nvmax equal to the total number of parameters in your maximum model

• Note: regsubsets() treats each dummy variable from a factor as a separate predictor. This means some levels of a categorical variable could be included while others are excluded. This is a known limitation. You will compare these results with stepwise methods (Step 3), which handle factors as complete units. • Create a summary table or plot showing Adjusted R-squared and BIC across model sizes (number of predictors) • Identify the model size that maximizes Adjusted R-squared • Identify the model size that minimizes BIC • In 2-3 sentences, describe what you observe: Do the two criteria agree on the best model size? If not, which criterion favors a simpler model

library(leaps)

best_subsets <- regsubsets(
  physhlth_days ~ menthlth_days+ bmi+ sex +
    exercise + age_group + income + education + smoking,
  data = brfss_analytic,
  nvmax = 8,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1, p2, ncol = 2)

Best Subsets: Adjusted R² and BIC by Model Size

cat("Best model by Adj. R²:", which.max(best_summary$adjr2), "variables\n")

## Best model by Adj. R²: 8 variables

cat("Best model by BIC:", which.min(best_summary$bic), "variables\n")

## Best model by BIC: 8 variables

# Show which variables are in the BIC-best model
best_bic_idx <- which.min(best_summary$bic)
best_vars <- names(which(best_summary$which[best_bic_idx, -1]))
cat("\nVariables in BIC-best model:\n")

## 
## Variables in BIC-best model:

cat(paste(" ", best_vars), sep = "\n")

##   menthlth_days
##   exerciseYes
##   age_group35-49
##   age_group50-64
##   age_group65+
##   income$25,000 ~ $49,000
##   income$50,000 ~ $99,000
##   income$100,000+

Interpretation: In this case, both criteria agree. Adjusted R^2 is maximized and BIC is minimized at the same model size of 8 variable. The BIC-best model retains menthlth_days, exerciseYes, all three age group dummies and all three income dummies, while excluding bmi, sex, education and smoking. Those predictors did not contirbute sufficient explanatory power to offset the BIC complexity penalty.

Step 3: Stepwise Selection Methods (10 points) Using the step() function, perform the following three selection procedures. For each, report which variables are in the final model.

• Backward elimination: Start from the maximum model (direction = “backward”)

• Forward selection: Start from an intercept-only model with the maximum model as the upper scope (direction = “forward”)

• Stepwise selection: Start from the intercept-only model with both directions allowed (direction = “both”) In 3-5 sentences, compare the results:

• Do the three methods agree on the same final model?

• Which variables (if any) were excluded by all three methods?

• Which variables were retained by all three methods?

Backward elimination:

# Step 1: Maximum model
mod_back <- mod_max
cat("Step 1: Maximum model\n")

## Step 1: Maximum model

cat("Variables:", paste(names(coef(mod_back))[-1], collapse = ", "), "\n")

## Variables: menthlth_days, bmi, exerciseYes, age_group35-49, age_group50-64, age_group65+, income$25,000 ~ $49,000, income$50,000 ~ $99,000, income$100,000+, educationHS/GED, educationSome College, educationCollege graduate, smokingFormer, smokingCurrent

# Show p-values for the maximum model
pvals <- tidy(mod_back) |>
  filter(term != "(Intercept)") |>
  arrange(desc(p.value)) |>
  dplyr::select(term, estimate, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

pvals |>
  head(5) |>
  kable(caption = "Maximum Model: Variables Sorted by p-value (Highest First)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: Variables Sorted by p-value (Highest First)
term	estimate	p.value
educationSome College	-0.7178	0.1127
educationHS/GED	-1.1978	0.0079
educationCollege graduate	-1.2319	0.0066
smokingFormer	0.6768	0.0010
bmi	0.0480	0.0006

# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)

## Start:  AIC=33000.43
## physhlth_days ~ menthlth_days + bmi + exercise + age_group + 
##     income + education + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       493117 33000
## - education      3       745 493862 33007
## - bmi            1       728 493845 33010
## - smoking        2      1295 494412 33017
## - income         3      7909 501026 33122
## - age_group      3      8302 501418 33128
## - exercise       1     14480 507597 33230
## - menthlth_days  1     44141 537257 33684

tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	5.8000	0.6530	8.8821	0.0000	4.5200	7.0800
menthlth_days	0.3042	0.0114	26.7352	0.0000	0.2819	0.3265
bmi	0.0480	0.0140	3.4339	0.0006	0.0206	0.0754
exerciseYes	-3.4505	0.2253	-15.3127	0.0000	-3.8922	-3.0088
age_group35-49	1.3477	0.2981	4.5215	0.0000	0.7634	1.9320
age_group50-64	2.6452	0.2870	9.2171	0.0000	2.0826	3.2077
age_group65+	2.8828	0.2740	10.5221	0.0000	2.3457	3.4198
income$25,000 ~ $49,000	-2.6241	0.3073	-8.5398	0.0000	-3.2264	-2.0217
income$50,000 ~ $99,000	-3.2385	0.2952	-10.9701	0.0000	-3.8172	-2.6598
income$100,000+	-3.8271	0.4303	-8.8949	0.0000	-4.6705	-2.9837
educationHS/GED	-1.1978	0.4508	-2.6567	0.0079	-2.0815	-0.3140
educationSome College	-0.7178	0.4525	-1.5863	0.1127	-1.6048	0.1692
educationCollege graduate	-1.2319	0.4531	-2.7188	0.0066	-2.1201	-0.3437
smokingFormer	0.6768	0.2052	3.2977	0.0010	0.2745	1.0791
smokingCurrent	1.1503	0.2964	3.8807	0.0001	0.5692	1.7313

Interpretation: The backward elimination is identical to the maximum model, retaining all 14 terms. Backward elimination(AIC based) kept all predictors including education, even thought education_some college has a non-significant p-value: 0.113.

• Forward selection: Start from an intercept-only model with the maximum model as the upper scope (direction = “forward”)

# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = brfss_analytic)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)

## Start:  AIC=34603.45
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     59071 545563 33783
## + exercise       1     37706 566927 34090
## + income         3     32443 572191 34168
## + smoking        2     12018 592616 34447
## + education      3     11873 592761 34451
## + age_group      3      5515 599119 34536
## + bmi            1      4732 599901 34543
## <none>                       604634 34603
## 
## Step:  AIC=33783.01
## physhlth_days ~ menthlth_days
## 
##             Df Sum of Sq    RSS   AIC
## + exercise   1   28346.8 517216 33358
## + income     3   19686.5 525876 33495
## + age_group  3   14455.2 531108 33574
## + education  3    8240.9 537322 33667
## + smoking    2    6070.4 539492 33697
## + bmi        1    2821.0 542742 33744
## <none>                   545563 33783
## 
## Step:  AIC=33358.15
## physhlth_days ~ menthlth_days + exercise
## 
##             Df Sum of Sq    RSS   AIC
## + income     3   11764.1 505452 33180
## + age_group  3    9921.3 507295 33209
## + smoking    2    3952.9 513263 33301
## + education  3    3409.0 513807 33311
## + bmi        1     753.5 516463 33348
## <none>                   517216 33358
## 
## Step:  AIC=33180.09
## physhlth_days ~ menthlth_days + exercise + income
## 
##             Df Sum of Sq    RSS   AIC
## + age_group  3    9400.2 496052 33036
## + smoking    2    2483.5 502968 33145
## + bmi        1     781.6 504670 33170
## + education  3     937.1 504515 33171
## <none>                   505452 33180
## 
## Step:  AIC=33035.91
## physhlth_days ~ menthlth_days + exercise + income + age_group
## 
##             Df Sum of Sq    RSS   AIC
## + smoking    2   1427.13 494625 33017
## + education  3   1013.78 495038 33026
## + bmi        1    666.76 495385 33027
## <none>                   496052 33036
## 
## Step:  AIC=33016.86
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     smoking
## 
##             Df Sum of Sq    RSS   AIC
## + bmi        1    762.78 493862 33007
## + education  3    779.94 493845 33010
## <none>                   494625 33017
## 
## Step:  AIC=33006.52
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     smoking + bmi
## 
##             Df Sum of Sq    RSS   AIC
## + education  3    745.36 493117 33000
## <none>                   493862 33007
## 
## Step:  AIC=33000.43
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     smoking + bmi + education

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	5.8000	0.6530	8.8821	0.0000	4.5200	7.0800
menthlth_days	0.3042	0.0114	26.7352	0.0000	0.2819	0.3265
exerciseYes	-3.4505	0.2253	-15.3127	0.0000	-3.8922	-3.0088
income$25,000 ~ $49,000	-2.6241	0.3073	-8.5398	0.0000	-3.2264	-2.0217
income$50,000 ~ $99,000	-3.2385	0.2952	-10.9701	0.0000	-3.8172	-2.6598
income$100,000+	-3.8271	0.4303	-8.8949	0.0000	-4.6705	-2.9837
age_group35-49	1.3477	0.2981	4.5215	0.0000	0.7634	1.9320
age_group50-64	2.6452	0.2870	9.2171	0.0000	2.0826	3.2077
age_group65+	2.8828	0.2740	10.5221	0.0000	2.3457	3.4198
smokingFormer	0.6768	0.2052	3.2977	0.0010	0.2745	1.0791
smokingCurrent	1.1503	0.2964	3.8807	0.0001	0.5692	1.7313
bmi	0.0480	0.0140	3.4339	0.0006	0.0206	0.0754
educationHS/GED	-1.1978	0.4508	-2.6567	0.0079	-2.0815	-0.3140
educationSome College	-0.7178	0.4525	-1.5863	0.1127	-1.6048	0.1692
educationCollege graduate	-1.2319	0.4531	-2.7188	0.0066	-2.1201	-0.3437

Interpretation: Forward selection arrived at the same final model as backward elimination, including the same 14 predictor terms. The order of entry is informative: menthlth_days entered first (the strongest predictor), followed by exercise, income, age_group, smoking, bmi and education This ordering reflects each variable’s marginal contribution given the variables already in the model. The convergence of forward and backward methods on the same model increases our confidence in this particular subset, though this convergence is not guaranteed in general.

• Stepwise selection: Start from the intercept-only model with both directions allowed (direction = “both”) In 3-5 sentences, compare the results:

mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = mod_max),
                     direction = "both", trace = 1)

## Start:  AIC=34603.45
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     59071 545563 33783
## + exercise       1     37706 566927 34090
## + income         3     32443 572191 34168
## + smoking        2     12018 592616 34447
## + education      3     11873 592761 34451
## + age_group      3      5515 599119 34536
## + bmi            1      4732 599901 34543
## <none>                       604634 34603
## 
## Step:  AIC=33783.01
## physhlth_days ~ menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1     28347 517216 33358
## + income         3     19686 525876 33495
## + age_group      3     14455 531108 33574
## + education      3      8241 537322 33667
## + smoking        2      6070 539492 33697
## + bmi            1      2821 542742 33744
## <none>                       545563 33783
## - menthlth_days  1     59071 604634 34603
## 
## Step:  AIC=33358.15
## physhlth_days ~ menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income         3     11764 505452 33180
## + age_group      3      9921 507295 33209
## + smoking        2      3953 513263 33301
## + education      3      3409 513807 33311
## + bmi            1       754 516463 33348
## <none>                       517216 33358
## - exercise       1     28347 545563 33783
## - menthlth_days  1     49711 566927 34090
## 
## Step:  AIC=33180.09
## physhlth_days ~ menthlth_days + exercise + income
## 
##                 Df Sum of Sq    RSS   AIC
## + age_group      3      9400 496052 33036
## + smoking        2      2484 502968 33145
## + bmi            1       782 504670 33170
## + education      3       937 504515 33171
## <none>                       505452 33180
## - income         3     11764 517216 33358
## - exercise       1     20424 525876 33495
## - menthlth_days  1     41532 546984 33810
## 
## Step:  AIC=33035.91
## physhlth_days ~ menthlth_days + exercise + income + age_group
## 
##                 Df Sum of Sq    RSS   AIC
## + smoking        2      1427 494625 33017
## + education      3      1014 495038 33026
## + bmi            1       667 495385 33027
## <none>                       496052 33036
## - age_group      3      9400 505452 33180
## - income         3     11243 507295 33209
## - exercise       1     17199 513251 33307
## - menthlth_days  1     47177 543229 33761
## 
## Step:  AIC=33016.86
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     smoking
## 
##                 Df Sum of Sq    RSS   AIC
## + bmi            1       763 493862 33007
## + education      3       780 493845 33010
## <none>                       494625 33017
## - smoking        2      1427 496052 33036
## - age_group      3      8344 502968 33145
## - income         3      9897 504522 33169
## - exercise       1     16705 511329 33281
## - menthlth_days  1     44789 539414 33708
## 
## Step:  AIC=33006.52
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     smoking + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## + education      3       745 493117 33000
## <none>                       493862 33007
## - bmi            1       763 494625 33017
## - smoking        2      1523 495385 33027
## - age_group      3      8245 502107 33133
## - income         3      9781 503643 33157
## - exercise       1     15215 509077 33247
## - menthlth_days  1     44206 538068 33690
## 
## Step:  AIC=33000.43
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     smoking + bmi + education
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       493117 33000
## - education      3       745 493862 33007
## - bmi            1       728 493845 33010
## - smoking        2      1295 494412 33017
## - income         3      7909 501026 33122
## - age_group      3      8302 501418 33128
## - exercise       1     14480 507597 33230
## - menthlth_days  1     44141 537257 33684

tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stepwise Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	5.8000	0.6530	8.8821	0.0000	4.5200	7.0800
menthlth_days	0.3042	0.0114	26.7352	0.0000	0.2819	0.3265
exerciseYes	-3.4505	0.2253	-15.3127	0.0000	-3.8922	-3.0088
income$25,000 ~ $49,000	-2.6241	0.3073	-8.5398	0.0000	-3.2264	-2.0217
income$50,000 ~ $99,000	-3.2385	0.2952	-10.9701	0.0000	-3.8172	-2.6598
income$100,000+	-3.8271	0.4303	-8.8949	0.0000	-4.6705	-2.9837
age_group35-49	1.3477	0.2981	4.5215	0.0000	0.7634	1.9320
age_group50-64	2.6452	0.2870	9.2171	0.0000	2.0826	3.2077
age_group65+	2.8828	0.2740	10.5221	0.0000	2.3457	3.4198
smokingFormer	0.6768	0.2052	3.2977	0.0010	0.2745	1.0791
smokingCurrent	1.1503	0.2964	3.8807	0.0001	0.5692	1.7313
bmi	0.0480	0.0140	3.4339	0.0006	0.0206	0.0754
educationHS/GED	-1.1978	0.4508	-2.6567	0.0079	-2.0815	-0.3140
educationSome College	-0.7178	0.4525	-1.5863	0.1127	-1.6048	0.1692
educationCollege graduate	-1.2319	0.4531	-2.7188	0.0066	-2.1201	-0.3437

Interpretation: The stepwise procedure, which allows both addition and removal at each step, also converges on the identical model. This three-way agreement (backward = forward = stepwise) is reassuring but should not be taken as proof that this is the “correct” model. All three methods optimize the same criterion (AIC) on the same data.

# Comparing all three models
method_comparison <- tribble(
  ~Method, ~`Variables selected`, ~`Adj. R²`, ~AIC, ~BIC,
  "Maximum model",
    length(coef(mod_max)) - 1,
    round(glance(mod_max)$adj.r.squared, 4),
    round(AIC(mod_max), 1),
    round(BIC(mod_max), 1),
  "Backward (AIC)",
    length(coef(mod_backward)) - 1,
    round(glance(mod_backward)$adj.r.squared, 4),
    round(AIC(mod_backward), 1),
    round(BIC(mod_backward), 1),
  "Forward (AIC)",
    length(coef(mod_forward)) - 1,
    round(glance(mod_forward)$adj.r.squared, 4),
    round(AIC(mod_forward), 1),
    round(BIC(mod_forward), 1),
  "Stepwise (AIC)",
    length(coef(mod_stepwise)) - 1,
    round(glance(mod_stepwise)$adj.r.squared, 4),
    round(AIC(mod_stepwise), 1),
    round(BIC(mod_stepwise), 1)
)

method_comparison |>
  kable(caption = "Comparison of Variable Selection Methods") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparison of Variable Selection Methods
Method	Variables selected	Adj. R²	AIC	BIC
Maximum model	14	0.183	55705.4	55817.2
Backward (AIC)	14	0.183	55705.4	55817.2
Forward (AIC)	14	0.183	55705.4	55817.2
Stepwise (AIC)	14	0.183	55705.4	55817.2

Interpretation All three AIC-based stepwise methods backward elimination, forward selection, and stepwise selection agreed on the same final model, retaining all 14 predictor terms from the maximum model. No variables were excluded by any method, meaning every predictor (menthlth_days, bmi, exercise, age_group, income, education, and smoking) was contribute sufficiently to justify its inclusion under the AIC penalty criterion.

Step 4: Final Model Selection and Interpretation (5 points) • Choose your final model. State which method(s) guided your choice and why. • Display the coefficient table for your final model using tidy() with confidence intervals • Interpret the coefficients for at least three predictors in plain language, including units and “holding all other variables constant” language. Include at least one continuous and one categorical predictor.

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Model: Coefficient table",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Final Model: Coefficient table
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	5.8000	0.6530	8.8821	0.0000	4.5200	7.0800
menthlth_days	0.3042	0.0114	26.7352	0.0000	0.2819	0.3265
bmi	0.0480	0.0140	3.4339	0.0006	0.0206	0.0754
exerciseYes	-3.4505	0.2253	-15.3127	0.0000	-3.8922	-3.0088
age_group35-49	1.3477	0.2981	4.5215	0.0000	0.7634	1.9320
age_group50-64	2.6452	0.2870	9.2171	0.0000	2.0826	3.2077
age_group65+	2.8828	0.2740	10.5221	0.0000	2.3457	3.4198
income$25,000 ~ $49,000	-2.6241	0.3073	-8.5398	0.0000	-3.2264	-2.0217
income$50,000 ~ $99,000	-3.2385	0.2952	-10.9701	0.0000	-3.8172	-2.6598
income$100,000+	-3.8271	0.4303	-8.8949	0.0000	-4.6705	-2.9837
educationHS/GED	-1.1978	0.4508	-2.6567	0.0079	-2.0815	-0.3140
educationSome College	-0.7178	0.4525	-1.5863	0.1127	-1.6048	0.1692
educationCollege graduate	-1.2319	0.4531	-2.7188	0.0066	-2.1201	-0.3437
smokingFormer	0.6768	0.2052	3.2977	0.0010	0.2745	1.0791
smokingCurrent	1.1503	0.2964	3.8807	0.0001	0.5692	1.7313

Interpretation: The maximum model is selected as the final model. All three stepwise methods agreed backward elimination, forward selection and stepwise selection all ended up with the same full model. When three different approaches starting from different points and working in different directions all have the same answer, that is a good sign the predictors is well suppoeted.

Menthlth_days: Holding all other variable constant, each additional mentally unhealthy days reported in the past 30 days is associated with 0.304 more poor physical health days (95%CI: 0.2819, 0.3265, p<0.001). This is the strongest estimated predictor in the model.

Bmi: Holding all other variables constant, each one-unit increase in BMI is associated with 0.0480 additional poor physical health days (95%CI: 0.0206, 0.0754 p<0.001). This effect is small but with evidence linking higher BMI to worse physical conditions.

Smoking: Holding all other variables constant, both former and current smokers report more poor physical health days compared to never smokers. Holding all other variable constant, former smoker reported 0.6768 more days of physical unhealthy days compared to never smoker (95%CI: 0.5692, 1.7313; p<0.001).

Step 5: LINE Assumption Check (5 points)

Using your final linear model, produce the four base R diagnostic plots: par(mfrow = c(2, 2)) plot(your_final_model) par(mfrow = c(1, 1))

par(mfrow = c(2, 2)) 
plot(mod_max)

Interpret each panel in 1-2 sentences: 1. Residuals vs. Fitted: Is there evidence of non-linearity or non-constant variance? There is no clear evidence of non-linearity, as the smoothed line stays relatively flat across fitted values. However, the spread of residuals appears slightly uneven.

Normal Q-Q: Do the residuals approximately follow a normal distribution? Where do they deviate?

The residuals do not perfectly follow a normal distribution. They align well in the middle range but drift away from the diagonal line in the upper tail, suggesting the distribution is somewhat skewed.

Scale-Location: Is the spread of residuals roughly constant across fitted values?

The spread of residuals is not fully constant across fitted values. The upward trend in the smoothed line suggests that variance increases as fitted values get larger, which is a sign of heteroscedasticity.

Residuals vs. Leverage: Are there any highly influential observations (large Cook’s distance)?

There are no highly influential observations in this model. No data points come close to exceeding Cook’s distance thresholds, so no individual observations are unduly driving the results.

Write a brief conclusion (2-3 sentences): Are the LINE assumptions reasonably satisfied? What violations, if any, do you observe?

The LINE assumptions are mostly met, with linearity holding reasonably well and no concerning influential observations. The two main violations are mild heteroscedasticity where variance grows slightly with larger fitted values and a departure from normality in the upper tail of the residuals. Given the large sample size (n = 8,000), these violations are unlikely to affect the reliability of the model’s estimates or conclusions.

Part 2.Logistic Regression

You will now reframe the research question using a binary outcome. The CDC defines frequent physical distress as reporting 14 or more physically unhealthy days in the past 30 days. Step 1: Create the Binary Outcome (5 points) • Create a new variable frequent_distress: o 1 if physhlth_days >= 14 o 0 if physhlth_days < 14 • Store as a factor with levels “No” (reference) and “Yes” • Report the prevalence of frequent physical distress (count and percentage in each category) • In 1-2 sentences, comment on the balance of the outcome

brfss_new <-brfss_analytic|>
  mutate(
    frequent_distress = ifelse(physhlth_days >=14,  1,  0), 
    frequent_distress = factor(frequent_distress, 
                               levels = c(0, 1),
                               labels = c("No", "Yes"))
  )
brfss_new |>
  count(frequent_distress) |>
  mutate(percent = n/ sum(n) * 100) |>
  mutate(percent = round(percent, 1)) |>
  kable(
    caption = "Prevalence of frequent physical distress",
    col.names = c("Frequency Distress", "Count", "Percentage")
  ) |>
  
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Prevalence of frequent physical distress
Frequency Distress	Count	Percentage
No	6914	86.4
Yes	1086	13.6

Interpretation: 13.6% of people reported frequent physical distress while 86.4% did not report frequent physical distress. The outcome is not balanced with a large proportion of individuals in the “No” category.

Step 2: Simple (Unadjusted) Logistic Regression (10 points) • Choose one predictor from your final linear model that you expect to have a strong association with physical distress. State why you chose it. 6 • Fit a simple logistic regression: mod_simple <- glm(frequent_distress ~ your_predictor, data = brfss_analytic, family = binomial) • Display the model summary • Calculate and report the odds ratio and 95% confidence interval: exp(coef(mod_simple)) # Odds ratio exp(confint(mod_simple)) # 95% CI for OR • Interpret the odds ratio in plain language. For a continuous predictor: “For every one-unit increase in [predictor], the odds of frequent physical distress are multiplied by [OR], 95% CI [lower, upper].” For a categorical predictor: “Compared to [reference group], [comparison group] has [OR] times the odds of frequent physical distress, 95% CI [lower, upper].” • State whether the association is statistically significant at alpha = 0.05 and how you know (Wald test p-value or CI excluding 1.0)

mod_simple <- glm(frequent_distress ~ exercise, data = brfss_new,
                    family = binomial(link = "logit"))

tidy(mod_simple, conf.int = TRUE, exponentiate = FALSE) |>
  kable(digits = 3, caption = "Simple Logistic Regression: Frequent physical distress ~ Exercise (Log-Odds Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: Frequent physical distress ~ Exercise (Log-Odds Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-0.933	0.053	-17.613	0	-1.037	-0.830
exerciseYes	-1.329	0.068	-19.423	0	-1.463	-1.194

tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: frequent physical distress ~ Exercise (Odds Ratio Scale)") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: frequent physical distress ~ Exercise (Odds Ratio Scale)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.394	0.053	-17.613	0	0.354	0.436
exerciseYes	0.265	0.068	-19.423	0	0.232	0.303

mod_simple |>
  tbl_regression(
    exponentiate = TRUE,
    label = list(exercise ~ "Any exercise in past 30 days")
  ) |>
  bold_labels() |>
  bold_p()

Characteristic	OR	95% CI	p-value
Any exercise in past 30 days
No	—	—
Yes	0.26	0.23, 0.30	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation: Exercise was selected because it was the strongest predictor producing the largest coefficient -3.45 days among all predictors. It has a clear relationship with physical health in public health literature.

Log-odds scale: The coefficient for exerciseYes is -1.329 meaning exercise is associated with lower log-odds of frequent physical distress. Odds ratio for exercise Yes is 0.265 (95%CI: 0.23, 0.3; p<0.001). Compared to people who did not exercise in the past 30 days, those who did exercise have 0.265 times the odds of experiencing frequent physical distress. In other words, exerciser have roughly 73% lower odds of frequent physical distress than non-exercisers, holding other variables constant. The aossocicaiton is strong and statistically significant.

Step 3: Multiple Logistic Regression (10 points) • Fit a multiple logistic regression using the same predictors as your final linear model from Part 1: mod_logistic <- glm(frequent_distress ~ [same predictors as final linear model], data = brfss_analytic, family = binomial) • Display the model summary • Calculate and display a table of adjusted odds ratios with 95% confidence intervals for all predictors: tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |> kable(digits = 3) • Interpret the adjusted odds ratios for at least three predictors in plain language, using “holding all other variables constant” language. Include at least one continuous and one categorical predictor.

mod_logistic <-glm(
  frequent_distress ~  menthlth_days + bmi +  sex + exercise + age_group + income + education + smoking,
  data = brfss_new,
  family= binomial)
summary(mod_logistic)

## 
## Call:
## glm(formula = frequent_distress ~ menthlth_days + bmi + sex + 
##     exercise + age_group + income + education + smoking, family = binomial, 
##     data = brfss_new)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -2.255697   0.243962  -9.246  < 2e-16 ***
## menthlth_days              0.068198   0.003664  18.611  < 2e-16 ***
## bmi                        0.014348   0.005155   2.783 0.005380 ** 
## sexFemale                  0.014039   0.073171   0.192 0.847851    
## exerciseYes               -0.885038   0.076998 -11.494  < 2e-16 ***
## age_group35-49             0.762689   0.147703   5.164 2.42e-07 ***
## age_group50-64             1.213299   0.139855   8.675  < 2e-16 ***
## age_group65+               1.291576   0.137163   9.416  < 2e-16 ***
## income$25,000 ~ $49,000   -0.589866   0.101555  -5.808 6.31e-09 ***
## income$50,000 ~ $99,000   -0.853635   0.100733  -8.474  < 2e-16 ***
## income$100,000+           -1.498863   0.230273  -6.509 7.56e-11 ***
## educationHS/GED           -0.336043   0.149017  -2.255 0.024129 *  
## educationSome College     -0.148256   0.149889  -0.989 0.322613    
## educationCollege graduate -0.494821   0.155068  -3.191 0.001418 ** 
## smokingFormer              0.296136   0.081651   3.627 0.000287 ***
## smokingCurrent             0.358801   0.105610   3.397 0.000680 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6354.8  on 7999  degrees of freedom
## Residual deviance: 5280.5  on 7984  degrees of freedom
## AIC: 5312.5
## 
## Number of Fisher Scoring iterations: 6

tidy(mod_logistic, conf.int = TRUE, exponentiate = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  kable(
    caption = "Adjusted Odds Ratios for Frequent Physical Distress",
    digits = 3
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Adjusted Odds Ratios for Frequent Physical Distress
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.105	0.244	-9.246	0.000	0.065	0.168
menthlth_days	1.071	0.004	18.611	0.000	1.063	1.078
bmi	1.014	0.005	2.783	0.005	1.004	1.025
sexFemale	1.014	0.073	0.192	0.848	0.879	1.171
exerciseYes	0.413	0.077	-11.494	0.000	0.355	0.480
age_group35-49	2.144	0.148	5.164	0.000	1.610	2.875
age_group50-64	3.365	0.140	8.675	0.000	2.569	4.448
age_group65+	3.639	0.137	9.416	0.000	2.795	4.787
income$25,000 ~ $49,000	0.554	0.102	-5.808	0.000	0.454	0.677
income$50,000 ~ $99,000	0.426	0.101	-8.474	0.000	0.350	0.519
income$100,000+	0.223	0.230	-6.509	0.000	0.139	0.345
educationHS/GED	0.715	0.149	-2.255	0.024	0.535	0.960
educationSome College	0.862	0.150	-0.989	0.323	0.645	1.160
educationCollege graduate	0.610	0.155	-3.191	0.001	0.451	0.829
smokingFormer	1.345	0.082	3.627	0.000	1.145	1.577
smokingCurrent	1.432	0.106	3.397	0.001	1.162	1.758

Interpretation: menthlth_days: holding all other variables constant, for every one additional mentally unhealthy days, the odds of frequenty physical distress are 1.071 times higher (95%CI: 1.063, 1.078) statistically significant.

Sexfemale: After adjusted for all other variables, there is no meaningful differences in the odds of frequenty physical distress between males and female. (95%CI: 0.879, 1.171) The CI crosses 1.0, therefore, the assocaition is not statistically significant.

Exercise Yes: holding all other variables constant, compared to people who do not exercise, people who did exercise have 0.413 times the odds of frequent physical unhealthy days about 59% lower odds. The association is statistically significant (95%CI:0.355, 0.480; p<0.001 ).

Step 4: Likelihood Ratio Test (5 points) • Choose a group of related predictors to test collectively (for example, all income levels, all education levels, or all smoking levels) • Fit a reduced model that excludes this group • Conduct a likelihood ratio test comparing the reduced and full models: anova(mod_reduced, mod_logistic, test = “Chisq”) 7 • State the null and alternative hypotheses • Report the test statistic (deviance), degrees of freedom, and p-value • Conclude at alpha = 0.05: Does this group of predictors significantly improve the model?

mod_reduced <- glm(
  frequent_distress ~ menthlth_days + bmi + age_group + income + sex + smoking + exercise,
  data = brfss_new,
  family = binomial
)

anova(mod_reduced, mod_logistic, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test: Does removing education improve the model? ") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test: Does removing education improve the model?
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
7987	5299.505	NA	NA	NA
7984	5280.523	3	18.982	0

Interpretaiton: Null Hyptohesis: Education level is not assocaited with frequent physical distress ( education coefficient = 0 ).

Alternative Hypothesis: Education level is associated with frequent physical distress ( at least one education coefficients ≠ 0 ).

Chi-square(deviance) = 18.982 df = 3 p-value <0.001 We can reject the null hypothesis since p-value <0.001. Education level significantly improve the model.

Step 5: ROC Curve and AUC (5 points) • Use the pROC package to generate a ROC curve for your multiple logistic regression

roc_obj <- roc(
  response = brfss_new$frequent_distress,
  predictor = fitted(mod_logistic),
  levels = c("No", "Yes"),
  direction = "<"
)

auc_value <- auc(roc_obj)

ggroc(roc_obj, color = "steelblue", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "red") +
  labs(title = "ROC Curve for Frequent Physical Distress Model",
       subtitle = paste0("AUC = ", round(auc_value, 3)),
       x = "Specificity", y = "Sensitivity") +
  theme_minimal()

auc(roc_obj)

## Area under the curve: 0.7774

ci.auc(roc_obj)

## 95% CI: 0.7619-0.7928 (DeLong)

• Report the AUC with 95% confidence interval: Area under the curve: 0.7774 95% CI: 0.7619-0.7928 (DeLong)

An AUC is approximately 0.7774 indicates acceptable to excellent discrimination, meaning the model can distinguish between individuals with and without frequent physical distress reasonably well.

Step 6: Hosmer-Lemeshow Goodness-of-Fit Test (5 points) • Conduct the Hosmer-Lemeshow test using the ResourceSelection package: hoslem.test(mod_logistic$y, fitted(mod_logistic), g = 10) • State the null hypothesis (the model fits the data well) and alternative hypothesis (the model does not fit the data well) • Report the test statistic, degrees of freedom, and p-value • Interpret: At alpha = 0.05, is there evidence of poor model fit? • In 1-2 sentences, discuss how the Hosmer-Lemeshow result complements the ROC/AUC finding

hl_test <- hoslem.test(
  x = as.numeric(brfss_new$frequent_distress) - 1,
  y = fitted(mod_logistic),
  g = 10
)

hl_test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(brfss_new$frequent_distress) - 1, fitted(mod_logistic)
## X-squared = 11.097, df = 8, p-value = 0.1963

Null Hypothesis: the model fits the data well. Alternative Hypothesis: The model does not fit the data well. Chi-squared test = 11.097 Df=8 p-value = 0.1963 Since p-value > 0.05, we failed to reject the null hypothesis. There is no statistically significant evidence of poor model fit. The ROC/AUC AUC=0.737 told us the model does an acceptable job sperating people with and without frequent physical distress. The Hosmer Lemeshow test further confirmed that the model’s predicted probabilities are also well calibrated to the actual observed rates.

Part 3: Reflection (15 points) Write 250-400 words in continuous prose (no bullet points). A. Statistical Insights (5 points): What do your results suggest about the factors associated with physical health burden among U.S. adults? Which predictors were the strongest in both the linear and logistic models? Were any predictors significant in one model but not the other? What are the key limitations of using cross-sectional survey data for this type of analysis? Name at least two specific potential confounders not measured in your model.

The results suggest that physical health burden among U.S. adults is shaped by behavioral, socioeconomic, and demographic factors. In both the linear and logistic models, exercise and mental health days are the strongest predictors. People who exercised reported roughly 3.45 fewer poor physical health days in the linear model and had about 59% lower odds of frequent physical distress in the logistic model. Mental health days showed a consistent positive association across both frameworks, with each additional poor mental health day linked to more poor physical health days and higher odds of frequent distress. Income and age group were also strong and consistent across both models, showing clear gradients in the expected direction.The key limitations of using cross-sectional survey data is lacking of causality.

B. Linear versus Logistic Regression (5 points): Compare what the linear regression model tells you versus what the logistic regression model tells you about the same research question. What information does each approach provide that the other does not? In what situations would you prefer one approach over the other? How do the model selection criteria differ between the two frameworks (for example, R-squared versus AUC)?

Linear regression estimated how many additional poor physical health days were associated with each predictor, making results easy to interpret in days. Logistic regression instead examined whether someone had 14 or more poor physical health days and reported associations as odds ratios. Linear regression captures the full range of physical health burden, while logistic regression identifies factors linked to a high-burden group that may be more relevant for screening and intervention. Linear regression is preferred when the outcome is continuous, while logistic regression is more useful when a meaningful cutoff is important. Model evaluation also differs: linear regression uses R² and Adjusted R² to assess explained variance, while logistic regression uses AUC to measure discrimination and the Hosmer-Lemeshow test to assess calibration. In this analysis, the logistic model showed acceptable discrimination (AUC = 0.777), and the linear model explained a modest amount of variation (R² = 0.184), indicating both models identified meaningful associations.

C. AI Transparency (5 points): Describe any parts of this assignment where you sought AI assistance. Include the specific prompts you used and how you verified the correctness of the output. If you did not use AI, state this explicitly

I used Claude as an AI assistant for debugging during the data preparation and recoding section, where I encountered syntax errors. Claude helped me identify where these errors occurred and explained why the corrected syntax was right, which helped me learn from the mistakes rather than just fixing them blindly. Prompts “debugging the code”.

Assignment 4: Model Selection and Introduction to Logistic Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Jingjun Yang

April 24, 2026