Research Question

What individual, behavioral, and health factors are associated with physical health burden among U.S. adults? How do findings compare when the outcome is modeled as the number of physically unhealthy days (continuous) versus as an indicator of frequent physical distress (binary: 14 or more days in the past 30)?

Part 0. Data Preparation

0.1 Data Loading

Source: https://www.cdc.gov/brfss/annual_data/annual_2023.html

If there is already a file named “working_data.rds” in the data folder in the directory, then read the rds cache in. If there is not, read from the XPT. Write the rds cache only if it already exists, to save processing time.

filename <- "LLCP2023.XPT"
filename <- paste0("data/", filename)

df <- if (file.exists("data/working_data.rds")) {
  print("Loading working data from cache: data/working_data.rds")
  readRDS("data/working_data.rds")
} else {
  print(paste0("Loading full XPT from ", filename))
  read_xpt(filename)
}

## [1] "Loading working data from cache: data/working_data.rds"

if (file.exists("data/working_data.rds")) {
  print("nah it's cool don't rewrite")
  } else {
    saveRDS(df, "data/working_data.rds")
  print("Writing rds")
  }

## [1] "nah it's cool don't rewrite"

n_start <- nrow(df)

There are currently 433323 observations in the dataset.

0.2 Variable Recoding

0.2.1 Variable List

Outcome: physhlth_days Predictors: - 2 Continuous: menthlth_days, bmi - 1 Binary: depression - 2 Categorical: income, age_group - Any other 3: exercise (binary), sex (binary), smoking (categorical)

I included the binary variable depression because I believe that having a covariate for a depression diagnosis has the potential to inform interpretation of the exposure, in terms of the relationship between mental and physical health with more specifics. I included income and age_group because physical health is known to be associated negatively with both age and low socioeconomic status, and these covariates capture these. I included exercise and smoking to give insight into behavioral factors with a known relationship to physical health, and sex for further insight.

Raw Variable Name	Description	Your Recoded Name
PHYSHLTH	Physically unhealthy days in past 30	`physhlth_days`
MENTHLTH	Mentally unhealthy days in past 30 (outcome)	`menthlth_days`
_BMI5	Body mass index × 100 (divide by 100)	`bmi`
ADDEPEV3	Ever told depressive disorder (1=Yes, 2 = No)	`depression`
_INCOMG1	Annual household income (7 levels)	`income`
_AGEG5YR	Age group in 5-year categories (13 levels)	`age_group`
EXERANY2	Any exercise in past 30 days (1 = Yes, 2 = No)	`exercise`
SEXVAR	Sex (1 = Male, 2 = Female)	`sex`
_SMOKER3	Smoking status (4 levels)	`smoking`

0.2.2 Recode variables

df <- df |>
  dplyr::select(MENTHLTH, PHYSHLTH, `_BMI5`, ADDEPEV3,`_INCOMG1`, `_AGEG5YR`, EXERANY2,   SEXVAR, `_SMOKER3`) |>
  mutate(
  ##Recode PHYSHLTH to physhlth_days
  physhlth_days =   case_when(
                      PHYSHLTH == 88 ~ 0,
                      PHYSHLTH == 99 ~ NA,
                      PHYSHLTH == 77 ~ NA,
                      TRUE ~ PHYSHLTH), 
  ##Recode MENTHLTH to menthlth_days
  menthlth_days = case_when(
                      MENTHLTH == 88 ~ 0,
                      MENTHLTH == 99 ~ NA,
                      MENTHLTH == 77 ~ NA,
                      TRUE ~ MENTHLTH), 
  ##Recode __BMI5 to bmi - divide by 100
  bmi = case_when(`_BMI5` == 9999 ~ NA,
                  TRUE ~ `_BMI5` /100),
  ## Recode ADDEPEV3 as factor
depression = factor(
  ifelse(ADDEPEV3 %in% c(7, 9), NA, ADDEPEV3),
  levels = c(1, 2),
  labels = c("Yes", "No")
),
  
## Recode age group - Collapse
 age_group = factor(
  case_when(
   `_AGEG5YR`== 14 ~ NA,
   `_AGEG5YR` %in% c(1,2,3) ~ "18-34",
   `_AGEG5YR` %in% c(4,5,6) ~ "35-49",
   `_AGEG5YR` %in% c(7,8,9) ~ "50-64",
   `_AGEG5YR` %in% c(10,11,12,13) ~ "65+",
  ),
  levels = c(
    "18-34",
    "35-49",
    "50-64",
    "65+"
  )
),

## Recode Income - Collapse
 income = factor(
  case_when(
   `_INCOMG1`== 9 ~ NA,
   `_INCOMG1` %in% c(1,2) ~ "Less than $25K",
   `_INCOMG1` %in% c(3,4) ~ "$25K - $49K",
   `_INCOMG1` %in% c(5,6) ~ "$50K - $99K",
   `_INCOMG1` == 7 ~ "$100K+",
  ),
  levels = c(
    "Less than $25K",
    "$25K - $49K",
    "$50K - $99K",
    "$100K+"
  )
),

## Recode exerany2 as factor
exercise = factor(
  ifelse(EXERANY2 %in% c(7, 9), NA, EXERANY2),
  levels = c(1, 2),
  labels = c("Yes", "No")
),
## Recode sex as factor
  sex = factor(
        ifelse(SEXVAR %in% c(7, 9), NA, SEXVAR),
        levels = c(1, 2),
        labels = c("Male", "Female")
),


## Recode Income - Collapse
 smoking = factor(
  case_when(
   `_SMOKER3`== 9 ~ NA,
   `_SMOKER3` %in% c(1,2) ~ "Current smoker",
   `_SMOKER3` %in% c(3) ~ "Former smoker",
   `_SMOKER3` %in% c(4) ~ "Never smoked"
  ),
  levels = c(
    "Never smoked",
    "Former smoker",
    "Current smoker"
  )
)) |>
  
## Adjust reference categories - could've been done earlier if I'd read further in the instructionslol
  mutate(
    ##  Binary - no as reference
    exercise = fct_relevel(exercise, "No"),
    depression = fct_relevel(depression, "No"),

    ## Smoking - Never smoked
    smoking = fct_relevel(smoking, "Never smoked"),
  )


#QA - Print top 3 rows, output for new var
(df[1:3, (ncol(df)-9):ncol(df)]) %>% 
  kable(caption = "3 row sample of recoded variables"
)

3 row sample of recoded variables
_SMOKER3	physhlth_days	menthlth_days	bmi	depression	age_group	income	exercise	sex	smoking
4	0	0	30.47	No	65+	NA	No	Female	Never smoked
4	0	0	28.56	Yes	65+	NA	Yes	Female	Never smoked
3	6	2	22.31	No	65+	Less than $25K	Yes	Female	Former smoker

0.3 Analytical Sample

After recoding, create your analytic dataset by removing observations with missing values on all nine analysis variables, then drawing a random sample of 8,000 observations. Report the number and percentage of missing observations for menthlth_days and at least two other key variables, before removing any missing values

Using set.seed(1220) ensures that your results are reproducible and comparable across submissions (this is critical)

0.3.1 - Helper Function

This helper function was written for my final project. Adjusting it so it looks at everything in an input list, rather than just by prefix.

missingness <- function(w_data, colsearch){
w_data %>%
  summarize(
    across(all_of(colsearch),
        ~sum(is.na(.x)), .names = "{.col}")) %>%
  pivot_longer(cols = everything(), 
               names_to = "variable", 
               values_to = "nmissing") %>%
  mutate(
      run = unique(w_data$v_year),
      n = nrow(w_data),
      pct_missing = nmissing / n * 100
    )
}

0.3.2 - Missing by Variable

Call helper function on the list

#list of columns to search in
search <- c("physhlth_days", "menthlth_days", "bmi", "depression", "income", "age_group", "sex", "exercise", "smoking")

#
a_df <- df

missing <- missingness(a_df, search)

missing %>% kable()

variable	nmissing	n	pct_missing
physhlth_days	10785	433323	2.4889055
menthlth_days	8108	433323	1.8711215
bmi	40535	433323	9.3544538
depression	2587	433323	0.5970142
income	86623	433323	19.9903998
age_group	7779	433323	1.7951967
sex	0	433323	0.0000000
exercise	1251	433323	0.2886992
smoking	23062	433323	5.3221269

0.3.3 - Complete case exclusion and slice by sample

Call helper function on the list

## Drop if missing on any values in this list.
a_df <- a_df |>
 drop_na(search)

n_end <- nrow(a_df)

set.seed(1220)

a_df <- a_df |> slice_sample(n=8000)

The starting dataset had 433323 observations, and with exclusions of any variables for complete case analysis, the dataset had 310071 observations. A sample of 8000 observations will be used for the analysis.

0.4 Descriptive statistics table

table1 <- a_df %>%
  tbl_summary(
    include = search,
    statistic = all_continuous() ~ "{mean} ({sd})",
    label= list(physhlth_days ~ "Physically unhealthy days in past 30 days",
      menthlth_days ~ "Mentally unhealthy days in past 30 days",
                bmi ~ "Body mass index",
         depression ~ "Ever told have depressive disorder",
             income ~ "Anual household income",
          age_group ~ "Age (years)",
          exercise  ~ "Exercise in past 50 days",
                sex ~ "Assigned sex at birth",
            smoking ~ "Smoking status"
                ))|>
  modify_caption(caption= "Table 1. Characteristics of sample of 2023 BRFSS particpants (n=8000)")

as_flex_table(table1)

Table 1. Characteristics of sample of 2023 BRFSS particpants (n=8000)
Characteristic	N = 8,0001
Physically unhealthy days in past 30 days	4 (9)
Mentally unhealthy days in past 30 days	5 (8)
Body mass index	29 (7)
Ever told have depressive disorder	1,677 (21%)
Anual household income
Less than $25K	1,134 (14%)
$25K - $49K	1,838 (23%)
$50K - $99K	4,398 (55%)
$100K+	630 (7.9%)
Age (years)
18-34	1,288 (16%)
35-49	1,651 (21%)
50-64	2,169 (27%)
65+	2,892 (36%)
Assigned sex at birth
Male	4,050 (51%)
Female	3,950 (49%)
Exercise in past 50 days	6,173 (77%)
Smoking status
Never smoked	4,795 (60%)
Former smoker	2,280 (29%)
Current smoker	925 (12%)
1Mean (SD); n (%)

Part 1. Model Selection for Linear Regression

1.1 Fit the maximum model

mod_max <- lm(physhlth_days ~  menthlth_days + bmi + depression + income + 
                age_group + exercise + sex + smoking, data = a_df)

tidy(mod_max, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: All Candidate Predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.5446	0.5684	6.2359	0.0000	2.4304	4.6589
menthlth_days	0.2761	0.0120	22.9780	0.0000	0.2526	0.2997
bmi	0.0743	0.0140	5.3174	0.0000	0.0469	0.1017
depressionYes	1.3497	0.2452	5.5038	0.0000	0.8690	1.8304
income$25K - $49K	-1.9927	0.3023	-6.5908	0.0000	-2.5854	-1.4000
income$50K - $99K	-3.0688	0.2744	-11.1847	0.0000	-3.6066	-2.5310
income$100K+	-3.3754	0.4112	-8.2091	0.0000	-4.1814	-2.5694
age_group35-49	0.9924	0.3003	3.3051	0.0010	0.4038	1.5810
age_group50-64	2.4057	0.2863	8.4014	0.0000	1.8444	2.9671
age_group65+	2.8829	0.2778	10.3757	0.0000	2.3382	3.4275
exerciseYes	-3.2309	0.2236	-14.4527	0.0000	-3.6691	-2.7927
sexFemale	-0.1406	0.1814	-0.7749	0.4384	-0.4962	0.2151
smokingFormer smoker	0.7184	0.2076	3.4608	0.0005	0.3115	1.1252
smokingCurrent smoker	1.1817	0.2962	3.9896	0.0001	0.6011	1.7622

1.2 Display the model summary

summary_modmax <- glance(mod_max)
glance(mod_max) |>
  dplyr::select(r.squared, adj.r.squared, sigma, AIC, BIC, df.residual) |>
  mutate(across(everything(), \(x) round(x, 3))) |>
  kable(caption = "Maximum Model: Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Maximum Model: Fit Statistics
r.squared	adj.r.squared	sigma	AIC	BIC	df.residual
0.188	0.187	7.93	55849.88	55954.69	7986

The maximum model’s $R^2$ is 0.188 and Adjusted $R^2$ is 0.187, indicating that the model explains approximately 18.8% of the variance in physically unhealthy days. Because $R^2$ will always increase with additional predictors added to the model, both AIC and BIC penalize complexity more than Adjusted $R^2$, as a way to balance both goodness of fit and complexity.

1.3 Best Subsets Regression

##get nvmax

summary(mod_max)$df[1]

## [1] 14

best_subsets <- regsubsets(physhlth_days ~  menthlth_days + bmi + depression + income + 
                age_group + exercise + sex + smoking,
  data = a_df,
  nvmax = 14,      # maximum number of variables to consider
  method = "exhaustive"
)

best_summary <- summary(best_subsets)

1.4 Summary Plot of Adjusted R2 and BIC by Model Size

subset_metrics <- tibble(
  p = 1:length(best_summary$adjr2),
  `Adj. R²` = best_summary$adjr2,
  BIC = best_summary$bic,
  Cp = best_summary$cp
)

p1 <- ggplot(subset_metrics, aes(x = p, y = `Adj. R²`)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.max(best_summary$adjr2),
             linetype = "dashed", color = "tomato") +
  labs(title = "Adjusted R² by Model Size", x = "Number of Variables", y = "Adjusted R²") +
  theme_minimal(base_size = 12)

p2 <- ggplot(subset_metrics, aes(x = p, y = BIC)) +
  geom_line(linewidth = 1, color = "steelblue") +
  geom_point(size = 3, color = "steelblue") +
  geom_vline(xintercept = which.min(best_summary$bic),
             linetype = "dashed", color = "tomato") +
  labs(title = "BIC by Model Size", x = "Number of Variables", y = "BIC") +
  theme_minimal(base_size = 12)

gridExtra::grid.arrange(p1, p2, ncol = 2)

Best Subsets: Adjusted R² and BIC by Model Size

adjr_max <- which.max(best_summary[["adjr2"]])
bic_min <- which.min(best_summary[["bic"]])

The model that maximizes Adjusted $R^2$ has 13 parameters, and the model that minimizes BIC has 13 parameters (including the intercept).

Note: In addition to including the intercept, the above counts reflects all parameters that are included in the model, including levels of factors that would normally be counted as a single predictor.

In this case, the two criteria do agree on the best model size - each output similar parameter numbers, at 12, and highlight the same model. BIC would typically favor a simpler model, as it penalizes additional parameters more than Adjusted $R^2$.

1.4 Stepwise Selection Methods

1.4.1 Backward Elimination

# Automated backward elimination using AIC
mod_backward <- step(mod_max, direction = "backward", trace = 1)

## Start:  AIC=33144.86
## physhlth_days ~ menthlth_days + bmi + depression + income + age_group + 
##     exercise + sex + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## - sex            1        38 502263 33143
## <none>                       502226 33145
## - smoking        2      1420 503646 33163
## - bmi            1      1778 504004 33171
## - depression     1      1905 504131 33173
## - income         3      8458 510684 33272
## - age_group      3      8554 510779 33274
## - exercise       1     13136 515362 33349
## - menthlth_days  1     33204 535430 33655
## 
## Step:  AIC=33143.46
## physhlth_days ~ menthlth_days + bmi + depression + income + age_group + 
##     exercise + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       502263 33143
## - smoking        2      1492 503755 33163
## - bmi            1      1798 504062 33170
## - depression     1      1868 504132 33171
## - income         3      8420 510684 33270
## - age_group      3      8516 510779 33272
## - exercise       1     13104 515367 33348
## - menthlth_days  1     33167 535430 33653

tidy(mod_backward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Backward Elimination Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Backward Elimination Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.4662	0.5593	6.1971	0.0000	2.3698	4.5626
menthlth_days	0.2758	0.0120	22.9655	0.0000	0.2523	0.2994
bmi	0.0747	0.0140	5.3476	0.0000	0.0473	0.1021
depressionYes	1.3278	0.2436	5.4509	0.0000	0.8503	1.8052
income$25K - $49K	-1.9909	0.3023	-6.5853	0.0000	-2.5836	-1.3983
income$50K - $99K	-3.0593	0.2741	-11.1615	0.0000	-3.5965	-2.5220
income$100K+	-3.3533	0.4102	-8.1752	0.0000	-4.1574	-2.5493
age_group35-49	0.9756	0.2995	3.2577	0.0011	0.3885	1.5626
age_group50-64	2.3878	0.2854	8.3663	0.0000	1.8283	2.9473
age_group65+	2.8659	0.2770	10.3471	0.0000	2.3230	3.4089
exerciseYes	-3.2252	0.2234	-14.4353	0.0000	-3.6631	-2.7872
smokingFormer smoker	0.7359	0.2063	3.5667	0.0004	0.3315	1.1404
smokingCurrent smoker	1.1982	0.2954	4.0559	0.0001	0.6191	1.7772

Backward elimination removed sex from the model and retained all other variables.

1.4.2 Forward Selection

# Automated forward selection using AIC
mod_null <- lm(physhlth_days ~ 1, data = a_df)

mod_forward <- step(mod_null,
                    scope = list(lower = mod_null, upper = mod_max),
                    direction = "forward", trace = 1)

## Start:  AIC=34785.58
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     62274 556283 33939
## + exercise       1     38144 580413 34278
## + income         3     30840 587717 34382
## + depression     1     25569 592988 34450
## + smoking        2     11940 606617 34634
## + bmi            1      8532 610025 34676
## + age_group      3      6839 611718 34703
## + sex            1      1128 617429 34773
## <none>                       618557 34786
## 
## Step:  AIC=33938.69
## physhlth_days ~ menthlth_days
## 
##              Df Sum of Sq    RSS   AIC
## + exercise    1   27643.6 528639 33533
## + income      3   19576.0 536707 33658
## + age_group   3   15837.8 540445 33714
## + smoking     2    6562.1 549721 33848
## + bmi         1    5267.5 551016 33865
## + depression  1    3240.2 553043 33894
## + sex         1     168.4 556115 33938
## <none>                    556283 33939
## 
## Step:  AIC=33532.92
## physhlth_days ~ menthlth_days + exercise
## 
##              Df Sum of Sq    RSS   AIC
## + income      3   11314.2 517325 33366
## + age_group   3   10584.9 518054 33377
## + smoking     2    4154.0 524485 33474
## + depression  1    2683.3 525956 33494
## + bmi         1    2249.6 526390 33501
## <none>                    528639 33533
## + sex         1      22.7 528617 33535
## 
## Step:  AIC=33365.84
## physhlth_days ~ menthlth_days + exercise + income
## 
##              Df Sum of Sq    RSS   AIC
## + age_group   3    9534.8 507790 33223
## + smoking     2    2756.9 514568 33327
## + depression  1    2143.4 515182 33335
## + bmi         1    2123.3 515202 33335
## <none>                    517325 33366
## + sex         1       2.0 517323 33368
## 
## Step:  AIC=33223.02
## physhlth_days ~ menthlth_days + exercise + income + age_group
## 
##              Df Sum of Sq    RSS   AIC
## + depression  1   2336.90 505454 33188
## + bmi         1   1944.56 505846 33194
## + smoking     2   1614.19 506176 33202
## <none>                    507790 33223
## + sex         1     43.68 507747 33224
## 
## Step:  AIC=33188.12
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     depression
## 
##           Df Sum of Sq    RSS   AIC
## + bmi      1   1698.57 503755 33163
## + smoking  2   1391.86 504062 33170
## + sex      1    139.23 505314 33188
## <none>                 505454 33188
## 
## Step:  AIC=33163.19
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     depression + bmi
## 
##           Df Sum of Sq    RSS   AIC
## + smoking  2   1491.58 502263 33143
## <none>                 503755 33163
## + sex      1    108.85 503646 33163
## 
## Step:  AIC=33143.46
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     depression + bmi + smoking
## 
##        Df Sum of Sq    RSS   AIC
## <none>              502263 33143
## + sex   1    37.758 502226 33145

tidy(mod_forward, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Forward Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Forward Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.4662	0.5593	6.1971	0.0000	2.3698	4.5626
menthlth_days	0.2758	0.0120	22.9655	0.0000	0.2523	0.2994
exerciseYes	-3.2252	0.2234	-14.4353	0.0000	-3.6631	-2.7872
income$25K - $49K	-1.9909	0.3023	-6.5853	0.0000	-2.5836	-1.3983
income$50K - $99K	-3.0593	0.2741	-11.1615	0.0000	-3.5965	-2.5220
income$100K+	-3.3533	0.4102	-8.1752	0.0000	-4.1574	-2.5493
age_group35-49	0.9756	0.2995	3.2577	0.0011	0.3885	1.5626
age_group50-64	2.3878	0.2854	8.3663	0.0000	1.8283	2.9473
age_group65+	2.8659	0.2770	10.3471	0.0000	2.3230	3.4089
depressionYes	1.3278	0.2436	5.4509	0.0000	0.8503	1.8052
bmi	0.0747	0.0140	5.3476	0.0000	0.0473	0.1021
smokingFormer smoker	0.7359	0.2063	3.5667	0.0004	0.3315	1.1404
smokingCurrent smoker	1.1982	0.2954	4.0559	0.0001	0.6191	1.7772

Forward Selection has added everything except for sex - this means it has converged on the same outcome as Backward Selection.

1.4.3 Stepwise Selection

mod_stepwise <- step(mod_null,
                     scope = list(lower = mod_null, upper = mod_max))

## Start:  AIC=34785.58
## physhlth_days ~ 1
## 
##                 Df Sum of Sq    RSS   AIC
## + menthlth_days  1     62274 556283 33939
## + exercise       1     38144 580413 34278
## + income         3     30840 587717 34382
## + depression     1     25569 592988 34450
## + smoking        2     11940 606617 34634
## + bmi            1      8532 610025 34676
## + age_group      3      6839 611718 34703
## + sex            1      1128 617429 34773
## <none>                       618557 34786
## 
## Step:  AIC=33938.69
## physhlth_days ~ menthlth_days
## 
##                 Df Sum of Sq    RSS   AIC
## + exercise       1     27644 528639 33533
## + income         3     19576 536707 33658
## + age_group      3     15838 540445 33714
## + smoking        2      6562 549721 33848
## + bmi            1      5267 551016 33865
## + depression     1      3240 553043 33894
## + sex            1       168 556115 33938
## <none>                       556283 33939
## - menthlth_days  1     62274 618557 34786
## 
## Step:  AIC=33532.92
## physhlth_days ~ menthlth_days + exercise
## 
##                 Df Sum of Sq    RSS   AIC
## + income         3     11314 517325 33366
## + age_group      3     10585 518054 33377
## + smoking        2      4154 524485 33474
## + depression     1      2683 525956 33494
## + bmi            1      2250 526390 33501
## <none>                       528639 33533
## + sex            1        23 528617 33535
## - exercise       1     27644 556283 33939
## - menthlth_days  1     51773 580413 34278
## 
## Step:  AIC=33365.84
## physhlth_days ~ menthlth_days + exercise + income
## 
##                 Df Sum of Sq    RSS   AIC
## + age_group      3      9535 507790 33223
## + smoking        2      2757 514568 33327
## + depression     1      2143 515182 33335
## + bmi            1      2123 515202 33335
## <none>                       517325 33366
## + sex            1         2 517323 33368
## - income         3     11314 528639 33533
## - exercise       1     19382 536707 33658
## - menthlth_days  1     45040 562365 34032
## 
## Step:  AIC=33223.02
## physhlth_days ~ menthlth_days + exercise + income + age_group
## 
##                 Df Sum of Sq    RSS   AIC
## + depression     1      2337 505454 33188
## + bmi            1      1945 505846 33194
## + smoking        2      1614 506176 33202
## <none>                       507790 33223
## + sex            1        44 507747 33224
## - age_group      3      9535 517325 33366
## - income         3     10264 518054 33377
## - exercise       1     15763 523553 33466
## - menthlth_days  1     51103 558894 33988
## 
## Step:  AIC=33188.12
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     depression
## 
##                 Df Sum of Sq    RSS   AIC
## + bmi            1      1699 503755 33163
## + smoking        2      1392 504062 33170
## + sex            1       139 505314 33188
## <none>                       505454 33188
## - depression     1      2337 507790 33223
## - income         3      9698 515151 33334
## - age_group      3      9728 515182 33335
## - exercise       1     15494 520947 33428
## - menthlth_days  1     34670 540124 33717
## 
## Step:  AIC=33163.19
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     depression + bmi
## 
##                 Df Sum of Sq    RSS   AIC
## + smoking        2      1492 502263 33143
## <none>                       503755 33163
## + sex            1       109 503646 33163
## - bmi            1      1699 505454 33188
## - depression     1      2091 505846 33194
## - age_group      3      9562 513317 33308
## - income         3      9569 513324 33308
## - exercise       1     13747 517502 33377
## - menthlth_days  1     34268 538023 33688
## 
## Step:  AIC=33143.46
## physhlth_days ~ menthlth_days + exercise + income + age_group + 
##     depression + bmi + smoking
## 
##                 Df Sum of Sq    RSS   AIC
## <none>                       502263 33143
## + sex            1        38 502226 33145
## - smoking        2      1492 503755 33163
## - bmi            1      1798 504062 33170
## - depression     1      1868 504132 33171
## - income         3      8420 510684 33270
## - age_group      3      8516 510779 33272
## - exercise       1     13104 515367 33348
## - menthlth_days  1     33167 535430 33653

tidy(mod_stepwise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stepwise Selection Result (AIC-based)",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stepwise Selection Result (AIC-based)
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.4662	0.5593	6.1971	0.0000	2.3698	4.5626
menthlth_days	0.2758	0.0120	22.9655	0.0000	0.2523	0.2994
exerciseYes	-3.2252	0.2234	-14.4353	0.0000	-3.6631	-2.7872
income$25K - $49K	-1.9909	0.3023	-6.5853	0.0000	-2.5836	-1.3983
income$50K - $99K	-3.0593	0.2741	-11.1615	0.0000	-3.5965	-2.5220
income$100K+	-3.3533	0.4102	-8.1752	0.0000	-4.1574	-2.5493
age_group35-49	0.9756	0.2995	3.2577	0.0011	0.3885	1.5626
age_group50-64	2.3878	0.2854	8.3663	0.0000	1.8283	2.9473
age_group65+	2.8659	0.2770	10.3471	0.0000	2.3230	3.4089
depressionYes	1.3278	0.2436	5.4509	0.0000	0.8503	1.8052
bmi	0.0747	0.0140	5.3476	0.0000	0.0473	0.1021
smokingFormer smoker	0.7359	0.2063	3.5667	0.0004	0.3315	1.1404
smokingCurrent smoker	1.1982	0.2954	4.0559	0.0001	0.6191	1.7772

Stepwise selection, which can both add and remove variables, converged on the same model that the forward selection and backward elimination methods of variable selection also landed on. These models all incorporate mentally unhealthy days, exercise, income, age, depression, BMI, and smoking as predictors. They likewise also all remove sex. It makes sense for these models to converge, as they are optimizing based off of the same metric - AIC.

1.5 Final Model Selection and Interpretation

The final model that I have chosen is the same model that all of the different stepwise selection methods have converged on - a model that includes exercise, income, age, depression, bmi, and smoking, and excludes sex. While these automated methods do not identify the best model, they removed sex for a reason, and that is because it explains very little of the variance in Physically Unhealthy Days, and is not statistically significant in the full model.

# Automated backward elimination using AIC
mod_final <- mod_stepwise

tidy(mod_final, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Model",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Final Model
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	3.4662	0.5593	6.1971	0.0000	2.3698	4.5626
menthlth_days	0.2758	0.0120	22.9655	0.0000	0.2523	0.2994
exerciseYes	-3.2252	0.2234	-14.4353	0.0000	-3.6631	-2.7872
income$25K - $49K	-1.9909	0.3023	-6.5853	0.0000	-2.5836	-1.3983
income$50K - $99K	-3.0593	0.2741	-11.1615	0.0000	-3.5965	-2.5220
income$100K+	-3.3533	0.4102	-8.1752	0.0000	-4.1574	-2.5493
age_group35-49	0.9756	0.2995	3.2577	0.0011	0.3885	1.5626
age_group50-64	2.3878	0.2854	8.3663	0.0000	1.8283	2.9473
age_group65+	2.8659	0.2770	10.3471	0.0000	2.3230	3.4089
depressionYes	1.3278	0.2436	5.4509	0.0000	0.8503	1.8052
bmi	0.0747	0.0140	5.3476	0.0000	0.0473	0.1021
smokingFormer smoker	0.7359	0.2063	3.5667	0.0004	0.3315	1.1404
smokingCurrent smoker	1.1982	0.2954	4.0559	0.0001	0.6191	1.7772

This model indicates that for every additional day of a participant reporting poor mental health, holding all variables constant, they will report 0.276 additional physically unhealthy days.
This model predicts that exercise is a protective factor, and that participants who do report exercising will report 3.22 fewer physically unhealthy days, holding all other variables constant.
Higher income levels are associated with fewer physically unhealthy days. Holding all other variables constant, this model predicts that having an income of greater than $100,000 a year is a protective factor, resulting in 3.35 fewer days reporting poor physical health.

1.6 LINE Assumption Check

par(mfrow = c(2, 2))
plot(mod_final)

par(mfrow = c(1, 1))

This is model is not great.

1.6.1 Residuals vs. Fitted

The residuals vs. fitted plot is used to look for both non-linearity and heteroscedasticity at once. The loess line on this diagnostic plot curves downwards in the center, and does not remain flat at zero - this indicates that the linear model does not fully capture the relationship - a transformation of the outcome or polynomial terms may be better suited for this model. Because the outcome is bounded between 0-30, it has a clear diagonal pattern, but it is difficult to identify heteroscedasticity from this plot.

1.6.2 Q-Q Plot

The Q-Q plot indicates a severe departure from normality - the points display both an S-curve shape and a wide deviation above normal in the upper tail. This indicates both a right skew, and the potential need for a different model family entirely - it is closer in function to count data (although it is bounded), so poisson or negative binomial may be a better fit. This sees the same issue that the residuals vs. fitted plot highlighted - the outcome of physically unhealthy days is bounded by 0 and 30, and most people are reporting 0 physically unhealthy days. It also deviates from normal at the bottom of the plot, downwards, though less substantially than the upper extremity.

1.6.3 Scale Location Plot

The scale-location plot is diagnostic of the same issues that the other plots have made clear, the bounded point data is not a good fit for a linear model. The loess line slopes upward from left to right, so the spread of residuals does increase systematically with fitted values, which may be an indication of heteroscedasticity. It does, however, also deviate substantially on the left, indicating that the bounding of the data causes an issue with fit on both sides, and may be more of a matter of poor model family fit or need for transformation than heteroscedasticity.

1.6.4 Residuals vs. Leverage

The leverage plot has called out a few high-leverage observations, but nothing has been placed outside of Cook’s distance, indicating that there are not obvious outliers with high leverage.

1.6.5 Summary of LINE assumption diagnostics

The major violation for this model is that the model is not the correct family for the outcome - because it is bounded and count-like, and all of the diagnostic plots reflect LINE assumption violations that are downstream of that issue. This results in violations within many of the LINE assumptions. It appears that the relationship may not be linear, per residuals vs. fitted, and with many loess-line deviations in the residual diagnostic plots. The Q-Q plot indicates that the residual errors do not follow a normal distribution, and all of the tests for homoscedasticity indicate that there may be deviation but it is less conclusive than the general poor model family fit.

Part 2. Logistic Regression

2.1 - Create Binary Variable for Logistic Outcome

a_df <- a_df |> mutate(
  frequent_distress = factor(
      ifelse(physhlth_days >= 14, 1, 0),
  levels = c(0, 1),
  labels = c( "No", "Yes")
))


## i know how to do this I've done it a million times i'm just tired
## this is the worst I've ever done it

a_df |> mutate(
      total = n()) |>
  group_by(frequent_distress, total) |>
  summarize(count = n()) |>
  mutate(pct = 100* (count/total))|>
  kable()

frequent_distress	total	count	pct
No	8000	6910	86.375
Yes	8000	1090	13.625

Most participants in this survey do not report frequent physical distress, with a prevalence of only about 13.6%.

2.2 - Unadjusted Logistic Regression

mod_simple <- glm(frequent_distress ~ age_group,
                  data = a_df, family=binomial)

summary(mod_simple)

## 
## Call:
## glm(formula = frequent_distress ~ age_group, family = binomial, 
##     data = a_df)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -2.5649     0.1082 -23.707  < 2e-16 ***
## age_group35-49   0.4580     0.1341   3.416 0.000635 ***
## age_group50-64   0.9270     0.1228   7.547 4.46e-14 ***
## age_group65+     0.9151     0.1194   7.662 1.82e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6369.6  on 7999  degrees of freedom
## Residual deviance: 6276.9  on 7996  degrees of freedom
## AIC: 6284.9
## 
## Number of Fisher Scoring iterations: 5

I have picked age as the primary exposure for the unadjusted model, because aging brings physical challenges that may heighten vulnerability to injury and pain.

tidy(mod_simple, conf.int = TRUE, exponentiate = TRUE) |>
  kable(digits = 3,
        caption = "Simple Logistic Regression: FPD ~ Age Group") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Simple Logistic Regression: FPD ~ Age Group
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.077	0.108	-23.707	0.000	0.062	0.095
age_group35-49	1.581	0.134	3.416	0.001	1.219	2.063
age_group50-64	2.527	0.123	7.547	0.000	1.995	3.231
age_group65+	2.497	0.119	7.662	0.000	1.986	3.173

Compared to the reference group of people in the age group 18-34, people in the age group 35-49 have 1.6 times the odds of frequent physical distress, 95% CI 1.2 to 2.1.

Compared to the reference group of 18-34, people in the age group of 50-64 have 2.5x the odds of reporting frequent physical distress, 95% CI 2.0 - 3.2.

Compared to the reference group of 18-34, people in the age group 65+ have 2.5 x the odds of reporting frequent physical distress, 95% CI 1.9 - 3.2.

The association between age group and frequent physical distress is statistically significant at alpha = 0.05, this is evident because for all groups in comparison to the reference group of 18-34, all 95% confidence intervals are greater than 1, not crossing 1.

2.3 - Multiple Logistic Regression

mod_logistic <- glm(frequent_distress ~  menthlth_days + bmi + depression + income + age_group + exercise + smoking,
                    data = a_df,
                    family=binomial)

tidy(mod_logistic, conf.int = TRUE, exponentiate=TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 3))) |>
  kable(
    caption = "Maximum Model: All Candidate Predictors",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Maximum Model: All Candidate Predictors
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	0.058	0.219	-12.988	0.000	0.038	0.089
menthlth_days	1.066	0.004	16.630	0.000	1.058	1.074
bmi	1.024	0.005	4.712	0.000	1.014	1.034
depressionYes	1.366	0.088	3.558	0.000	1.149	1.620
income$25K - $49K	0.667	0.100	-4.057	0.000	0.549	0.811
income$50K - $99K	0.417	0.094	-9.324	0.000	0.347	0.502
income$100K+	0.317	0.199	-5.781	0.000	0.212	0.462
age_group35-49	1.639	0.144	3.421	0.001	1.238	2.183
age_group50-64	2.692	0.134	7.380	0.000	2.078	3.518
age_group65+	3.079	0.132	8.501	0.000	2.386	4.010
exerciseYes	0.442	0.077	-10.659	0.000	0.380	0.514
smokingFormer smoker	1.309	0.082	3.300	0.001	1.115	1.536
smokingCurrent smoker	1.494	0.105	3.834	0.000	1.215	1.832

Holding all other variables constant, for each additional day reporting poor mental health, the odds of reporting frequent physical distress are multiplied 1.066x.
Holding all other variables constant, people reporting a previous diagnosis of a depressive disorder have 1.366 times the odds of reporting frequent physical distress.
Income is a protective factor - compared to people with an income below $25k, people reporting an income of $25k - 49k have 0.66 times the odds of reporting frequent physical distress, holding all other variables constant. Compared to the same group and holding all other variables constant, people with an income of $50k to $99k have 0.42 times the odds of experiencing frequent physical distress. And people reporting an income of $100k+ have 0.32 times the odds of reporting frequent physical distress compared to people with an income below $25k, holding all other variables constant.

2.4 - Likelihood Ratio Test

I am testing the effect that the Smoking variable has on the model, first by fitting a reduced model that does not have this group, and then testing both the full and reduced model with the Likelihood Ratio Test.

mod_reduced <- glm(
  frequent_distress ~  menthlth_days + bmi + depression + income + age_group + exercise, 
  data = a_df,
  family = binomial
)

anova(mod_reduced, mod_logistic, test = "LRT") |>
  kable(digits = 3,
        caption = "LR Test: Does adding smoking status improve the model?") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

LR Test: Does adding smoking status improve the model?
Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
7989	5300.187	NA	NA	NA
7987	5280.669	2	19.518	0

The $H_0$ states that the predictors that were dropped for the reduced model do not have a statistically singificant effect on the model, and the alternate hypothesis states that they do. The LR test gives the statistic on 2 degrees of freedom. The deviance is 19.5, and the P-value is below 0.001, meaning that at an alpha of 0.05, we can reject the null hypothesis. This indicates that smoking does significantly improve the model, so this variable will be retained.

2.5 - ROC Curve and AUC

roc_obj <- roc(
  response = a_df$frequent_distress,
  predictor = fitted(mod_logistic),
  levels = c("No", "Yes"),
  direction = "<"
)

auc_value <- auc(roc_obj)

ci.auc(roc_obj)

## 95% CI: 0.7688-0.7991 (DeLong)

ggroc(roc_obj, color = "cyan", linewidth = 1.2) +
  geom_abline(slope = 1, intercept = 1, linetype = "dashed", color = "coral") +
  labs(title = "ROC Curve for Frequent Physical Distress Model",
       subtitle = paste0("AUC = ", round(auc_value, 3)),
       x = "Specificity", y = "Sensitivity") +
  theme_classic()

The model has an AUC of .784, and a CI of 0.7688 - 0.7991, indicating that it has acceptable discrimination, distinguishing between individuals with and without frequent physical distress well.

2.6 - Hosmer-Lemeshow Goodness-of-Fit Test

hl_test <- hoslem.test(
  x = as.numeric(a_df$frequent_distress) - 1,
  y = fitted(mod_logistic),
  g = 10
)

hl_test

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  as.numeric(a_df$frequent_distress) - 1, fitted(mod_logistic)
## X-squared = 6.6248, df = 8, p-value = 0.5776

The null hypothesis states that the model fits the data well, and the alternate hypothesis is that it does. This test has 8 degrees of freedom, an X^2 of 6.624, and the p-value is 0.5776. At an alpha of 0.05, we do not have sufficient evidence to reject the null hypothesis, indicating that there is not evidence of poor model fit.

Both ROC and the Hosmer Lemeshow test assess validity, but Hosmer-Lemeshow is a measure of calibration, while and ROC is a measure of discrimination. Calibration is a measure of how well predicted probabilities match observed rates, while discrimination is the separation of events from non-events, these are complementary measures, and ideally we look to have both high, but models can have high calibration and low discrimination, and vice-versa. In this case, we find strong calibration and acceptable discrimination.

Part 3. Reflection

Our results indicate that physical health burden in the United States is associated with a variety of factors – physical, mental, behavioral, and socioeconomic factors all played a statistically significant role in both models. Both the linear and logistic models supported using the same predictors, with statistical and significance for all of those included in the final model. In both, income level, age group, and exercise status were all the strongest predictors of physical distress.

While this model has added response to the question, “Have you ever been told you have a depressive disorder?” as a predictor, most chronic conditions that may result in days of poor physical health have not been reflected in this analysis – these conditions are likely unmeasured confounders affecting both our outcome and our exposures. Additionally, while some elements of Socioeconomic Status are reflected with our measure of reported household income, this is a proxy for a large number of unmeasured effects, which are likely to further confound the outcome. Further, this analysis is entirely reflective of the results of a cross-sectional survey, where data is taken at a single point in time. This means that there is no way to infer causality, as there is no way to confirm whether an exposure occurred before an outcome.

Linear regression models can tell the magnitude of an effect of a predictor on the outcome, while logistic regressions measure the change in probability of an outcome. Linear regression is useful in being easy to interpret and convey the magnitude of effect, but by assuming that there is an effect uniformly across the predictor and outcome, it can add confusion that lowers interpretability. Logistic regression is more useful when you are looking for classification into categories. Linear regression model selection looks for residuals and numeric differences between observations and the prediction, whereas logistic selection frameworks look for accuracy and precision of the classification.

I did not use AI for this assignment.

Assignment 4: Multiple Linear Regression, Tests of Hypotheses, and Interaction Analysis

EPI 553: Principles of Statistical Inference II

Alex Mossawir

2026-04-24