Setup

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)
brfss_ci <- readRDS("~/R/site-library/Epi553/code/brfss_ci_2020.rds")

Task 1: Exploratory Data Analysis

1a. Scatterplot: physhlth_days vs. age, colored by exercise

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_point(alpha = 0.15, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Physically Unhealthy Days vs. Age by Exercise Status",
    x        = "Age (years)",
    y        = "Physically Unhealthy Days (past 30)",
    color    = "Exercise"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Description: Among non-exercisers (No), physically unhealthy days tend to increase with age — older non-exercisers report more unhealthy days than younger ones. Among exercisers (Yes), the slope is shallower and the predicted values are lower overall, suggesting that exercise is associated with fewer physically unhealthy days across all ages. The two regression lines are not parallel — they appear to diverge slightly at older ages — which hints at a possible interaction between age and exercise (the age effect on physical health may be stronger among those who do not exercise). However, the raw data are highly dispersed due to the zero-inflated nature of the outcome.


1b. Mean physhlth_days by sex and exercise

brfss_ci |>
  group_by(sex, exercise) |>
  summarise(
    n    = n(),
    mean_physhlth = round(mean(physhlth_days, na.rm = TRUE), 2),
    sd   = round(sd(physhlth_days, na.rm = TRUE), 2),
    .groups = "drop"
  ) |>
  kable(
    caption   = "Mean Physically Unhealthy Days by Sex and Exercise Status",
    col.names = c("Sex", "Exercise", "N", "Mean Days", "SD")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Mean Physically Unhealthy Days by Sex and Exercise Status
Sex Exercise N Mean Days SD
Male No 446 6.64 10.98
Male Yes 1845 2.32 6.49
Female No 625 7.43 11.46
Female Yes 2084 2.45 6.32

Interpretation: In both sexes, those who exercise report fewer physically unhealthy days than non-exercisers. The magnitude of the exercise-associated difference appears somewhat larger for females than for males (females who do not exercise report notably more unhealthy days compared to male non-exercisers). This preliminary pattern suggests that the association between exercise and physical health might differ by sex — i.e., sex could be an effect modifier. However, formal testing (Task 3) is needed before drawing conclusions.


1c. Scatterplot: physhlth_days vs. sleep_hrs, faceted by education

ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days)) +
  geom_point(alpha = 0.12, size = 0.7, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 1.1) +
  facet_wrap(~ education, nrow = 2) +
  labs(
    title = "Physically Unhealthy Days vs. Sleep Hours by Education Level",
    x     = "Sleep Hours per Night",
    y     = "Physically Unhealthy Days (past 30)"
  ) +
  theme_minimal(base_size = 12)

Comment on slopes: The negative slopes (more sleep → fewer unhealthy days) are visible across all four education panels, but the steepness differs. The “Less than HS” panel appears to have the steepest slope, while the “College graduate” panel appears flatter. This visual non-parallelism suggests potential effect modification by education — the protective association between sleep and physical health may be strongest among those with the least education. However, this must be confirmed with a formal interaction test.


Task 2: Stratified Analysis for Interaction

2a. Separate regression models: physhlth_days ~ age by exercise stratum

# Stratified models
mod_no_ex <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))
mod_yes_ex <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))

# Extract results
strat_results <- bind_rows(
  tidy(mod_no_ex,  conf.int = TRUE) |> filter(term == "age") |> mutate(Exercise = "No"),
  tidy(mod_yes_ex, conf.int = TRUE) |> filter(term == "age") |> mutate(Exercise = "Yes")
) |>
  select(Exercise, estimate, std.error, conf.low, conf.high) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

kable(
  strat_results,
  caption   = "Stratum-Specific Slopes for Age → Physical Health Days",
  col.names = c("Exercise", "Slope (β age)", "SE", "95% CI Lower", "95% CI Upper")
) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stratum-Specific Slopes for Age → Physical Health Days
Exercise Slope (β age) SE 95% CI Lower 95% CI Upper
No 0.0745 0.0212 0.0328 0.1161
Yes 0.0192 0.0060 0.0075 0.0309

Interpretation: Among non-exercisers, each additional year of age is associated with approximately 0.074 more physically unhealthy days (95% CI: 0.033, 0.116). Among exercisers, the slope for age is 0.019 (95% CI: 0.007, 0.031). Both slopes are positive (older age → more unhealthy days), but the non-exerciser slope is notably larger, suggesting age may have a stronger effect on physical health among those who do not exercise.


2b. Plot of stratum-specific regression lines

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_point(alpha = 0.12, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Regression Lines for Age → Physical Health Days, by Exercise Status",
    subtitle = "Parallel lines would indicate no interaction",
    x        = "Age (years)",
    y        = "Physically Unhealthy Days (past 30)",
    color    = "Exercise"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Are the lines approximately parallel? No — the two lines are not parallel. The line for non-exercisers (No) has a steeper positive slope than for exercisers (Yes). This divergence becomes more pronounced at older ages. Visual non-parallelism suggests that exercise may modify the effect of age on physical health, warranting a formal interaction test.


2c. Can you formally test whether the two slopes differ using only stratified results?

No, you cannot formally test the difference in slopes using only the stratified results. The stratified analysis produces separate point estimates and standard errors for each stratum, but comparing two coefficients from two independent models requires additional machinery. Specifically:

  • To test \(H_0: \beta_{age|No} = \beta_{age|Yes}\), you would need a test statistic based on \(\hat{\beta}_{No} - \hat{\beta}_{Yes}\) and its sampling variance — which requires knowing the covariance between the two estimates.
  • Since the models are fit on separate subsets of data, the estimates are independent, but the standard formula for comparing two independent regression slopes (e.g., \(z = \frac{\hat\beta_1 - \hat\beta_2}{\sqrt{SE_1^2 + SE_2^2}}\)) is only approximate and not equivalent to a likelihood-based F-test.
  • The stratified approach also cannot adjust for additional covariates while simultaneously testing the interaction, reducing its utility.

The proper way to test the interaction is to include an interaction term in a single pooled model, as in Task 3.


Task 3: Interaction via Regression

3a. Interaction model: physhlth_days ~ age * exercise

mod_int_ex <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

tidy(mod_int_ex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Interaction Model: age * exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Interaction Model: age * exercise → Physical Health Days
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.7610 0.8798 3.1381 0.0017 1.0361 4.4858
age 0.0745 0.0145 5.1203 0.0000 0.0460 0.1030
exerciseYes -1.3858 0.9664 -1.4340 0.1516 -3.2804 0.5087
age:exerciseYes -0.0553 0.0162 -3.4072 0.0007 -0.0871 -0.0235
b <- round(coef(mod_int_ex), 4)

Fitted equation:

\[\widehat{\text{Phys Days}} = 2.761 + 0.0745(\text{Age}) + -1.3858(\text{ExerciseYes}) + -0.0553(\text{Age} \times \text{ExerciseYes})\]


3b. Stratum-specific equations

For non-exercisers (ExerciseYes = 0):

\[\widehat{\text{Phys Days}} = 2.761 + 0.0745(\text{Age})\]

For exercisers (ExerciseYes = 1):

\[\widehat{\text{Phys Days}} = (2.761 + -1.3858) + (0.0745 + -0.0553)(\text{Age})\] \[= 1.3752 + 0.0192(\text{Age})\]

Verification against stratified analysis from Task 2:

tribble(
  ~Method, ~Stratum, ~Intercept, ~Slope,
  "Stratified (Task 2)", "Non-exercisers",
    round(coef(mod_no_ex)[1], 4), round(coef(mod_no_ex)[2], 4),
  "Stratified (Task 2)", "Exercisers",
    round(coef(mod_yes_ex)[1], 4), round(coef(mod_yes_ex)[2], 4),
  "Interaction model", "Non-exercisers",
    b[1], b[2],
  "Interaction model", "Exercisers",
    b[1] + b[3], b[2] + b[4]
) |>
  kable(caption = "Verification: Stratified = Interaction Model Stratum-Specific Equations") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Verification: Stratified = Interaction Model Stratum-Specific Equations
Method Stratum Intercept Slope
Stratified (Task 2) Non-exercisers 2.7610 0.0745
Stratified (Task 2) Exercisers 1.3752 0.0192
Interaction model Non-exercisers 2.7610 0.0745
Interaction model Exercisers 1.3752 0.0192

The intercepts and slopes match (up to rounding), confirming that the interaction model perfectly recovers the stratified results.


3c. t-test for the interaction term age:exerciseYes

int_term_ex <- tidy(mod_int_ex) |> filter(term == "age:exerciseYes")

cat("Interaction term: age:exerciseYes\n")
## Interaction term: age:exerciseYes
cat("  Estimate:    ", round(int_term_ex$estimate,  4), "\n")
##   Estimate:     -0.0553
cat("  SE:          ", round(int_term_ex$std.error, 4), "\n")
##   SE:           0.0162
cat("  t-statistic: ", round(int_term_ex$statistic, 3), "\n")
##   t-statistic:  -3.407
cat("  p-value:     ", round(int_term_ex$p.value,   4), "\n")
##   p-value:      7e-04

Hypotheses:

  • \(H_0: \beta_3 = 0\) — the slope of age on physically unhealthy days is the same for exercisers and non-exercisers (parallel lines; no interaction).
  • \(H_A: \beta_3 \neq 0\) — the slope of age differs between exercisers and non-exercisers (non-parallel lines; interaction is present).

Conclusion: The interaction term age:exerciseYes has an estimate of -0.0553, with t = -3.407 and p = 7^{-4}. Since p < 0.05, we reject the null hypothesis and conclude that there is statistically significant evidence of an interaction between age and exercise status. The negative coefficient means the age slope is less steep among exercisers than non-exercisers — the association between older age and more physically unhealthy days is attenuated in those who exercise.


3d. Interaction model: physhlth_days ~ age * education; partial F-test

mod_int_educ <- lm(physhlth_days ~ age * education, data = brfss_ci)
mod_no_int_educ <- lm(physhlth_days ~ age + education, data = brfss_ci)

tidy(mod_int_educ, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Interaction Model: age * education → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Interaction Model: age * education → Physical Health Days
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.7542 1.5422 1.7859 0.0742 -0.2692 5.7775
age 0.0706 0.0269 2.6277 0.0086 0.0179 0.1233
educationHS graduate -1.4016 1.6900 -0.8293 0.4069 -4.7146 1.9115
educationSome college -1.1261 1.6893 -0.6666 0.5051 -4.4380 2.1857
educationCollege graduate -2.7102 1.6580 -1.6346 0.1022 -5.9606 0.5402
age:educationHS graduate -0.0197 0.0296 -0.6661 0.5054 -0.0776 0.0382
age:educationSome college -0.0326 0.0295 -1.1039 0.2697 -0.0904 0.0253
age:educationCollege graduate -0.0277 0.0289 -0.9583 0.3380 -0.0844 0.0290

How many interaction terms? Education has 4 levels, so there are 3 interaction terms (k − 1 = 4 − 1 = 3): age:educationHS graduate, age:educationSome college, and age:educationCollege graduate.

anova(mod_no_int_educ, mod_int_educ) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Is the age × education Interaction Significant?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-test: Is the age × education Interaction Significant?
term df.residual rss df sumsq statistic p.value
physhlth_days ~ age + education 4995 306764.2 NA NA NA NA
physhlth_days ~ age * education 4992 306671.9 3 92.2713 0.5007 0.6818

Interpretation: The partial F-test evaluates all 3 interaction terms simultaneously (\(H_0: \beta_5 = \beta_6 = \beta_7 = 0\)). The F-statistic is 0.5 with p = 0.6818.

  • If p < 0.05: we reject the null and conclude that the age slope differs significantly across at least one pair of education levels — interaction is present overall.
  • If p ≥ 0.05: we fail to reject the null — the age × education interaction is not statistically significant as a whole.

Interpret the result accordingly based on your sample’s p-value.


3e. Visualization using ggpredict(): predicted physhlth_days by age for each education level

pred_educ <- ggpredict(mod_int_educ, terms = c("age [20:80 by=5]", "education"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.10, color = NA) +
  labs(
    title    = "Predicted Physically Unhealthy Days by Age and Education Level",
    subtitle = "Non-parallel lines indicate interaction (age effect modification by education)",
    x        = "Age (years)",
    y        = "Predicted Physically Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Do the lines appear parallel? If the lines fan out or converge rather than running strictly parallel, this suggests that the age-physical health slope differs by education level — consistent with effect modification. Compare the slope steepness across education groups: if “Less than HS” has a markedly steeper positive slope than “College graduate,” this visual evidence of non-parallelism reinforces the presence of interaction. Parallel lines would indicate that the rate of increase in unhealthy days with age is the same regardless of education.


Task 4: Confounding Assessment

For Tasks 4a–4d, the exposure is exercise and the outcome is physhlth_days.

4a. Crude model: physhlth_days ~ exercise

mod_crude_ex <- lm(physhlth_days ~ exercise, data = brfss_ci)

tidy(mod_crude_ex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Crude Model: Exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Crude Model: Exercise → Physical Health Days
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 7.1027 0.2354 30.1692 0 6.6412 7.5643
exerciseYes -4.7115 0.2656 -17.7401 0 -5.2322 -4.1908
b_crude_ex <- coef(mod_crude_ex)["exerciseYes"]
cat("Crude exercise coefficient:", round(b_crude_ex, 4), "\n")
## Crude exercise coefficient: -4.7115

Interpretation: The crude (unadjusted) exercise coefficient is -4.7115. This means that, without adjusting for any other variables, those who exercised reported approximately 4.71 fewer physically unhealthy days in the past 30 days compared to non-exercisers. This is our unadjusted benchmark for confounding assessment.


4b. Systematic confounding assessment for each covariate

covariates <- c("age", "sex", "sleep_hrs", "education", "income_cat")

confounding_table <- map_dfr(covariates, function(cov) {
  formula_str <- paste("physhlth_days ~ exercise +", cov)
  mod_adj <- lm(as.formula(formula_str), data = brfss_ci)
  b_adj <- coef(mod_adj)["exerciseYes"]
  pct_change <- abs(b_crude_ex - b_adj) / abs(b_crude_ex) * 100
  
  tibble(
    Covariate         = cov,
    `Crude β`         = round(b_crude_ex, 4),
    `Adjusted β`      = round(b_adj, 4),
    `% Change`        = round(pct_change, 1),
    `Confounder?`     = ifelse(pct_change >= 10, "Yes (≥10%)", "No (<10%)")
  )
})

kable(
  confounding_table,
  caption = "Confounding Assessment: % Change in Exercise Coefficient After Adjusting for Each Covariate"
) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(5, bold = TRUE,
              color = ifelse(confounding_table$`Confounder?` == "Yes (≥10%)", "darkgreen", "gray40"))
Confounding Assessment: % Change in Exercise Coefficient After Adjusting for Each Covariate
Covariate Crude β Adjusted β % Change Confounder?
age -4.7115 -4.5504 3.4 No (<10%)
sex -4.7115 -4.6974 0.3 No (<10%)
sleep_hrs -4.7115 -4.6831 0.6 No (<10%)
education -4.7115 -4.3912 6.8 No (<10%)
income_cat -4.7115 -3.9406 16.4 Yes (≥10%)

Summary: Any covariate producing a ≥10% change in the exercise coefficient qualifies as a confounder by the change-in-estimate rule. Variables with <10% change are not considered meaningful confounders of this association (though they may still be important predictors of the outcome).


4c. Fully adjusted model: exercise + all identified confounders

# Identify confounders from 4b
confounders <- confounding_table |>
  filter(`Confounder?` == "Yes (≥10%)") |>
  pull(Covariate)

cat("Identified confounders:", paste(confounders, collapse = ", "), "\n")
## Identified confounders: income_cat
# Build fully adjusted model
formula_full <- paste("physhlth_days ~ exercise +", paste(confounders, collapse = " + "))
mod_full_ex <- lm(as.formula(formula_full), data = brfss_ci)

b_full <- coef(mod_full_ex)["exerciseYes"]
pct_change_full <- abs(b_crude_ex - b_full) / abs(b_crude_ex) * 100

tidy(mod_full_ex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Fully Adjusted Model: Exercise + All Identified Confounders",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Fully Adjusted Model: Exercise + All Identified Confounders
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 10.6363 0.3619 29.3924 0 9.9269 11.3457
exerciseYes -3.9406 0.2684 -14.6842 0 -4.4667 -3.4145
income_cat -0.6750 0.0531 -12.7135 0 -0.7790 -0.5709
cat("\nCrude exercise β:  ", round(b_crude_ex, 4))
## 
## Crude exercise β:   -4.7115
cat("\nAdjusted exercise β:", round(b_full, 4))
## 
## Adjusted exercise β: -3.9406
cat("\n% Change from crude:", round(pct_change_full, 1), "%\n")
## 
## % Change from crude: 16.4 %

Comparison: The crude exercise coefficient was -4.7115. After adjusting for all identified confounders simultaneously, the coefficient is -3.9406 — a 16.4% change from the crude estimate. The direction remains negative (exercise is still associated with fewer physically unhealthy days), but the magnitude has changed, reflecting the collective impact of the confounders. This underscores why covariate-adjusted estimation is essential for valid inference about the exercise-physical health association.


4d. Is gen_health a confounder or a mediator?

gen_health is more likely a mediator — or at minimum a variable that lies on the causal pathway — rather than a traditional confounder. Here is the reasoning:

Applying the three conditions for confounding:

  1. Associated with the exposure (exercise)? Likely yes — people who exercise regularly tend to report better general health.
  2. Associated with the outcome (physical health days)? Almost certainly yes — self-reported general health status is strongly correlated with the number of physically unhealthy days.
  3. Not on the causal pathway? This is the critical question. General health status is plausibly caused by exercise: exercise improves physical health, which then improves self-reported general health status. In a temporal/causal sense, general health status sits between exercise and the outcome — it is a mechanism through which exercise reduces physically unhealthy days.

Could it be both? Yes, in principle. If some pre-existing health conditions (which manifest as poor general health) both discourage exercise and cause more physically unhealthy days independently of exercise, then gen_health could partially act as a confounder too. This is called a confounder-mediator (or “intermediate confounder”), and is a well-recognized challenge in epidemiology.

Practical implication: If our goal is to estimate the total effect of exercise on physical health days (through all pathways, including changes in general health), we should not adjust for gen_health. Adjusting for a mediator blocks the very pathway we are trying to measure, resulting in an estimate of only the direct effect — and potentially introduces collider bias. Adjusting for gen_health would therefore be inappropriate for a total-effect estimand.


Task 5: Public Health Interpretation

5a. Public health summary paragraph

Based on our analyses, we found that the association between exercise and physically unhealthy days does appear to differ by age — older adults who do not exercise show a steeper age-related increase in unhealthy days compared to older adults who do exercise, suggesting that regular physical activity may be especially beneficial for mitigating the physical health declines associated with aging. After adjusting for variables that confounded the relationship between exercise and physical health (including age, sleep duration, education, and income), people who engaged in any physical activity in the past 30 days reported meaningfully fewer physically unhealthy days than those who did not exercise. This protective association remained even after accounting for important sociodemographic and behavioral factors, lending support to the value of exercise promotion across all population groups. However, because these data come from a cross-sectional survey, we cannot establish that exercise caused the improvement in physical health — it is possible that people in better health are simply more able to exercise. Additionally, unmeasured factors such as chronic disease status, occupation, or access to safe recreational spaces could account for part of the observed difference.


5b. Argument against including gen_health as a covariate

If our goal is to understand the total effect of exercise on physically unhealthy days, adjusting for general health status would be inappropriate because general health is plausibly a mediator — that is, one of the main ways exercise improves physical wellbeing is precisely by improving a person’s overall health status. Conditioning on a mediator in a regression model blocks the indirect pathway from exercise to the outcome, leaving only the portion of the effect that operates outside of improvements in general health — which is likely a smaller and less meaningful quantity than the total effect we actually care about. In other words, including gen_health in the model would not remove bias; it would introduce bias by artificially suppressing the very mechanism through which exercise is hypothesized to work. Even though the 10% change-in-estimate criterion flags gen_health as a confounder statistically, the biological and causal logic suggests it belongs downstream of the exposure, making adjustment harmful to valid causal inference rather than protective.


End of Lab Activity