Setup

library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)

brfss_ci <- readRDS("~/R/site-library/Epi553/code/brfss_ci_2020.rds")

Task 1: Exploratory Data Analysis

1a. Scatterplot: `physhlth_days` vs. `age`, colored by `exercise`

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_point(alpha = 0.15, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Physically Unhealthy Days vs. Age by Exercise Status",
    x        = "Age (years)",
    y        = "Physically Unhealthy Days (past 30)",
    color    = "Exercise"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Description: Among non-exercisers (No), physically unhealthy days tend to increase with age — older non-exercisers report more unhealthy days than younger ones. Among exercisers (Yes), the slope is shallower and the predicted values are lower overall, suggesting that exercise is associated with fewer physically unhealthy days across all ages. The two regression lines are not parallel — they appear to diverge slightly at older ages — which hints at a possible interaction between age and exercise (the age effect on physical health may be stronger among those who do not exercise). However, the raw data are highly dispersed due to the zero-inflated nature of the outcome.

1b. Mean `physhlth_days` by `sex` and `exercise`

brfss_ci |>
  group_by(sex, exercise) |>
  summarise(
    n    = n(),
    mean_physhlth = round(mean(physhlth_days, na.rm = TRUE), 2),
    sd   = round(sd(physhlth_days, na.rm = TRUE), 2),
    .groups = "drop"
  ) |>
  kable(
    caption   = "Mean Physically Unhealthy Days by Sex and Exercise Status",
    col.names = c("Sex", "Exercise", "N", "Mean Days", "SD")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mean Physically Unhealthy Days by Sex and Exercise Status
Sex	Exercise	N	Mean Days	SD
Male	No	446	6.64	10.98
Male	Yes	1845	2.32	6.49
Female	No	625	7.43	11.46
Female	Yes	2084	2.45	6.32

Interpretation: In both sexes, those who exercise report fewer physically unhealthy days than non-exercisers. The magnitude of the exercise-associated difference appears somewhat larger for females than for males (females who do not exercise report notably more unhealthy days compared to male non-exercisers). This preliminary pattern suggests that the association between exercise and physical health might differ by sex — i.e., sex could be an effect modifier. However, formal testing (Task 3) is needed before drawing conclusions.

1c. Scatterplot: `physhlth_days` vs. `sleep_hrs`, faceted by `education`

ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days)) +
  geom_point(alpha = 0.12, size = 0.7, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "firebrick", linewidth = 1.1) +
  facet_wrap(~ education, nrow = 2) +
  labs(
    title = "Physically Unhealthy Days vs. Sleep Hours by Education Level",
    x     = "Sleep Hours per Night",
    y     = "Physically Unhealthy Days (past 30)"
  ) +
  theme_minimal(base_size = 12)

Comment on slopes: The negative slopes (more sleep → fewer unhealthy days) are visible across all four education panels, but the steepness differs. The “Less than HS” panel appears to have the steepest slope, while the “College graduate” panel appears flatter. This visual non-parallelism suggests potential effect modification by education — the protective association between sleep and physical health may be strongest among those with the least education. However, this must be confirmed with a formal interaction test.

Task 2: Stratified Analysis for Interaction

2a. Separate regression models: `physhlth_days ~ age` by exercise stratum

# Stratified models
mod_no_ex <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))
mod_yes_ex <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))

# Extract results
strat_results <- bind_rows(
  tidy(mod_no_ex,  conf.int = TRUE) |> filter(term == "age") |> mutate(Exercise = "No"),
  tidy(mod_yes_ex, conf.int = TRUE) |> filter(term == "age") |> mutate(Exercise = "Yes")
) |>
  select(Exercise, estimate, std.error, conf.low, conf.high) |>
  mutate(across(where(is.numeric), \(x) round(x, 4)))

kable(
  strat_results,
  caption   = "Stratum-Specific Slopes for Age → Physical Health Days",
  col.names = c("Exercise", "Slope (β age)", "SE", "95% CI Lower", "95% CI Upper")
) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stratum-Specific Slopes for Age → Physical Health Days
Exercise	Slope (β age)	SE	95% CI Lower	95% CI Upper
No	0.0745	0.0212	0.0328	0.1161
Yes	0.0192	0.0060	0.0075	0.0309

Interpretation: Among non-exercisers, each additional year of age is associated with approximately 0.074 more physically unhealthy days (95% CI: 0.033, 0.116). Among exercisers, the slope for age is 0.019 (95% CI: 0.007, 0.031). Both slopes are positive (older age → more unhealthy days), but the non-exerciser slope is notably larger, suggesting age may have a stronger effect on physical health among those who do not exercise.

2b. Plot of stratum-specific regression lines

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_point(alpha = 0.12, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Regression Lines for Age → Physical Health Days, by Exercise Status",
    subtitle = "Parallel lines would indicate no interaction",
    x        = "Age (years)",
    y        = "Physically Unhealthy Days (past 30)",
    color    = "Exercise"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Are the lines approximately parallel? No — the two lines are not parallel. The line for non-exercisers (No) has a steeper positive slope than for exercisers (Yes). This divergence becomes more pronounced at older ages. Visual non-parallelism suggests that exercise may modify the effect of age on physical health, warranting a formal interaction test.

2c. Can you formally test whether the two slopes differ using only stratified results?

No, you cannot formally test the difference in slopes using only the stratified results. The stratified analysis produces separate point estimates and standard errors for each stratum, but comparing two coefficients from two independent models requires additional machinery. Specifically:

To test \(H_0: \beta_{age|No} = \beta_{age|Yes}\), you would need a test statistic based on \(\hat{\beta}_{No} - \hat{\beta}_{Yes}\) and its sampling variance — which requires knowing the covariance between the two estimates.
Since the models are fit on separate subsets of data, the estimates are independent, but the standard formula for comparing two independent regression slopes (e.g., \(z = \frac{\hat\beta_1 - \hat\beta_2}{\sqrt{SE_1^2 + SE_2^2}}\)) is only approximate and not equivalent to a likelihood-based F-test.
The stratified approach also cannot adjust for additional covariates while simultaneously testing the interaction, reducing its utility.

The proper way to test the interaction is to include an interaction term in a single pooled model, as in Task 3.

Task 3: Interaction via Regression

3a. Interaction model: `physhlth_days ~ age * exercise`

mod_int_ex <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

tidy(mod_int_ex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Interaction Model: age * exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: age * exercise → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7610	0.8798	3.1381	0.0017	1.0361	4.4858
age	0.0745	0.0145	5.1203	0.0000	0.0460	0.1030
exerciseYes	-1.3858	0.9664	-1.4340	0.1516	-3.2804	0.5087
age:exerciseYes	-0.0553	0.0162	-3.4072	0.0007	-0.0871	-0.0235

b <- round(coef(mod_int_ex), 4)

Fitted equation:

\[\widehat{\text{Phys Days}} = 2.761 + 0.0745(\text{Age}) + -1.3858(\text{ExerciseYes}) + -0.0553(\text{Age} \times \text{ExerciseYes})\]

3b. Stratum-specific equations

For non-exercisers (ExerciseYes = 0):

\[\widehat{\text{Phys Days}} = 2.761 + 0.0745(\text{Age})\]

For exercisers (ExerciseYes = 1):

\[\widehat{\text{Phys Days}} = (2.761 + -1.3858) + (0.0745 + -0.0553)(\text{Age})\] \[= 1.3752 + 0.0192(\text{Age})\]

Verification against stratified analysis from Task 2:

tribble(
  ~Method, ~Stratum, ~Intercept, ~Slope,
  "Stratified (Task 2)", "Non-exercisers",
    round(coef(mod_no_ex)[1], 4), round(coef(mod_no_ex)[2], 4),
  "Stratified (Task 2)", "Exercisers",
    round(coef(mod_yes_ex)[1], 4), round(coef(mod_yes_ex)[2], 4),
  "Interaction model", "Non-exercisers",
    b[1], b[2],
  "Interaction model", "Exercisers",
    b[1] + b[3], b[2] + b[4]
) |>
  kable(caption = "Verification: Stratified = Interaction Model Stratum-Specific Equations") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Verification: Stratified = Interaction Model Stratum-Specific Equations
Method	Stratum	Intercept	Slope
Stratified (Task 2)	Non-exercisers	2.7610	0.0745
Stratified (Task 2)	Exercisers	1.3752	0.0192
Interaction model	Non-exercisers	2.7610	0.0745
Interaction model	Exercisers	1.3752	0.0192

The intercepts and slopes match (up to rounding), confirming that the interaction model perfectly recovers the stratified results.

3c. t-test for the interaction term `age:exerciseYes`

int_term_ex <- tidy(mod_int_ex) |> filter(term == "age:exerciseYes")

cat("Interaction term: age:exerciseYes\n")

## Interaction term: age:exerciseYes

cat("  Estimate:    ", round(int_term_ex$estimate,  4), "\n")

##   Estimate:     -0.0553

cat("  SE:          ", round(int_term_ex$std.error, 4), "\n")

##   SE:           0.0162

cat("  t-statistic: ", round(int_term_ex$statistic, 3), "\n")

##   t-statistic:  -3.407

cat("  p-value:     ", round(int_term_ex$p.value,   4), "\n")

##   p-value:      7e-04

Hypotheses:

\(H_0: \beta_3 = 0\) — the slope of age on physically unhealthy days is the same for exercisers and non-exercisers (parallel lines; no interaction).
\(H_A: \beta_3 \neq 0\) — the slope of age differs between exercisers and non-exercisers (non-parallel lines; interaction is present).

Conclusion: The interaction term age:exerciseYes has an estimate of -0.0553, with t = -3.407 and p = 7^{-4}. Since p < 0.05, we reject the null hypothesis and conclude that there is statistically significant evidence of an interaction between age and exercise status. The negative coefficient means the age slope is less steep among exercisers than non-exercisers — the association between older age and more physically unhealthy days is attenuated in those who exercise.

3d. Interaction model: `physhlth_days ~ age * education`; partial F-test

mod_int_educ <- lm(physhlth_days ~ age * education, data = brfss_ci)
mod_no_int_educ <- lm(physhlth_days ~ age + education, data = brfss_ci)

tidy(mod_int_educ, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Interaction Model: age * education → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: age * education → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7542	1.5422	1.7859	0.0742	-0.2692	5.7775
age	0.0706	0.0269	2.6277	0.0086	0.0179	0.1233
educationHS graduate	-1.4016	1.6900	-0.8293	0.4069	-4.7146	1.9115
educationSome college	-1.1261	1.6893	-0.6666	0.5051	-4.4380	2.1857
educationCollege graduate	-2.7102	1.6580	-1.6346	0.1022	-5.9606	0.5402
age:educationHS graduate	-0.0197	0.0296	-0.6661	0.5054	-0.0776	0.0382
age:educationSome college	-0.0326	0.0295	-1.1039	0.2697	-0.0904	0.0253
age:educationCollege graduate	-0.0277	0.0289	-0.9583	0.3380	-0.0844	0.0290

How many interaction terms? Education has 4 levels, so there are 3 interaction terms (k − 1 = 4 − 1 = 3): age:educationHS graduate, age:educationSome college, and age:educationCollege graduate.

anova(mod_no_int_educ, mod_int_educ) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Is the age × education Interaction Significant?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Is the age × education Interaction Significant?
term	df.residual	rss	df	sumsq	statistic	p.value
physhlth_days ~ age + education	4995	306764.2	NA	NA	NA	NA
physhlth_days ~ age * education	4992	306671.9	3	92.2713	0.5007	0.6818

Interpretation: The partial F-test evaluates all 3 interaction terms simultaneously (\(H_0: \beta_5 = \beta_6 = \beta_7 = 0\)). The F-statistic is 0.5 with p = 0.6818.

If p < 0.05: we reject the null and conclude that the age slope differs significantly across at least one pair of education levels — interaction is present overall.
If p ≥ 0.05: we fail to reject the null — the age × education interaction is not statistically significant as a whole.

Interpret the result accordingly based on your sample’s p-value.

3e. Visualization using `ggpredict()`: predicted `physhlth_days` by age for each education level

pred_educ <- ggpredict(mod_int_educ, terms = c("age [20:80 by=5]", "education"))

ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.10, color = NA) +
  labs(
    title    = "Predicted Physically Unhealthy Days by Age and Education Level",
    subtitle = "Non-parallel lines indicate interaction (age effect modification by education)",
    x        = "Age (years)",
    y        = "Predicted Physically Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  scale_color_brewer(palette = "Set2") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "bottom")

Do the lines appear parallel? If the lines fan out or converge rather than running strictly parallel, this suggests that the age-physical health slope differs by education level — consistent with effect modification. Compare the slope steepness across education groups: if “Less than HS” has a markedly steeper positive slope than “College graduate,” this visual evidence of non-parallelism reinforces the presence of interaction. Parallel lines would indicate that the rate of increase in unhealthy days with age is the same regardless of education.

Task 4: Confounding Assessment

For Tasks 4a–4d, the exposure is exercise and the outcome is physhlth_days.

4a. Crude model: `physhlth_days ~ exercise`

mod_crude_ex <- lm(physhlth_days ~ exercise, data = brfss_ci)

tidy(mod_crude_ex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Crude Model: Exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Crude Model: Exercise → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	7.1027	0.2354	30.1692	0	6.6412	7.5643
exerciseYes	-4.7115	0.2656	-17.7401	0	-5.2322	-4.1908

b_crude_ex <- coef(mod_crude_ex)["exerciseYes"]
cat("Crude exercise coefficient:", round(b_crude_ex, 4), "\n")

## Crude exercise coefficient: -4.7115

Interpretation: The crude (unadjusted) exercise coefficient is -4.7115. This means that, without adjusting for any other variables, those who exercised reported approximately 4.71 fewer physically unhealthy days in the past 30 days compared to non-exercisers. This is our unadjusted benchmark for confounding assessment.

4b. Systematic confounding assessment for each covariate

covariates <- c("age", "sex", "sleep_hrs", "education", "income_cat")

confounding_table <- map_dfr(covariates, function(cov) {
  formula_str <- paste("physhlth_days ~ exercise +", cov)
  mod_adj <- lm(as.formula(formula_str), data = brfss_ci)
  b_adj <- coef(mod_adj)["exerciseYes"]
  pct_change <- abs(b_crude_ex - b_adj) / abs(b_crude_ex) * 100
  
  tibble(
    Covariate         = cov,
    `Crude β`         = round(b_crude_ex, 4),
    `Adjusted β`      = round(b_adj, 4),
    `% Change`        = round(pct_change, 1),
    `Confounder?`     = ifelse(pct_change >= 10, "Yes (≥10%)", "No (<10%)")
  )
})

kable(
  confounding_table,
  caption = "Confounding Assessment: % Change in Exercise Coefficient After Adjusting for Each Covariate"
) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(5, bold = TRUE,
              color = ifelse(confounding_table$`Confounder?` == "Yes (≥10%)", "darkgreen", "gray40"))

Confounding Assessment: % Change in Exercise Coefficient After Adjusting for Each Covariate
Covariate	Crude β	Adjusted β	% Change	Confounder?
age	-4.7115	-4.5504	3.4	No (<10%)
sex	-4.7115	-4.6974	0.3	No (<10%)
sleep_hrs	-4.7115	-4.6831	0.6	No (<10%)
education	-4.7115	-4.3912	6.8	No (<10%)
income_cat	-4.7115	-3.9406	16.4	Yes (≥10%)

Summary: Any covariate producing a ≥10% change in the exercise coefficient qualifies as a confounder by the change-in-estimate rule. Variables with <10% change are not considered meaningful confounders of this association (though they may still be important predictors of the outcome).

4c. Fully adjusted model: exercise + all identified confounders

# Identify confounders from 4b
confounders <- confounding_table |>
  filter(`Confounder?` == "Yes (≥10%)") |>
  pull(Covariate)

cat("Identified confounders:", paste(confounders, collapse = ", "), "\n")

## Identified confounders: income_cat

# Build fully adjusted model
formula_full <- paste("physhlth_days ~ exercise +", paste(confounders, collapse = " + "))
mod_full_ex <- lm(as.formula(formula_full), data = brfss_ci)

b_full <- coef(mod_full_ex)["exerciseYes"]
pct_change_full <- abs(b_crude_ex - b_full) / abs(b_crude_ex) * 100

tidy(mod_full_ex, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption   = "Fully Adjusted Model: Exercise + All Identified Confounders",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Fully Adjusted Model: Exercise + All Identified Confounders
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	10.6363	0.3619	29.3924	9.9269	11.3457
exerciseYes	-3.9406	0.2684	-14.6842	-4.4667	-3.4145
income_cat	-0.6750	0.0531	-12.7135	-0.7790	-0.5709

cat("\nCrude exercise β:  ", round(b_crude_ex, 4))

## 
## Crude exercise β:   -4.7115

cat("\nAdjusted exercise β:", round(b_full, 4))

## 
## Adjusted exercise β: -3.9406

cat("\n% Change from crude:", round(pct_change_full, 1), "%\n")

## 
## % Change from crude: 16.4 %

Comparison: The crude exercise coefficient was -4.7115. After adjusting for all identified confounders simultaneously, the coefficient is -3.9406 — a 16.4% change from the crude estimate. The direction remains negative (exercise is still associated with fewer physically unhealthy days), but the magnitude has changed, reflecting the collective impact of the confounders. This underscores why covariate-adjusted estimation is essential for valid inference about the exercise-physical health association.

4d. Is `gen_health` a confounder or a mediator?

gen_health is more likely a mediator — or at minimum a variable that lies on the causal pathway — rather than a traditional confounder. Here is the reasoning:

Applying the three conditions for confounding:

Associated with the exposure (exercise)? Likely yes — people who exercise regularly tend to report better general health.
Associated with the outcome (physical health days)? Almost certainly yes — self-reported general health status is strongly correlated with the number of physically unhealthy days.
Not on the causal pathway? This is the critical question. General health status is plausibly caused by exercise: exercise improves physical health, which then improves self-reported general health status. In a temporal/causal sense, general health status sits between exercise and the outcome — it is a mechanism through which exercise reduces physically unhealthy days.

Could it be both? Yes, in principle. If some pre-existing health conditions (which manifest as poor general health) both discourage exercise and cause more physically unhealthy days independently of exercise, then gen_health could partially act as a confounder too. This is called a confounder-mediator (or “intermediate confounder”), and is a well-recognized challenge in epidemiology.

Practical implication: If our goal is to estimate the total effect of exercise on physical health days (through all pathways, including changes in general health), we should not adjust for gen_health. Adjusting for a mediator blocks the very pathway we are trying to measure, resulting in an estimate of only the direct effect — and potentially introduces collider bias. Adjusting for gen_health would therefore be inappropriate for a total-effect estimand.

Task 5: Public Health Interpretation

5a. Public health summary paragraph

Based on our analyses, we found that the association between exercise and physically unhealthy days does appear to differ by age — older adults who do not exercise show a steeper age-related increase in unhealthy days compared to older adults who do exercise, suggesting that regular physical activity may be especially beneficial for mitigating the physical health declines associated with aging. After adjusting for variables that confounded the relationship between exercise and physical health (including age, sleep duration, education, and income), people who engaged in any physical activity in the past 30 days reported meaningfully fewer physically unhealthy days than those who did not exercise. This protective association remained even after accounting for important sociodemographic and behavioral factors, lending support to the value of exercise promotion across all population groups. However, because these data come from a cross-sectional survey, we cannot establish that exercise caused the improvement in physical health — it is possible that people in better health are simply more able to exercise. Additionally, unmeasured factors such as chronic disease status, occupation, or access to safe recreational spaces could account for part of the observed difference.

5b. Argument against including `gen_health` as a covariate

If our goal is to understand the total effect of exercise on physically unhealthy days, adjusting for general health status would be inappropriate because general health is plausibly a mediator — that is, one of the main ways exercise improves physical wellbeing is precisely by improving a person’s overall health status. Conditioning on a mediator in a regression model blocks the indirect pathway from exercise to the outcome, leaving only the portion of the effect that operates outside of improvements in general health — which is likely a smaller and less meaningful quantity than the total effect we actually care about. In other words, including gen_health in the model would not remove bias; it would introduce bias by artificially suppressing the very mechanism through which exercise is hypothesized to work. Even though the 10% change-in-estimate criterion flags gen_health as a confounder statistically, the biological and causal logic suggests it belongs downstream of the exposure, making adjustment harmful to valid causal inference rather than protective.

End of Lab Activity

EPI 553 — Confounding and Interactions Lab

In-Class Lab Activity — Completed Answers

Karo

March 24, 2026

Setup

Task 1: Exploratory Data Analysis

1a. Scatterplot: `physhlth_days` vs. `age`, colored by `exercise`

1b. Mean `physhlth_days` by `sex` and `exercise`

1c. Scatterplot: `physhlth_days` vs. `sleep_hrs`, faceted by `education`

Task 2: Stratified Analysis for Interaction

2a. Separate regression models: `physhlth_days ~ age` by exercise stratum

2b. Plot of stratum-specific regression lines

2c. Can you formally test whether the two slopes differ using only stratified results?

Task 3: Interaction via Regression

3a. Interaction model: `physhlth_days ~ age * exercise`

3b. Stratum-specific equations

3c. t-test for the interaction term `age:exerciseYes`

3d. Interaction model: `physhlth_days ~ age * education`; partial F-test

3e. Visualization using `ggpredict()`: predicted `physhlth_days` by age for each education level

Task 4: Confounding Assessment

4a. Crude model: `physhlth_days ~ exercise`

4b. Systematic confounding assessment for each covariate

4c. Fully adjusted model: exercise + all identified confounders

4d. Is `gen_health` a confounder or a mediator?

Task 5: Public Health Interpretation

5a. Public health summary paragraph

5b. Argument against including `gen_health` as a covariate

EPI 553 — Confounding and Interactions Lab

In-Class Lab Activity — Completed Answers

Karo

March 24, 2026

Setup

Task 1: Exploratory Data Analysis

1a. Scatterplot: physhlth_days vs. age, colored by exercise

1b. Mean physhlth_days by sex and exercise

1c. Scatterplot: physhlth_days vs. sleep_hrs, faceted by education

Task 2: Stratified Analysis for Interaction

2a. Separate regression models: physhlth_days ~ age by exercise stratum

2b. Plot of stratum-specific regression lines

2c. Can you formally test whether the two slopes differ using only stratified results?

Task 3: Interaction via Regression

3a. Interaction model: physhlth_days ~ age * exercise

3b. Stratum-specific equations

3c. t-test for the interaction term age:exerciseYes

3d. Interaction model: physhlth_days ~ age * education; partial F-test

3e. Visualization using ggpredict(): predicted physhlth_days by age for each education level

Task 4: Confounding Assessment

4a. Crude model: physhlth_days ~ exercise

4b. Systematic confounding assessment for each covariate

4c. Fully adjusted model: exercise + all identified confounders

4d. Is gen_health a confounder or a mediator?

Task 5: Public Health Interpretation

5a. Public health summary paragraph

5b. Argument against including gen_health as a covariate

1a. Scatterplot: `physhlth_days` vs. `age`, colored by `exercise`

1b. Mean `physhlth_days` by `sex` and `exercise`

1c. Scatterplot: `physhlth_days` vs. `sleep_hrs`, faceted by `education`

2a. Separate regression models: `physhlth_days ~ age` by exercise stratum

3a. Interaction model: `physhlth_days ~ age * exercise`

3c. t-test for the interaction term `age:exerciseYes`

3d. Interaction model: `physhlth_days ~ age * education`; partial F-test

3e. Visualization using `ggpredict()`: predicted `physhlth_days` by age for each education level

4a. Crude model: `physhlth_days ~ exercise`

4d. Is `gen_health` a confounder or a mediator?

5b. Argument against including `gen_health` as a covariate