EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026
In this lab, assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)
library(ggplot2)
brfss_ci <- readRDS(
"/Users/nataliasmall/Downloads/EPI 553/brfss_ci_2020.rds"
)1a. (5 pts) Create a scatterplot of
physhlth_days (y-axis) vs. age (x-axis),
colored by exercise status. Add separate regression lines
for each group. Describe the pattern you observe.
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Age and Physical Health Days, Stratified by Exercise Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Age in years (capped at 80)",
y = "Physically Unhealthy Days (Past 30)",
color = "Exercise Status"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
Interpretation: The observed pattern reveals that the
age-physical health association is positive in both groups. Both
regression lines slope upward, confirming the positive association
between age and physically unhealthy days for both exercise status
categories. The “No” exercise status category regression line suggest a
stronger protective association of age compared to the “Yes” line. Note,
that the “Yes” line only indicates a slight positive association,
compared to the “No” line. The confidence bands do not overlap, which
hints that the difference in slopes may be statistically significant.
The scattered points also reveal the highly skewed nature of the
outcome: most respondents cluster near zero unhealthy days regardless of
age in years, with a long tail of individuals reporting many unhealthy
days.
1b. (5 pts) Compute the mean
physhlth_days for each combination of sex and
exercise. Present the results in a table. Does it appear
that the association between exercise and physical health days might
differ by sex?
# Compute observed group means
physhlth_means <- brfss_ci |>
summarise(mean_physhlth = mean(physhlth_days), .by = c(sex, exercise))
# Present the results in a table
physhlth_means |>
kable(
caption = "Mean Physically Unhealthy Days by Sex and Exercise Status",
col.names = c("Sex", "Exercise", "Mean Days")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Sex | Exercise | Mean Days |
|---|---|---|
| Female | No | 7.430400 |
| Male | Yes | 2.323577 |
| Female | Yes | 2.451056 |
| Male | No | 6.643498 |
Interpretation: It appears that the association between exercise and physical health days might not differ by sex. However, individuals, regardless of sex, have similar average physically unhealthy days when grouped by their exercise status.
1c. (5 pts) Create a scatterplot of
physhlth_days vs. sleep_hrs, faceted by
education level. Comment on whether the slopes appear
similar or different across education groups.
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = education)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Sleep hours and Physical Health Days, Faceted by Education level",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Sleep hours per night",
y = "Physically Unhealthy Days (Past 30)",
color = "Education level"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
Interpretation: The slopes appear similar across
education groups. The observed pattern shows that the confidence bands
overlap, which hints that the difference in slopes may not be
statistically significant.
2a. (5 pts) Fit separate simple linear regression
models of physhlth_days ~ age for exercisers and
non-exercisers. Report the slope, SE, and 95% CI for age in each
stratum.
# Fit separate models for exercisers and non-exercisers
mod_exer <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))
mod_non_excer <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))
# Compare coefficients
bind_rows(
tidy(mod_exer, conf.int = TRUE) |> mutate(Stratum = "Exercisers"),
tidy(mod_non_excer, conf.int = TRUE) |> mutate(Stratum = "Non-Exercisers")
) |>
filter(term == "age") |>
select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stratified Analysis: Physically unhealthy days ~ Age, by Exercise Status",
col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Stratum | Estimate | SE | CI Lower | CI Upper | p-value |
|---|---|---|---|---|---|
| Exercisers | 0.0192 | 0.0060 | 0.0075 | 0.0309 | 0.0013 |
| Non-Exercisers | 0.0745 | 0.0212 | 0.0328 | 0.1161 | 0.0005 |
Age by Exercisers (Yes) Slope: 0.0192 Age by Exercisers (Yes) SE: 0.0060 Age by Exercisers (Yes) 95% CI: [0.0075, 0.0309]
Age by Non-Exercisers (No) Slope: 0.0745 Age by Non-Exercisers (No) SE: 0.0212 Age by Non-Exercisers (No) 95% CI: [0.0328, 0.1161]
2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
labs(
title = "Association Between Age and Physical Health Days, Stratified by Exercise Status",
subtitle = "BRFSS 2020 (n = 5,000)",
x = "Age in years (capped at 80)",
y = "Physically Unhealthy Days (Past 30)",
color = "Exercise Status"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
Interpretation: The lines are not approximately
parallel. The non-exercisers (exercise status = No) regression line
shows a sharper positive incline compared to the exercisers (exercise
status = yes) regression line showing a very slight positive
incline.
2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not. You cannot formally test whether the two slopes are different using only the stratified results. While the stratified approach is intuitive and visually compelling, and we can see the slopes are different, we cannot compute a p-value for the difference without additional machinery.
3a. (5 pts) Fit the interaction model:
physhlth_days ~ age * exercise. Write out the fitted
equation.
# Model with interaction
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)
tidy(mod_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Age * Exercise Status → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7610 | 0.8798 | 3.1381 | 0.0017 | 1.0361 | 4.4858 |
| age | 0.0745 | 0.0145 | 5.1203 | 0.0000 | 0.0460 | 0.1030 |
| exerciseYes | -1.3858 | 0.9664 | -1.4340 | 0.1516 | -3.2804 | 0.5087 |
| age:exerciseYes | -0.0553 | 0.0162 | -3.4072 | 0.0007 | -0.0871 | -0.0235 |
Fitted equation: Phys Days = 2.7610 + 0.0745 (Age) + -1.3858 (Exercise:Yes) + -0.0553 (Age × Exercise:Yes)
3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.
For Non-Exercisers (Exercise = No) Phys Days = 2.7610 + 0.0745 (Age) For Exercisers (Exercise = Yes) Phys Days = (2.7610 + -1.3858) + (0.0745 + -0.0553)(Age) = 1.3752 + 0.0192(Age)
These match the stratified analysis from Task 2.
3c. (5 pts) Conduct the t-test for the interaction
term (age:exerciseYes). State the null and alternative
hypotheses, report the test statistic and p-value, and state your
conclusion about whether interaction is present.
int_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")## Interaction term (age:exerciseYes):
## Estimate: -0.0553
## t-statistic: -3.407
## p-value: 7e-04
Null Hypothesis: 𝐻0:𝛽3=0 (slopes are equal, lines are parallel, no interaction) Alternative Hypothesis: 𝐻𝐴:𝛽3≠0 (slopes differ, interaction is present) t-statistic: -3.407 p-value: 7e-04 (<0.001) Conclusion: Since p-value 7e-04 is <0.05 we reject the null hypothesis, indicating that slopes significantly differ and interaction is present. There is statistically significant evidence that the relationship between age and physical health does change depending on exercise status.
3d. (5 pts) Now fit a model with an interaction
between age and education: physhlth_days ~ age * education.
How many interaction terms are produced? Use a partial F-test to test
whether the age \(\times\) education
interaction as a whole is significant.
# Model with an interaction between age and education: physhlth_days ~ age * education
mod_edu <- lm(physhlth_days ~ age * education, data = brfss_ci)
# Conduct partial F-test
anova(mod_edu) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Test for Coincidence: Does Age × Education Affect the Age-Physical Health Relationship at All?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| age | 1 | 2830.1358 | 2830.1358 | 46.0689 | 0.0000 |
| education | 3 | 5780.0958 | 1926.6986 | 31.3628 | 0.0000 |
| age:education | 3 | 92.2713 | 30.7571 | 0.5007 | 0.6818 |
| Residuals | 4992 | 306671.8963 | 61.4327 | NA | NA |
Interpretation: 3 interaction terms are produced. The age × education interaction as a whole is not significant as, p-value 0.6818 > 0.05.
3e. (5 pts) Create a visualization using
ggpredict() showing the predicted
physhlth_days by age for each education level. Do the lines
appear parallel?
pred_int <- ggpredict(mod_edu, terms = c("age", "education"))
ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Physical Health Days by Age and Education Level",
subtitle = "From interaction model: age * education",
x = "Age in years (capped at 80)",
y = "Predicted Physically Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")
Interpretation: The lines appear relatively
parallel.
For this task, the exposure is exercise
and the outcome is physhlth_days.
4a. (5 pts) Fit the crude model:
physhlth_days ~ exercise. Report the exercise coefficient.
This is the unadjusted estimate.
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)
tidy(mod_crude, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Crude Model: Physical Health Days ~ Exercise Status",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.1027 | 0.2354 | 30.1692 | 0 | 6.6412 | 7.5643 |
| exerciseYes | -4.7115 | 0.2656 | -17.7401 | 0 | -5.2322 | -4.1908 |
Coefficient: The exercise coefficient is -4.7115.
4b. (10 pts) Systematically assess whether each of
the following is a confounder of the exercise-physical health
association: age, sex, sleep_hrs,
education, and income_cat. For each:
physhlth_days ~ exercise + [covariate]Present your results in a single summary table.
# Crude model
b_crude_val <- coef(mod_crude)["exerciseYes"]
# One-at-a-time adjusted models
confounders <- list(
"Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
"Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
"Sleep Hours" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
"Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
"Income" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)
conf_table <- map_dfr(names(confounders), \(name) {
mod <- confounders[[name]]
b_adj_val <- coef(mod)["exerciseYes"]
tibble(
Covariate = name,
`Crude β (exercise)` = round(b_crude_val, 4),
`Adjusted β (exercise)` = round(b_adj_val, 4),
`% Change` = round(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100, 1),
Confounder = ifelse(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100 > 10,
"Yes (>10%)", "No")
)
})
conf_table |>
kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(5, bold = TRUE)| Covariate | Crude β (exercise) | Adjusted β (exercise) | % Change | Confounder |
|---|---|---|---|---|
| Age | -4.7115 | -4.5504 | 3.4 | No |
| Sex | -4.7115 | -4.6974 | 0.3 | No |
| Sleep Hours | -4.7115 | -4.6831 | 0.6 | No |
| Education | -4.7115 | -4.3912 | 6.8 | No |
| Income | -4.7115 | -3.9406 | 16.4 | Yes (>10%) |
4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?
# Step 1: Test for interaction between age and exercise
mod_final_int <- lm(physhlth_days ~ exercise * age + sex + sleep_hrs + education + income_cat,
data = brfss_ci)
# Check interaction term
int_pval <- tidy(mod_final_int) |>
filter(term == "exerciseYes:age") |>
pull(p.value)
cat("Interaction p-value (age x exercise):", round(int_pval, 4), "\n")## Interaction p-value (age x exercise): 4e-04
# p-value < 0.05
# Table
tidy(mod_final_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Final Model: Exercise Status → Physical Health Days, Adjusted for Confounders",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5496 | 1.1017 | 8.6681 | 0.0000 | 7.3898 | 11.7094 |
| exerciseYes | -0.4797 | 0.9525 | -0.5037 | 0.6145 | -2.3470 | 1.3876 |
| age | 0.0785 | 0.0143 | 5.4826 | 0.0000 | 0.0505 | 0.1066 |
| sexFemale | 0.0092 | 0.2169 | 0.0426 | 0.9660 | -0.4159 | 0.4344 |
| sleep_hrs | -0.4096 | 0.0795 | -5.1521 | 0.0000 | -0.5655 | -0.2538 |
| educationHS graduate | -1.0692 | 0.5083 | -2.1034 | 0.0355 | -2.0658 | -0.0726 |
| educationSome college | -0.7924 | 0.5145 | -1.5401 | 0.1236 | -1.8012 | 0.2163 |
| educationCollege graduate | -1.0013 | 0.5222 | -1.9175 | 0.0552 | -2.0251 | 0.0224 |
| income_cat | -0.6311 | 0.0589 | -10.7067 | 0.0000 | -0.7467 | -0.5156 |
| exerciseYes:age | -0.0567 | 0.0159 | -3.5588 | 0.0004 | -0.0880 | -0.0255 |
# Compare Crude-Final
b_final <- coef(mod_final_int)["exerciseYes"]
tribble(
~Model, ~`Exercise β`, ~`% Change from Crude`,
"Crude", round(b_crude_val, 4), "—",
"Unadjusted (exercise only)", round(coef(mod_crude)["exerciseYes"], 4),
paste0(round(abs(b_crude_val - coef(mod_crude)["exerciseYes"]) / abs(b_crude_val) * 100, 1), "%"),
"Fully Adjusted (all confounders)", round(b_final, 4),
paste0(round(abs(b_crude_val - b_final) / abs(b_crude_val) * 100, 1), "%")
) |>
kable(caption = "Sleep Coefficient: Crude vs. Progressively Adjusted Models") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Exercise β | % Change from Crude |
|---|---|---|
| Crude | -4.7115 | — |
| Unadjusted (exercise only) | -4.7115 | 0% |
| Fully Adjusted (all confounders) | -0.4797 | 89.8% |
Interpretation: From the fully adjusted model, including all confounders, the exercise coefficient is -0.4797 Compared to the crude estimate, -4.7115, it changed 89.8%.
4d. (5 pts) Is gen_health a confounder
or a mediator of the exercise-physical health relationship? Could it be
both? Explain your reasoning with reference to the three conditions for
confounding and the concept of the causal pathway.
Interpretation: Gen Health could be a confounder of the exercise-physical health relationship. People who exercise generally have better general health, affirming association with exposure. Additionally, general health is a strong predictor of physically unhealthy days, confirming association with outcome. But, exercise directly influences general health, indicating that it is on the causal pathway. Thus, gen health would be better identified as a mediator of the exercise-physical health relationship. Exercise improves health status over time.
5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:
Do not use statistical jargon.
Summary: The association between exercise and physical health showed substantial difference by subgroup. While individuals benefit from exercise, the protective effect increases as people get older. Age, sleep hours, education level, and income are variables that confounded the exercise-physical health association. Moreover, after accounting for confounding factors the direction and approximate magnitude of the adjusted exercise effect remained positive, however, the initial estimate was reduced by 89.8%. Since the data used for this analysis was collected at a single point in time, it cannot be definitively proven that exercise causes better health, and can we cannot rule out other potential unmeasured lifestyle factors that might influence these results.
5b. (10 pts) A colleague suggests including
gen_health as a covariate in the final model because it
changes the exercise coefficient by more than 10%. You disagree. Write a
3–4 sentence argument explaining why adjusting for
general health may not be appropriate if the goal is to estimate the
total effect of exercise on physical health days. Use the concept of
mediation in your argument.
Adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days because general health acts as a mediator in the exercise-physical health relationship. While it does change the exercise coefficient by more than 10%, general health is on the causal pathway; exercise improves overall health, which directly reduces physically unhealthy days. Including general health as a covariate in the final model will lead to an over-adjusted estimate, which we don’t want.
End of Lab Activity