EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026
In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(ggplot2)
library(car)
library(ggeffects)
library(haven)
library(janitor)
library(gtsummary)
library(GGally)
library(plotly)
library(lmtest)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))brfss_ci <- readRDS(
"C:/Users/joshm/Documents/UAlbany/Spring 2026/EPI 553/Labs/brfss_ci.rds")
#clean_names()1a. (5 pts) Create a scatterplot of
physhlth_days (y-axis) vs. age (x-axis),
colored by exercise status. Add separate regression lines
for each group. Describe the pattern you observe.
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Age and Physical Health Days, Stratified by Exercise status",
subtitle = "BRFSS 2020",
x = "Age in years",
y = "Physically Unhealthy Days (Past 30)",
color = "Exercise status"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Age is positively associated with physically unhealthy days in both Exercise status groups. However, the slope is steeper for the No Exercise group, suggesting that Exercise modifies the Age-Physically Unhealthy relationship.
1b. (5 pts) Compute the mean
physhlth_days for each combination of sex and
exercise. Present the results in a table. Does it appear
that the association between exercise and physical health days might
differ by sex?
group_means <- brfss_ci |>
summarise(mean_days = mean(physhlth_days), .by = c(exercise, sex))
group_means %>%
kable(caption = "Physical Health Days by Sex & Exercise",
align = "lrrrrrrrr") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| exercise | sex | mean_days |
|---|---|---|
| No | Female | 7.430400 |
| Yes | Male | 2.323577 |
| Yes | Female | 2.451056 |
| No | Male | 6.643498 |
It appears the relationship between Physical health days and Exercise might slightly differ by sex – males and females who exercised had roughly the same number of poor physical health days, but among the non-exercisers, females had much more physically unhealthy days than males.
1c. (5 pts) Create a scatterplot of
physhlth_days vs. sleep_hrs, faceted by
education level. Comment on whether the slopes appear
similar or different across education groups.
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = education)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Sleep and Physical Health Days, Stratified by Education Level",
subtitle = "BRFSS 2020",
x = "Sleep (hrs)",
y = "Physically Unhealthy Days (Past 30)",
color = "Education level"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")The slopes are different across education groups – the slope becomes less steep with increasing education level.
2a. (5 pts) Fit separate simple linear regression
models of physhlth_days ~ age for exercisers and
non-exercisers. Report the slope, SE, and 95% CI for age in each
stratum.
exercise_brfss <- brfss_ci %>%
filter(
exercise == "Yes")
no_exercise_brfss <- brfss_ci %>%
filter(
exercise == "No")
exer <- lm(physhlth_days ~ age, data = exercise_brfss)
tidy(exer, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Physically Unhealthy Days vs. Age, Exercisers",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.3752 | 0.3327 | 4.1327 | 0.0000 | 0.7228 | 2.0275 |
| age | 0.0192 | 0.0060 | 3.2079 | 0.0013 | 0.0075 | 0.0309 |
no_exer <- lm(physhlth_days ~ age, data = no_exercise_brfss)
tidy(no_exer, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Physically Unhealthy Days vs. Age, Non-Exercisers",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7610 | 1.2837 | 2.1508 | 0.0317 | 0.2421 | 5.2798 |
| age | 0.0745 | 0.0212 | 3.5094 | 0.0005 | 0.0328 | 0.1161 |
2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
labs(
title = "Association Between Age and Physical Health Days, Stratified by Exercise status",
subtitle = "BRFSS 2020",
x = "Age in years",
y = "Physically Unhealthy Days (Past 30)",
color = "Exercise status"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")The lines are not parallel. The non-exercise group has a steeper slope.
2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.
No, even though it is visually apparent that the slopes are different, we cannot determine if that difference is statistically significant through the stratified results alone.
3a. (5 pts) Fit the interaction model:
physhlth_days ~ age * exercise. Write out the fitted
equation.
inter <- lm(physhlth_days ~ age * exercise, data = brfss_ci)
tidy(inter, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Physically Unhealthy Days vs. Age * Exercise",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7610 | 0.8798 | 3.1381 | 0.0017 | 1.0361 | 4.4858 |
| age | 0.0745 | 0.0145 | 5.1203 | 0.0000 | 0.0460 | 0.1030 |
| exerciseYes | -1.3858 | 0.9664 | -1.4340 | 0.1516 | -3.2804 | 0.5087 |
| age:exerciseYes | -0.0553 | 0.0162 | -3.4072 | 0.0007 | -0.0871 | -0.0235 |
Physhlth_days = 2.7610 + (0.0745 * age) + (-1.3858 * ExerciseYes) + (-0.0553 * (ExerciseYes * Age))
3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.
No Exercise:
Physhlth_days = 2.7610 + (0.0745 * age)
Exercise:
Physhlth_days = (2.7610-1.3858) + (0.0745 * age) + (-0.0553 * Age))
or
Physhlth_days = 1.3752 + (0.0192 * age)
These match the stratified analysis from Task 2.
3c. (5 pts) Conduct the t-test for the interaction
term (age:exerciseYes). State the null and alternative
hypotheses, report the test statistic and p-value, and state your
conclusion about whether interaction is present.
int_term <- tidy(inter) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")## Interaction term (age:exerciseYes):
## Estimate: -0.0553
## t-statistic: -3.407
## p-value: 7e-04
Null hypothesis: The slopes are equal, no interaction. Alternative hypothesis: The slopes are not equal, there is an interaction. T statistic is -3.407, p value is 7e-04. Since p<0.05, I reject the null hypothesis and include that interaction is present.
3d. (5 pts) Now fit a model with an interaction
between age and education: physhlth_days ~ age * education.
How many interaction terms are produced? Use a partial F-test to test
whether the age \(\times\) education
interaction as a whole is significant.
inter2 <- lm(physhlth_days ~ age * education, data = brfss_ci)
tidy(inter2, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Physically Unhealthy Days vs. Age * Education",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7542 | 1.5422 | 1.7859 | 0.0742 | -0.2692 | 5.7775 |
| age | 0.0706 | 0.0269 | 2.6277 | 0.0086 | 0.0179 | 0.1233 |
| educationHS graduate | -1.4016 | 1.6900 | -0.8293 | 0.4069 | -4.7146 | 1.9115 |
| educationSome college | -1.1261 | 1.6893 | -0.6666 | 0.5051 | -4.4380 | 2.1857 |
| educationCollege graduate | -2.7102 | 1.6580 | -1.6346 | 0.1022 | -5.9606 | 0.5402 |
| age:educationHS graduate | -0.0197 | 0.0296 | -0.6661 | 0.5054 | -0.0776 | 0.0382 |
| age:educationSome college | -0.0326 | 0.0295 | -1.1039 | 0.2697 | -0.0904 | 0.0253 |
| age:educationCollege graduate | -0.0277 | 0.0289 | -0.9583 | 0.3380 | -0.0844 | 0.0290 |
#F-test
inter3 <- lm(physhlth_days ~ age + education, data = brfss_ci)
anova(inter3, inter2) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Is the Age x Education Interaction Significant?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ age + education | 4995 | 306764.2 | NA | NA | NA | NA |
| physhlth_days ~ age * education | 4992 | 306671.9 | 3 | 92.2713 | 0.5007 | 0.6818 |
3 interaction terms are produced. p > 0.05 so the interaction is not statistically significant.
3e. (5 pts) Create a visualization using
ggpredict() showing the predicted
physhlth_days by age for each education level. Do the lines
appear parallel?
pred_int <- ggpredict(inter2, terms = c("age", "education"))
ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Physical Health Days by Age & Education",
x = "Age in years",
y = "Predicted Physically Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")The lines do not appear parallel.
For this task, the exposure is exercise
and the outcome is physhlth_days.
4a. (5 pts) Fit the crude model:
physhlth_days ~ exercise. Report the exercise coefficient.
This is the unadjusted estimate.
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)
tidy(mod_crude, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Crude Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.1027 | 0.2354 | 30.1692 | 0 | 6.6412 | 7.5643 |
| exerciseYes | -4.7115 | 0.2656 | -17.7401 | 0 | -5.2322 | -4.1908 |
The exercise coefficient is -4.7115.
4b. (10 pts) Systematically assess whether each of
the following is a confounder of the exercise-physical health
association: age, sex, sleep_hrs,
education, and income_cat. For each:
physhlth_days ~ exercise + [covariate]## exerciseYes
## -4.711514
mod_adj_age <- lm(physhlth_days ~ exercise + age, data = brfss_ci)
mod_adj_sex <- lm(physhlth_days ~ exercise + sex, data = brfss_ci)
mod_adj_sleep_hrs <- lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci)
mod_adj_education <- lm(physhlth_days ~ exercise + education, data = brfss_ci)
mod_adj_income_cat <- lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
age_adj <- coef(mod_adj_age)["exerciseYes"]
age_adj## exerciseYes
## -4.550392
## exerciseYes
## -4.697375
## exerciseYes
## -4.683112
## exerciseYes
## -4.391222
## exerciseYes
## -3.940615
## exerciseYes
## 3.419746
## exerciseYes
## 0.3000968
## exerciseYes
## 0.6028321
## exerciseYes
## 6.798071
## exerciseYes
## 16.36203
confounders <- list(
"Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
"Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
"Sleep" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
"Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
"Income" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci))
conf_table <- map_dfr(names(confounders), \(name) {
mod <- confounders[[name]]
b_adj_val <- coef(mod)["exerciseYes"]
tibble(
Covariate = name,
`Crude β (exercise)` = round(b_crude, 4),
`Adjusted β (exercise)` = round(b_adj_val, 4),
`% Change` = round(abs(b_crude - b_adj_val) / abs(b_crude) * 100, 1),
Confounder = ifelse(abs(b_crude - b_adj_val) / abs(b_crude) * 100 > 10,
"Yes (>10%)", "No")
)
})
conf_table## # A tibble: 5 × 5
## Covariate `Crude β (exercise)` `Adjusted β (exercise)` `% Change` Confounder
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Age -4.71 -4.55 3.4 No
## 2 Sex -4.71 -4.70 0.3 No
## 3 Sleep -4.71 -4.68 0.6 No
## 4 Education -4.71 -4.39 6.8 No
## 5 Income -4.71 -3.94 16.4 Yes (>10%)
conf_table |>
kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Covariate | Crude β (exercise) | Adjusted β (exercise) | % Change | Confounder |
|---|---|---|---|---|
| Age | -4.7115 | -4.5504 | 3.4 | No |
| Sex | -4.7115 | -4.6974 | 0.3 | No |
| Sleep | -4.7115 | -4.6831 | 0.6 | No |
| Education | -4.7115 | -4.3912 | 6.8 | No |
| Income | -4.7115 | -3.9406 | 16.4 | Yes (>10%) |
Income is a confounder of the exercise - physical health relationship as the percent change between the crude and adjusted coefficients is >10%. The other variables age, sex, sleep, and education are not confounders as the percent change between the crude and adjusted coefficients is <10% for each.
Present your results in a single summary table.
4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?
full_model <- lm(physhlth_days ~ exercise * age + sex + sleep_hrs + education + income_cat, data = brfss_ci)
tidy(full_model, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Full Model",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.5496 | 1.1017 | 8.6681 | 0.0000 | 7.3898 | 11.7094 |
| exerciseYes | -0.4797 | 0.9525 | -0.5037 | 0.6145 | -2.3470 | 1.3876 |
| age | 0.0785 | 0.0143 | 5.4826 | 0.0000 | 0.0505 | 0.1066 |
| sexFemale | 0.0092 | 0.2169 | 0.0426 | 0.9660 | -0.4159 | 0.4344 |
| sleep_hrs | -0.4096 | 0.0795 | -5.1521 | 0.0000 | -0.5655 | -0.2538 |
| educationHS graduate | -1.0692 | 0.5083 | -2.1034 | 0.0355 | -2.0658 | -0.0726 |
| educationSome college | -0.7924 | 0.5145 | -1.5401 | 0.1236 | -1.8012 | 0.2163 |
| educationCollege graduate | -1.0013 | 0.5222 | -1.9175 | 0.0552 | -2.0251 | 0.0224 |
| income_cat | -0.6311 | 0.0589 | -10.7067 | 0.0000 | -0.7467 | -0.5156 |
| exerciseYes:age | -0.0567 | 0.0159 | -3.5588 | 0.0004 | -0.0880 | -0.0255 |
b_crude_full = coef(full_model)["exerciseYes"]
change <- abs(b_crude - b_crude_full) / abs(b_crude) * 100
change## exerciseYes
## 89.81761
The exercise coefficient is -0.4797, compared to the crude estimate of -4.7115. The estimate decreased by almost 90%.
4d. (5 pts) Is gen_health a confounder
or a mediator of the exercise-physical health relationship? Could it be
both? Explain your reasoning with reference to the three conditions for
confounding and the concept of the causal pathway.
Gen health is likely a confounder of the exercise-physical health relationship. It is associated with the exposure of exercise and the outcome of physical health. Individuals with good general health will be more likely to engage in exercise, and individuals with good general health are also more likely to have good physical health. It is not likely that exercise elicits an effect on physical health via general health – exercise likely has an effect on physical health directly. Thus it is unlikely that gen health lies on the causal pathway between exercise and physical health.
It cannot be both a confounder or a mediator because a mediator lies on the causal pathway, but a confounder cannot lie on the causal pathway.
5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:
The association between exercise and physical health prior to adjusting for other variables was -4.7115, so exercising was associated with 4.7115 LESS physically unhealthy days per month. There was a significant interaction between age and exercise, so age affected the physical health-exercise relationship. Income was a confounder of the exercise-physical health relationship – it was associated with both exercise and physical health and therefore served to distort the true relationship between exercise and physical health. After adjusting for other variables, the association between exercise and physical health was -0.4797. Thus, a negative relationship was still observed, but with a much lower magnitude after adjusting for other variables. Because BRFSS is a cross-sectional study with data collected at a single point in time, we cannot infer a causal relationship between any variables, and there may be some additional confounders that remain unknown and unaccounted for.
Do not use statistical jargon.
5b. (10 pts) A colleague suggests including
gen_health as a covariate in the final model because it
changes the exercise coefficient by more than 10%. You disagree. Write a
3–4 sentence argument explaining why adjusting for
general health may not be appropriate if the goal is to estimate the
total effect of exercise on physical health days. Use the concept of
mediation in your argument.
Removing gen health may not be appropriate if it is a mediating variable. Because the research aim is to determine the total effect of exercise on physically unhealthy days, the study should account for the potential indirect pathway of exercise -> gen health -> physical health. By removing gen health, the relationship between exercise and physical health days might be underestimated.
End of Lab Activity