In previous lectures, we built multiple linear regression models that included several predictors. We interpreted each coefficient as the expected change in \(Y\) for a one-unit increase in \(X_j\), “holding all other predictors constant.” But we did not examine two critical methodological questions:
Confounding: Does the estimated effect of our exposure change when we add or remove covariates? If so, those covariates are confounders, and we need them in the model to get an unbiased estimate.
Interaction (Effect Modification): Is the effect of our exposure the same for everyone, or does it differ across subgroups? If the effect of sleep on physical health is different for men and women, we say that sex modifies the effect of sleep.
These are the two most important methodological concepts in associative (etiologic) modeling, the type of modeling most common in epidemiology.
Research question for today:
Is the association between sleep duration and physically unhealthy days modified by sex or education? And which variables confound this association?
Before we dive in, it is important to revisit the distinction between the two primary goals of regression modeling, because confounding and interaction are only relevant in one of them.
| Goal | Question | What matters most |
|---|---|---|
| Predictive | Which set of variables best predicts \(Y\)? | Overall model accuracy (\(R^2\), RMSE, out-of-sample prediction) |
| Associative | What is the effect of a specific exposure on \(Y\), after adjusting for confounders? | Accuracy and interpretability of a specific \(\hat{\beta}\) |
In predictive modeling, we want the combination of variables that minimizes prediction error. We do not care whether individual coefficients are “correct” or interpretable, only whether the model predicts well.
In associative modeling, we care deeply about one (or a few) specific coefficients. We want to ensure that \(\hat{\beta}_{\text{exposure}}\) reflects the true relationship, free from confounding bias. This is the setting where confounding and interaction become critical.
In epidemiology, we are almost always doing associative modeling. We have a specific exposure of interest (e.g., sleep) and want to estimate its effect on a health outcome (e.g., physical health days) while controlling for confounders.
We should never extrapolate predictions from a statistical model beyond the range of the observed data. Extrapolation assumes that the relationship continues unchanged into regions where we have no data, which is often false.
A famous example: NASA engineers extrapolated O-ring performance data to predict behavior at temperatures colder than any previously tested. The resulting decision to launch the Space Shuttle Challenger in cold weather led to the 1986 disaster.
Rule of thumb: Your model is only valid within the range of the data used to build it.
library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
library(lmtest)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue with the BRFSS 2020 dataset. For this lecture we shift our outcome to physically unhealthy days and examine sleep as the primary exposure, with sex, education, age, exercise, and general health as potential modifiers and confounders.
brfss_full <- read_xpt(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2024/Papers/BRFSS Project/Final Analysis/BRFSS Data 2013 - 2023/LLCP2020.XPT"
) |>
clean_names()brfss_ci <- brfss_full |>
mutate(
# Outcome: physically unhealthy days in past 30
physhlth_days = case_when(
physhlth == 88 ~ 0,
physhlth >= 1 & physhlth <= 30 ~ as.numeric(physhlth),
TRUE ~ NA_real_
),
# Primary exposure: sleep hours
sleep_hrs = case_when(
sleptim1 >= 1 & sleptim1 <= 14 ~ as.numeric(sleptim1),
TRUE ~ NA_real_
),
# Mentally unhealthy days (covariate)
menthlth_days = case_when(
menthlth == 88 ~ 0,
menthlth >= 1 & menthlth <= 30 ~ as.numeric(menthlth),
TRUE ~ NA_real_
),
# Age
age = age80,
# Sex
sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
# Education (4-level)
education = factor(case_when(
educa %in% c(1, 2, 3) ~ "Less than HS",
educa == 4 ~ "HS graduate",
educa == 5 ~ "Some college",
educa == 6 ~ "College graduate",
TRUE ~ NA_character_
), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
# Exercise in past 30 days
exercise = factor(case_when(
exerany2 == 1 ~ "Yes",
exerany2 == 2 ~ "No",
TRUE ~ NA_character_
), levels = c("No", "Yes")),
# General health status
gen_health = factor(case_when(
genhlth == 1 ~ "Excellent",
genhlth == 2 ~ "Very good",
genhlth == 3 ~ "Good",
genhlth == 4 ~ "Fair",
genhlth == 5 ~ "Poor",
TRUE ~ NA_character_
), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
# Income category (ordinal 1-8)
income_cat = case_when(
income2 %in% 1:8 ~ as.numeric(income2),
TRUE ~ NA_real_
)
) |>
filter(
!is.na(physhlth_days),
!is.na(sleep_hrs),
!is.na(menthlth_days),
!is.na(age), age >= 18,
!is.na(sex),
!is.na(education),
!is.na(exercise),
!is.na(gen_health),
!is.na(income_cat)
)
# Reproducible random sample
set.seed(1220)
brfss_ci <- brfss_ci |>
select(physhlth_days, sleep_hrs, menthlth_days, age, sex,
education, exercise, gen_health, income_cat) |>
slice_sample(n = 5000)
# Save for lab activity
saveRDS(brfss_ci,
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Confounding Interaction/brfss_ci_2020.rds")
tibble(Metric = c("Observations", "Variables"),
Value = c(nrow(brfss_ci), ncol(brfss_ci))) |>
kable(caption = "Analytic Dataset Dimensions") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Metric | Value |
|---|---|
| Observations | 5000 |
| Variables | 9 |
brfss_ci |>
select(physhlth_days, sleep_hrs, age, sex, education, exercise, gen_health) |>
tbl_summary(
label = list(
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
exercise ~ "Any physical activity (past 30 days)",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Physically unhealthy days (past 30) | 5,000 | 3.4 (7.9) |
Sleep (hours/night) | 5,000 | 7.1 (1.4) |
Age (years) | 5,000 | 54.1 (17.0) |
Sex | 5,000 | |
Male | 2,291 (46%) | |
Female | 2,709 (54%) | |
Education level | 5,000 | |
Less than HS | 276 (5.5%) | |
HS graduate | 1,263 (25%) | |
Some college | 1,378 (28%) | |
College graduate | 2,083 (42%) | |
Any physical activity (past 30 days) | 5,000 | 3,929 (79%) |
General health status | 5,000 | |
Excellent | 1,031 (21%) | |
Very good | 1,762 (35%) | |
Good | 1,507 (30%) | |
Fair | 524 (10%) | |
Poor | 176 (3.5%) | |
1Mean (SD); n (%) | ||
Interaction (also called effect modification) is present when the relationship between an exposure and an outcome is different at different levels of a third variable. In regression terms, the slope of the exposure-outcome relationship changes depending on the value of the modifier.
For example, if the effect of sleep on physical health is stronger for women than for men, then sex modifies the effect of sleep. The two variables (sleep and sex) have a multiplicative, not merely additive, effect on the outcome.
This is fundamentally different from confounding:
| Concept | Question | Implication |
|---|---|---|
| Confounding | Is the crude estimate of the exposure effect biased by a third variable? | Must adjust for the confounder to get a valid estimate |
| Interaction | Does the effect of the exposure differ across subgroups? | Must report stratum-specific effects, not a single overall estimate |
Critical point: Always assess interaction before confounding. If interaction is present, a single “adjusted” coefficient for the exposure is misleading because the effect is not the same for everyone. You must stratify or include interaction terms.
Consider a model with a continuous exposure \(X_1\) (sleep hours) and a dichotomous variable \(X_2\) (sex, where Male = 1 and Female = 0):
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon\]
The term \(\beta_3 X_1 X_2\) is the interaction term. Let’s see what happens when we plug in the values for each group:
For males (\(X_2 = 1\)): \[E(Y | X_1, \text{Male}) = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) X_1\]
For females (\(X_2 = 0\)): \[E(Y | X_1, \text{Female}) = \beta_0 + \beta_1 X_1\]
The key insight:
If \(\beta_3 = 0\), the slopes are equal and the lines are parallel (no interaction). If \(\beta_3 \neq 0\), the lines are not parallel (interaction is present).
# Create synthetic data to illustrate the concept
set.seed(1220)
n <- 200
concept_data <- tibble(
x1 = runif(n, 0, 10),
group = rep(c("Group A", "Group B"), each = n / 2)
)
# No interaction: same slope, different intercepts
no_int <- concept_data |>
mutate(
y = ifelse(group == "Group A", 2 + 1.5 * x1, 5 + 1.5 * x1) + rnorm(n, 0, 2)
)
# Interaction: different slopes
with_int <- concept_data |>
mutate(
y = ifelse(group == "Group A", 2 + 1.5 * x1, 5 + 0.3 * x1) + rnorm(n, 0, 2)
)
p1 <- ggplot(no_int, aes(x = x1, y = y, color = group)) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
labs(title = "No Interaction", subtitle = "Parallel lines: same slope",
x = expression(X[1]), y = "Y") +
theme_minimal(base_size = 12) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
p2 <- ggplot(with_int, aes(x = x1, y = y, color = group)) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
labs(title = "Interaction Present", subtitle = "Non-parallel lines: different slopes",
x = expression(X[1]), y = "Y") +
theme_minimal(base_size = 12) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
gridExtra::grid.arrange(p1, p2, ncol = 2)No Interaction (Parallel Lines) vs. Interaction (Non-Parallel Lines)
One way to assess interaction is to split the data by levels of the potential modifier and fit separate regression models in each stratum. If the slopes differ meaningfully, interaction may be present.
Let’s test whether the association between sleep and physical health days differs by sex.
# Fit separate models for males and females
mod_male <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci |> filter(sex == "Male"))
mod_female <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci |> filter(sex == "Female"))
# Compare coefficients
bind_rows(
tidy(mod_male, conf.int = TRUE) |> mutate(Stratum = "Male"),
tidy(mod_female, conf.int = TRUE) |> mutate(Stratum = "Female")
) |>
filter(term == "sleep_hrs") |>
select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stratified Analysis: Sleep → Physical Health Days, by Sex",
col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Stratum | Estimate | SE | CI Lower | CI Upper | p-value |
|---|---|---|---|---|---|
| Male | -0.5243 | 0.1193 | -0.7583 | -0.2903 | 0.0000 |
| Female | -0.3752 | 0.1144 | -0.5996 | -0.1508 | 0.0011 |
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = sex)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Sleep and Physical Health Days, Stratified by Sex",
subtitle = "Are the slopes parallel or different?",
x = "Sleep Hours per Night",
y = "Physically Unhealthy Days (Past 30)",
color = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Stratified Regression: Sleep vs. Physical Health Days by Sex
The stratified approach is intuitive and visually compelling, but it has important limitations:
The solution is to use a single regression model with an interaction term.
R provides two operators for specifying interactions:
X1:X2 — includes only
the interaction term \(X_1 \times
X_2\)X1*X2 — shorthand for
X1 + X2 + X1:X2 (main effects plus
interaction)Rule: Any model with an interaction term must also include the main effects that comprise the interaction. Always use
*or explicitly include both main effects with:.
# Model without interaction (additive model)
mod_add <- lm(physhlth_days ~ sleep_hrs + sex, data = brfss_ci)
# Model with interaction
mod_int <- lm(physhlth_days ~ sleep_hrs * sex, data = brfss_ci)
# Equivalent to: lm(physhlth_days ~ sleep_hrs + sex + sleep_hrs:sex, data = brfss_ci)
tidy(mod_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Sleep * Sex → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 6.8388 | 0.8723 | 7.8397 | 0.0000 | 5.1287 | 8.5490 |
| sleep_hrs | -0.5243 | 0.1222 | -4.2898 | 0.0000 | -0.7639 | -0.2847 |
| sexFemale | -0.5786 | 1.1906 | -0.4860 | 0.6270 | -2.9127 | 1.7555 |
| sleep_hrs:sexFemale | 0.1491 | 0.1659 | 0.8986 | 0.3689 | -0.1762 | 0.4744 |
The fitted model is:
\[\widehat{\text{Phys Days}} = 6.839 + -0.524(\text{Sleep}) + -0.579(\text{Female}) + 0.149(\text{Sleep} \times \text{Female})\]
For males (Female = 0): \[\widehat{\text{Phys Days}} = 6.839 + -0.524(\text{Sleep})\]
For females (Female = 1): \[\widehat{\text{Phys Days}} = (6.839 + -0.579) + (-0.524 + 0.149)(\text{Sleep}) = 6.26 + -0.375(\text{Sleep})\]
Interpretation of each coefficient:
The formal test for interaction is simply the t-test for the interaction coefficient:
\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\] \[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]
int_term <- tidy(mod_int) |> filter(term == "sleep_hrs:sexFemale")
cat("Interaction term (sleep_hrs:sexFemale):\n")## Interaction term (sleep_hrs:sexFemale):
## Estimate: 0.1491
## t-statistic: 0.899
## p-value: 0.3689
A stronger test asks whether sex has any effect on the relationship (neither intercept nor slope differs):
\[H_0: \beta_2 = \beta_3 = 0 \quad \text{(the two lines are identical)}\] \[H_A: \text{At least one } \neq 0 \quad \text{(the lines differ in intercept and/or slope)}\]
This is a partial F-test comparing the interaction model to a model with only sleep:
mod_sleep_only <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci)
anova(mod_sleep_only, mod_int) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Test for Coincidence: Does Sex Affect the Sleep-Physical Health Relationship at All?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ sleep_hrs | 4998 | 313611.5 | NA | NA | NA | NA |
| physhlth_days ~ sleep_hrs * sex | 4996 | 313284.5 | 2 | 327.0047 | 2.6074 | 0.0738 |
The interaction model recovers exactly the same stratum-specific equations as the stratified analysis:
tribble(
~Method, ~Stratum, ~Intercept, ~Slope,
"Stratified", "Male",
round(coef(mod_male)[1], 3), round(coef(mod_male)[2], 3),
"Stratified", "Female",
round(coef(mod_female)[1], 3), round(coef(mod_female)[2], 3),
"Interaction model", "Male",
round(b_int[1], 3), round(b_int[2], 3),
"Interaction model", "Female",
round(b_int[1] + b_int[3], 3), round(b_int[2] + b_int[4], 3)
) |>
kable(caption = "Verification: Stratified Analysis = Interaction Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | Stratum | Intercept | Slope |
|---|---|---|---|
| Stratified | Male | 6.839 | -0.524 |
| Stratified | Female | 6.260 | -0.375 |
| Interaction model | Male | 6.839 | -0.524 |
| Interaction model | Female | 6.260 | -0.375 |
The interaction model and the stratified analysis give identical stratum-specific equations. The advantage of the interaction model is that it provides a formal test of the difference between slopes and allows adjustment for additional covariates.
pred_int <- ggpredict(mod_int, terms = c("sleep_hrs [3:12]", "sex"))
ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Physical Health Days by Sleep Duration and Sex",
subtitle = "From interaction model: sleep_hrs * sex",
x = "Sleep Hours per Night",
y = "Predicted Physically Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Interaction Model: Predicted Physical Health Days by Sleep and Sex
When the modifier has \(k > 2\) categories (such as education with 4 levels), the interaction between a continuous exposure and the categorical modifier requires \(k - 1\) interaction terms, one for each dummy variable.
For sleep \(\times\) education (with “Less than HS” as reference):
\[Y = \beta_0 + \beta_1(\text{Sleep}) + \beta_2(\text{HS grad}) + \beta_3(\text{Some college}) + \beta_4(\text{College grad}) + \beta_5(\text{Sleep} \times \text{HS grad}) + \beta_6(\text{Sleep} \times \text{Some college}) + \beta_7(\text{Sleep} \times \text{College grad}) + \varepsilon\]
Each interaction coefficient \(\beta_5, \beta_6, \beta_7\) represents the difference in the sleep slope between that education group and the reference group.
mod_int_educ <- lm(physhlth_days ~ sleep_hrs * education, data = brfss_ci)
tidy(mod_int_educ, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Sleep * Education → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.9692 | 2.0406 | 5.8655 | 0.0000 | 7.9687 | 15.9697 |
| sleep_hrs | -0.7877 | 0.2919 | -2.6986 | 0.0070 | -1.3599 | -0.2155 |
| educationHS graduate | -4.2823 | 2.3006 | -1.8613 | 0.0628 | -8.7925 | 0.2280 |
| educationSome college | -5.1543 | 2.3032 | -2.2379 | 0.0253 | -9.6695 | -0.6391 |
| educationCollege graduate | -8.8728 | 2.3020 | -3.8545 | 0.0001 | -13.3857 | -4.3600 |
| sleep_hrs:educationHS graduate | 0.2713 | 0.3274 | 0.8287 | 0.4073 | -0.3705 | 0.9131 |
| sleep_hrs:educationSome college | 0.3386 | 0.3280 | 1.0322 | 0.3020 | -0.3045 | 0.9816 |
| sleep_hrs:educationCollege graduate | 0.6889 | 0.3268 | 2.1079 | 0.0351 | 0.0482 | 1.3297 |
Individual t-tests for each interaction term may be non-significant, yet the interaction as a whole could be meaningful. To test whether sleep \(\times\) education is significant overall, we use a partial F-test comparing the model with and without the interaction terms:
\[H_0: \beta_5 = \beta_6 = \beta_7 = 0 \quad \text{(the sleep slope is the same across all education levels)}\]
mod_no_int_educ <- lm(physhlth_days ~ sleep_hrs + education, data = brfss_ci)
anova(mod_no_int_educ, mod_int_educ) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Is the Sleep x Education Interaction Significant?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ sleep_hrs + education | 4995 | 308368.1 | NA | NA | NA | NA |
| physhlth_days ~ sleep_hrs * education | 4992 | 307957.2 | 3 | 410.9464 | 2.2205 | 0.0836 |
pred_int_educ <- ggpredict(mod_int_educ, terms = c("sleep_hrs [3:12]", "education"))
ggplot(pred_int_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Physical Health Days by Sleep and Education",
subtitle = "Are the lines parallel? Non-parallel lines indicate interaction.",
x = "Sleep Hours per Night",
y = "Predicted Physically Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Interaction: Sleep Effect by Education Level
Interaction is not limited to categorical modifiers. We can also examine whether the effect of one continuous predictor changes across values of another continuous predictor. For example, does the sleep-physical health association differ by age?
\[Y = \beta_0 + \beta_1(\text{Sleep}) + \beta_2(\text{Age}) + \beta_3(\text{Sleep} \times \text{Age}) + \varepsilon\]
Here, \(\beta_3\) tells us how the slope of sleep changes for each one-year increase in age:
mod_int_cont <- lm(physhlth_days ~ sleep_hrs * age, data = brfss_ci)
tidy(mod_int_cont, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Continuous Interaction: Sleep * Age → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.6494 | 1.9752 | 4.8852 | 0.0000 | 5.7771 | 13.5217 |
| sleep_hrs | -1.2597 | 0.2758 | -4.5668 | 0.0000 | -1.8005 | -0.7189 |
| age | -0.0490 | 0.0351 | -1.3948 | 0.1631 | -0.1178 | 0.0199 |
| sleep_hrs:age | 0.0138 | 0.0048 | 2.8444 | 0.0045 | 0.0043 | 0.0232 |
With continuous modifiers, we visualize the interaction by plotting the predicted relationship at specific values of the modifier (e.g., age = 30, 50, 70):
pred_cont <- ggpredict(mod_int_cont, terms = c("sleep_hrs [3:12]", "age [30, 50, 70]"))
ggplot(pred_cont, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Physical Health Days by Sleep, at Different Ages",
subtitle = "Does the sleep-health relationship change with age?",
x = "Sleep Hours per Night",
y = "Predicted Physically Unhealthy Days",
color = "Age",
fill = "Age"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Dark2")Interaction: Sleep Effect at Different Ages
If the lines are approximately parallel, age does not modify the sleep effect. If they fan out or converge, the interaction is meaningful.
Confounding exists when the estimated association between an exposure and an outcome is distorted because of a third variable that is related to both. When confounding is present, ignoring the confounder leads to a biased estimate of the exposure effect.
For a variable to be a confounder, it must satisfy three conditions:
Confounding Structure: The Confounder Affects Both Exposure and Outcome
Age is a classic confounder of the sleep-physical health relationship:
The standard approach in epidemiology is to compare the crude (unadjusted) estimate to the adjusted estimate:
Important: This is a rule of thumb, not a rigid cutoff. The 10% threshold is conventional, not absolute. Some researchers use 5% or 15% depending on context.
# Crude model: sleep → physical health days
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci)
# Adjusted model: adding age
mod_adj_age <- lm(physhlth_days ~ sleep_hrs + age, data = brfss_ci)
# Extract sleep coefficients
b_crude <- coef(mod_crude)["sleep_hrs"]
b_adj <- coef(mod_adj_age)["sleep_hrs"]
pct_change <- abs(b_crude - b_adj) / abs(b_crude) * 100
tribble(
~Model, ~`Sleep β`, ~SE, ~`95% CI`,
"Crude (sleep only)",
round(b_crude, 4),
round(tidy(mod_crude) |> filter(term == "sleep_hrs") |> pull(std.error), 4),
paste0("(", round(confint(mod_crude)["sleep_hrs", 1], 4), ", ",
round(confint(mod_crude)["sleep_hrs", 2], 4), ")"),
"Adjusted (+ age)",
round(b_adj, 4),
round(tidy(mod_adj_age) |> filter(term == "sleep_hrs") |> pull(std.error), 4),
paste0("(", round(confint(mod_adj_age)["sleep_hrs", 1], 4), ", ",
round(confint(mod_adj_age)["sleep_hrs", 2], 4), ")")
) |>
kable(caption = "Confounding Assessment: Does Age Confound the Sleep-Physical Health Association?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Sleep β | SE | 95% CI |
|---|---|---|---|
| Crude (sleep only) | -0.4381 | 0.0827 | (-0.6002, -0.2761) |
| Adjusted (+ age) | -0.5112 | 0.0828 | (-0.6736, -0.3489) |
##
## Percent change in sleep coefficient when adding age: 16.7 %
cat("10% rule threshold: |", round(b_adj, 4), "| should fall within (",
round(b_crude * 0.9, 4), ",", round(b_crude * 1.1, 4), ")\n")## 10% rule threshold: | -0.5112 | should fall within ( -0.3943 , -0.4819 )
In practice, we assess multiple potential confounders. Let’s evaluate age, exercise, education, and general health:
# Crude model
b_crude_val <- coef(mod_crude)["sleep_hrs"]
# One-at-a-time adjusted models
confounders <- list(
"Age" = lm(physhlth_days ~ sleep_hrs + age, data = brfss_ci),
"Exercise" = lm(physhlth_days ~ sleep_hrs + exercise, data = brfss_ci),
"Education" = lm(physhlth_days ~ sleep_hrs + education, data = brfss_ci),
"General health" = lm(physhlth_days ~ sleep_hrs + gen_health, data = brfss_ci),
"Sex" = lm(physhlth_days ~ sleep_hrs + sex, data = brfss_ci),
"Income" = lm(physhlth_days ~ sleep_hrs + income_cat, data = brfss_ci)
)
conf_table <- map_dfr(names(confounders), \(name) {
mod <- confounders[[name]]
b_adj_val <- coef(mod)["sleep_hrs"]
tibble(
Covariate = name,
`Crude β (sleep)` = round(b_crude_val, 4),
`Adjusted β (sleep)` = round(b_adj_val, 4),
`% Change` = round(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100, 1),
Confounder = ifelse(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100 > 10,
"Yes (>10%)", "No")
)
})
conf_table |>
kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(5, bold = TRUE)| Covariate | Crude β (sleep) | Adjusted β (sleep) | % Change | Confounder |
|---|---|---|---|---|
| Age | -0.4381 | -0.5112 | 16.7 | Yes (>10%) |
| Exercise | -0.4381 | -0.4083 | 6.8 | No |
| Education | -0.4381 | -0.3872 | 11.6 | Yes (>10%) |
| General health | -0.4381 | -0.1948 | 55.5 | Yes (>10%) |
| Sex | -0.4381 | -0.4434 | 1.2 | No |
| Income | -0.4381 | -0.3739 | 14.7 | Yes (>10%) |
Identifying candidate confounders is not purely a statistical exercise. The three conditions for confounding (associated with exposure, associated with outcome, not on the causal pathway) require substantive knowledge from the literature and your understanding of the causal structure.
Caution about missing data: When you add a covariate to the model, observations with missing values on that covariate are dropped. This changes the analytic sample, which could alter \(\hat{\beta}\) for reasons unrelated to confounding. Always ensure that the crude and adjusted models are fit on the same set of observations.
# Verify our dataset has no missing values (we filtered earlier)
cat("Missing values in analytic dataset:", sum(!complete.cases(brfss_ci)), "\n")## Missing values in analytic dataset: 0
## All models are fit on the same n = 5000 observations.
Confounders vs. mediators: A variable on the causal pathway between exposure and outcome is a mediator, not a confounder. For example, if sleep affects exercise habits, which in turn affect physical health, then exercise is a mediator. Adjusting for a mediator would attenuate the exposure effect, but this attenuation is not bias; it reflects the removal of an indirect pathway. Whether to adjust depends on your research question.
General health status is a tricky case. It could be a confounder (poor overall health causes both poor sleep and more physically unhealthy days) or a mediator (poor sleep leads to poor general health, which leads to more physically unhealthy days). The direction depends on the assumed causal structure and should be guided by subject-matter knowledge.
The standard epidemiological approach to model building follows this order:
The reason for this order is that confounding assessment assumes a single exposure effect. If the effect actually varies across subgroups (interaction), then a single “adjusted” coefficient is misleading. Reporting an average effect when the true effect differs by sex, for example, could mask important public health heterogeneity.
| Step | Action |
|---|---|
| 1 | Specify the exposure-outcome relationship of interest |
| 2 | Identify potential effect modifiers (from literature, biological plausibility) |
| 3 | Test for interaction using interaction terms or stratified analysis |
| 4 | If interaction is present: report stratum-specific effects; assess confounding within strata |
| 5 | If no interaction: assess confounding in the full sample using the 10% change-in-estimate rule |
Let’s build a final model that accounts for both potential interaction and confounding in our BRFSS data:
# Step 1: Test for interaction between sleep and sex
mod_final_int <- lm(physhlth_days ~ sleep_hrs * sex + age + education + exercise,
data = brfss_ci)
# Check interaction term
int_pval <- tidy(mod_final_int) |>
filter(term == "sleep_hrs:sexFemale") |>
pull(p.value)
cat("Interaction p-value (sleep x sex):", round(int_pval, 4), "\n")## Interaction p-value (sleep x sex): 0.0417
# Step 2: If interaction is not significant, fit additive model with confounders
mod_final <- lm(physhlth_days ~ sleep_hrs + sex + age + education + exercise,
data = brfss_ci)
tidy(mod_final, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Final Model: Sleep → Physical Health Days, Adjusted for Confounders",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9181 | 0.7955 | 12.4676 | 0.0000 | 8.3585 | 11.4776 |
| sleep_hrs | -0.4318 | 0.0805 | -5.3662 | 0.0000 | -0.5895 | -0.2740 |
| sexFemale | 0.2862 | 0.2178 | 1.3140 | 0.1889 | -0.1408 | 0.7132 |
| age | 0.0355 | 0.0065 | 5.4878 | 0.0000 | 0.0228 | 0.0482 |
| educationHS graduate | -1.9241 | 0.5085 | -3.7838 | 0.0002 | -2.9209 | -0.9272 |
| educationSome college | -2.0609 | 0.5065 | -4.0688 | 0.0000 | -3.0539 | -1.0679 |
| educationCollege graduate | -2.9504 | 0.4955 | -5.9548 | 0.0000 | -3.9217 | -1.9791 |
| exerciseYes | -4.1551 | 0.2711 | -15.3273 | 0.0000 | -4.6866 | -3.6236 |
b_final <- coef(mod_final)["sleep_hrs"]
tribble(
~Model, ~`Sleep β`, ~`% Change from Crude`,
"Crude", round(b_crude_val, 4), "—",
"Adjusted (age only)", round(coef(mod_adj_age)["sleep_hrs"], 4),
paste0(round(abs(b_crude_val - coef(mod_adj_age)["sleep_hrs"]) / abs(b_crude_val) * 100, 1), "%"),
"Final (all confounders)", round(b_final, 4),
paste0(round(abs(b_crude_val - b_final) / abs(b_crude_val) * 100, 1), "%")
) |>
kable(caption = "Sleep Coefficient: Crude vs. Progressively Adjusted Models") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Sleep β | % Change from Crude |
|---|---|---|
| Crude | -0.4381 | — |
| Adjusted (age only) | -0.5112 | 16.7% |
| Final (all confounders) | -0.4318 | 1.5% |
| Concept | Key Point |
|---|---|
| Predictive vs. associative | Confounding and interaction matter for associative (etiologic) modeling |
| Interaction | The effect of the exposure differs across levels of a modifier |
| Stratified analysis | Fit separate models per stratum; informative but no formal test |
| Interaction term | \(X_1 \times X_2\) in the model; t-test on \(\beta_3\) tests parallelism |
: vs. * in R |
: = interaction only; * = main effects +
interaction |
| Must include main effects | Never include \(X_1 X_2\) without also including \(X_1\) and \(X_2\) |
| Partial F-test for interaction | Tests all \(k - 1\) interaction terms simultaneously |
| Continuous \(\times\) continuous | The slope of one predictor changes linearly with the other |
| Confounding | A third variable distorts the exposure-outcome association |
| Three conditions | Associated with exposure, associated with outcome, not on causal pathway |
| 10% change-in-estimate | Compare crude and adjusted \(\hat{\beta}\); >10% change = confounder |
| Interaction before confounding | Assess interaction first; if present, stratify before assessing confounding |
| Mediator vs. confounder | Adjusting for a mediator removes an indirect effect, not bias |
EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026
In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)
brfss_ci <- readRDS(
"/Users/mm992584/Library/CloudStorage/OneDrive-UniversityatAlbany-SUNY/Spring 2026/Epi 553/Lectures Notes/Confounding Interaction/brfss_ci_2020.rds"
)1a. (5 pts) Create a scatterplot of
physhlth_days (y-axis) vs. age (x-axis),
colored by exercise status. Add separate regression lines
for each group. Describe the pattern you observe.
1b. (5 pts) Compute the mean
physhlth_days for each combination of sex and
exercise. Present the results in a table. Does it appear
that the association between exercise and physical health days might
differ by sex?
1c. (5 pts) Create a scatterplot of
physhlth_days vs. sleep_hrs, faceted by
education level. Comment on whether the slopes appear
similar or different across education groups.
2a. (5 pts) Fit separate simple linear regression
models of physhlth_days ~ age for exercisers and
non-exercisers. Report the slope, SE, and 95% CI for age in each
stratum.
2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?
2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.
3a. (5 pts) Fit the interaction model:
physhlth_days ~ age * exercise. Write out the fitted
equation.
3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.
3c. (5 pts) Conduct the t-test for the interaction
term (age:exerciseYes). State the null and alternative
hypotheses, report the test statistic and p-value, and state your
conclusion about whether interaction is present.
3d. (5 pts) Now fit a model with an interaction
between age and education: physhlth_days ~ age * education.
How many interaction terms are produced? Use a partial F-test to test
whether the age \(\times\) education
interaction as a whole is significant.
3e. (5 pts) Create a visualization using
ggpredict() showing the predicted
physhlth_days by age for each education level. Do the lines
appear parallel?
For this task, the exposure is exercise
and the outcome is physhlth_days.
4a. (5 pts) Fit the crude model:
physhlth_days ~ exercise. Report the exercise coefficient.
This is the unadjusted estimate.
4b. (10 pts) Systematically assess whether each of
the following is a confounder of the exercise-physical health
association: age, sex, sleep_hrs,
education, and income_cat. For each:
physhlth_days ~ exercise + [covariate]Present your results in a single summary table.
4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?
4d. (5 pts) Is gen_health a confounder
or a mediator of the exercise-physical health relationship? Could it be
both? Explain your reasoning with reference to the three conditions for
confounding and the concept of the causal pathway.
5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:
Do not use statistical jargon.
5b. (10 pts) A colleague suggests including
gen_health as a covariate in the final model because it
changes the exercise coefficient by more than 10%. You disagree. Write a
3–4 sentence argument explaining why adjusting for
general health may not be appropriate if the goal is to estimate the
total effect of exercise on physical health days. Use the concept of
mediation in your argument.
End of Lab Activity