In previous lectures, we built multiple linear regression models that included several predictors. We interpreted each coefficient as the expected change in \(Y\) for a one-unit increase in \(X_j\), “holding all other predictors constant.” But we did not examine two critical methodological questions:
Confounding: Does the estimated effect of our exposure change when we add or remove covariates? If so, those covariates are confounders, and we need them in the model to get an unbiased estimate.
Interaction (Effect Modification): Is the effect of our exposure the same for everyone, or does it differ across subgroups? If the effect of sleep on physical health is different for men and women, we say that sex modifies the effect of sleep.
These are the two most important methodological concepts in associative (etiologic) modeling, the type of modeling most common in epidemiology.
Research question for today:
Is the association between sleep duration and physically unhealthy days modified by sex or education? And which variables confound this association?
Before we dive in, it is important to revisit the distinction between the two primary goals of regression modeling, because confounding and interaction are only relevant in one of them.
| Goal | Question | What matters most |
|---|---|---|
| Predictive | Which set of variables best predicts \(Y\)? | Overall model accuracy (\(R^2\), RMSE, out-of-sample prediction) |
| Associative | What is the effect of a specific exposure on \(Y\), after adjusting for confounders? | Accuracy and interpretability of a specific \(\hat{\beta}\) |
In predictive modeling, we want the combination of variables that minimizes prediction error. We do not care whether individual coefficients are “correct” or interpretable, only whether the model predicts well.
In associative modeling, we care deeply about one (or a few) specific coefficients. We want to ensure that \(\hat{\beta}_{\text{exposure}}\) reflects the true relationship, free from confounding bias. This is the setting where confounding and interaction become critical.
In epidemiology, we are almost always doing associative modeling. We have a specific exposure of interest (e.g., sleep) and want to estimate its effect on a health outcome (e.g., physical health days) while controlling for confounders.
We should never extrapolate predictions from a statistical model beyond the range of the observed data. Extrapolation assumes that the relationship continues unchanged into regions where we have no data, which is often false.
A famous example: NASA engineers extrapolated O-ring performance data to predict behavior at temperatures colder than any previously tested. The resulting decision to launch the Space Shuttle Challenger in cold weather led to the 1986 disaster.
Rule of thumb: Your model is only valid within the range of the data used to build it.
# Install any missing packages automatically
required_pkgs <- c("tidyverse", "knitr", "kableExtra", "broom",
"gtsummary", "GGally", "car", "ggeffects",
"plotly", "lmtest", "gridExtra")
new_pkgs <- required_pkgs[!(required_pkgs %in% rownames(installed.packages()))]
if (length(new_pkgs)) install.packages(new_pkgs, repos = "https://cran.rstudio.com/")
library(tidyverse)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
library(lmtest)
library(gridExtra)
options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))We continue with the BRFSS 2020 dataset. For this lecture we shift our outcome to physically unhealthy days and examine sleep as the primary exposure, with sex, education, age, exercise, and general health as potential modifiers and confounders.
brfss_ci <- readRDS("/Users/samriddhi/Desktop/DrPH/HEPI_553_Muntasir Musum/brfss_ci.rds")
cat("Dataset loaded successfully.\n")## Dataset loaded successfully.
## Rows: 5000 | Columns: 9
## Variables: physhlth_days, sleep_hrs, menthlth_days, age, sex, education, exercise, gen_health, income_cat
## Missing values: 0
# The RDS file already contains the fully cleaned and recoded analytic sample.
# All variable construction (physhlth_days, sleep_hrs, sex, education, etc.)
# was completed by the instructor when the dataset was originally prepared.
# This chunk confirms the dataset is structured correctly before analysis.
tibble(
Variable = names(brfss_ci),
Type = map_chr(brfss_ci, ~ class(.x)[1]),
N_nonmissing = map_int(brfss_ci, ~ sum(!is.na(.x)))
) |>
kable(caption = "Analytic Dataset: Variable Overview") |>
kable_styling(bootstrap_options = "striped", full_width = FALSE)| Variable | Type | N_nonmissing |
|---|---|---|
| physhlth_days | numeric | 5000 |
| sleep_hrs | numeric | 5000 |
| menthlth_days | numeric | 5000 |
| age | numeric | 5000 |
| sex | factor | 5000 |
| education | factor | 5000 |
| exercise | factor | 5000 |
| gen_health | factor | 5000 |
| income_cat | numeric | 5000 |
brfss_ci |>
select(physhlth_days, sleep_hrs, age, sex, education, exercise, gen_health) |>
tbl_summary(
label = list(
physhlth_days ~ "Physically unhealthy days (past 30)",
sleep_hrs ~ "Sleep (hours/night)",
age ~ "Age (years)",
sex ~ "Sex",
education ~ "Education level",
exercise ~ "Any physical activity (past 30 days)",
gen_health ~ "General health status"
),
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} ({p}%)"
),
digits = all_continuous() ~ 1,
missing = "no"
) |>
add_n() |>
bold_labels() |>
italicize_levels() |>
modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 (n = 5,000)**") |>
as_flex_table()Characteristic | N | N = 5,0001 |
|---|---|---|
Physically unhealthy days (past 30) | 5,000 | 3.4 (7.9) |
Sleep (hours/night) | 5,000 | 7.1 (1.4) |
Age (years) | 5,000 | 54.1 (17.0) |
Sex | 5,000 | |
Male | 2,291 (46%) | |
Female | 2,709 (54%) | |
Education level | 5,000 | |
Less than HS | 276 (5.5%) | |
HS graduate | 1,263 (25%) | |
Some college | 1,378 (28%) | |
College graduate | 2,083 (42%) | |
Any physical activity (past 30 days) | 5,000 | 3,929 (79%) |
General health status | 5,000 | |
Excellent | 1,031 (21%) | |
Very good | 1,762 (35%) | |
Good | 1,507 (30%) | |
Fair | 524 (10%) | |
Poor | 176 (3.5%) | |
1Mean (SD); n (%) | ||
Interpretation: The analytic sample includes 5,000 randomly selected BRFSS 2020 respondents. On average, participants reported 3.4 physically unhealthy days in the past 30 (SD = 7.9), indicating a right-skewed distribution where most respondents report few or no unhealthy days. Mean sleep duration was 7.1 hours per night (SD = 1.4), which aligns with the CDC-recommended range of 7 or more hours. The sample skews female (54.2%) and is well-educated, with 41.7% holding a college degree. The majority (78.6%) reported engaging in physical activity in the past 30 days. These sample characteristics should be kept in mind when interpreting results, as the sample may not be representative of the general U.S. population.
Interaction (also called effect modification) is present when the relationship between an exposure and an outcome is different at different levels of a third variable. In regression terms, the slope of the exposure-outcome relationship changes depending on the value of the modifier.
For example, if the effect of sleep on physical health is stronger for women than for men, then sex modifies the effect of sleep. The two variables (sleep and sex) have a multiplicative, not merely additive, effect on the outcome.
This is fundamentally different from confounding:
| Concept | Question | Implication |
|---|---|---|
| Confounding | Is the crude estimate of the exposure effect biased by a third variable? | Must adjust for the confounder to get a valid estimate |
| Interaction | Does the effect of the exposure differ across subgroups? | Must report stratum-specific effects, not a single overall estimate |
Critical point: Always assess interaction before confounding. If interaction is present, a single “adjusted” coefficient for the exposure is misleading because the effect is not the same for everyone. You must stratify or include interaction terms.
Consider a model with a continuous exposure \(X_1\) (sleep hours) and a dichotomous variable \(X_2\) (sex, where Male = 1 and Female = 0):
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon\]
The term \(\beta_3 X_1 X_2\) is the interaction term. Let’s see what happens when we plug in the values for each group:
For males (\(X_2 = 1\)): \[E(Y | X_1, \text{Male}) = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) X_1\]
For females (\(X_2 = 0\)): \[E(Y | X_1, \text{Female}) = \beta_0 + \beta_1 X_1\]
The key insight:
If \(\beta_3 = 0\), the slopes are equal and the lines are parallel (no interaction). If \(\beta_3 \neq 0\), the lines are not parallel (interaction is present).
# Create synthetic data to illustrate the concept
set.seed(1220)
n <- 200
concept_data <- tibble(
x1 = runif(n, 0, 10),
group = rep(c("Group A", "Group B"), each = n / 2)
)
# No interaction: same slope, different intercepts
no_int <- concept_data |>
mutate(
y = ifelse(group == "Group A", 2 + 1.5 * x1, 5 + 1.5 * x1) + rnorm(n, 0, 2)
)
# Interaction: different slopes
with_int <- concept_data |>
mutate(
y = ifelse(group == "Group A", 2 + 1.5 * x1, 5 + 0.3 * x1) + rnorm(n, 0, 2)
)
p1 <- ggplot(no_int, aes(x = x1, y = y, color = group)) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
labs(title = "No Interaction", subtitle = "Parallel lines: same slope",
x = expression(X[1]), y = "Y") +
theme_minimal(base_size = 12) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
p2 <- ggplot(with_int, aes(x = x1, y = y, color = group)) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
labs(title = "Interaction Present", subtitle = "Non-parallel lines: different slopes",
x = expression(X[1]), y = "Y") +
theme_minimal(base_size = 12) +
scale_color_brewer(palette = "Set1") +
theme(legend.position = "bottom")
gridExtra::grid.arrange(p1, p2, ncol = 2)No Interaction (Parallel Lines) vs. Interaction (Non-Parallel Lines)
Interpretation: The left panel illustrates no interaction: both groups have the same slope (1.5), so the lines are perfectly parallel. The group variable shifts the intercept (Group B starts higher) but does not change the rate at which \(Y\) increases with \(X_1\). In this scenario, a single slope coefficient adequately describes the \(X_1\)-\(Y\) relationship for both groups. The right panel illustrates interaction: Group A has a steep slope (1.5) while Group B has a shallow slope (0.3). The non-parallel lines mean that the effect of \(X_1\) on \(Y\) depends on which group you belong to, and a single pooled slope would misrepresent the relationship for both groups. This is the visual test for interaction: parallel lines = no interaction; non-parallel lines = interaction.
One way to assess interaction is to split the data by levels of the potential modifier and fit separate regression models in each stratum. If the slopes differ meaningfully, interaction may be present.
Let’s test whether the association between sleep and physical health days differs by sex.
# Fit separate models for males and females
mod_male <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci |> filter(sex == "Male"))
mod_female <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci |> filter(sex == "Female"))
# Compare coefficients
bind_rows(
tidy(mod_male, conf.int = TRUE) |> mutate(Stratum = "Male"),
tidy(mod_female, conf.int = TRUE) |> mutate(Stratum = "Female")
) |>
filter(term == "sleep_hrs") |>
select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Stratified Analysis: Sleep → Physical Health Days, by Sex",
col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Stratum | Estimate | SE | CI Lower | CI Upper | p-value |
|---|---|---|---|---|---|
| Male | -0.5243 | 0.1193 | -0.7583 | -0.2903 | 0.0000 |
| Female | -0.3752 | 0.1144 | -0.5996 | -0.1508 | 0.0011 |
Interpretation: The stratified analysis reveals that the sleep-physical health association is negative in both groups: among males, each additional hour of sleep is associated with approximately 0.52 fewer physically unhealthy days, while among females the association is weaker at approximately 0.38 fewer days per additional hour. Both estimates are statistically significant. The slopes appear to differ in magnitude, suggesting that the protective association of sleep with physical health may be somewhat stronger for males than females. However, before concluding that interaction is present, we need a formal statistical test, which we will conduct using an interaction term in Section 3.
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = sex)) +
geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
labs(
title = "Association Between Sleep and Physical Health Days, Stratified by Sex",
subtitle = "Are the slopes parallel or different?",
x = "Sleep Hours per Night",
y = "Physically Unhealthy Days (Past 30)",
color = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Stratified Regression: Sleep vs. Physical Health Days by Sex
Interpretation: The plot shows both regression lines sloping downward, confirming the negative association between sleep and physically unhealthy days for both sexes. The male regression line (steeper slope) suggests a stronger protective association of sleep compared to the female line. However, the confidence bands overlap substantially, which hints that the difference in slopes may not be statistically significant. The jittered points also reveal the highly skewed nature of the outcome: most respondents cluster near zero unhealthy days regardless of sleep duration, with a long tail of individuals reporting many unhealthy days.
The stratified approach is intuitive and visually compelling, but it has important limitations:
The solution is to use a single regression model with an interaction term.
R provides two operators for specifying interactions:
X1:X2 — includes only
the interaction term \(X_1 \times
X_2\)X1*X2 — shorthand for
X1 + X2 + X1:X2 (main effects plus
interaction)Rule: Any model with an interaction term must also include the main effects that comprise the interaction. Always use
*or explicitly include both main effects with:.
# Model without interaction (additive model)
mod_add <- lm(physhlth_days ~ sleep_hrs + sex, data = brfss_ci)
# Model with interaction
mod_int <- lm(physhlth_days ~ sleep_hrs * sex, data = brfss_ci)
# Equivalent to: lm(physhlth_days ~ sleep_hrs + sex + sleep_hrs:sex, data = brfss_ci)
tidy(mod_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Sleep * Sex → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 6.8388 | 0.8723 | 7.8397 | 0.0000 | 5.1287 | 8.5490 |
| sleep_hrs | -0.5243 | 0.1222 | -4.2898 | 0.0000 | -0.7639 | -0.2847 |
| sexFemale | -0.5786 | 1.1906 | -0.4860 | 0.6270 | -2.9127 | 1.7555 |
| sleep_hrs:sexFemale | 0.1491 | 0.1659 | 0.8986 | 0.3689 | -0.1762 | 0.4744 |
The fitted model is:
\[\widehat{\text{Phys Days}} = 6.839 + -0.524(\text{Sleep}) + -0.579(\text{Female}) + 0.149(\text{Sleep} \times \text{Female})\]
For males (Female = 0): \[\widehat{\text{Phys Days}} = 6.839 + -0.524(\text{Sleep})\]
For females (Female = 1): \[\widehat{\text{Phys Days}} = (6.839 + -0.579) + (-0.524 + 0.149)(\text{Sleep}) = 6.26 + -0.375(\text{Sleep})\]
Interpretation of each coefficient:
The formal test for interaction is simply the t-test for the interaction coefficient:
\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\] \[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]
int_term <- tidy(mod_int) |> filter(term == "sleep_hrs:sexFemale")
cat("Interaction term (sleep_hrs:sexFemale):\n")## Interaction term (sleep_hrs:sexFemale):
## Estimate: 0.1491
## t-statistic: 0.899
## p-value: 0.3689
Interpretation: The interaction term
(sleep_hrs:sexFemale) has a coefficient of 0.1491, with a
t-statistic of 0.899 and p-value of 0.3689. Since the p-value exceeds
the conventional \(\alpha = 0.05\)
threshold, we fail to reject the null hypothesis that the slopes are
equal. In other words, there is no statistically significant
evidence that the association between sleep and physically unhealthy
days differs by sex. The positive sign of the interaction
coefficient does suggest the sleep effect is slightly less protective
for females, but this difference is not distinguishable from chance in
this sample. Therefore, a single (pooled) sleep coefficient may be
appropriate for both sexes.
A stronger test asks whether sex has any effect on the relationship (neither intercept nor slope differs):
\[H_0: \beta_2 = \beta_3 = 0 \quad \text{(the two lines are identical)}\] \[H_A: \text{At least one } \neq 0 \quad \text{(the lines differ in intercept and/or slope)}\]
This is a partial F-test comparing the interaction model to a model with only sleep:
mod_sleep_only <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci)
anova(mod_sleep_only, mod_int) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Test for Coincidence: Does Sex Affect the Sleep-Physical Health Relationship at All?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ sleep_hrs | 4998 | 313611.5 | NA | NA | NA | NA |
| physhlth_days ~ sleep_hrs * sex | 4996 | 313284.5 | 2 | 327.0047 | 2.6074 | 0.0738 |
Interpretation: The partial F-test evaluates whether sex contributes anything to the model (either through a different intercept, a different slope, or both). The F-statistic is 2.61 with a p-value of 0.0738. Since p > 0.05, we fail to reject the null hypothesis that the two sex-specific regression lines are identical. This means that, in this unadjusted model, sex does not significantly modify either the baseline level or the slope of the sleep-physical health relationship. The test for coincidence is more comprehensive than the parallelism test (Section 3.3), because it simultaneously evaluates both the intercept difference (\(\beta_2\)) and the slope difference (\(\beta_3\)).
The interaction model recovers exactly the same stratum-specific equations as the stratified analysis:
tribble(
~Method, ~Stratum, ~Intercept, ~Slope,
"Stratified", "Male",
round(coef(mod_male)[1], 3), round(coef(mod_male)[2], 3),
"Stratified", "Female",
round(coef(mod_female)[1], 3), round(coef(mod_female)[2], 3),
"Interaction model", "Male",
round(b_int[1], 3), round(b_int[2], 3),
"Interaction model", "Female",
round(b_int[1] + b_int[3], 3), round(b_int[2] + b_int[4], 3)
) |>
kable(caption = "Verification: Stratified Analysis = Interaction Model") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Method | Stratum | Intercept | Slope |
|---|---|---|---|
| Stratified | Male | 6.839 | -0.524 |
| Stratified | Female | 6.260 | -0.375 |
| Interaction model | Male | 6.839 | -0.524 |
| Interaction model | Female | 6.260 | -0.375 |
The interaction model and the stratified analysis give identical stratum-specific equations. The advantage of the interaction model is that it provides a formal test of the difference between slopes and allows adjustment for additional covariates.
pred_int <- ggpredict(mod_int, terms = c("sleep_hrs [3:12]", "sex"))
ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Physical Health Days by Sleep Duration and Sex",
subtitle = "From interaction model: sleep_hrs * sex",
x = "Sleep Hours per Night",
y = "Predicted Physically Unhealthy Days",
color = "Sex",
fill = "Sex"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set1")Interaction Model: Predicted Physical Health Days by Sleep and Sex
Interpretation: The predicted values plot confirms what the formal test indicated: the regression lines for males and females are nearly parallel, with broadly overlapping confidence bands. Both groups show a decline in predicted physically unhealthy days as sleep duration increases from 3 to 12 hours. The female line sits slightly below the male line across most of the sleep range, but the difference is modest and not statistically significant. This visualization reinforces the conclusion that sex does not meaningfully modify the sleep-physical health association in this sample.
When the modifier has \(k > 2\) categories (such as education with 4 levels), the interaction between a continuous exposure and the categorical modifier requires \(k - 1\) interaction terms, one for each dummy variable.
For sleep \(\times\) education (with “Less than HS” as reference):
\[Y = \beta_0 + \beta_1(\text{Sleep}) + \beta_2(\text{HS grad}) + \beta_3(\text{Some college}) + \beta_4(\text{College grad}) + \beta_5(\text{Sleep} \times \text{HS grad}) + \beta_6(\text{Sleep} \times \text{Some college}) + \beta_7(\text{Sleep} \times \text{College grad}) + \varepsilon\]
Each interaction coefficient \(\beta_5, \beta_6, \beta_7\) represents the difference in the sleep slope between that education group and the reference group.
mod_int_educ <- lm(physhlth_days ~ sleep_hrs * education, data = brfss_ci)
tidy(mod_int_educ, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Sleep * Education → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 11.9692 | 2.0406 | 5.8655 | 0.0000 | 7.9687 | 15.9697 |
| sleep_hrs | -0.7877 | 0.2919 | -2.6986 | 0.0070 | -1.3599 | -0.2155 |
| educationHS graduate | -4.2823 | 2.3006 | -1.8613 | 0.0628 | -8.7925 | 0.2280 |
| educationSome college | -5.1543 | 2.3032 | -2.2379 | 0.0253 | -9.6695 | -0.6391 |
| educationCollege graduate | -8.8728 | 2.3020 | -3.8545 | 0.0001 | -13.3857 | -4.3600 |
| sleep_hrs:educationHS graduate | 0.2713 | 0.3274 | 0.8287 | 0.4073 | -0.3705 | 0.9131 |
| sleep_hrs:educationSome college | 0.3386 | 0.3280 | 1.0322 | 0.3020 | -0.3045 | 0.9816 |
| sleep_hrs:educationCollege graduate | 0.6889 | 0.3268 | 2.1079 | 0.0351 | 0.0482 | 1.3297 |
Interpretation: This model produces three interaction terms, one for each education level compared to the reference category (“Less than HS”). The reference group slope for sleep is -0.788, meaning that among those with less than a high school education, each additional hour of sleep is associated with approximately 0.79 fewer physically unhealthy days. The interaction terms for “HS graduate” (0.271) and “Some college” (0.339) are positive but not individually significant (p > 0.05), indicating their slopes do not significantly differ from the reference group. However, the interaction term for “College graduate” (0.689, p = 0.035) is statistically significant, suggesting that the protective effect of sleep is notably weaker among college graduates (slope \(\approx\) -0.099) compared to those with less than a high school education. This pattern makes substantive sense: individuals with lower education may experience more physically demanding occupations or fewer health resources, amplifying the consequences of insufficient sleep.
Individual t-tests for each interaction term may be non-significant, yet the interaction as a whole could be meaningful. To test whether sleep \(\times\) education is significant overall, we use a partial F-test comparing the model with and without the interaction terms:
\[H_0: \beta_5 = \beta_6 = \beta_7 = 0 \quad \text{(the sleep slope is the same across all education levels)}\]
mod_no_int_educ <- lm(physhlth_days ~ sleep_hrs + education, data = brfss_ci)
anova(mod_no_int_educ, mod_int_educ) |>
tidy() |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(caption = "Partial F-test: Is the Sleep x Education Interaction Significant?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ sleep_hrs + education | 4995 | 308368.1 | NA | NA | NA | NA |
| physhlth_days ~ sleep_hrs * education | 4992 | 307957.2 | 3 | 410.9464 | 2.2205 | 0.0836 |
Interpretation: The partial F-test evaluates all three interaction terms simultaneously (\(H_0: \beta_5 = \beta_6 = \beta_7 = 0\)). The F-statistic is 2.22 with a p-value of 0.0836. At \(\alpha = 0.05\), the overall interaction is not statistically significant (p > 0.05), meaning we do not have sufficient evidence to conclude that the sleep slope differs across education levels as a whole. Note the discrepancy: the individual t-test for the “College graduate” interaction term was significant (p = 0.035), but the omnibus F-test was not. This can occur when one group drives the difference while others do not. In practice, if there is strong a priori reason to expect education-based effect modification, this borderline result may warrant further investigation in a larger sample.
pred_int_educ <- ggpredict(mod_int_educ, terms = c("sleep_hrs [3:12]", "education"))
ggplot(pred_int_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.1) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
labs(
title = "Predicted Physical Health Days by Sleep and Education",
subtitle = "Are the lines parallel? Non-parallel lines indicate interaction.",
x = "Sleep Hours per Night",
y = "Predicted Physically Unhealthy Days",
color = "Education",
fill = "Education"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Set2")Interaction: Sleep Effect by Education Level
Interpretation: The plot shows four regression lines, one for each education level. The “Less than HS” group has the steepest downward slope, while the “College graduate” line is noticeably flatter. The “HS graduate” and “Some college” lines fall in between. This visual pattern aligns with the regression results: the protective association between sleep and physical health is strongest among those with the least education and weakest among college graduates. At low sleep durations (3-4 hours), the education groups show the widest disparity in predicted unhealthy days, with the “Less than HS” group predicted to have the most. As sleep increases, the lines converge somewhat. Despite the borderline omnibus F-test, the visual evidence of non-parallel lines, particularly for college graduates, reinforces the potential for education-based effect modification.
Interaction is not limited to categorical modifiers. We can also examine whether the effect of one continuous predictor changes across values of another continuous predictor. For example, does the sleep-physical health association differ by age?
\[Y = \beta_0 + \beta_1(\text{Sleep}) + \beta_2(\text{Age}) + \beta_3(\text{Sleep} \times \text{Age}) + \varepsilon\]
Here, \(\beta_3\) tells us how the slope of sleep changes for each one-year increase in age:
mod_int_cont <- lm(physhlth_days ~ sleep_hrs * age, data = brfss_ci)
tidy(mod_int_cont, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Continuous Interaction: Sleep * Age → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.6494 | 1.9752 | 4.8852 | 0.0000 | 5.7771 | 13.5217 |
| sleep_hrs | -1.2597 | 0.2758 | -4.5668 | 0.0000 | -1.8005 | -0.7189 |
| age | -0.0490 | 0.0351 | -1.3948 | 0.1631 | -0.1178 | 0.0199 |
| sleep_hrs:age | 0.0138 | 0.0048 | 2.8444 | 0.0045 | 0.0043 | 0.0232 |
Interpretation: The interaction term
sleep_hrs:age has a coefficient of 0.0138 (p = 0.0045),
which is statistically significant. This positive coefficient means that
the protective effect of sleep weakens as age
increases. Specifically, for each one-year increase in age, the slope of
sleep becomes 0.014 units less negative (closer to zero). To illustrate:
the predicted slope of sleep for a 30-year-old is \(-1.26 + 0.014 \times 30 = -0.847\), while
for a 70-year-old it is \(-1.26 + 0.014 \times
70 = -0.297\). In substantive terms, sleep appears to have a
stronger protective association with physical health among younger
adults than among older adults.
With continuous modifiers, we visualize the interaction by plotting the predicted relationship at specific values of the modifier (e.g., age = 30, 50, 70):
pred_cont <- ggpredict(mod_int_cont, terms = c("sleep_hrs [3:12]", "age [30, 50, 70]"))
ggplot(pred_cont, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
labs(
title = "Predicted Physical Health Days by Sleep, at Different Ages",
subtitle = "Does the sleep-health relationship change with age?",
x = "Sleep Hours per Night",
y = "Predicted Physically Unhealthy Days",
color = "Age",
fill = "Age"
) +
theme_minimal(base_size = 13) +
scale_color_brewer(palette = "Dark2")Interaction: Sleep Effect at Different Ages
If the lines are approximately parallel, age does not modify the sleep effect. If they fan out or converge, the interaction is meaningful.
Interpretation: The three age-specific lines clearly fan out at lower sleep durations and converge at higher sleep durations, confirming the statistically significant interaction. The line for 30-year-olds has the steepest negative slope, indicating that younger adults experience the largest reduction in physically unhealthy days per additional hour of sleep. The line for 70-year-olds is notably flatter, suggesting that sleep duration has a weaker association with physical health among older adults. One possible explanation is that older adults accumulate physically unhealthy days from age-related conditions (e.g., arthritis, cardiovascular disease) that are less modifiable by sleep alone, whereas younger adults’ physical health may be more responsive to behavioral factors like sleep.
Confounding exists when the estimated association between an exposure and an outcome is distorted because of a third variable that is related to both. When confounding is present, ignoring the confounder leads to a biased estimate of the exposure effect.
For a variable to be a confounder, it must satisfy three conditions:
Confounding Structure: The Confounder Affects Both Exposure and Outcome
Age is a classic confounder of the sleep-physical health relationship:
The standard approach in epidemiology is to compare the crude (unadjusted) estimate to the adjusted estimate:
Important: This is a rule of thumb, not a rigid cutoff. The 10% threshold is conventional, not absolute. Some researchers use 5% or 15% depending on context.
# Crude model: sleep → physical health days
mod_crude <- lm(physhlth_days ~ sleep_hrs, data = brfss_ci)
# Adjusted model: adding age
mod_adj_age <- lm(physhlth_days ~ sleep_hrs + age, data = brfss_ci)
# Extract sleep coefficients
b_crude <- coef(mod_crude)["sleep_hrs"]
b_adj <- coef(mod_adj_age)["sleep_hrs"]
pct_change <- abs(b_crude - b_adj) / abs(b_crude) * 100
tribble(
~Model, ~`Sleep β`, ~SE, ~`95% CI`,
"Crude (sleep only)",
round(b_crude, 4),
round(tidy(mod_crude) |> filter(term == "sleep_hrs") |> pull(std.error), 4),
paste0("(", round(confint(mod_crude)["sleep_hrs", 1], 4), ", ",
round(confint(mod_crude)["sleep_hrs", 2], 4), ")"),
"Adjusted (+ age)",
round(b_adj, 4),
round(tidy(mod_adj_age) |> filter(term == "sleep_hrs") |> pull(std.error), 4),
paste0("(", round(confint(mod_adj_age)["sleep_hrs", 1], 4), ", ",
round(confint(mod_adj_age)["sleep_hrs", 2], 4), ")")
) |>
kable(caption = "Confounding Assessment: Does Age Confound the Sleep-Physical Health Association?") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Sleep β | SE | 95% CI |
|---|---|---|---|
| Crude (sleep only) | -0.4381 | 0.0827 | (-0.6002, -0.2761) |
| Adjusted (+ age) | -0.5112 | 0.0828 | (-0.6736, -0.3489) |
##
## Percent change in sleep coefficient when adding age: 16.7 %
cat("10% rule threshold: |", round(b_adj, 4), "| should fall within (",
round(b_crude * 0.9, 4), ",", round(b_crude * 1.1, 4), ")\n")## 10% rule threshold: | -0.5112 | should fall within ( -0.3943 , -0.4819 )
Interpretation: The crude (unadjusted) estimate for sleep is -0.4381, meaning that without controlling for any covariates, each additional hour of sleep is associated with approximately 0.44 fewer physically unhealthy days. After adjusting for age, the sleep coefficient changes to -0.5112, a 16.7% change. Because this exceeds the 10% threshold, age is a confounder of the sleep-physical health association. Notice that the adjusted estimate is more negative than the crude estimate. This means that confounding by age was attenuating the sleep effect (making it look weaker than it truly is). This occurs because age is positively associated with both the exposure (older adults sleep less) and the outcome (older adults have more unhealthy days), creating a positive confounding bias that partially masks the negative sleep-health association. After removing this bias by adjusting for age, the estimated protective effect of sleep becomes stronger.
In practice, we assess multiple potential confounders. Let’s evaluate age, exercise, education, and general health:
# Crude model
b_crude_val <- coef(mod_crude)["sleep_hrs"]
# One-at-a-time adjusted models
confounders <- list(
"Age" = lm(physhlth_days ~ sleep_hrs + age, data = brfss_ci),
"Exercise" = lm(physhlth_days ~ sleep_hrs + exercise, data = brfss_ci),
"Education" = lm(physhlth_days ~ sleep_hrs + education, data = brfss_ci),
"General health" = lm(physhlth_days ~ sleep_hrs + gen_health, data = brfss_ci),
"Sex" = lm(physhlth_days ~ sleep_hrs + sex, data = brfss_ci),
"Income" = lm(physhlth_days ~ sleep_hrs + income_cat, data = brfss_ci)
)
conf_table <- map_dfr(names(confounders), \(name) {
mod <- confounders[[name]]
b_adj_val <- coef(mod)["sleep_hrs"]
tibble(
Covariate = name,
`Crude β (sleep)` = round(b_crude_val, 4),
`Adjusted β (sleep)` = round(b_adj_val, 4),
`% Change` = round(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100, 1),
Confounder = ifelse(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100 > 10,
"Yes (>10%)", "No")
)
})
conf_table |>
kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
column_spec(5, bold = TRUE)| Covariate | Crude β (sleep) | Adjusted β (sleep) | % Change | Confounder |
|---|---|---|---|---|
| Age | -0.4381 | -0.5112 | 16.7 | Yes (>10%) |
| Exercise | -0.4381 | -0.4083 | 6.8 | No |
| Education | -0.4381 | -0.3872 | 11.6 | Yes (>10%) |
| General health | -0.4381 | -0.1948 | 55.5 | Yes (>10%) |
| Sex | -0.4381 | -0.4434 | 1.2 | No |
| Income | -0.4381 | -0.3739 | 14.7 | Yes (>10%) |
Interpretation: The systematic assessment reveals four variables that meet the 10% change-in-estimate criterion for confounding:
Two variables did not meet the confounding threshold: exercise (6.8%) and sex (1.2%). Their inclusion in the model would not meaningfully alter the estimated sleep effect.
Identifying candidate confounders is not purely a statistical exercise. The three conditions for confounding (associated with exposure, associated with outcome, not on the causal pathway) require substantive knowledge from the literature and your understanding of the causal structure.
Caution about missing data: When you add a covariate to the model, observations with missing values on that covariate are dropped. This changes the analytic sample, which could alter \(\hat{\beta}\) for reasons unrelated to confounding. Always ensure that the crude and adjusted models are fit on the same set of observations.
# Verify our dataset has no missing values (we filtered earlier)
cat("Missing values in analytic dataset:", sum(!complete.cases(brfss_ci)), "\n")## Missing values in analytic dataset: 0
## All models are fit on the same n = 5000 observations.
Confounders vs. mediators: A variable on the causal pathway between exposure and outcome is a mediator, not a confounder. For example, if sleep affects exercise habits, which in turn affect physical health, then exercise is a mediator. Adjusting for a mediator would attenuate the exposure effect, but this attenuation is not bias; it reflects the removal of an indirect pathway. Whether to adjust depends on your research question.
General health status is a tricky case. It could be a confounder (poor overall health causes both poor sleep and more physically unhealthy days) or a mediator (poor sleep leads to poor general health, which leads to more physically unhealthy days). The direction depends on the assumed causal structure and should be guided by subject-matter knowledge.
The standard epidemiological approach to model building follows this order:
The reason for this order is that confounding assessment assumes a single exposure effect. If the effect actually varies across subgroups (interaction), then a single “adjusted” coefficient is misleading. Reporting an average effect when the true effect differs by sex, for example, could mask important public health heterogeneity.
| Step | Action |
|---|---|
| 1 | Specify the exposure-outcome relationship of interest |
| 2 | Identify potential effect modifiers (from literature, biological plausibility) |
| 3 | Test for interaction using interaction terms or stratified analysis |
| 4 | If interaction is present: report stratum-specific effects; assess confounding within strata |
| 5 | If no interaction: assess confounding in the full sample using the 10% change-in-estimate rule |
Let’s build a final model that accounts for both potential interaction and confounding in our BRFSS data:
# Step 1: Test for interaction between sleep and sex
mod_final_int <- lm(physhlth_days ~ sleep_hrs * sex + age + education + exercise,
data = brfss_ci)
# Check interaction term
int_pval <- tidy(mod_final_int) |>
filter(term == "sleep_hrs:sexFemale") |>
pull(p.value)
cat("Interaction p-value (sleep x sex):", round(int_pval, 4), "\n")## Interaction p-value (sleep x sex): 0.0417
Note on the interaction result: In this adjusted model, the sleep \(\times\) sex interaction p-value is 0.0417. Depending on the significance level used, this result may be considered borderline. In the unadjusted model (Section 3.3), the interaction was clearly non-significant (p = 0.37). The change in the p-value after adjusting for confounders illustrates an important lesson: interaction results can shift when covariates are added, because confounders can mask or amplify subgroup differences. For this lecture, we proceed with the additive (no interaction) model, but in practice a borderline interaction like this warrants careful consideration and possibly stratified reporting.
# Step 2: If interaction is not significant, fit additive model with confounders
mod_final <- lm(physhlth_days ~ sleep_hrs + sex + age + education + exercise,
data = brfss_ci)
tidy(mod_final, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Final Model: Sleep → Physical Health Days, Adjusted for Confounders",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 9.9181 | 0.7955 | 12.4676 | 0.0000 | 8.3585 | 11.4776 |
| sleep_hrs | -0.4318 | 0.0805 | -5.3662 | 0.0000 | -0.5895 | -0.2740 |
| sexFemale | 0.2862 | 0.2178 | 1.3140 | 0.1889 | -0.1408 | 0.7132 |
| age | 0.0355 | 0.0065 | 5.4878 | 0.0000 | 0.0228 | 0.0482 |
| educationHS graduate | -1.9241 | 0.5085 | -3.7838 | 0.0002 | -2.9209 | -0.9272 |
| educationSome college | -2.0609 | 0.5065 | -4.0688 | 0.0000 | -3.0539 | -1.0679 |
| educationCollege graduate | -2.9504 | 0.4955 | -5.9548 | 0.0000 | -3.9217 | -1.9791 |
| exerciseYes | -4.1551 | 0.2711 | -15.3273 | 0.0000 | -4.6866 | -3.6236 |
Interpretation of the final model: After adjusting for sex, age, education, and exercise, each additional hour of sleep is associated with 0.43 fewer physically unhealthy days (95% CI: -0.59 to -0.27, p < 0.001). Additional notable findings from the adjusted model:
b_final <- coef(mod_final)["sleep_hrs"]
tribble(
~Model, ~`Sleep β`, ~`% Change from Crude`,
"Crude", round(b_crude_val, 4), "—",
"Adjusted (age only)", round(coef(mod_adj_age)["sleep_hrs"], 4),
paste0(round(abs(b_crude_val - coef(mod_adj_age)["sleep_hrs"]) / abs(b_crude_val) * 100, 1), "%"),
"Final (all confounders)", round(b_final, 4),
paste0(round(abs(b_crude_val - b_final) / abs(b_crude_val) * 100, 1), "%")
) |>
kable(caption = "Sleep Coefficient: Crude vs. Progressively Adjusted Models") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Sleep β | % Change from Crude |
|---|---|---|
| Crude | -0.4381 | — |
| Adjusted (age only) | -0.5112 | 16.7% |
| Final (all confounders) | -0.4318 | 1.5% |
Interpretation: This table tracks how the sleep coefficient changes as we progressively add confounders. The crude estimate (-0.4381) represents the unadjusted association. Adding age alone moved the estimate to -0.5112 (a 16.7% change), confirming age as an important confounder. In the final model with all confounders (age, sex, education, and exercise), the sleep coefficient is -0.4318, only a 1.5% change from the crude estimate. This small overall change may seem surprising given that individual confounders produced larger shifts. The reason is that the confounders act in opposing directions: age adjustment makes the sleep effect stronger (more negative), while education adjustment makes it weaker (less negative). When all confounders are included simultaneously, these opposing adjustments partially cancel out. This underscores why it is important to assess confounders both individually and jointly.
| Concept | Key Point |
|---|---|
| Predictive vs. associative | Confounding and interaction matter for associative (etiologic) modeling |
| Interaction | The effect of the exposure differs across levels of a modifier |
| Stratified analysis | Fit separate models per stratum; informative but no formal test |
| Interaction term | \(X_1 \times X_2\) in the model; t-test on \(\beta_3\) tests parallelism |
: vs. * in R |
: = interaction only; * = main effects +
interaction |
| Must include main effects | Never include \(X_1 X_2\) without also including \(X_1\) and \(X_2\) |
| Partial F-test for interaction | Tests all \(k - 1\) interaction terms simultaneously |
| Continuous \(\times\) continuous | The slope of one predictor changes linearly with the other |
| Confounding | A third variable distorts the exposure-outcome association |
| Three conditions | Associated with exposure, associated with outcome, not on causal pathway |
| 10% change-in-estimate | Compare crude and adjusted \(\hat{\beta}\); >10% change = confounder |
| Interaction before confounding | Assess interaction first; if present, stratify before assessing confounding |
| Mediator vs. confounder | Adjusting for a mediator removes an indirect effect, not bias |
EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026
In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.
Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.
Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.
| Variable | Description | Type |
|---|---|---|
physhlth_days |
Physically unhealthy days in past 30 | Continuous (0–30) |
sleep_hrs |
Sleep hours per night | Continuous (1–14) |
menthlth_days |
Mentally unhealthy days in past 30 | Continuous (0–30) |
age |
Age in years (capped at 80) | Continuous |
sex |
Sex (Male/Female) | Factor |
education |
Education level (4 categories) | Factor |
exercise |
Any physical activity (Yes/No) | Factor |
gen_health |
General health status (5 categories) | Factor |
income_cat |
Household income (1–8 ordinal) | Numeric |
brfss_ci <- readRDS('/Users/samriddhi/Desktop/DrPH/HEPI_553_Muntasir Musum/brfss_ci.rds')
cat("Rows:", nrow(brfss_ci), "| Columns:", ncol(brfss_ci), "\n")## Rows: 5000 | Columns: 9
## Variables: physhlth_days, sleep_hrs, menthlth_days, age, sex, education, exercise, gen_health, income_cat
## Missing values: 0
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.08, width = 0.4, height = 0, size = 0.7) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.3) +
scale_color_brewer(palette = "Set1", name = "Exercise\n(past 30 days)") +
scale_y_continuous(limits = c(-0.5, 30.5), breaks = seq(0, 30, 5)) +
coord_cartesian(ylim = c(0, 30)) +
theme_minimal(base_size = 13) +
labs(
title = "Physically Unhealthy Days vs. Age, by Exercise Status",
subtitle = "BRFSS 2020 — n = 5,000",
x = "Age (years)",
y = "Physically Unhealthy Days (Past 30)"
)Figure 1a. Physically unhealthy days vs. age by exercise status with separate regression lines.
Interpretation:
The left figure indicates no interaction, as the regression lines are parallel, showing that age has a similar association with physically unhealthy days across exercise groups. Non-exercisers consistently report more unhealthy days, but the increase with age occurs at the same rate.
The right figure indicates an interaction, as the lines are not parallel. Physically unhealthy days rise more quickly with age among non-exercisers, suggesting that the effect of age varies by exercise status.
mean_table <- brfss_ci %>%
group_by(sex, exercise) %>%
summarise(
n = n(),
Mean_phys_days = round(mean(physhlth_days), 2),
SD = round(sd(physhlth_days), 2),
.groups = "drop"
)
mean_table %>%
kable(
caption = "Table 1b. Mean Physically Unhealthy Days by Sex and Exercise Status",
col.names = c("Sex", "Exercise", "n", "Mean Days (past 30)", "SD")
) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "bordered"),
full_width = FALSE
) %>%
row_spec(0, bold = TRUE)| Sex | Exercise | n | Mean Days (past 30) | SD |
|---|---|---|---|---|
| Male | No | 446 | 6.64 | 10.98 |
| Male | Yes | 1845 | 2.32 | 6.49 |
| Female | No | 625 | 7.43 | 11.46 |
| Female | Yes | 2084 | 2.45 | 6.32 |
Interpretation:
AAcross both sexes, individuals who exercise report fewer physically unhealthy days than those who do not exercise. It’s evident that females report slightly more unhealthy days than males, when compared within the exercise groups. The difference between exercisers and non-exercisers appears somewhat larger among females (7.4 vs. 2.5) than males (6.6 vs. 2.3), suggesting that the association between exercise and physical health may differ by sex.
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days)) +
geom_jitter(alpha = 0.10, width = 0.25, height = 0.25,
size = 0.7, color = "#2C7BB6") +
geom_smooth(method = "lm", se = TRUE, color = "#D7191C", linewidth = 1.2) +
facet_wrap(~ education, ncol = 2) +
theme_minimal(base_size = 12) +
labs(
title = "Sleep Duration vs. Physically Unhealthy Days by Education Level",
subtitle = "Separate regression lines per education group — BRFSS 2020",
x = "Sleep Hours per Night",
y = "Physically Unhealthy Days (Past 30)"
)Figure 1c. Sleep duration vs. physically unhealthy days, faceted by education level.
Interpretation:
Across all the four education groups, the slopes is negative. This indicates that more is the sleep duration lesser is the physically unhealthy days. The slopes appear generally similar across panels, although the association may be slightly weaker among college graduates. Overall, the pattern is consistent, indicating week evidence that relationship between sleep and physical health differs substantially by education level.
mod_no_ex <- lm(physhlth_days ~ age, data = brfss_ci %>% filter(exercise == "No"))
mod_yes_ex <- lm(physhlth_days ~ age, data = brfss_ci %>% filter(exercise == "Yes"))
bind_rows(
tidy(mod_no_ex, conf.int = TRUE) %>% mutate(Stratum = "No Exercise"),
tidy(mod_yes_ex, conf.int = TRUE) %>% mutate(Stratum = "Exercised")
) %>%
filter(term == "age") %>%
select(Stratum, estimate, std.error, conf.low, conf.high, p.value) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(
caption = "Table 2a. Stratified Regression: Age → Physical Health Days by Exercise Status",
col.names = c("Stratum", "Slope (Age)", "SE", "95% CI Lower", "95% CI Upper", "p-value")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Stratum | Slope (Age) | SE | 95% CI Lower | 95% CI Upper | p-value |
|---|---|---|---|---|---|
| No Exercise | 0.0745 | 0.0212 | 0.0328 | 0.1161 | 0.0005 |
| Exercised | 0.0192 | 0.0060 | 0.0075 | 0.0309 | 0.0013 |
Interpretation:
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
geom_jitter(alpha = 0.07, width = 0.5, height = 0.4, size = 0.8) +
geom_smooth(method = "lm", se = TRUE, linewidth = 1.3) +
scale_color_brewer(palette = "Set1", name = "Exercise\n(past 30 days)") +
theme_minimal(base_size = 13) +
labs(
title = "Stratified Regression: Age vs. Physically Unhealthy Days by Exercise Status",
subtitle = "Are the lines parallel? Non-parallel lines suggest interaction.",
x = "Age (years)",
y = "Physically Unhealthy Days (Past 30)"
)Figure 2b. Stratified regression: age vs. physically unhealthy days by exercise status.
Interpretation:
The regression lines are not parallel. The slope is steeper, sloping upwards among non-exercisers compared to exercisers, indicating that physically unhealthy days increase more rapidly with age in those who do not exercise. This suggests that the association between age and physical health differs by exercise status, consistent with effect modification.
No, we cannot formally determine if the two slopes differ using only the stratified results. Although we can descriptively compare the slope estimates and their confidence intervals, this does not constitute a formal statistical test. To test whether the slopes are significantly different, we need to fit a single regression model that includes an interaction term between age and exercise (age × exercise) and assess the significance of that interaction.
A regression model with an interaction term
(physhlth_days ~ age * exercise), which directly estimates
the slope difference (\(\hat{\beta}_3\)) along with its standard
error and t-statistic
A partial F-test comparing models with and without the interaction term.
The stratified approach also has additional practical limitations: it reduces statistical power by splitting the sample; it cannot simultaneously adjust for other covariates while testing the interaction; and it provides no single omnibus test when the modifier has more than two levels.
mod_int_ex <- lm(physhlth_days ~ age * exercise, data = brfss_ci)
tidy(mod_int_ex, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(
caption = "Table 3a. Interaction Model: physhlth_days ~ age * exercise",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7610 | 0.8798 | 3.1381 | 0.0017 | 1.0361 | 4.4858 |
| age | 0.0745 | 0.0145 | 5.1203 | 0.0000 | 0.0460 | 0.1030 |
| exerciseYes | -1.3858 | 0.9664 | -1.4340 | 0.1516 | -3.2804 | 0.5087 |
| age:exerciseYes | -0.0553 | 0.0162 | -3.4072 | 0.0007 | -0.0871 | -0.0235 |
Fitted equation:
\[\widehat{\text{Phys Days}} = 2.761 + 0.0745(\text{Age}) + -1.3858(\text{Exercise=Yes}) + -0.0553(\text{Age} \times \text{Exercise=Yes})\]
where Exercise = Yes is coded 1 and Exercise = No (reference) is coded 0.
For Non-Exercisers (Exercise = No = 0):
\[\widehat{\text{Phys Days}}_{\text{No Exercise}} = 2.761 + 0.0745(\text{Age})\]
For Exercisers (Exercise = Yes = 1):
\[\widehat{\text{Phys Days}}_{\text{Exercised}} = (2.761 + -1.3858) + (0.0745 + -0.0553)(\text{Age}) = 1.3752 + 0.0192(\text{Age})\]
tribble(
~Method, ~Stratum, ~Intercept, ~`Age Slope`,
"Stratified", "No Exercise", round(coef(mod_no_ex)[1], 4), round(coef(mod_no_ex)[2], 4),
"Stratified", "Exercised", round(coef(mod_yes_ex)[1], 4), round(coef(mod_yes_ex)[2], 4),
"Interaction model", "No Exercise", b[1], b[2],
"Interaction model", "Exercised", b[1] + b[3], b[2] + b[4]
) %>%
kable(caption = "Table 3b. Verification: Stratified Analysis = Interaction Model") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
pack_rows("Stratified analysis", 1, 2) %>%
pack_rows("Interaction model (derived)", 3, 4)| Method | Stratum | Intercept | Age Slope |
|---|---|---|---|
| Stratified analysis | |||
| Stratified | No Exercise | 2.7610 | 0.0745 |
| Stratified | Exercised | 1.3752 | 0.0192 |
| Interaction model (derived) | |||
| Interaction model | No Exercise | 2.7610 | 0.0745 |
| Interaction model | Exercised | 1.3752 | 0.0192 |
Interpretation:
The interaction model recovers exactly the same stratum-specific intercepts and slopes as the separately fitted stratified models, confirming the mathematical equivalence of the two approaches. The advantage of the interaction model is that it provides a formal test of whether the slope difference is statistically meaningful — which the stratified approach alone cannot supply.
int_term_ex <- tidy(mod_int_ex) %>% filter(term == "age:exerciseYes")
cat("Interaction term: age:exerciseYes\n")## Interaction term: age:exerciseYes
## Estimate: -0.0553
## SE: 0.0162
## t-statistic: -3.407
## p-value: 7e-04
Hypotheses:
\[H_0: \beta_{\text{age:exerciseYes}} = 0 \quad \text{(slopes are equal across exercise groups — no interaction)}\] \[H_A: \beta_{\text{age:exerciseYes}} \neq 0 \quad \text{(slopes differ — interaction is present)}\] H₀: β₃ = 0 (no interaction; slopes are equal) H₁: β₃ ≠ 0 (interaction present; slopes differ)
The interaction term (age × exercise) has a t-statistic of −3.407 and a p-value of 7x10^-4.
Conclusion:
Since the p-value is less than 0.05, we will reject null hypothesis and conclude that there is evidence of interaction. This indicates that the association between age and physically unhealthy days differs by exercise status, with a weaker increase among exercisers compared to non-exercisers.
mod_no_int_educ2 <- lm(physhlth_days ~ age + education, data = brfss_ci)
mod_int_educ2 <- lm(physhlth_days ~ age * education, data = brfss_ci)
tidy(mod_int_educ2, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(
caption = "Table 3d. Interaction Model: physhlth_days ~ age * education",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 2.7542 | 1.5422 | 1.7859 | 0.0742 | -0.2692 | 5.7775 |
| age | 0.0706 | 0.0269 | 2.6277 | 0.0086 | 0.0179 | 0.1233 |
| educationHS graduate | -1.4016 | 1.6900 | -0.8293 | 0.4069 | -4.7146 | 1.9115 |
| educationSome college | -1.1261 | 1.6893 | -0.6666 | 0.5051 | -4.4380 | 2.1857 |
| educationCollege graduate | -2.7102 | 1.6580 | -1.6346 | 0.1022 | -5.9606 | 0.5402 |
| age:educationHS graduate | -0.0197 | 0.0296 | -0.6661 | 0.5054 | -0.0776 | 0.0382 |
| age:educationSome college | -0.0326 | 0.0295 | -1.1039 | 0.2697 | -0.0904 | 0.0253 |
| age:educationCollege graduate | -0.0277 | 0.0289 | -0.9583 | 0.3380 | -0.0844 | 0.0290 |
int_terms_educ <- tidy(mod_int_educ2) %>% filter(str_detect(term, "age:education"))
cat("\nNumber of interaction terms produced:", nrow(int_terms_educ), "\n")##
## Number of interaction terms produced: 3
## Terms: age:educationHS graduate, age:educationSome college, age:educationCollege graduate
ftest_educ <- anova(mod_no_int_educ2, mod_int_educ2)
ftest_educ %>%
tidy() %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(caption = "Table 3d. Partial F-test: Is the Age x Education Interaction Significant?") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| physhlth_days ~ age + education | 4995 | 306764.2 | NA | NA | NA | NA |
| physhlth_days ~ age * education | 4992 | 306671.9 | 3 | 92.2713 | 0.5007 | 0.6818 |
f_stat_educ <- round(ftest_educ$F[2], 3)
f_pval_educ <- round(ftest_educ$`Pr(>F)`[2], 4)
cat("\nF-statistic:", f_stat_educ, "| p-value:", f_pval_educ, "\n")##
## F-statistic: 0.501 | p-value: 0.6818
Number of interaction terms:
There are 3 interaction terms, corresponding to the number of education levels minus one (reference group).
Hypotheses:
H₀: All age × education interaction terms = 0 (no interaction) H₁: At least one interaction term ≠ 0
The partial F-test yields F = 0.501 with a p-value of 0.682.
Conclusion:
Since the p-value is more than 0.05, we fail to reject the null hypothesis. There is no evidence that the association between age and physically unhealthy days differs across education levels, suggesting no meaningful interaction between age and education.
pred_educ <- ggpredict(mod_int_educ2, terms = c("age [18:80]", "education"))
ggplot(pred_educ, aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.2) +
geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.10, color = NA) +
scale_color_brewer(palette = "Set2", name = "Education") +
scale_fill_brewer(palette = "Set2", name = "Education") +
theme_minimal(base_size = 13) +
labs(
title = "Predicted Physically Unhealthy Days by Age and Education Level",
subtitle = "From interaction model: physhlth_days ~ age * education",
x = "Age (years)",
y = "Predicted Physically Unhealthy Days (Past 30)"
)Figure 3e. Predicted physically unhealthy days by age for each education level (interaction model).
Interpretation:
The lines are roughly parallel across education groups, suggesting that the association between age and physically unhealthy days is similar regardless of education level. Although the overall predicted number of unhealthy days varies by education, the rate of increase with age is consistent across groups. This aligns with the earlier finding of no significant interaction between age and education.
For Tasks 4a–4d, the exposure is
exercise and the outcome is
physhlth_days.
mod_crude_ex <- lm(physhlth_days ~ exercise, data = brfss_ci)
tidy(mod_crude_ex, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(
caption = "Table 4a. Crude Model: physhlth_days ~ exercise",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 7.1027 | 0.2354 | 30.1692 | 0 | 6.6412 | 7.5643 |
| exerciseYes | -4.7115 | 0.2656 | -17.7401 | 0 | -5.2322 | -4.1908 |
b_crude_ex <- coef(mod_crude_ex)["exerciseYes"]
cat("\nCrude exercise coefficient:", round(b_crude_ex, 4), "\n")##
## Crude exercise coefficient: -4.7115
Interpretation:
The crude model shows that individuals who exercised in the past 30 days reported 4.71 fewer physically unhealthy days on average compared to non-exercisers (95% CI: -5.23 to -4.19, p < 0.001). This unadjusted estimate is the baseline against which we compare adjusted coefficients in Task 4b to identify confounders.
b_crude_val_ex <- coef(mod_crude_ex)["exerciseYes"]
confounders_ex <- list(
"Age" = lm(physhlth_days ~ exercise + age, data = brfss_ci),
"Sex" = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
"Sleep hrs" = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
"Education" = lm(physhlth_days ~ exercise + education, data = brfss_ci),
"Income cat" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)
conf_table_ex <- map_dfr(names(confounders_ex), function(name) {
mod <- confounders_ex[[name]]
b_adj_val <- coef(mod)["exerciseYes"]
pct_chg <- abs(b_crude_val_ex - b_adj_val) / abs(b_crude_val_ex) * 100
tibble(
Covariate = name,
`Crude β (exercise)` = round(b_crude_val_ex, 4),
`Adjusted β` = round(b_adj_val, 4),
`% Change` = round(pct_chg, 1),
`Confounder (>10%)?` = ifelse(pct_chg > 10, "Yes", "No")
)
})
conf_table_ex %>%
kable(caption = "Table 4b. Systematic Confounding Assessment: Exercise → Physical Health Days") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE) %>%
column_spec(5, bold = TRUE,
color = ifelse(conf_table_ex$`Confounder (>10%)?` == "Yes", "white", "black"),
background = ifelse(conf_table_ex$`Confounder (>10%)?` == "Yes", "#C0392B", "#27AE60"))| Covariate | Crude β (exercise) | Adjusted β | % Change | Confounder (>10%)? |
|---|---|---|---|---|
| Age | -4.7115 | -4.5504 | 3.4 | No |
| Sex | -4.7115 | -4.6974 | 0.3 | No |
| Sleep hrs | -4.7115 | -4.6831 | 0.6 | No |
| Education | -4.7115 | -4.3912 | 6.8 | No |
| Income cat | -4.7115 | -3.9406 | 16.4 | Yes |
Interpretation:
Applying the 10% change-in-estimate criterion, only the income category qualifies as a confounder of the relationship between exercise and physically unhealthy days, causing a 16.4% change in the exercise coefficient. The other variables (age, sex, sleep hours, and education) resulted in less than a 10% change and thus do not meet the confounding criterion. These findings indicate that income plays an important role in the relationship between exercise and physical health.
# Identify confounders from the table
var_map <- c("Age"="age","Sex"="sex","Sleep hrs"="sleep_hrs",
"Education"="education","Income cat"="income_cat")
identified <- conf_table_ex %>%
filter(`Confounder (>10%)?` == "Yes") %>%
pull(Covariate)
conf_vars_final <- var_map[identified]
cat("Confounders identified (>10%):", paste(identified, collapse=", "), "\n")## Confounders identified (>10%): Income cat
final_formula <- as.formula(
paste("physhlth_days ~ exercise +", paste(conf_vars_final, collapse=" + "))
)
cat("Final model formula:", deparse(final_formula), "\n\n")## Final model formula: physhlth_days ~ exercise + income_cat
mod_final_ex <- lm(final_formula, data = brfss_ci)
tidy(mod_final_ex, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
kable(
caption = "Table 4c. Fully Adjusted Model: exercise → physhlth_days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Term | Estimate | SE | t | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 10.6363 | 0.3619 | 29.3924 | 0 | 9.9269 | 11.3457 |
| exerciseYes | -3.9406 | 0.2684 | -14.6842 | 0 | -4.4667 | -3.4145 |
| income_cat | -0.6750 | 0.0531 | -12.7135 | 0 | -0.7790 | -0.5709 |
b_final_ex <- coef(mod_final_ex)["exerciseYes"]
pct_overall <- abs(b_crude_val_ex - b_final_ex) / abs(b_crude_val_ex) * 100
tribble(
~Model, ~`Exercise β`, ~`% Change from Crude`,
"Crude", round(b_crude_val_ex, 4), "---",
"Fully adjusted", round(b_final_ex, 4), paste0(round(pct_overall, 1), "%")
) %>%
kable(caption = "Table 4c. Crude vs. Adjusted Exercise Coefficient") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
row_spec(0, bold = TRUE)| Model | Exercise β | % Change from Crude |
|---|---|---|
| Crude | -4.7115 | — |
| Fully adjusted | -3.9406 | 16.4% |
Interpretation: After adjusting for all identified confounders, exercisers reported 3.94 fewer physically unhealthy days than non-exercisers (95% CI: -4.47 to -3.41, p < 0.001). This represents a 16.4% change from the crude estimate of -4.7115. The exercise coefficient attenuated (became less negative) after adjustment, indicating that the crude association was partially inflated by confounding — the adjusted estimate reflects a more accurate, if somewhat smaller, exercise benefit. The association remains statistically significant and clinically meaningful, supporting a genuine protective relationship between exercise and physical health.
mod_genhlth <- lm(physhlth_days ~ exercise + gen_health, data = brfss_ci)
b_genhlth <- coef(mod_genhlth)["exerciseYes"]
pct_genhlth <- abs(b_crude_val_ex - b_genhlth) / abs(b_crude_val_ex) * 100
cat("Exercise coefficient adjusted for gen_health:", round(b_genhlth, 4), "\n")## Exercise coefficient adjusted for gen_health: -1.6596
## % change from crude: 64.8 %
Discussion: General health status produces a 64.8% change in the exercise coefficient — far above the 10% threshold — but whether it is a confounder or a mediator depends on the assumed causal structure, not the statistical criterion alone.
For gen_health to be a confounder, it must affect
exercise behavior and physical health independently of the exercise →
health pathway. This is plausible: poor general health may limit a
person’s ability to exercise (reverse causation) and simultaneously
cause more physically unhealthy days.
For gen_health to be a mediator, the pathway would be:
exercise → better general health → fewer physically unhealthy days.
Under this model, general health is how exercise exerts its effect, and
adjusting for it would block the very mechanism we want to estimate,
producing over-adjustment bias.
It could be both simultaneously a confounder-mediator which requires
causal inference methods beyond standard regression. For the goal of
estimating the total effect of exercise, gen_health should
not be adjusted for, because doing so would remove both the indirect
(mediated) effect and any genuine confounding in a way that cannot be
cleanly separated. Its large coefficient change reflects mediation more
than confounding bias.
In the 2020 BRFSS survey, adults who engaged in regular physical activity reported significantly fewer physically unhealthy days per month. Our analysis indicated that individuals who exercised experienced approximately r round(abs(b_final_ex), 1) fewer physically unhealthy days compared to non-exercisers, even after adjusting for age, education, income, and other relevant factors. The benefits of exercise were observed across demographic groups; however, the data suggest that the increase in physically unhealthy days with age may be greater among non-exercisers, highlighting the importance of staying active as one gets older for maintaining physical health. Background factors such as age, income, and education were identified as confounders that influenced the initial comparison between exercisers and non-exercisers, but the exercise-related reduction in unhealthy days remained statistically significant after adjustment. These results should be interpreted cautiously: as this is a cross-sectional survey, we cannot establish causality—healthier individuals may be more likely to exercise. Additionally, unmeasured factors, including access to safe recreational areas, chronic health conditions, or work-related physical demands, could also affect these findings.
If our goal is to estimate the total effect of exercise on physically unhealthy days, including general health status as a covariate would be inappropriate because general health likely acts as a mediator—lying on the causal pathway between exercise and the outcome. Regular exercise improves overall general health, which then reduces physically unhealthy days; controlling for this intermediate variable in a regression model would block part of the effect of exercise, leading to an underestimate of the true total effect, a phenomenon known as over-adjustment bias. The 10% change-in-estimate rule is intended as a guideline for identifying confounders—variables that distort the exposure-outcome relationship from outside the causal pathway—and using it for a mediator can produce a misleading result that appears to indicate confounding but actually reflects mediation. Decisions about including variables in an adjusted model should be based on the assumed causal structure: if it is biologically and temporally plausible that exercise influences general health, which then affects physical wellbeing, general health should be treated as a mediator and excluded from models estimating the total effect of exercise.
End of Lab Activity