In-Class Lab Activity

EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026

Instructions

In this lab, assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)
library(ggplot2)

brfss_ci <- readRDS(
  "/Users/nataliasmall/Downloads/EPI 553/brfss_ci_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a scatterplot of physhlth_days (y-axis) vs. age (x-axis), colored by exercise status. Add separate regression lines for each group. Describe the pattern you observe.

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Association Between Age and Physical Health Days, Stratified by Exercise Status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x        = "Age in years (capped at 80)",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Exercise Status"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Interpretation: The observed pattern reveals that the age-physical health association is positive in both groups. Both regression lines slope upward, confirming the positive association between age and physically unhealthy days for both exercise status categories. The “No” exercise status category regression line suggest a stronger protective association of age compared to the “Yes” line. Note, that the “Yes” line only indicates a slight positive association, compared to the “No” line. The confidence bands do not overlap, which hints that the difference in slopes may be statistically significant. The scattered points also reveal the highly skewed nature of the outcome: most respondents cluster near zero unhealthy days regardless of age in years, with a long tail of individuals reporting many unhealthy days.

1b. (5 pts) Compute the mean physhlth_days for each combination of sex and exercise. Present the results in a table. Does it appear that the association between exercise and physical health days might differ by sex?

# Compute observed group means
physhlth_means <- brfss_ci |>
  summarise(mean_physhlth = mean(physhlth_days), .by = c(sex, exercise))

# Present the results in a table
physhlth_means |>
  kable(
    caption = "Mean Physically Unhealthy Days by Sex and Exercise Status",
    col.names = c("Sex", "Exercise", "Mean Days")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mean Physically Unhealthy Days by Sex and Exercise Status
Sex	Exercise	Mean Days
Female	No	7.430400
Male	Yes	2.323577
Female	Yes	2.451056
Male	No	6.643498

Interpretation: It appears that the association between exercise and physical health days might not differ by sex. However, individuals, regardless of sex, have similar average physically unhealthy days when grouped by their exercise status.

1c. (5 pts) Create a scatterplot of physhlth_days vs. sleep_hrs, faceted by education level. Comment on whether the slopes appear similar or different across education groups.

ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = education)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Association Between Sleep hours and Physical Health Days, Faceted by Education level",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x        = "Sleep hours per night",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Education level"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Interpretation: The slopes appear similar across education groups. The observed pattern shows that the confidence bands overlap, which hints that the difference in slopes may not be statistically significant.

Task 2: Stratified Analysis for Interaction (15 points)

2a. (5 pts) Fit separate simple linear regression models of physhlth_days ~ age for exercisers and non-exercisers. Report the slope, SE, and 95% CI for age in each stratum.

# Fit separate models for exercisers and non-exercisers
mod_exer <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))
mod_non_excer <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))

# Compare coefficients
bind_rows(
  tidy(mod_exer, conf.int = TRUE) |> mutate(Stratum = "Exercisers"),
  tidy(mod_non_excer, conf.int = TRUE) |> mutate(Stratum = "Non-Exercisers")
) |>
  filter(term == "age") |>
  select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stratified Analysis: Physically unhealthy days ~ Age, by Exercise Status",
    col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stratified Analysis: Physically unhealthy days ~ Age, by Exercise Status
Stratum	Estimate	SE	CI Lower	CI Upper	p-value
Exercisers	0.0192	0.0060	0.0075	0.0309	0.0013
Non-Exercisers	0.0745	0.0212	0.0328	0.1161	0.0005

Age by Exercisers (Yes) Slope: 0.0192 Age by Exercisers (Yes) SE: 0.0060 Age by Exercisers (Yes) 95% CI: [0.0075, 0.0309]

Age by Non-Exercisers (No) Slope: 0.0745 Age by Non-Exercisers (No) SE: 0.0212 Age by Non-Exercisers (No) 95% CI: [0.0328, 0.1161]

2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  labs(
    title    = "Association Between Age and Physical Health Days, Stratified by Exercise Status",
    subtitle = "BRFSS 2020 (n = 5,000)",
    x        = "Age in years (capped at 80)",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Exercise Status"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Interpretation: The lines are not approximately parallel. The non-exercisers (exercise status = No) regression line shows a sharper positive incline compared to the exercisers (exercise status = yes) regression line showing a very slight positive incline.

2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not. You cannot formally test whether the two slopes are different using only the stratified results. While the stratified approach is intuitive and visually compelling, and we can see the slopes are different, we cannot compute a p-value for the difference without additional machinery.

Task 3: Interaction via Regression (25 points)

3a. (5 pts) Fit the interaction model: physhlth_days ~ age * exercise. Write out the fitted equation.

# Model with interaction
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

tidy(mod_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Age * Exercise Status → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: Age * Exercise Status → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7610	0.8798	3.1381	0.0017	1.0361	4.4858
age	0.0745	0.0145	5.1203	0.0000	0.0460	0.1030
exerciseYes	-1.3858	0.9664	-1.4340	0.1516	-3.2804	0.5087
age:exerciseYes	-0.0553	0.0162	-3.4072	0.0007	-0.0871	-0.0235

Fitted equation: Phys Days = 2.7610 + 0.0745 (Age) + -1.3858 (Exercise:Yes) + -0.0553 (Age × Exercise:Yes)

3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.

For Non-Exercisers (Exercise = No) Phys Days = 2.7610 + 0.0745 (Age) For Exercisers (Exercise = Yes) Phys Days = (2.7610 + -1.3858) + (0.0745 + -0.0553)(Age) = 1.3752 + 0.0192(Age)

These match the stratified analysis from Task 2.

3c. (5 pts) Conduct the t-test for the interaction term (age:exerciseYes). State the null and alternative hypotheses, report the test statistic and p-value, and state your conclusion about whether interaction is present.

int_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")

## Interaction term (age:exerciseYes):

cat("  Estimate:", round(int_term$estimate, 4), "\n")

##   Estimate: -0.0553

cat("  t-statistic:", round(int_term$statistic, 3), "\n")

##   t-statistic: -3.407

cat("  p-value:", round(int_term$p.value, 4), "\n")

##   p-value: 7e-04

Null Hypothesis: 𝐻0:𝛽3=0 (slopes are equal, lines are parallel, no interaction) Alternative Hypothesis: 𝐻𝐴:𝛽3≠0 (slopes differ, interaction is present) t-statistic: -3.407 p-value: 7e-04 (<0.001) Conclusion: Since p-value 7e-04 is <0.05 we reject the null hypothesis, indicating that slopes significantly differ and interaction is present. There is statistically significant evidence that the relationship between age and physical health does change depending on exercise status.

3d. (5 pts) Now fit a model with an interaction between age and education: physhlth_days ~ age * education. How many interaction terms are produced? Use a partial F-test to test whether the age \(\times\) education interaction as a whole is significant.

# Model with an interaction between age and education: physhlth_days ~ age * education
mod_edu <- lm(physhlth_days ~ age * education, data = brfss_ci)

# Conduct partial F-test
anova(mod_edu) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Test for Coincidence: Does Age × Education Affect the Age-Physical Health Relationship at All?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Test for Coincidence: Does Age × Education Affect the Age-Physical Health Relationship at All?
term	df	sumsq	meansq	statistic	p.value
age	1	2830.1358	2830.1358	46.0689	0.0000
education	3	5780.0958	1926.6986	31.3628	0.0000
age:education	3	92.2713	30.7571	0.5007	0.6818
Residuals	4992	306671.8963	61.4327	NA	NA

Interpretation: 3 interaction terms are produced. The age × education interaction as a whole is not significant as, p-value 0.6818 > 0.05.

3e. (5 pts) Create a visualization using ggpredict() showing the predicted physhlth_days by age for each education level. Do the lines appear parallel?

pred_int <- ggpredict(mod_edu, terms = c("age", "education"))

ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
  labs(
    title    = "Predicted Physical Health Days by Age and Education Level",
    subtitle = "From interaction model: age * education",
    x        = "Age in years (capped at 80)",
    y        = "Predicted Physically Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

Interpretation: The lines appear relatively parallel.

Task 4: Confounding Assessment (25 points)

For this task, the exposure is exercise and the outcome is physhlth_days.

4a. (5 pts) Fit the crude model: physhlth_days ~ exercise. Report the exercise coefficient. This is the unadjusted estimate.

mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)

tidy(mod_crude, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Crude Model: Physical Health Days ~ Exercise Status",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Crude Model: Physical Health Days ~ Exercise Status
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	7.1027	0.2354	30.1692	0	6.6412	7.5643
exerciseYes	-4.7115	0.2656	-17.7401	0	-5.2322	-4.1908

Coefficient: The exercise coefficient is -4.7115.

4b. (10 pts) Systematically assess whether each of the following is a confounder of the exercise-physical health association: age, sex, sleep_hrs, education, and income_cat. For each:

Fit the model physhlth_days ~ exercise + [covariate]
Report the adjusted exercise coefficient
Compute the percent change from the crude estimate
Apply the 10% rule to determine if the variable is a confounder

Present your results in a single summary table.

# Crude model
b_crude_val <- coef(mod_crude)["exerciseYes"]

# One-at-a-time adjusted models
confounders <- list(
  "Age"             = lm(physhlth_days ~ exercise + age, data = brfss_ci),
  "Sex"             = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
  "Sleep Hours"     = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
  "Education"       = lm(physhlth_days ~ exercise + education, data = brfss_ci),
  "Income"          = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)

conf_table <- map_dfr(names(confounders), \(name) {
  mod <- confounders[[name]]
  b_adj_val <- coef(mod)["exerciseYes"]
  tibble(
    Covariate = name,
    `Crude β (exercise)` = round(b_crude_val, 4),
    `Adjusted β (exercise)` = round(b_adj_val, 4),
    `% Change` = round(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100, 1),
    Confounder = ifelse(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100 > 10,
                        "Yes (>10%)", "No")
  )
})

conf_table |>
  kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(5, bold = TRUE)

Systematic Confounding Assessment: One-at-a-Time Addition
Covariate	Crude β (exercise)	Adjusted β (exercise)	% Change	Confounder
Age	-4.7115	-4.5504	3.4	No
Sex	-4.7115	-4.6974	0.3	No
Sleep Hours	-4.7115	-4.6831	0.6	No
Education	-4.7115	-4.3912	6.8	No
Income	-4.7115	-3.9406	16.4	Yes (>10%)

4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?

# Step 1: Test for interaction between age and exercise
mod_final_int <- lm(physhlth_days ~ exercise * age + sex + sleep_hrs + education + income_cat,
                data = brfss_ci)

# Check interaction term
int_pval <- tidy(mod_final_int) |>
  filter(term == "exerciseYes:age") |>
  pull(p.value)

cat("Interaction p-value (age x exercise):", round(int_pval, 4), "\n")

## Interaction p-value (age x exercise): 4e-04

# p-value < 0.05

# Table
tidy(mod_final_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Model: Exercise Status → Physical Health Days, Adjusted for Confounders",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Final Model: Exercise Status → Physical Health Days, Adjusted for Confounders
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	9.5496	1.1017	8.6681	0.0000	7.3898	11.7094
exerciseYes	-0.4797	0.9525	-0.5037	0.6145	-2.3470	1.3876
age	0.0785	0.0143	5.4826	0.0000	0.0505	0.1066
sexFemale	0.0092	0.2169	0.0426	0.9660	-0.4159	0.4344
sleep_hrs	-0.4096	0.0795	-5.1521	0.0000	-0.5655	-0.2538
educationHS graduate	-1.0692	0.5083	-2.1034	0.0355	-2.0658	-0.0726
educationSome college	-0.7924	0.5145	-1.5401	0.1236	-1.8012	0.2163
educationCollege graduate	-1.0013	0.5222	-1.9175	0.0552	-2.0251	0.0224
income_cat	-0.6311	0.0589	-10.7067	0.0000	-0.7467	-0.5156
exerciseYes:age	-0.0567	0.0159	-3.5588	0.0004	-0.0880	-0.0255

# Compare Crude-Final
b_final <- coef(mod_final_int)["exerciseYes"]

tribble(
  ~Model, ~`Exercise β`, ~`% Change from Crude`,
  "Crude", round(b_crude_val, 4), "—",
  "Unadjusted (exercise only)", round(coef(mod_crude)["exerciseYes"], 4),
    paste0(round(abs(b_crude_val - coef(mod_crude)["exerciseYes"]) / abs(b_crude_val) * 100, 1), "%"),
  "Fully Adjusted (all confounders)", round(b_final, 4),
    paste0(round(abs(b_crude_val - b_final) / abs(b_crude_val) * 100, 1), "%")
) |>
  kable(caption = "Sleep Coefficient: Crude vs. Progressively Adjusted Models") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Sleep Coefficient: Crude vs. Progressively Adjusted Models
Model	Exercise β	% Change from Crude
Crude	-4.7115	—
Unadjusted (exercise only)	-4.7115	0%
Fully Adjusted (all confounders)	-0.4797	89.8%

Interpretation: From the fully adjusted model, including all confounders, the exercise coefficient is -0.4797 Compared to the crude estimate, -4.7115, it changed 89.8%.

4d. (5 pts) Is gen_health a confounder or a mediator of the exercise-physical health relationship? Could it be both? Explain your reasoning with reference to the three conditions for confounding and the concept of the causal pathway.

Interpretation: Gen Health could be a confounder of the exercise-physical health relationship. People who exercise generally have better general health, affirming association with exposure. Additionally, general health is a strong predictor of physically unhealthy days, confirming association with outcome. But, exercise directly influences general health, indicating that it is on the causal pathway. Thus, gen health would be better identified as a mediator of the exercise-physical health relationship. Exercise improves health status over time.

Task 5: Public Health Interpretation (20 points)

5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:

Whether the association between exercise and physical health differs by any subgroup (interaction assessment)
Which variables confounded the exercise-physical health association
The direction and approximate magnitude of the adjusted exercise effect
Appropriate caveats about cross-sectional data and potential unmeasured confounding

Do not use statistical jargon.

Summary: The association between exercise and physical health showed substantial difference by subgroup. While individuals benefit from exercise, the protective effect increases as people get older. Age, sleep hours, education level, and income are variables that confounded the exercise-physical health association. Moreover, after accounting for confounding factors the direction and approximate magnitude of the adjusted exercise effect remained positive, however, the initial estimate was reduced by 89.8%. Since the data used for this analysis was collected at a single point in time, it cannot be definitively proven that exercise causes better health, and can we cannot rule out other potential unmeasured lifestyle factors that might influence these results.

5b. (10 pts) A colleague suggests including gen_health as a covariate in the final model because it changes the exercise coefficient by more than 10%. You disagree. Write a 3–4 sentence argument explaining why adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days. Use the concept of mediation in your argument.

Adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days because general health acts as a mediator in the exercise-physical health relationship. While it does change the exercise coefficient by more than 10%, general health is on the causal pathway; exercise improves overall health, which directly reduces physically unhealthy days. Including general health as a covariate in the final model will lead to an over-adjusted estimate, which we don’t want.

End of Lab Activity

Confounding and Interactions in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Natalia Small

March 24, 2026