Part 2: In-Class Lab Activity

EPI 553 — Confounding and Interactions Lab

Instructions

In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)

brfss_ci <- readRDS(
  "~/Desktop/EPI553/data/brfss_ci_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a scatterplot of physhlth_days (y-axis) vs. age (x-axis), colored by exercise status. Add separate regression lines for each group. Describe the pattern you observe.

There is a slight positive relationship indicating that as age increases so does physically unhealthy days. Points are scattered so there is not a perfect relationship.

# Create scatterplot
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color= exercise)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(
    title = "Age vs Physically Unhealthy Days",
    subtitle = "BRFSS Data",
    x = "Age (years)",
    y = "Physically Unhealthy Days in the past 30"
  ) +
  theme_minimal()

1b. (5 pts) Compute the mean physhlth_days for each combination of sex and exercise. Present the results in a table. Does it appear that the association between exercise and physical health days might differ by sex?

For both male and female, those who exercised had a lower mean of unhealthy physical health days. Those who do not exercise had a higher mean. The means for males and females who do not exercise vary slightly with females having a higher mean at 7.43 in comparison to males at 6.64.

brfss_ci %>%
  group_by(sex, exercise) %>%
  summarise(
    mean_physhlth_days= mean(physhlth_days, na.rm= TRUE),
    .groups= "drop"
  )

## # A tibble: 4 × 3
##   sex    exercise mean_physhlth_days
##   <fct>  <fct>                 <dbl>
## 1 Male   No                     6.64
## 2 Male   Yes                    2.32
## 3 Female No                     7.43
## 4 Female Yes                    2.45

1c. (5 pts) Create a scatterplot of physhlth_days vs. sleep_hrs, faceted by education level. Comment on whether the slopes appear similar or different across education groups.

All scatterplots show a negative relationship between hours of sleep each night and physically unhealthy days. The college graduate group has a flatter slope while the less than high school group is the steeper.

# Create scatterplot
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  facet_wrap(~ education) +
  labs(
    title = "Sleep Hours per Night vs Physically Unhealthy Days",
    subtitle = "BRFSS Data",
    x = "Sleep (Hours per Night)",
    y = "Physically Unhealthy Days in the past 30"
  ) +
  theme_minimal()

Task 2: Stratified Analysis for Interaction (15 points)

2a. (5 pts) Fit separate simple linear regression models of physhlth_days ~ age for exercisers and non-exercisers. Report the slope, SE, and 95% CI for age in each stratum.

Exercisers Slope= 0.019194 Std. error= 0.005984 95% CI= [0.0075,0.0309]

Non-exercisers Slope= 0.0745 Std. error= 0.0212 95% CI= [0.1161, 0.0005]

# Fit separate models for exercisers and nonexercisers
mod_yes <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))
mod_no <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))

# Compare coefficients
bind_rows(
  tidy(mod_yes, conf.int = TRUE) |> mutate(Stratum = "Exerciser"),
  tidy(mod_no, conf.int = TRUE) |> mutate(Stratum = "Nonexerciser")
) |>
  filter(term == "age") |>
  select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stratified Analysis: Age → Physical Health Days, by Exercise",
    col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stratified Analysis: Age → Physical Health Days, by Exercise
Stratum	Estimate	SE	CI Lower	CI Upper	p-value
Exerciser	0.0192	0.0060	0.0075	0.0309	0.0013
Nonexerciser	0.0745	0.0212	0.0328	0.1161	0.0005

2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?

The lines are not parallel. Both lines do slope upward slightly, confirming a positive relationship between age and unhealthy physical days for exercisers and nonexercisers.

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Association Between Age and Physical Health Days, Stratified by Exercise",
    subtitle = "Are the slopes parallel or different?",
    x        = "Age in Years",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Exercise"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.

No you cannot formally test whether the two slopes are different using only stratified results. We can use plots to visualize if the slopes are relatively parallel but no p-value can be calculated for the difference.

Task 3: Interaction via Regression (25 points)

3a. (5 pts) Fit the interaction model: physhlth_days ~ age * exercise. Write out the fitted equation.

Fitted Equation Physhlth_days= 2.7610 + 0.0745 (age) + -1.3858 (Yes) + -0.0553 (age x Yes)

#Model without interaction (additive model)
mod_add <- lm(physhlth_days ~ age + exercise, data = brfss_ci)
# Model with interaction
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

tidy(mod_int, conf.int = TRUE) |>
mutate(across(where(is.numeric), \(x) round(x, 4))) |>
kable(
caption = "Interaction Model: Age * Exercise → Physical Health Days",
col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Uppe
r")
) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: Age * Exercise → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Uppe
(Intercept)	2.7610	0.8798	3.1381	0.0017	1.0361	4.4858
age	0.0745	0.0145	5.1203	0.0000	0.0460	0.1030
exerciseYes	-1.3858	0.9664	-1.4340	0.1516	-3.2804	0.5087
age:exerciseYes	-0.0553	0.0162	-3.4072	0.0007	-0.0871	-0.0235

3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.

Exercisers Physhlth_days= 2.7610 + 0.0745 (age) + -1.3858 (Yes) + -0.0553 (age x Yes) Physhlth_days= 1.3752 + 0.0192(age)

Non-exercisers Physhlth_days= 2.7610 + 0.0745(age)

3c. (5 pts) Conduct the t-test for the interaction term (age:exerciseYes). State the null and alternative hypotheses, report the test statistic and p-value, and state your conclusion about whether interaction is present.

H0= B=0 (slopes are equal, lines are parallel, no interaction) H1= B≠0 (slopes differ, interaction is present) Test statistic= -3.407 p-value= 7e-04

The interaction term (age:exerciseYes) has a coefficient of -0.0553, t-stat of -3.407, p-value of 7e-04 (0.0007). Since the p-value is less than 0.05, we reject the null hypothesis. There is statistically significant evidence that the association between age and physically unhealthy days differ by exercise status, indicating the presence of an interaction.

int_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")

## Interaction term (age:exerciseYes):

cat("estimate:", round(int_term$estimate, 4), "\n")

## estimate: -0.0553

cat("t-statistic:", round(int_term$statistic, 3), "\n")

## t-statistic: -3.407

cat("p-value:", round(int_term$p.value,4), "\n")

## p-value: 7e-04

3d. (5 pts) Now fit a model with an interaction between age and education: physhlth_days ~ age * education. How many interaction terms are produced? Use a partial F-test to test whether the age \(\times\) education interaction as a whole is significant.

There are 3 interaction terms produced. No significance since p-value is 0.6818 which is more than 0.05

mod_add_edu <- lm(physhlth_days ~ age + education, data = brfss_ci)
mod_int_edu <- lm(physhlth_days ~ age * education, data = brfss_ci)

anova(mod_add_edu, mod_int_edu) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Test for Coincidence: Does Education Affect the Age-Physical Health Relationship at All?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Test for Coincidence: Does Education Affect the Age-Physical Health Relationship at All?
term	df.residual	rss	df	sumsq	statistic	p.value
physhlth_days ~ age + education	4995	306764.2	NA	NA	NA	NA
physhlth_days ~ age * education	4992	306671.9	3	92.2713	0.5007	0.6818

3e. (5 pts) Create a visualization using ggpredict() showing the predicted physhlth_days by age for each education level. Do the lines appear parallel?

All four are not parallel with eachother. However, some college and college graduate seem to be close to parallel and HS graduate and Less than high school seem to be parallel.

pred_int <- ggpredict(mod_int_edu, terms = c("age", "education"))

ggplot(pred_int, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Physical Health Days by Age and Education",
    subtitle = "Are the lines parallel? Non-parallel lines indicate interaction.",
    x        = "Age in years",
    y        = "Predicted Physically Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Task 4: Confounding Assessment (25 points)

For this task, the exposure is exercise and the outcome is physhlth_days.

4a. (5 pts) Fit the crude model: physhlth_days ~ exercise. Report the exercise coefficient. This is the unadjusted estimate.

exercise coefficient= -4.7115

# Crude model: exercise → physical health days
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)

# Adjusted model: adding age
mod_adj_age <- lm(physhlth_days ~ exercise + age, data = brfss_ci)

# Extract exercise coefficients
b_crude <- coef(mod_crude)["exerciseYes"]
b_adj   <- coef(mod_adj_age)["exerciseYes"]
pct_change <- abs(b_crude - b_adj) / abs(b_crude) * 100

tribble(
  ~Model, ~`Exercise β`, ~SE, ~`95% CI`,
  "Crude (exerciseYes)",
    round(b_crude, 4),
    round(tidy(mod_crude) |> filter(term == "exerciseYes") |> pull(std.error), 4),
    paste0("(", round(confint(mod_crude)["exerciseYes", 1], 4), ", ",
           round(confint(mod_crude)["exerciseYes", 2], 4), ")"),
  "Adjusted (+ age)",
    round(b_adj, 4),
    round(tidy(mod_adj_age) |> filter(term == "exerciseYes") |> pull(std.error), 4),
    paste0("(", round(confint(mod_adj_age)["exerciseYes", 1], 4), ", ",
           round(confint(mod_adj_age)["exerciseYes", 2], 4), ")")
) |>
  kable(caption = "Confounding Assessment: Does Age Confound the Exercise-Physical Health Association?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Confounding Assessment: Does Age Confound the Exercise-Physical Health Association?
Model	Exercise β	SE	95% CI
Crude (exerciseYes)	-4.7115	0.2656	(-5.2322, -4.1908)
Adjusted (+ age)	-4.5504	0.2673	(-5.0744, -4.0264)

4b. (10 pts) Systematically assess whether each of the following is a confounder of the exercise-physical health association: age, sex, sleep_hrs, education, and income_cat. For each:

Fit the model physhlth_days ~ exercise + [covariate]
Report the adjusted exercise coefficient
Compute the percent change from the crude estimate
Apply the 10% rule to determine if the variable is a confounder

Present your results in a single summary table.

# Crude model
b_crude_val <- coef(mod_crude)["exerciseYes"]

# One-at-a-time adjusted models
confounders <- list(
  "Age"             = lm(physhlth_days ~ exercise + age, data = brfss_ci),
  "Sex"             = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
  "Sleep_hrs"        = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
  "Education"       = lm(physhlth_days ~ exercise + education, data = brfss_ci),
  "Income"          = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)

conf_table <- map_dfr(names(confounders), \(name) {
  mod <- confounders[[name]]
  b_adj_val <- coef(mod)["exerciseYes"]
  tibble(
    Covariate = name,
    `Crude β (exerciseYes)` = round(b_crude_val, 4),
    `Adjusted β (exerciseYes)` = round(b_adj_val, 4),
    `% Change` = round(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100, 1),
    Confounder = ifelse(abs(b_crude_val - b_adj_val) / abs(b_crude_val) * 100 > 10,
                        "Yes (>10%)", "No")
  )
})

conf_table |>
  kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(5, bold = TRUE)

Systematic Confounding Assessment: One-at-a-Time Addition
Covariate	Crude β (exerciseYes)	Adjusted β (exerciseYes)	% Change	Confounder
Age	-4.7115	-4.5504	3.4	No
Sex	-4.7115	-4.6974	0.3	No
Sleep_hrs	-4.7115	-4.6831	0.6	No
Education	-4.7115	-4.3912	6.8	No
Income	-4.7115	-3.9406	16.4	Yes (>10%)

4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?

The exercise coefficient is -3.7297 and the crude estimate is -4.7115. The estimate changed 20.84% after adjusting for confounders.

# fully adjusted model 
mod_full <- lm(physhlth_days ~ exercise + age + sex + sleep_hrs + education + income_cat,
               data = brfss_ci)
# coefficients 
b_crude <- coef(mod_crude)["exerciseYes"]
b_full  <- coef(mod_full)["exerciseYes"]

# Percent change in exercise coefficients
pct_change <- abs(b_crude - b_full) / abs(b_crude) * 100

# results
cat("Crude β:", round(b_crude, 4), "\n")

## Crude β: -4.7115

cat("Adjusted β:", round(b_full, 4), "\n")

## Adjusted β: -3.7297

cat("Percent change:", round(pct_change, 2), "%\n")

## Percent change: 20.84 %

4d. (5 pts) Is gen_health a confounder or a mediator of the exercise-physical health relationship? Could it be both? Explain your reasoning with reference to the three conditions for confounding and the concept of the causal pathway.

gen_health can be a confounder or a mediator of exercise-physical health relationship. A confounder exists when the estimated association between an outcome and exposure is affected by a third variable. For a variable to be a confounder it has to be associated with both the exposure and outcome and not on the causal pathway. Being on the causal pathway would make it a mediator. Gen_health could be a confounder that causes poor exercise habits and more unhealthy days but it could be a mediator if the individual has poor exercise habits causing poor general health that leads to poor physical health.

Task 5: Public Health Interpretation (20 points)

5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:

Whether the association between exercise and physical health differs by any subgroup (interaction assessment)
Which variables confounded the exercise-physical health association
The direction and approximate magnitude of the adjusted exercise effect
Appropriate caveats about cross-sectional data and potential unmeasured confounding

Do not use statistical jargon.

5b. (10 pts) A colleague suggests including gen_health as a covariate in the final model because it changes the exercise coefficient by more than 10%. You disagree. Write a 3–4 sentence argument explaining why adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days. Use the concept of mediation in your argument.

End of Lab Activity

Confounding and Interactions in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

March 24, 2026