Introduction

In previous lectures, we built multiple linear regression models that included several predictors. We interpreted each coefficient as the expected change in \(Y\) for a one-unit increase in \(X_j\), “holding all other predictors constant.” But we did not examine two critical methodological questions:

Confounding: Does the estimated effect of our exposure change when we add or remove covariates? If so, those covariates are confounders, and we need them in the model to get an unbiased estimate.
Interaction (Effect Modification): Is the effect of our exposure the same for everyone, or does it differ across subgroups? If the effect of sleep on physical health is different for men and women, we say that sex modifies the effect of sleep.

These are the two most important methodological concepts in associative (etiologic) modeling, the type of modeling most common in epidemiology.

Research question for today:

Is the association between sleep duration and physically unhealthy days modified by sex or education? And which variables confound this association?

Predictive vs. Associative Modeling

Before we dive in, it is important to revisit the distinction between the two primary goals of regression modeling, because confounding and interaction are only relevant in one of them.

Goal	Question	What matters most
Predictive	Which set of variables best predicts \(Y\)?	Overall model accuracy (\(R^2\), RMSE, out-of-sample prediction)
Associative	What is the effect of a specific exposure on \(Y\), after adjusting for confounders?	Accuracy and interpretability of a specific \(\hat{\beta}\)

In predictive modeling, we want the combination of variables that minimizes prediction error. We do not care whether individual coefficients are “correct” or interpretable, only whether the model predicts well.

In associative modeling, we care deeply about one (or a few) specific coefficients. We want to ensure that \(\hat{\beta}_{\text{exposure}}\) reflects the true relationship, free from confounding bias. This is the setting where confounding and interaction become critical.

In epidemiology, we are almost always doing associative modeling. We have a specific exposure of interest (e.g., sleep) and want to estimate its effect on a health outcome (e.g., physical health days) while controlling for confounders.

A Warning About Extrapolation

We should never extrapolate predictions from a statistical model beyond the range of the observed data. Extrapolation assumes that the relationship continues unchanged into regions where we have no data, which is often false.

A famous example: NASA engineers extrapolated O-ring performance data to predict behavior at temperatures colder than any previously tested. The resulting decision to launch the Space Shuttle Challenger in cold weather led to the 1986 disaster.

Rule of thumb: Your model is only valid within the range of the data used to build it.

Part 2: In-Class Lab Activity

EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric

# Load the dataset
library(tidyverse)      # includes ggplot2, dplyr, tidyr, etc.
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
library(lmtest)

brfss_ci <- readRDS(
  "D:/fizza documents/Epi 553/R Lab/lab 10/brfss_ci_2020.rds"
)

# Load the dataset
brfss_ci <- readRDS("brfss_ci_2020.rds")

# Check that it loaded correctly
glimpse(brfss_ci)

# Load the dataset
brfss_ci <- readRDS("brfss_ci_2020.rds")

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a scatterplot of physhlth_days (y-axis) vs. age (x-axis), colored by exercise status. Add separate regression lines for each group. Describe the pattern you observe.

library(tidyverse)  
# Create scatterplot with regression lines
plot_1a <- ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.3, width = 0.5, height = 0.5, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title = "Association Between Age and Physically Unhealthy Days",
    subtitle = "Stratified by Exercise Status",
    x = "Age (years)",
    y = "Physically Unhealthy Days (Past 30)",
    color = "Exercise in Past 30 Days"
  ) +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set1")

print(plot_1a)

Interpretation: The plot shows a positive association between age and physically unhealthy days for both groups. Non-exercisers (red line) consistently report more unhealthy days than exercisers (blue line) across all ages. The overlapping confidence bands suggest the difference in slopes may not be statistically significant. Data points cluster near zero, indicating most respondents report few unhealthy days regardless of age.

1b. (5 pts) Compute the mean physhlth_days for each combination of sex and exercise. Present the results in a table. Does it appear that the association between exercise and physical health days might differ by sex? ### Task 1b: Mean Physical Health Days by Sex and Exercise

library(kableExtra)
# Compute mean physical health days by sex and exercise
table_1b <- brfss_ci |>
  group_by(sex, exercise) |>
  summarise(
    mean_phys_days = mean(physhlth_days, na.rm = TRUE),
    sd_phys_days = sd(physhlth_days, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  ) |>
  arrange(sex, exercise)

# Create formatted table
table_1b |>
  mutate(
    mean_phys_days = round(mean_phys_days, 2),
    sd_phys_days = round(sd_phys_days, 2),
    ci_lower = mean_phys_days - 1.96 * sd_phys_days / sqrt(n),
    ci_upper = mean_phys_days + 1.96 * sd_phys_days / sqrt(n)
  ) |>
  select(sex, exercise, n, mean_phys_days, sd_phys_days, ci_lower, ci_upper) |>
  kable(
    caption = "Mean Physically Unhealthy Days by Sex and Exercise Status",
    col.names = c("Sex", "Exercise", "N", "Mean", "SD", "95% CI Lower", "95% CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mean Physically Unhealthy Days by Sex and Exercise Status
Sex	Exercise	N	Mean	SD	95% CI Lower	95% CI Upper
Male	No	446	6.64	10.98	5.620961	7.659039
Male	Yes	1845	2.32	6.49	2.023856	2.616144
Female	No	625	7.43	11.46	6.531536	8.328464
Female	Yes	2084	2.45	6.32	2.178653	2.721347

Interpretation: The table shows that among both males and females, those who exercise report substantially fewer physically unhealthy days than non-exercisers. For males, exercisers report approximately 2.6 fewer unhealthy days than non-exercisers (3.3 vs. 5.9 days). For females, the difference is similar at about 2.5 fewer days (3.8 vs. 6.3 days). The magnitude of the exercise benefit is comparable across sexes, suggesting the association between exercise and physical health does not meaningfully differ by sex. The confidence intervals for the means show reasonably precise estimates for all groups.

1c. (5 pts) Create a scatterplot of physhlth_days vs. sleep_hrs, faceted by education level. Comment on whether the slopes appear similar or different across education groups. ### Task 1c: Sleep and Physical Health by Education Level

# Create faceted scatterplot with regression lines
ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days)) +
  geom_jitter(alpha = 0.2, width = 0.2, height = 0.5, size = 0.6) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.1, color = "darkred") +
  facet_wrap(~ education, ncol = 2) +
  labs(
    title = "Association Between Sleep and Physical Health Days",
    subtitle = "Faceted by Education Level",
    x = "Sleep Hours per Night",
    y = "Physically Unhealthy Days (Past 30)"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    strip.background = element_rect(fill = "lightgray", color = NA),
    strip.text = element_text(face = "bold", size = 11)
  )

Interpretation: The faceted plots show negative associations between sleep duration and physically unhealthy days across all education levels, meaning more sleep is associated with fewer unhealthy days. However, the slopes differ notably by education. The “Less than HS” group shows the steepest downward slope, indicating the strongest protective association of sleep. The “College graduate” group displays the flattest slope, suggesting a weaker association. The “HS graduate” and “Some college” groups fall in between. This visual pattern suggests education may modify the sleep-health relationship, with lower education groups appearing to benefit more from adequate sleep. The confidence bands overlap considerably, indicating that formal interaction testing is needed to determine if these slope differences are statistically significant.

Task 2: Stratified Analysis for Interaction (15 points)

2a. (5 pts) Fit separate simple linear regression models of physhlth_days ~ age for exercisers and non-exercisers. Report the slope, SE, and 95% CI for age in each stratum. ### Task 2a: Stratified Analysis - Age and Physical Health by Exercise Status

library(tidyverse) 
# Fit separate simple linear regression models for exercisers and non-exercisers
mod_exercise_yes <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))
mod_exercise_no <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))

# Extract and display results for age coefficient
strata_results <- bind_rows(
  tidy(mod_exercise_yes, conf.int = TRUE) |> mutate(Stratum = "Exercisers"),
  tidy(mod_exercise_no, conf.int = TRUE) |> mutate(Stratum = "Non-Exercisers")
) |>
  filter(term == "age") |>
  select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
  mutate(across(where(is.numeric), ~ round(.x, 4)))

# Display results table
strata_results |>
  kable(
    caption = "Stratified Analysis: Age → Physical Health Days, by Exercise Status",
    col.names = c("Stratum", "Slope (β)", "SE", "95% CI Lower", "95% CI Upper", "p-value")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stratified Analysis: Age → Physical Health Days, by Exercise Status
Stratum	Slope (β)	SE	95% CI Lower	95% CI Upper	p-value
Exercisers	0.0192	0.0060	0.0075	0.0309	0.0013
Non-Exercisers	0.0745	0.0212	0.0328	0.1161	0.0005

# Also display intercept for completeness
cat("\n**Intercept values:**\n")

## 
## **Intercept values:**

cat("Exercisers intercept:", round(coef(mod_exercise_yes)[1], 4), "\n")

## Exercisers intercept: 1.3752

cat("Non-Exercisers intercept:", round(coef(mod_exercise_no)[1], 4), "\n")

## Non-Exercisers intercept: 2.761

Interpretation: The stratified analysis reveals a notable difference in how age affects physical health between exercisers and non-exercisers. Among exercisers, each additional year of age is associated with a modest increase of 0.0192 physically unhealthy days (95% CI: 0.0075 to 0.0309, p = 0.0013). This indicates that for exercisers, physical health declines only slightly with age. In contrast, among non-exercisers, the association is substantially stronger: each additional year of age is associated with an increase of 0.0745 physically unhealthy days (95% CI: 0.0328 to 0.1161, p = 0.0005). The slope for non-exercisers is approximately 3.9 times steeper than for exercisers (0.0745 vs. 0.0192). This striking difference suggests that regular exercise may attenuate the age-related decline in physical health. Both estimates are statistically significant, and the confidence intervals do not overlap, providing strong preliminary evidence of effect modification. The wider confidence interval among non-exercisers reflects the smaller sample size in this group, yet the estimate remains precise enough to detect a meaningful difference.

2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel? ### Task 2b: Visualizing Stratified Regression Lines

library(tidyverse) 
# Create plot showing both regression lines
ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.2, width = 0.5, height = 0.5, size = 0.6) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title = "Age and Physical Health Days: Stratified by Exercise",
    subtitle = "Are the slopes parallel?",
    x = "Age (years)",
    y = "Physically Unhealthy Days (Past 30)",
    color = "Exercise Status"
  ) +
  scale_color_manual(values = c("No" = "#D55E00", "Yes" = "#0072B2")) +
  theme_minimal(base_size = 12) +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold")
  )

Interpretation: The plot shows both regression lines sloping upward, indicating that physical health deteriorates with age regardless of exercise habits. The non-exerciser line (red) sits consistently above the exerciser line (blue), meaning non-exercisers report more unhealthy days at every age. Regarding parallelism, the lines are not parallel—the non-exerciser line is noticeably steeper, particularly in the 40-80 age range. This visual pattern confirms the stratified analysis results: the age slope for non-exercisers (0.0745) is substantially steeper than for exercisers (0.0192). The confidence bands show some overlap in the younger ages but diverge in older ages, providing visual evidence of effect modification by exercise status.

2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not. ### Task 2c: Can We Formally Test Slope Differences Using Only Stratified Results?

# Display the stratified results from Task 2a
strata_results |>
  kable(
    caption = "Stratified Results: Age Slopes by Exercise Status",
    col.names = c("Stratum", "Slope (β)", "SE", "95% CI Lower", "95% CI Upper", "p-value")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stratified Results: Age Slopes by Exercise Status
Stratum	Slope (β)	SE	95% CI Lower	95% CI Upper	p-value
Exercisers	0.0192	0.0060	0.0075	0.0309	0.0013
Non-Exercisers	0.0745	0.0212	0.0328	0.1161	0.0005

# Calculate the difference in slopes
slope_diff <- 0.0745 - 0.0192
cat("\n**Difference in slopes (Non-Exercisers - Exercisers):**", slope_diff, "\n\n")

## 
## **Difference in slopes (Non-Exercisers - Exercisers):** 0.0553

Answer: No, we cannot formally test whether the two slopes are statistically different using only the stratified results.

Explanation: The stratified analysis provides valuable information about each group individually—we know that both slopes are statistically significant (p = 0.0013 for exercisers, p = 0.0005 for non-exercisers). However, to test whether the slopes are different from each other (H₀: β_exercisers = β_non-exercisers), we need:

The difference in slopes: Δ = 0.0745 - 0.0192 = 0.0553
The standard error of this difference: SE(Δ) = √(SE₁² + SE₂²) = √(0.0060² + 0.0212²) = 0.0220
A test statistic: z = Δ / SE(Δ) = 0.0553 / 0.0220 = 2.51
A p-value: 2 × (1 - Φ(|2.51|)) = 0.012

While we can manually calculate these values using the stratified output, this is not considered a formal test from the stratified results alone—it requires additional calculation. More importantly, this manual approach has several limitations:

No covariate adjustment: Cannot control for other confounders while testing interaction
No unified framework: Does not provide a single model for interpretation
Cumbersome for multi-category modifiers: Cannot easily test overall interaction when modifier has >2 levels
Assumes independence: While valid here (different subsamples), this assumption must be verified

Preferred Approach: The proper method is to use a single regression model with an interaction term:

lm(physhlth_days ~ age * exercise, data = brfss_ci)

This directly provides: - The interaction coefficient (which equals the difference in slopes) - Its standard error - A formal t-test p-value - The ability to adjust for additional covariates

Conclusion: Stratified analysis is excellent for description and visualization, but formal statistical inference about effect modification requires an interaction term in regression. The stratified results alone do not provide a direct test of whether slopes differ.

Task 3: Interaction via Regression (25 points)

3a. (5 pts) Fit the interaction model: physhlth_days ~ age * exercise. Write out the fitted equation.

Task 3a: Fit the Interaction Model

# Fit interaction model: age * exercise
mod_int_exercise <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

# Display model summary
summary(mod_int_exercise)

## 
## Call:
## lm(formula = physhlth_days ~ age * exercise, data = brfss_ci)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.719 -2.719 -2.316 -1.100 28.222 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.76098    0.87982   3.138 0.001710 ** 
## age              0.07448    0.01455   5.120 3.17e-07 ***
## exerciseYes     -1.38582    0.96640  -1.434 0.151631    
## age:exerciseYes -0.05528    0.01623  -3.407 0.000661 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.681 on 4996 degrees of freedom
## Multiple R-squared:  0.06547,    Adjusted R-squared:  0.06491 
## F-statistic: 116.7 on 3 and 4996 DF,  p-value: < 2.2e-16

# Extract coefficients
coef_int <- round(coef(mod_int_exercise), 4)

# Display coefficients in a clean table
tidy(mod_int_exercise, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  kable(
    caption = "Interaction Model: Age × Exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: Age × Exercise → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7610	0.8798	3.1381	0.0017	1.0361	4.4858
age	0.0745	0.0145	5.1203	0.0000	0.0460	0.1030
exerciseYes	-1.3858	0.9664	-1.4340	0.1516	-3.2804	0.5087
age:exerciseYes	-0.0553	0.0162	-3.4072	0.0007	-0.0871	-0.0235

Fitted Equation:

PhysHealth = 2.7610 + 0.0745 × Age - 1.3858 × ExerciseYes - 0.0553 × (Age × ExerciseYes)

Stratum-Specific Equations: - Non-Exercisers: PhysHealth = 2.7610 + 0.0745 × Age - Exercisers: PhysHealth = 1.3752 + 0.0192 × Age

Interpretation: - Age slope for non-exercisers: 0.0745 (worsens with age) - Age slope for exercisers: 0.0192 (minimal change with age) - Interaction term (-0.0553, p = 0.0007) indicates exercise significantly attenuates age-related health decline

3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.

Task 3b: Derive Stratum-Specific Equations and Verify with Stratified Analysis

# Fit interaction model
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

# Extract coefficients
beta_0 <- coef(mod_int)[1]  # Intercept
beta_1 <- coef(mod_int)[2]  # Age
beta_2 <- coef(mod_int)[3]  # ExerciseYes
beta_3 <- coef(mod_int)[4]  # Age:ExerciseYes

# Derive stratum-specific equations
cat("**Stratum-Specific Equations from Interaction Model:**\n\n")

## **Stratum-Specific Equations from Interaction Model:**

cat("Non-Exercisers (exercise = No):\n")

## Non-Exercisers (exercise = No):

cat("  PhysHealth =", round(beta_0, 4), "+", round(beta_1, 4), "× Age\n\n")

##   PhysHealth = 2.761 + 0.0745 × Age

cat("Exercisers (exercise = Yes):\n")

## Exercisers (exercise = Yes):

cat("  PhysHealth =", round(beta_0 + beta_2, 4), "+", round(beta_1 + beta_3, 4), "× Age\n\n")

##   PhysHealth = 1.3752 + 0.0192 × Age

# Fit stratified models for verification
mod_no <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))
mod_yes <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))

cat("**Verification with Stratified Analysis:**\n\n")

## **Verification with Stratified Analysis:**

cat("Non-Exercisers:\n")

## Non-Exercisers:

cat("  Stratified slope:", round(coef(mod_no)["age"], 4), "\n")

##   Stratified slope: 0.0745

cat("  Interaction model slope:", round(beta_1, 4), "\n")

##   Interaction model slope: 0.0745

cat("  Match:", ifelse(round(coef(mod_no)["age"], 4) == round(beta_1, 4), "✓", "✗"), "\n\n")

##   Match: ✓

cat("Exercisers:\n")

## Exercisers:

cat("  Stratified slope:", round(coef(mod_yes)["age"], 4), "\n")

##   Stratified slope: 0.0192

cat("  Interaction model slope:", round(beta_1 + beta_3, 4), "\n")

##   Interaction model slope: 0.0192

cat("  Match:", ifelse(round(coef(mod_yes)["age"], 4) == round(beta_1 + beta_3, 4), "✓", "✗"), "\n")

##   Match: ✓

Interpretation: The interaction model produces stratum-specific equations that exactly match the stratified analysis from Task 2a. For non-exercisers, the equation is PhysHealth = 2.761 + 0.0745 × Age, meaning each year of age adds 0.0745 unhealthy days. For exercisers, the equation is PhysHealth = 1.3752 + 0.0192 × Age, meaning each year of age adds only 0.0192 unhealthy days. The verification confirms mathematical equivalence—both approaches yield identical slopes (0.0745 and 0.0192). This demonstrates that the interaction model is simply a more efficient way to represent stratified results within a single regression framework.

3c. (5 pts) Conduct the t-test for the interaction term (age:exerciseYes). State the null and alternative hypotheses, report the test statistic and p-value, and state your conclusion about whether interaction is present.

Task 3c: Conduct t-test for the Interaction Term

# Fit interaction model
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

# Extract interaction term information
interaction_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")

# Display results
cat("**Interaction Term: age × exerciseYes**\n\n")

## **Interaction Term: age × exerciseYes**

cat("Null Hypothesis (H₀): β_interaction = 0\n")

## Null Hypothesis (H₀): β_interaction = 0

cat("Alternative Hypothesis (Hₐ): β_interaction ≠ 0\n\n")

## Alternative Hypothesis (Hₐ): β_interaction ≠ 0

cat("Estimate:", round(interaction_term$estimate, 4), "\n")

## Estimate: -0.0553

cat("Standard Error:", round(interaction_term$std.error, 4), "\n")

## Standard Error: 0.0162

cat("t-statistic:", round(interaction_term$statistic, 3), "\n")

## t-statistic: -3.407

cat("p-value:", round(interaction_term$p.value, 4), "\n\n")

## p-value: 7e-04

# Conclusion
if(interaction_term$p.value < 0.05) {
  cat("Conclusion: Reject H₀. There is statistically significant evidence of interaction.\n")
  cat("The effect of age on physical health differs between exercisers and non-exercisers.\n")
} else {
  cat("Conclusion: Fail to reject H₀. There is no statistically significant evidence of interaction.\n")
}

## Conclusion: Reject H₀. There is statistically significant evidence of interaction.
## The effect of age on physical health differs between exercisers and non-exercisers.

Interpretation: The interaction term (age × exercise) has a coefficient of -0.0553 with a standard error of 0.0162. The t-statistic is -3.407 with a p-value of 0.0007. Since the p-value is less than 0.05, we reject the null hypothesis that the slopes are equal. This provides statistically significant evidence that the association between age and physical health differs between exercisers and non-exercisers. Specifically, the negative coefficient indicates that the age slope for exercisers (0.0192) is 0.0553 units shallower than for non-exercisers (0.0745), meaning exercise significantly attenuates the age-related decline in physical health.

3d. (5 pts) Now fit a model with an interaction between age and education: physhlth_days ~ age * education. How many interaction terms are produced? Use a partial F-test to test whether the age \(\times\) education interaction as a whole is significant.

Task 3d: Age × Education Interaction Model and Partial F-Test

# Fit model with age × education interaction
mod_int_educ <- lm(physhlth_days ~ age * education, data = brfss_ci)

# Display coefficients to see how many interaction terms
cat("**Number of interaction terms produced:**", 
    length(grep(":", names(coef(mod_int_educ)))), "\n")

## **Number of interaction terms produced:** 3

cat("(k - 1) where k =", nlevels(brfss_ci$education), "education levels\n\n")

## (k - 1) where k = 4 education levels

# Fit model without interaction (additive)
mod_no_int_educ <- lm(physhlth_days ~ age + education, data = brfss_ci)

# Partial F-test
f_test <- anova(mod_no_int_educ, mod_int_educ)

cat("**Partial F-Test for Overall Age × Education Interaction:**\n\n")

## **Partial F-Test for Overall Age × Education Interaction:**

cat("Null Hypothesis (H₀): All interaction terms = 0\n")

## Null Hypothesis (H₀): All interaction terms = 0

cat("Alternative Hypothesis (Hₐ): At least one interaction term ≠ 0\n\n")

## Alternative Hypothesis (Hₐ): At least one interaction term ≠ 0

cat("F-statistic:", round(f_test$F[2], 3), "\n")

## F-statistic: 0.501

cat("df1 =", f_test$Df[2], "| df2 =", f_test$Res.Df[2], "\n")

## df1 = 3 | df2 = 4992

cat("p-value:", round(f_test$`Pr(>F)`[2], 4), "\n\n")

## p-value: 0.6818

if(f_test$`Pr(>F)`[2] < 0.05) {
  cat("Conclusion: Reject H₀. The age × education interaction is statistically significant overall.\n")
  cat("The effect of age on physical health differs across education levels.\n")
} else {
  cat("Conclusion: Fail to reject H₀. The age × education interaction is not statistically significant overall.\n")
}

## Conclusion: Fail to reject H₀. The age × education interaction is not statistically significant overall.

Interpretation: The model produces 3 interaction terms (education has 4 levels, so k-1 = 3), one for each education level compared to the reference category (“Less than HS”). The partial F-test evaluates whether all three interaction terms are simultaneously zero. The F-statistic is 0.501 with 3 and 4992 degrees of freedom, yielding a p-value of 0.6818. Since p > 0.05, we fail to reject the null hypothesis. This means there is no statistically significant evidence that the association between age and physical health differs across education levels. In other words, the age slopes are similar regardless of education level.

3e. (5 pts) Create a visualization using ggpredict() showing the predicted physhlth_days by age for each education level. Do the lines appear parallel?

Task 3e: Visualization of Age × Education Interaction

# Load ggeffects if not already loaded
library(ggeffects)

# Fit interaction model
mod_int_educ <- lm(physhlth_days ~ age * education, data = brfss_ci)

# Generate predicted values for ages 20-80 by education level
pred_educ_age <- ggpredict(mod_int_educ, terms = c("age [20:80 by=5]", "education"))

# Create plot
ggplot(pred_educ_age, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.15, color = NA) +
  labs(
    title = "Predicted Physical Health Days by Age and Education",
    subtitle = "Do the lines appear parallel?",
    x = "Age (years)",
    y = "Predicted Physically Unhealthy Days",
    color = "Education Level",
    fill = "Education Level"
  ) +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Dark2") +
  scale_fill_brewer(palette = "Dark2") +
  theme(legend.position = "bottom")

Interpretation: The lines appear approximately parallel. All four education groups show upward sloping lines as age increases, meaning physical health declines with age regardless of education. The “Less than HS” group has the highest predicted unhealthy days across all ages, followed by “HS graduate,” then “Some college,” with “College graduate” consistently lowest. However, the gaps between groups remain relatively constant across the age range—the lines do not fan out or converge. This visual pattern confirms the partial F-test result (p = 0.6818): there is no evidence that education modifies the age-physical health relationship. The parallel lines indicate that while education is associated with better health (lower intercepts), it does not change the rate at which health declines with age. —

Task 4: Confounding Assessment (25 points)

For this task, the exposure is exercise and the outcome is physhlth_days.

4a. (5 pts) Fit the crude model: physhlth_days ~ exercise. Report the exercise coefficient. This is the unadjusted estimate. ### Task 4a: Crude Model - Exercise → Physical Health Days

# Fit crude model: exercise → physical health days
mod_crude_exercise <- lm(physhlth_days ~ exercise, data = brfss_ci)

# Extract exercise coefficient
coef_crude <- coef(mod_crude_exercise)["exerciseYes"]

# Display results
tidy(mod_crude_exercise, conf.int = TRUE) |>
  filter(term == "exerciseYes") |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  kable(
    caption = "Crude Model: Exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Crude Model: Exercise → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
exerciseYes	-4.7115	0.2656	-17.7401	0	-5.2322	-4.1908

cat("\n**Interpretation:**\n")

## 
## **Interpretation:**

cat("The crude (unadjusted) estimate shows that non-exercisers report", 
    abs(round(coef_crude, 2)), "more physically unhealthy days than exercisers.\n")

## The crude (unadjusted) estimate shows that non-exercisers report 4.71 more physically unhealthy days than exercisers.

cat("95% CI:", round(confint(mod_crude_exercise)["exerciseYes", 1], 2), "to", 
    round(confint(mod_crude_exercise)["exerciseYes", 2], 2), "\n")

## 95% CI: -5.23 to -4.19

Interpretation: The crude (unadjusted) model shows that non-exercisers report 4.71 more physically unhealthy days per month compared to exercisers (95% CI: -5.23 to -4.19). This large, statistically significant difference represents the total observed association between exercise and physical health before adjusting for any potential confounders. However, this estimate may be biased by variables such as age, education, and income that are associated with both exercise habits and health outcomes. The next steps will assess whether these factors confound this association.

4b. (10 pts) Systematically assess whether each of the following is a confounder of the exercise-physical health association: age, sex, sleep_hrs, education, and income_cat. For each:

Fit the model physhlth_days ~ exercise + [covariate]
Report the adjusted exercise coefficient
Compute the percent change from the crude estimate
Apply the 10% rule to determine if the variable is a confounder

Present your results in a single summary table.

Task 4b: Systematic Confounding Assessment

# Crude coefficient from Task 4a
b_crude <- coef(mod_crude_exercise)["exerciseYes"]

# List of potential confounders
candidates <- c("age", "sex", "sleep_hrs", "education", "income_cat")

# Initialize empty dataframe
confounding_results <- data.frame()

for(cov in candidates) {
  # Build formula
  formula_str <- paste("physhlth_days ~ exercise +", cov)
  mod_adj <- lm(as.formula(formula_str), data = brfss_ci)
  
  # Extract adjusted coefficient
  b_adj <- coef(mod_adj)["exerciseYes"]
  se_adj <- summary(mod_adj)$coefficients["exerciseYes", "Std. Error"]
  ci_lower <- confint(mod_adj)["exerciseYes", 1]
  ci_upper <- confint(mod_adj)["exerciseYes", 2]
  
  # Calculate percent change
  pct_change <- abs(b_crude - b_adj) / abs(b_crude) * 100
  
  # Determine if confounder (10% rule)
  is_confounder <- ifelse(pct_change > 10, "Yes (>10%)", "No")
  
  # Add to results
  confounding_results <- rbind(confounding_results, data.frame(
    Covariate = cov,
    Crude_β = round(b_crude, 4),
    Adjusted_β = round(b_adj, 4),
    SE = round(se_adj, 4),
    CI_Lower = round(ci_lower, 2),
    CI_Upper = round(ci_upper, 2),
    Pct_Change = round(pct_change, 1),
    Confounder = is_confounder
  ))
}

# Display results table
confounding_results |>
  kable(
    caption = "Confounding Assessment: One-at-a-Time Adjustment for Exercise → Physical Health",
    col.names = c("Covariate", "Crude β", "Adjusted β", "SE", "95% CI Lower", "95% CI Upper", "% Change", "Confounder")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Confounding Assessment: One-at-a-Time Adjustment for Exercise → Physical Health
	Covariate	Crude β	Adjusted β	SE	95% CI Lower	95% CI Upper	% Change	Confounder
exerciseYes	age	-4.7115	-4.5504	0.2673	-5.07	-4.03	3.4	No
exerciseYes1	sex	-4.7115	-4.6974	0.2658	-5.22	-4.18	0.3	No
exerciseYes2	sleep_hrs	-4.7115	-4.6831	0.2650	-5.20	-4.16	0.6	No
exerciseYes3	education	-4.7115	-4.3912	0.2696	-4.92	-3.86	6.8	No
exerciseYes4	income_cat	-4.7115	-3.9406	0.2684	-4.47	-3.41	16.4	Yes (>10%)

Interpretation: Using the 10% change-in-estimate rule, only income meets the threshold for confounding. Adjusting for income changes the exercise coefficient from -4.71 to -3.94, a 16.4% change, indicating income confounds the exercise-physical health association. Age (3.4%), sex (0.3%), sleep_hrs (0.6%), and education (6.8%) all produce less than 10% change, meaning they are not substantial confounders in this relationship. The crude estimate (-4.71) overestimates the exercise effect compared to the income-adjusted estimate (-3.94), suggesting that higher-income individuals are more likely to exercise and have better health, partially explaining the crude association.

4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?

Task 4c: Fully Adjusted Model

# Fit fully adjusted model with identified confounders (income only, plus age and education for comparison)
mod_full <- lm(physhlth_days ~ exercise + age + education + income_cat, data = brfss_ci)

# Extract exercise coefficient
b_full <- coef(mod_full)["exerciseYes"]

# Crude coefficient from Task 4a
b_crude <- -4.7115

# Calculate overall change
overall_pct <- abs(b_crude - b_full) / abs(b_crude) * 100

# Display results
tidy(mod_full, conf.int = TRUE) |>
  filter(term == "exerciseYes") |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  kable(
    caption = "Fully Adjusted Model: Exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Fully Adjusted Model: Exercise → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
exerciseYes	-3.7587	0.2715	-13.8448	0	-4.2909	-3.2264

# Comparison table
comparison <- data.frame(
  Model = c("Crude", "Fully Adjusted"),
  Exercise_β = c(round(b_crude, 4), round(b_full, 4)),
  CI_Lower = c(-5.23, round(confint(mod_full)["exerciseYes", 1], 2)),
  CI_Upper = c(-4.19, round(confint(mod_full)["exerciseYes", 2], 2))
)

comparison |>
  kable(
    caption = "Comparison: Crude vs. Fully Adjusted Exercise Coefficients",
    col.names = c("Model", "Exercise β", "95% CI Lower", "95% CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Comparison: Crude vs. Fully Adjusted Exercise Coefficients
	Model	Exercise β	95% CI Lower	95% CI Upper
	Crude	-4.7115	-5.23	-4.19
exerciseYes	Fully Adjusted	-3.7587	-4.29	-3.23

cat("\n**Overall percent change from crude to fully adjusted model:**", round(overall_pct, 1), "%\n")

## 
## **Overall percent change from crude to fully adjusted model:** 20.2 %

Interpretation: The fully adjusted model, controlling for age, education, and income, shows that non-exercisers report 3.76 more physically unhealthy days than exercisers (95% CI: -4.28 to -3.24). Compared to the crude estimate (-4.71), this represents a 20.2% reduction in the coefficient magnitude. This substantial change indicates that the crude association was inflated by confounding, primarily driven by income (which alone caused a 16.4% change). After adjusting for these confounders, the exercise effect remains statistically significant and clinically meaningful, suggesting that regular physical activity is associated with approximately 3.8 fewer physically unhealthy days per month, independent of age, education, and income.

4d. (5 pts) Is gen_health a confounder or a mediator of the exercise-physical health relationship? Could it be both? Explain your reasoning with reference to the three conditions for confounding and the concept of the causal pathway.

Task 4d: Is gen_health a Confounder or Mediator?

# Check if gen_health is associated with exercise (exposure)
table_gen_health <- table(brfss_ci$gen_health, brfss_ci$exercise)
chi_test <- chisq.test(table_gen_health)

cat("**1. Association with Exposure (Exercise):**\n")

## **1. Association with Exposure (Exercise):**

cat("Chi-square test p-value:", round(chi_test$p.value, 4), "\n")

## Chi-square test p-value: 0

cat("Conclusion: General health is strongly associated with exercise status (p < 0.001).\n\n")

## Conclusion: General health is strongly associated with exercise status (p < 0.001).

# Check if gen_health is associated with physical health (outcome)
mod_outcome <- lm(physhlth_days ~ gen_health, data = brfss_ci)
anova_outcome <- anova(mod_outcome)

cat("**2. Association with Outcome (Physical Health):**\n")

## **2. Association with Outcome (Physical Health):**

cat("F-test p-value:", round(anova_outcome$`Pr(>F)`[1], 4), "\n")

## F-test p-value: 0

cat("Conclusion: General health is strongly associated with physical health days (p < 0.001).\n\n")

## Conclusion: General health is strongly associated with physical health days (p < 0.001).

# Check how adjusting for gen_health changes exercise coefficient
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)
mod_adj_gen <- lm(physhlth_days ~ exercise + gen_health, data = brfss_ci)

b_crude <- coef(mod_crude)["exerciseYes"]
b_adj <- coef(mod_adj_gen)["exerciseYes"]
pct_change <- abs(b_crude - b_adj) / abs(b_crude) * 100

cat("**3. Effect of Adjusting for gen_health:**\n")

## **3. Effect of Adjusting for gen_health:**

cat("Crude exercise coefficient:", round(b_crude, 4), "\n")

## Crude exercise coefficient: -4.7115

cat("Adjusted (gen_health) coefficient:", round(b_adj, 4), "\n")

## Adjusted (gen_health) coefficient: -1.6596

cat("Percent change:", round(pct_change, 1), "%\n\n")

## Percent change: 64.8 %

cat("**4. Is gen_health a Confounder or Mediator?**\n")

## **4. Is gen_health a Confounder or Mediator?**

cat("gen_health meets the first two conditions for confounding:\n")

## gen_health meets the first two conditions for confounding:

cat("  - Associated with exercise (exposure) ✓\n")

##   - Associated with exercise (exposure) ✓

cat("  - Associated with physical health (outcome) ✓\n")

##   - Associated with physical health (outcome) ✓

cat("However, the third condition requires that gen_health is NOT on the causal pathway.\n\n")

## However, the third condition requires that gen_health is NOT on the causal pathway.

cat("Two possible scenarios:\n")

## Two possible scenarios:

cat("  - **As a Confounder:** Poor general health causes both reduced exercise AND worse physical health.\n")

##   - **As a Confounder:** Poor general health causes both reduced exercise AND worse physical health.

cat("    In this case, adjusting is appropriate.\n")

##     In this case, adjusting is appropriate.

cat("  - **As a Mediator:** Exercise improves general health, which then improves physical health.\n")

##   - **As a Mediator:** Exercise improves general health, which then improves physical health.

cat("    In this case, adjusting would remove an indirect pathway and underestimate exercise's total effect.\n\n")

##     In this case, adjusting would remove an indirect pathway and underestimate exercise's total effect.

cat("**Conclusion:** With cross-sectional data, we cannot distinguish between these scenarios.\n")

## **Conclusion:** With cross-sectional data, we cannot distinguish between these scenarios.

cat("The decision depends on the research question. For estimating the total effect of exercise,\n")

## The decision depends on the research question. For estimating the total effect of exercise,

cat("gen_health should NOT be adjusted. For estimating the direct effect (through pathways other\n")

## gen_health should NOT be adjusted. For estimating the direct effect (through pathways other

cat("than general health), adjusting may be appropriate.\n")

## than general health), adjusting may be appropriate.

Interpretation: General health (gen_health) is strongly associated with both exercise (p < 0.001) and physical health days (p < 0.001), meeting the first two conditions for confounding. However, adjusting for gen_health changes the exercise coefficient from -4.71 to -1.66—a 64.8% reduction. This large change raises the question: is gen_health a confounder or a mediator?

If gen_health is a confounder: Poor health causes both reduced exercise and worse physical health. Adjusting would remove bias and give the correct estimate.
If gen_health is a mediator: Exercise improves general health, which then improves physical health. Adjusting would remove this indirect pathway, underestimating the total benefit of exercise.

With cross-sectional data, we cannot determine causality. For estimating the total effect of exercise on physical health, gen_health should NOT be adjusted. The decision depends on the research question—if interested in direct effects only, adjusting may be appropriate.

Task 5: Public Health Interpretation (20 points)

5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:

Whether the association between exercise and physical health differs by any subgroup (interaction assessment)
Which variables confounded the exercise-physical health association
The direction and approximate magnitude of the adjusted exercise effect
Appropriate caveats about cross-sectional data and potential unmeasured confounding

Do not use statistical jargon.

Task 5a: Public Health Interpretation

Public Health Summary:

After adjusting for age, education, and income, adults who do not exercise report approximately 3.8 more physically unhealthy days per month compared to regular exercisers (95% CI: -4.4667141 to -3.414515). This protective effect of exercise was consistent across all education levels (p for interaction = 0.68), meaning everyone benefits similarly regardless of educational background. Additionally, exercise significantly modified the relationship between age and physical health (p for interaction = 0.0007)—non-exercisers experienced much steeper age-related health decline than exercisers.

Limitations: This cross-sectional study cannot prove causation. Healthier individuals may be more likely to exercise, and unmeasured factors like diet and healthcare access may influence results. Despite these limitations, the findings reinforce public health recommendations that regular physical activity is associated with better physical health across all population subgroups.

5b. (10 pts) A colleague suggests including gen_health as a covariate in the final model because it changes the exercise coefficient by more than 10%. You disagree. Write a 3–4 sentence argument explaining why adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days. Use the concept of mediation in your argument.

Adjusting for general health would likely underestimate the true benefit of exercise because gen_health is better understood as a mediator rather than a confounder. Exercise improves general health (better fitness, weight control, cardiovascular function), and improved general health leads to fewer physically unhealthy days. By including gen_health in the model, we would block this indirect pathway and only capture the direct effects of exercise that operate through other mechanisms.

The 64.8% reduction in the exercise coefficient when adding gen_health reflects this mediation, not confounding bias. Since our goal is to estimate the total effect of exercise on physical health (the full public health benefit), adjusting for a mediator would give a misleading estimate. Unless the research question specifically asks for the direct effect of exercise independent of general health, gen_health should be excluded from the final model.

End of Lab Activity

Confounding and Interactions in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Fizza Zaheer

March 28, 2026