Part 2: In-Class Lab Activity

EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026

Instructions

In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.

Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.

Variable	Description	Type
`physhlth_days`	Physically unhealthy days in past 30	Continuous (0–30)
`sleep_hrs`	Sleep hours per night	Continuous (1–14)
`menthlth_days`	Mentally unhealthy days in past 30	Continuous (0–30)
`age`	Age in years (capped at 80)	Continuous
`sex`	Sex (Male/Female)	Factor
`education`	Education level (4 categories)	Factor
`exercise`	Any physical activity (Yes/No)	Factor
`gen_health`	General health status (5 categories)	Factor
`income_cat`	Household income (1–8 ordinal)	Numeric

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)
library (ggplot2)

brfss_ci <- readRDS(
  "C:/Users/suruc/OneDrive/Desktop/R/EPI553_Rclass/brfss_ci_2020.rds"
)

# Quick data check
dim(brfss_ci)

## [1] 5000    9

head(brfss_ci)

## # A tibble: 6 × 9
##   physhlth_days sleep_hrs menthlth_days   age sex    education        exercise
##           <dbl>     <dbl>         <dbl> <dbl> <fct>  <fct>            <fct>   
## 1             0         8            30    77 Female HS graduate      No      
## 2             0         8             0    65 Male   College graduate Yes     
## 3             0         8             4    41 Female College graduate Yes     
## 4             0         8             0    71 Male   Less than HS     Yes     
## 5            30         5             0    56 Male   Some college     No      
## 6             0         6             4    58 Female HS graduate      Yes     
## # ℹ 2 more variables: gen_health <fct>, income_cat <dbl>

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a scatterplot of physhlth_days (y-axis) vs. age (x-axis), colored by exercise status. Add separate regression lines for each group. Describe the pattern you observe.

TASK 1a: Age vs. Physical Health, by Exercise

# Create scatterplot
p1a <- ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  # Add jitter to reduce overplotting (data are discrete)
  geom_jitter(alpha = 0.15, width = 0.5, height = 0.5, size = 1) +
  
  # Fit separate regression lines for each exercise group
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.1) +
  
  labs(
    title = "Physical Health Days by Age, Stratified by Exercise Status",
    subtitle = "Do the slopes differ between exercisers and non-exercisers?",
    x = "Age (years)",
    y = "Physically Unhealthy Days (past 30)",
    color = "Physical Activity"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

print(p1a)

INTERPRETATION:

Non‑exercisers (red line) consistently report more physically unhealthy days at nearly every age.Exercisers (blue line) have fewer unhealthy days overall, and their slope is flatter.

1b. (5 pts) Compute the mean physhlth_days for each combination of sex and exercise. Present the results in a table. Does it appear that the association between exercise and physical health days might differ by sex?

# Calculate means and counts
mean_sex_exercise <- brfss_ci %>%
  group_by(sex, exercise) %>%
  summarize(
    n = n(),
    Mean_Physhlth = round(mean(physhlth_days), 2),
    SD_Physhlth = round(sd(physhlth_days), 2)
  )

# Display in table format
mean_sex_exercise %>%
  kable(
    caption = "Mean Physical Health Days by Sex and Exercise Status",
    col.names = c("Sex", "Exercise", "N", "Mean Unhealthy Days", "SD")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mean Physical Health Days by Sex and Exercise Status
Sex	Exercise	N	Mean Unhealthy Days	SD
Male	No	446	6.64	10.98
Male	Yes	1845	2.32	6.49
Female	No	625	7.43	11.46
Female	Yes	2084	2.45	6.32

Interpretation Based on the table, we can see that exercisers of both sexes report far fewer unhealthy days than non-exercisers and the gap is slightly larger for men (difference of about 4 days) than for women (difference of about 5 days). The association might differ by sex.

1c. (5 pts) Create a scatterplot of physhlth_days vs. sleep_hrs, faceted by education level. Comment on whether the slopes appear similar or different across education groups.

p1b <- ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = education)) +
  geom_jitter(alpha = 0.15, width = 0.2, height = 0.3, size = 0.9) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1) +
  
  # FACETING: separate panels for each education level
  facet_wrap(~education, nrow = 2) +
  labs(
    title = "Sleep vs. Physical Health Days, by Education Level",
    subtitle = "Are the regression slopes similar or different across education groups?",
    x = "Sleep Hours per Night",
    y = "Physically Unhealthy Days (past 30)",
    color = "Education"
  ) +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set2") +
  theme(legend.position = "none")

print(p1b)

Interpretation The relationship between sleep and physical health is consistently negative across all education levels, but the strength of the association differs slightly. People with lower education levels show a stronger drop in unhealthy days as sleep increases, while those with higher education levels show a modest decline.Highest education group, college graduate has the weaker association of sleep with physically unhealthy days.

Task 2: Stratified Analysis for Interaction (15 points)

2a. (5 pts) Fit separate simple linear regression models of physhlth_days ~ age for exercisers and non-exercisers. Report the slope, SE, and 95% CI for age in each stratum.

Research question: Is the age-physical health relationship different for people who exercise vs. don’t?

# Model for EXERCISERS
mod_exerciser <- lm(
  physhlth_days ~ age,
  data = brfss_ci %>% filter(exercise == "Yes")
)

# Model for NON-EXERCISERS
mod_nonexerciser <- lm(
  physhlth_days ~ age,
  data = brfss_ci %>% filter(exercise == "No")
)

# Extract and compare results
stratified_model <- bind_rows(
  tidy(mod_exerciser, conf.int = TRUE) %>%
    filter(term == "age") %>%
    mutate(Stratum = "Exercisers"),
  tidy(mod_nonexerciser, conf.int = TRUE) %>%
    filter(term == "age") %>%
    mutate(Stratum = "Non-exercisers")
) %>%
  select(Stratum, estimate, std.error, conf.low, conf.high, p.value) %>%
  mutate(
    across(where(is.numeric), ~round(., 4))
  )

# Display results
stratified_model %>%
  kable(
    caption = "Stratified Analysis: Age Effect on Physical Health Days, by Exercise Status",
    col.names = c("Stratum", "Estimate (β)", "SE", "CI Lower", "CI Upper", "p-value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Stratified Analysis: Age Effect on Physical Health Days, by Exercise Status
Stratum	Estimate (β)	SE	CI Lower	CI Upper	p-value
Exercisers	0.0192	0.0060	0.0075	0.0309	0.0013
Non-exercisers	0.0745	0.0212	0.0328	0.1161	0.0005

Interpretation:

Exercisers - Slope (β): 0.0192 - Standard Error: 0.0060 - 95% CI: 0.0075 to 0.0309

Non‑Exercisers - Slope (β): 0.0745 - Standard Error: 0.0212 - 95% CI: 0.0328 to 0.1161

Age is associated with more physically unhealthy days in both groups, but the increase is much more among non‑exercisers. The confidence intervals do not overlap much, which could be meaningful difference in how age relates to physical health depending on exercise status.

2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?

p2b <- ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.1, width = 0.5, height = 0.3, size = 0.8) +
  
  # Fit regression lines using the stratified models
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2, alpha = 0.2) +
  labs(
    title = "Stratified Regression Lines: Age vs. Physical Health Days",
    x = "Age (years)",
    y = "Physically Unhealthy Days (past 30)",
    color = "Exercise Status"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_manual(values = c("No" = "#e41a1c", "Yes" = "#377eb8")) +
  theme(legend.position = "bottom")

print(p2b)

Based on this,

Exercisers - Slope (β): 0.0192 - Standard Error: 0.0060 - 95% CI: 0.0075 to 0.0309

Non‑Exercisers - Slope (β): 0.0745 - Standard Error: 0.0212 - 95% CI: 0.0328 to 0.1161

The slope isn’t exactly parallel.

2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.

No, even though the stratified approach is intuitive and visually compelling, but it has important limitations:

No formal test: We can see the slopes are different, but we cannot compute a p-value for the difference without additional machinery
Reduced power: Each stratum has fewer observations than the full dataset
Multiple comparisons: With \(k\) strata, we would need \(k\) separate models
Cannot adjust for covariates easily while testing the interaction

Hence, the solution is to use a single regression model with an interaction term.

Task 3: Interaction via Regression (25 points)

3a. (5 pts) Fit the interaction model: physhlth_days ~ age * exercise. Write out the fitted equation.

**Model specification: physhlth_days ~ age * exercise This expands to: physhlth_days ~ age + exercise + age:exercise

** WHY age * exercise (not just age:exercise)? The interaction term (age:exercise) only tests if slopes differ We also need the main effects to estimate intercepts.***

# Model without interaction (additive model)
mod_add <- lm(physhlth_days ~ age + exercise, data = brfss_ci)

# Model with interaction
mod_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)
# Equivalent to: lm(physhlth_days ~ age + exercise + age:exercise, data = brfss_ci)

tidy(mod_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Age * Exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: Age * Exercise → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7610	0.8798	3.1381	0.0017	1.0361	4.4858
age	0.0745	0.0145	5.1203	0.0000	0.0460	0.1030
exerciseYes	-1.3858	0.9664	-1.4340	0.1516	-3.2804	0.5087
age:exerciseYes	-0.0553	0.0162	-3.4072	0.0007	-0.0871	-0.0235

b_int <- round(coef(mod_int), 3)

The Fitted Equation is:

\[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age}) - 1.386(\text{Exercise}) - 0.055 (\text{Age} \times \text{Exercise})\]

3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.

For exercisers (Nonexercisers = 0): \[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age}) - 1.386(\text{Exercise}) - 0.055 (\text{Age} \times \text{1})\]

For nonexercisers (exercisers = 1): \[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age}) - 1.386(\text{0}) - 0.055 (\text{Age} \times \text{0})\]

Stratum-specific equations

#Exercisers

\[\widehat{\text{Phys Days}} = 1.375 + 0.0192 (\text{age})\]

Hence, the Slope for age: 0.0192

#Nonexercisers

\[\widehat{\text{Phys Days}} = 2.761 + 0.0745 (\text{age})\]

Hence, the Slope for age: 0.0745

These slopes—0.0745 for non-exercisers and 0.0192 for exercisers—match the stratified analysis from Task 2, confirming that the interaction model and the stratified models are consistent.

3c. (5 pts) Conduct the t-test for the interaction term (age:exerciseYes). State the null and alternative hypotheses, report the test statistic and p-value, and state your conclusion about whether interaction is present.

The formal test for interaction is simply the t-test for the interaction coefficient:

\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\] \[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]

int_term <- tidy(mod_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")

## Interaction term (age:exerciseYes):

cat("  Estimate:", round(int_term$estimate, 4), "\n")

##   Estimate: -0.0553

cat("  t-statistic:", round(int_term$statistic, 3), "\n")

##   t-statistic: -3.407

cat("  p-value:", round(int_term$p.value, 4), "\n")

##   p-value: 7e-04

Interpretation: The interaction term (age:exerciseYes) has an estimated coefficient of –0.0553, with a t‑statistic of –3.407 and a p‑value of 0.0007. Because the p‑value is below the conventional α = 0.05 threshold, there is strong evidence that the age slopes differ for exercisers and non‑exercisers. The negative sign indicates that the increase in physically unhealthy days with age is substantially smaller for exercisers than for non‑exercisers. In other words, the age‑related rise in poor physical health is much steeper among people who do not exercise.

3d. (5 pts) Now fit a model with an interaction between age and education: physhlth_days ~ age * education. How many interaction terms are produced? Use a partial F-test to test whether the age \(\times\) education interaction as a whole is significant.

Model specification:

physhlth_days ~ age * education

This expands to: physhlth_days ~ age + education + age:education

# Model without interaction (additive model)
mod_add_edu <- lm(physhlth_days ~ age + education, data = brfss_ci)

# Model with interaction
mod_int_edu <- lm(physhlth_days ~ age * education, data = brfss_ci)
# Equivalent to: lm(physhlth_days ~ age + education + age:education, data = brfss_ci)

tidy(mod_int_edu, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Age * Education → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Interaction Model: Age * Education → Physical Health Days
Term	Estimate	SE	t	p-value	CI Lower	CI Upper
(Intercept)	2.7542	1.5422	1.7859	0.0742	-0.2692	5.7775
age	0.0706	0.0269	2.6277	0.0086	0.0179	0.1233
educationHS graduate	-1.4016	1.6900	-0.8293	0.4069	-4.7146	1.9115
educationSome college	-1.1261	1.6893	-0.6666	0.5051	-4.4380	2.1857
educationCollege graduate	-2.7102	1.6580	-1.6346	0.1022	-5.9606	0.5402
age:educationHS graduate	-0.0197	0.0296	-0.6661	0.5054	-0.0776	0.0382
age:educationSome college	-0.0326	0.0295	-1.1039	0.2697	-0.0904	0.0253
age:educationCollege graduate	-0.0277	0.0289	-0.9583	0.3380	-0.0844	0.0290

b_int <- round(coef(mod_int_edu), 3)

The Fitted Equation is:

\[\widehat{\text{Phys Days}} = 2.754+ 0.0706 (\text{age}) - 1.40(\text{educationHS graduate}) - 1.13 (\text{educationSome college} - 2.71 (\text{educationcollege graduate}) - 0.02 (\text{age} \times \text{educationHS graduate}) - 0.033 (\text{age} \times \text{educationSome college}) - 0.028(\text{age} \times \text{educationCollege graduate}) + \varepsilon\] #Interpretation:

This model includes three interaction terms, each comparing an education group to the reference category (“Less than HS”). The reference-group slope for age is 0.0706, meaning that among adults with less than a high school education, each additional year of age is associated with about 0.07 more physically unhealthy days per month. The interaction terms for “HS graduate” (–0.0197), “Some college” (–0.0326), and “College graduate” (–0.0277) are all negative and have p‑values greater than 0.05, indicating that none of these slopes differ meaningfully from the reference group. Because none of the interaction terms are statistically distinguishable from zero, the age–health relationship appears similar across all education levels, with no evidence that aging affects physical health differently for more educated versus less educated adults. This pattern is consistent with the idea that age-related increases in physically unhealthy days occur broadly across the population, regardless of educational attainment.

Testing the Overall Interaction with a Partial F-Test

Individual t-tests for each interaction term may be non-significant, yet the interaction as a whole could be meaningful. To test whether age \(\times\) education is significant overall, we use a partial F-test comparing the model with and without the interaction terms:

\[H_0: \beta_5 = \beta_6 = \beta_7 = 0 \quad \text{(the age slope is the same across all education levels)}\]

mod_no_int_edu <- lm(physhlth_days ~ age + education, data = brfss_ci)

anova(mod_no_int_edu, mod_int_edu) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Is the age x Education Interaction Significant?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Partial F-test: Is the age x Education Interaction Significant?
term	df.residual	rss	df	sumsq	statistic	p.value
physhlth_days ~ age + education	4995	306764.2	NA	NA	NA	NA
physhlth_days ~ age * education	4992	306671.9	3	92.2713	0.5007	0.6818

Interpretation: The partial F‑test evaluates all three age × education interaction terms simultaneously. The F‑statistic is 0.50 with a p‑value of 0.6818, indicating that the overall interaction does not improve model fit at the conventional p =0.05 threshold. In practical terms, this means there is no evidence that the age slope differs across education levels as a whole. Although the individual interaction coefficients were all negative, none were statistically distinguishable from zero, and the omnibus test confirms that these small differences do not collectively matter. This pattern suggests that the relationship between age and physically unhealthy days is fairly consistent across education groups, with no meaningful effect modification detected in this sample.

3e. (5 pts) Create a visualization using ggpredict() showing the predicted physhlth_days by age for each education level. Do the lines appear parallel?

pred_int_edu <- ggpredict(mod_int_edu, terms = c("age [3:12]", "education"))

ggplot(pred_int_edu, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.1) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.1, color = NA) +
  labs(
    title    = "Predicted Physical Health Days by Age and Education",
    subtitle = "Are the lines parallel? Non-parallel lines indicate interaction.",
    x        = "Sleep Hours per Night",
    y        = "Predicted Physically Unhealthy Days",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set2")

Interaction: Age Effect by Education Level

#Interpretation The predicted lines plot shows four predicted regression lines, one for each education group, showing how physically unhealthy days change with age. All lines slope upward, indicating that older adults tend to report more physically unhealthy days, but the slopes are nearly identical across education levels. The “Less than HS” and “HS graduate” groups rise at almost the same rate, and the “Some college” and “College graduate” lines follow a very similar line as well.

Task 4: Confounding Assessment (25 points)

For this task, the exposure is exercise and the outcome is physhlth_days.

4a. (5 pts) Fit the crude model: physhlth_days ~ exercise. Report the exercise coefficient. This is the unadjusted estimate.

# Crude model: Exercise → physical health days
mod_crude <- lm(physhlth_days ~ exercise, data = brfss_ci)

# Display results
tidy(mod_crude, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(
    caption = "Crude Model: physhlth_days ~ exercise",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  print()

## <table class="table table-striped table-hover" style="width: auto !important; margin-left: auto; margin-right: auto;">
## <caption>Crude Model: physhlth_days ~ exercise</caption>
##  <thead>
##   <tr>
##    <th style="text-align:left;"> Term </th>
##    <th style="text-align:right;"> Estimate </th>
##    <th style="text-align:right;"> SE </th>
##    <th style="text-align:right;"> t </th>
##    <th style="text-align:right;"> p-value </th>
##    <th style="text-align:right;"> CI Lower </th>
##    <th style="text-align:right;"> CI Upper </th>
##   </tr>
##  </thead>
## <tbody>
##   <tr>
##    <td style="text-align:left;"> (Intercept) </td>
##    <td style="text-align:right;"> 7.1027 </td>
##    <td style="text-align:right;"> 0.2354 </td>
##    <td style="text-align:right;"> 30.1692 </td>
##    <td style="text-align:right;"> 0 </td>
##    <td style="text-align:right;"> 6.6412 </td>
##    <td style="text-align:right;"> 7.5643 </td>
##   </tr>
##   <tr>
##    <td style="text-align:left;"> exerciseYes </td>
##    <td style="text-align:right;"> -4.7115 </td>
##    <td style="text-align:right;"> 0.2656 </td>
##    <td style="text-align:right;"> -17.7401 </td>
##    <td style="text-align:right;"> 0 </td>
##    <td style="text-align:right;"> -5.2322 </td>
##    <td style="text-align:right;"> -4.1908 </td>
##   </tr>
## </tbody>
## </table>

# Extract the exercise coefficient (this is our CRUDE unadjusted estimate)

b_crude <- coef(mod_crude)["exerciseYes"]

Interpretation - Non‑exercisers have an average of 7.10 physically unhealthy days per month. - Exercisers have 4.71 fewer unhealthy days than non‑exercisers (β = –4.7115, p < 0.001). - The confidence interval (–5.23 to –4.19) shows this is a large, precise, and highly statistically significant difference.

People who report exercising have substantially fewer physically unhealthy days than those who do not. This is a crude association, suggesting that exercise is linked to better physical health before considering any other factors.

4b. (10 pts) Systematically assess whether each of the following is a confounder of the exercise-physical health association: age, sex, sleep_hrs, education, and income_cat. For each:

Fit the model physhlth_days ~ exercise + [covariate]
Report the adjusted exercise coefficient
Compute the percent change from the crude estimate
Apply the 10% rule to determine if the variable is a confounder

Present your results in a single summary table.

Age, sex, sleep_hrs, education and income_cat

b_crude <- coef(mod_crude)["exerciseYes"]

confounders_ex <- list(
  "Age"        = lm(physhlth_days ~ exercise + age,        data = brfss_ci),
  "Sex"        = lm(physhlth_days ~ exercise + sex,        data = brfss_ci),
  "Sleep hrs"  = lm(physhlth_days ~ exercise + sleep_hrs,  data = brfss_ci),
  "Education"  = lm(physhlth_days ~ exercise + education,  data = brfss_ci),
  "Income cat" = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)

conf_table_ex <- map_dfr(names(confounders_ex), function(name) {
  mod       <- confounders_ex[[name]]
  b_adj_val <- coef(mod)["exerciseYes"]
  pct_chg   <- abs(b_crude - b_adj_val) / abs(b_crude) * 100
  tibble(
    Covariate              = name,
    `Crude β (exercise)`   = round(b_crude, 4),
    `Adjusted β`           = round(b_adj_val,       4),
    `% Change`             = round(pct_chg,          1),
    `Confounder (>10%)?`   = ifelse(pct_chg > 10, "Yes", "No")
  )
})

conf_table_ex %>%
  kable(caption = "Table 4b. Systematic Confounding Assessment: Exercise → Physical Health Days") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE) %>%
  column_spec(5, bold = TRUE,
              color      = ifelse(conf_table_ex$`Confounder (>10%)?` == "Yes", "white", "black"),
              background = ifelse(conf_table_ex$`Confounder (>10%)?` == "Yes", "#C0392B", "#27AE60"))

Table 4b. Systematic Confounding Assessment: Exercise → Physical Health Days
Covariate	Crude β (exercise)	Adjusted β	% Change	Confounder (>10%)?
Age	-4.7115	-4.5504	3.4	No
Sex	-4.7115	-4.6974	0.3	No
Sleep hrs	-4.7115	-4.6831	0.6	No
Education	-4.7115	-4.3912	6.8	No
Income cat	-4.7115	-3.9406	16.4	Yes

# Step 1: Crude model (already done in 4a)

# Step 2: Fit models with each potential confounder
mod_adj_age <- lm(physhlth_days ~ exercise + age, data = brfss_ci)
mod_adj_sex <- lm(physhlth_days ~ exercise + sex, data = brfss_ci)
mod_adj_sleep <- lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci)
mod_adj_edu <- lm(physhlth_days ~ exercise + education, data = brfss_ci)
mod_adj_income <- lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)

# Step 3: Extract exercise coefficient from EACH model
# (NOT the covariate coefficient - we want the exerciseYes coefficient)

b_adj_age <- coef(mod_adj_age)["exerciseYes"]
b_adj_sex <- coef(mod_adj_sex)["exerciseYes"]
b_adj_sleep <- coef(mod_adj_sleep)["exerciseYes"]
b_adj_edu <- coef(mod_adj_edu)["exerciseYes"]
b_adj_income <- coef(mod_adj_income)["exerciseYes"]

# Step 4: Calculate percent change for EACH covariate
pct_change_age <- abs(b_crude - b_adj_age) / abs(b_crude) * 100
pct_change_sex <- abs(b_crude - b_adj_sex) / abs(b_crude) * 100
pct_change_sleep <- abs(b_crude - b_adj_sleep) / abs(b_crude) * 100
pct_change_edu <- abs(b_crude - b_adj_edu) / abs(b_crude) * 100
pct_change_income <- abs(b_crude - b_adj_income) / abs(b_crude) * 100

# Step 5: Create confounding assessment table

confounding_table <- tribble(
  ~Covariate, ~"Crude β", ~"Adjusted β", ~"% Change", ~"Confounder (>10%)?",
  
  # Age
  "Age",
  round(b_crude, 4),
  round(b_adj_age, 4),
  round(pct_change_age, 1),
  ifelse(pct_change_age > 10, "YES", "NO"),
  
  # Sex
  "Sex",
  round(b_crude, 4),
  round(b_adj_sex, 4),
  round(pct_change_sex, 1),
  ifelse(pct_change_sex > 10, "YES", "NO"),
  
  # Sleep
  "Sleep Hours",
  round(b_crude, 4),
  round(b_adj_sleep, 4),
  round(pct_change_sleep, 1),
  ifelse(pct_change_sleep > 10, "YES", "NO"),
  
  # Education
  "Education",
  round(b_crude, 4),
  round(b_adj_edu, 4),
  round(pct_change_edu, 1),
  ifelse(pct_change_edu > 10, "YES", "NO"),
  
  # Income
  "Income",
  round(b_crude, 4),
  round(b_adj_income, 4),
  round(pct_change_income, 1),
  ifelse(pct_change_income > 10, "YES", "NO")
)

# Step 6: Display the table
confounding_table %>%
  kable(
    caption = "Confounding Assessment: Do Covariates confound the exercise- physical health relationship?",
    col.names = c("Covariate", "Crude β (Exercise)", "Adjusted β (Exercise)", 
                  "% Change from Crude", "Meets 10% Threshold?")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Confounding Assessment: Do Covariates confound the exercise- physical health relationship?
Covariate	Crude β (Exercise)	Adjusted β (Exercise)	% Change from Crude	Meets 10% Threshold?
Age	-4.7115	-4.5504	3.4	NO
Sex	-4.7115	-4.6974	0.3	NO
Sleep Hours	-4.7115	-4.6831	0.6	NO
Education	-4.7115	-4.3912	6.8	NO
Income	-4.7115	-3.9406	16.4	YES

Most covariates (age, sex, sleep, education) change the exercise coefficient by less than 10%, meaning they do not confound the exercise–health relationship, except Income. Adjusting for income reduces the exercise coefficient from –4.71 to –3.94, a 16.4% shift, which exceeds the 10% threshold.

Income appears to explain part of the crude association between exercise and physical health. - People with higher incomes may be more likely to exercise, and - They may also have better baseline physical health, independent of exercise. Because income is related to both the exposure (exercise) and the outcome (physical health days), it acts as a confounder.

#Interpretation

Exercise is strongly associated with fewer physically unhealthy days in the crude model. After evaluating potential confounders, income emerges as the only variable that meaningfully alters this association.

4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?

# Identify confounders from the table
var_map <- c("Age"="age","Sex"="sex","Sleep hrs"="sleep_hrs",
             "Education"="education","Income cat"="income_cat")

identified <- conf_table_ex %>%
  filter(`Confounder (>10%)?` == "Yes") %>%
  pull(Covariate)

conf_vars_final <- var_map[identified]
cat("Confounders identified (>10%):", paste(identified, collapse=", "), "\n")

## Confounders identified (>10%): Income cat

final_formula <- as.formula(
  paste("physhlth_days ~ exercise +", paste(conf_vars_final, collapse=" + "))
)
cat("Final model formula:", deparse(final_formula), "\n\n")

## Final model formula: physhlth_days ~ exercise + income_cat

mod_final_ex <- lm(final_formula, data = brfss_ci)

tidy(mod_final_ex, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4))) %>%
  kable(
    caption   = "Table 4c. Fully Adjusted Model: exercise → physhlth_days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 4c. Fully Adjusted Model: exercise → physhlth_days
Term	Estimate	SE	t	CI Lower	CI Upper
(Intercept)	10.6363	0.3619	29.3924	9.9269	11.3457
exerciseYes	-3.9406	0.2684	-14.6842	-4.4667	-3.4145
income_cat	-0.6750	0.0531	-12.7135	-0.7790	-0.5709

b_final_ex  <- coef(mod_final_ex)["exerciseYes"]
pct_overall <- abs(b_crude - b_final_ex) / abs(b_crude) * 100

tribble(
  ~Model,            ~`Exercise β`,             ~`% Change from Crude`,
  "Crude",            round(b_crude, 4),  "---",
  "Fully adjusted",   round(b_final_ex,     4),  paste0(round(pct_overall, 1), "%")
) %>%
  kable(caption = "Table 4c. Crude vs. Adjusted Exercise Coefficient") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
  row_spec(0, bold = TRUE)

Table 4c. Crude vs. Adjusted Exercise Coefficient
Model	Exercise β	% Change from Crude
Crude	-4.7115	—
Fully adjusted	-3.9406	16.4%

Interpretation: After full adjustment, exercisers reported 3.94 fewer physically unhealthy days per month than non‑exercisers, reflecting a 16.4% change from the crude estimate of –4.71 days. This decrease indicates that part of the crude association was inflated by confounding, primarily income, yet the adjusted coefficient remains large, negative, and highly statistically significant.

4d. (5 pts) Is gen_health a confounder or a mediator of the exercise-physical health relationship? Could it be both? Explain your reasoning with reference to the three conditions for confounding and the concept of the causal pathway.

mod_genhlth <- lm(physhlth_days ~ exercise + gen_health, data = brfss_ci)
b_genhlth   <- coef(mod_genhlth)["exerciseYes"]
pct_genhlth <- abs(b_crude - b_genhlth) / abs(b_crude) * 100

cat("Exercise coefficient adjusted for gen_health:", round(b_genhlth, 4), "\n")

## Exercise coefficient adjusted for gen_health: -1.6596

cat("% change from crude:", round(pct_genhlth, 1), "%\n")

## % change from crude: 64.8 %

Adjusting for general health changes the exercise coefficient from –4.71 to –1.66, a 64.8% change. That’s a large change, so statistically, gen_health behaves like a strong confounder or mediator (or both) of the exercise–physical health association.

However, a variable is a confounder if it: 1. Is associated with the exposure (exercise) People in better general health are more likely to exercise regularly. 2. Is associated with the outcome (physical unhealthy days), independent of the exposure General health is strongly related to physically unhealthy days even after accounting for exercise. 3. Is not on the causal pathway between exposure and outcome It must not be a consequence of exercise. If we think of gen_health as a baseline health status measured before or independent of exercise, then it satisfies all three and acts as a confounder, meaning healthier people both exercise more and have fewer unhealthy days.

A variable is a mediator if exercise improves general health, which in turn reduces unhealthy days. In that causal story, gen_health lies on the pathway from exercise to physical health.So it can be part both. Because the data is cross‑sectional, we cannot definitively distinguish confounding from mediation, but adjusting for general health likely remove baseline differences.

Task 5: Public Health Interpretation (20 points)

5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:

Whether the association between exercise and physical health differs by any subgroup (interaction assessment)
Which variables confounded the exercise-physical health association
The direction and approximate magnitude of the adjusted exercise effect
Appropriate caveats about cross-sectional data and potential unmeasured confounding

Do not use statistical jargon.

Adults who reported exercising had about 3.9 fewer physically unhealthy days per month than those who did not, even after accounting for age, sex, sleep, education, and income. This benefit was consistent across groups, and none of the subgroup analyses showed meaningful differences in the exercise–health relationship. Income was the only factor that changed the exercise effect, suggesting that part of crude difference was due to socioeconomic conditions rather than exercise alone. Even so, the adjusted difference remained large, indicating that regular physical activity is strongly linked to better physical health. Because these data are cross‑sectional, we cannot be sure whether exercise improves health or whether healthier people are more likely to exercise, and unmeasured factors such as chronic illness or access to safe places to be active (for females) may also influence the results.

5b. (10 pts) A colleague suggests including gen_health as a covariate in the final model because it changes the exercise coefficient by more than 10%. You disagree. Write a 3–4 sentence argument explaining why adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days. Use the concept of mediation in your argument.

General health should not be included in the final model because it is likely part of the pathway through which exercise affects physical health. Regular activity can improve a person’s overall health, which then leads to fewer unhealthy days; adjusting for general health would remove this pathway and make the exercise effect look smaller than it truly is. The large change in the coefficient after adjusting for general health reflects this mediation, not confounding. If the goal is to estimate the total effect of exercise, general health must be omitted so that the model does not control away part of the exercise benefit.

End of Lab Activity

Confounding and Interactions in Regression

EPI 553 — Principles of Statistical Inference II (Spring 2026)

Muntasir Masum

March 24, 2026