Introduction

In previous lectures, we built multiple linear regression models that included several predictors. We interpreted each coefficient as the expected change in \(Y\) for a one-unit increase in \(X_j\), “holding all other predictors constant.” But we did not examine two critical methodological questions:

  1. Confounding: Does the estimated effect of our exposure change when we add or remove covariates? If so, those covariates are confounders, and we need them in the model to get an unbiased estimate.

  2. Interaction (Effect Modification): Is the effect of our exposure the same for everyone, or does it differ across subgroups? If the effect of sleep on physical health is different for men and women, we say that sex modifies the effect of sleep.

These are the two most important methodological concepts in associative (etiologic) modeling, the type of modeling most common in epidemiology.

Research question for today:

Is the association between sleep duration and physically unhealthy days modified by sex or education? And which variables confound this association?


Predictive vs. Associative Modeling

Before we dive in, it is important to revisit the distinction between the two primary goals of regression modeling, because confounding and interaction are only relevant in one of them.

Goal | Question | What matters most |

|******|*********-|******************-| | Predictive | Which set of variables best predicts \(Y\)? | Overall model accuracy (\(R^2\), RMSE, out-of-sample prediction) | | Associative | What is the effect of a specific exposure on \(Y\), after adjusting for confounders? | Accuracy and interpretability of a specific \(\hat{\beta}\) |

In predictive modeling, we want the combination of variables that minimizes prediction error. We do not care whether individual coefficients are “correct” or interpretable, only whether the model predicts well.

In associative modeling, we care deeply about one (or a few) specific coefficients. We want to ensure that \(\hat{\beta}_{\text{exposure}}\) reflects the true relationship, free from confounding bias. This is the setting where confounding and interaction become critical.

In epidemiology, we are almost always doing associative modeling. We have a specific exposure of interest (e.g., sleep) and want to estimate its effect on a health outcome (e.g., physical health days) while controlling for confounders.

A Warning About Extrapolation

We should never extrapolate predictions from a statistical model beyond the range of the observed data. Extrapolation assumes that the relationship continues unchanged into regions where we have no data, which is often false.

A famous example: NASA engineers extrapolated O-ring performance data to predict behavior at temperatures colder than any previously tested. The resulting decision to launch the Space Shuttle Challenger in cold weather led to the 1986 disaster.

Rule of thumb: Your model is only valid within the range of the data used to build it.


Setup and Data

library(tidyverse)
library(haven)
library(janitor)
library(knitr)
library(kableExtra)
library(broom)
library(gtsummary)
library(GGally)
library(car)
library(ggeffects)
library(plotly)
library(lmtest)

options(gtsummary.use_ftExtra = TRUE)
set_gtsummary_theme(theme_gtsummary_compact(set_theme = TRUE))

Preparing the Dataset

We continue with the BRFSS 2020 dataset. For this lecture we shift our outcome to physically unhealthy days and examine sleep as the primary exposure, with sex, education, age, exercise, and general health as potential modifiers and confounders.

brfss_full <- read_xpt(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/lab11/LLCP2020.XPT"
) |>
  clean_names()
brfss_ci <- brfss_full |>
  mutate(
    # Outcome: physically unhealthy days in past 30
    physhlth_days = case_when(
      physhlth == 88                    ~ 0,
      physhlth >= 1 & physhlth <= 30   ~ as.numeric(physhlth),
      TRUE                             ~ NA_real_
    ),
    # Primary exposure: sleep hours
    sleep_hrs = case_when(
      sleptim1 >= 1 & sleptim1 <= 14   ~ as.numeric(sleptim1),
      TRUE                             ~ NA_real_
    ),
    # Mentally unhealthy days (covariate)
    menthlth_days = case_when(
      menthlth == 88                    ~ 0,
      menthlth >= 1 & menthlth <= 30   ~ as.numeric(menthlth),
      TRUE                             ~ NA_real_
    ),
    # Age
    age = age80,
    # Sex
    sex = factor(sexvar, levels = c(1, 2), labels = c("Male", "Female")),
    # Education (4-level)
    education = factor(case_when(
      educa %in% c(1, 2, 3) ~ "Less than HS",
      educa == 4             ~ "HS graduate",
      educa == 5             ~ "Some college",
      educa == 6             ~ "College graduate",
      TRUE                   ~ NA_character_
    ), levels = c("Less than HS", "HS graduate", "Some college", "College graduate")),
    # Exercise in past 30 days
    exercise = factor(case_when(
      exerany2 == 1 ~ "Yes",
      exerany2 == 2 ~ "No",
      TRUE          ~ NA_character_
    ), levels = c("No", "Yes")),
    # General health status
    gen_health = factor(case_when(
      genhlth == 1 ~ "Excellent",
      genhlth == 2 ~ "Very good",
      genhlth == 3 ~ "Good",
      genhlth == 4 ~ "Fair",
      genhlth == 5 ~ "Poor",
      TRUE         ~ NA_character_
    ), levels = c("Excellent", "Very good", "Good", "Fair", "Poor")),
    # Income category (ordinal 1-8)
    income_cat = case_when(
      income2 %in% 1:8 ~ as.numeric(income2),
      TRUE             ~ NA_real_
    )
  ) |>
  filter(
    !is.na(physhlth_days),
    !is.na(sleep_hrs),
    !is.na(menthlth_days),
    !is.na(age), age >= 18,
    !is.na(sex),
    !is.na(education),
    !is.na(exercise),
    !is.na(gen_health),
    !is.na(income_cat)
  )

# Reproducible random sample
set.seed(1220)
brfss_ci <- brfss_ci |>
  select(physhlth_days, sleep_hrs, menthlth_days, age, sex,
         education, exercise, gen_health, income_cat) |>
  slice_sample(n = 5000)

# Save for lab activity
saveRDS(brfss_ci,
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/lab11/brfss_ci_2020.rds")

tibble(Metric = c("Observations", "Variables"),
       Value  = c(nrow(brfss_ci), ncol(brfss_ci))) |>
  kable(caption = "Analytic Dataset Dimensions") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Analytic Dataset Dimensions
Metric Value
Observations 5000
Variables 9

Descriptive Statistics

brfss_ci |>
  select(physhlth_days, sleep_hrs, age, sex, education, exercise, gen_health) |>
  tbl_summary(
    label = list(
      physhlth_days ~ "Physically unhealthy days (past 30)",
      sleep_hrs     ~ "Sleep (hours/night)",
      age           ~ "Age (years)",
      sex           ~ "Sex",
      education     ~ "Education level",
      exercise      ~ "Any physical activity (past 30 days)",
      gen_health    ~ "General health status"
    ),
    statistic = list(
      all_continuous()  ~ "{mean} ({sd})",
      all_categorical() ~ "{n} ({p}%)"
    ),
    digits = all_continuous() ~ 1,
    missing = "no"
  ) |>
  add_n() |>
  bold_labels() |>
  italicize_levels() |>
  modify_caption("**Table 1. Descriptive Statistics — BRFSS 2020 (n = 5,000)**") |>
  as_flex_table()
**Table 1. Descriptive Statistics — BRFSS 2020 (n = 5,000)**

Characteristic

N

N = 5,0001

Physically unhealthy days (past 30)

5,000

3.4 (7.9)

Sleep (hours/night)

5,000

7.1 (1.4)

Age (years)

5,000

54.1 (17.0)

Sex

5,000

Male

2,291 (46%)

Female

2,709 (54%)

Education level

5,000

Less than HS

276 (5.5%)

HS graduate

1,263 (25%)

Some college

1,378 (28%)

College graduate

2,083 (42%)

Any physical activity (past 30 days)

5,000

3,929 (79%)

General health status

5,000

Excellent

1,031 (21%)

Very good

1,762 (35%)

Good

1,507 (30%)

Fair

524 (10%)

Poor

176 (3.5%)

1Mean (SD); n (%)

Part 1: Guided Practice — Confounding and Interactions


1. Interaction (Effect Modification)

1.1 What Is Interaction?

Interaction (also called effect modification) is present when the relationship between an exposure and an outcome is different at different levels of a third variable. In regression terms, the slope of the exposure-outcome relationship changes depending on the value of the modifier.

For example, if the effect of sleep on physical health is stronger for women than for men, then sex modifies the effect of sleep. The two variables (sleep and sex) have a multiplicative, not merely additive, effect on the outcome.

This is fundamentally different from confounding:

Concept | Question | Implication |

|*********|*********-|************-| | Confounding | Is the crude estimate of the exposure effect biased by a third variable? | Must adjust for the confounder to get a valid estimate | | Interaction | Does the effect of the exposure differ across subgroups? | Must report stratum-specific effects, not a single overall estimate |

Critical point: Always assess interaction before confounding. If interaction is present, a single “adjusted” coefficient for the exposure is misleading because the effect is not the same for everyone. You must stratify or include interaction terms.

1.2 Interaction Between a Continuous and a Dichotomous Variable

Consider a model with a continuous exposure \(X_1\) (sleep hours) and a dichotomous variable \(X_2\) (sex, where Male = 1 and Female = 0):

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \varepsilon\]

The term \(\beta_3 X_1 X_2\) is the interaction term. Let’s see what happens when we plug in the values for each group:

For males (\(X_2 = 1\)): \[E(Y | X_1, \text{Male}) = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) X_1\]

For females (\(X_2 = 0\)): \[E(Y | X_1, \text{Female}) = \beta_0 + \beta_1 X_1\]

The key insight:

  • The intercept for males is \(\beta_0 + \beta_2\), while for females it is \(\beta_0\)
  • The slope for males is \(\beta_1 + \beta_3\), while for females it is \(\beta_1\)
  • \(\beta_3\) is the difference in slopes between the two groups

If \(\beta_3 = 0\), the slopes are equal and the lines are parallel (no interaction). If \(\beta_3 \neq 0\), the lines are not parallel (interaction is present).

# Create synthetic data to illustrate the concept
set.seed(1220)
n <- 200
concept_data <- tibble(
  x1 = runif(n, 0, 10),
  group = rep(c("Group A", "Group B"), each = n / 2)
)

# No interaction: same slope, different intercepts
no_int <- concept_data |>
  mutate(
    y = ifelse(group == "Group A", 2 + 1.5 * x1, 5 + 1.5 * x1) + rnorm(n, 0, 2)
  )

# Interaction: different slopes
with_int <- concept_data |>
  mutate(
    y = ifelse(group == "Group A", 2 + 1.5 * x1, 5 + 0.3 * x1) + rnorm(n, 0, 2)
  )

p1 <- ggplot(no_int, aes(x = x1, y = y, color = group)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  labs(title = "No Interaction", subtitle = "Parallel lines: same slope",
       x = expression(X[1]), y = "Y") +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

p2 <- ggplot(with_int, aes(x = x1, y = y, color = group)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +
  labs(title = "Interaction Present", subtitle = "Non-parallel lines: different slopes",
       x = expression(X[1]), y = "Y") +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

gridExtra::grid.arrange(p1, p2, ncol = 2)
No Interaction (Parallel Lines) vs. Interaction (Non-Parallel Lines)

No Interaction (Parallel Lines) vs. Interaction (Non-Parallel Lines)

Interpretation: The left panel illustrates no interaction: both groups have the same slope (1.5), so the lines are perfectly parallel. The group variable shifts the intercept (Group B starts higher) but does not change the rate at which \(Y\) increases with \(X_1\). In this scenario, a single slope coefficient adequately describes the \(X_1\)-\(Y\) relationship for both groups. The right panel illustrates interaction: Group A has a steep slope (1.5) while Group B has a shallow slope (0.3). The non-parallel lines mean that the effect of \(X_1\) on \(Y\) depends on which group you belong to, and a single pooled slope would misrepresent the relationship for both groups. This is the visual test for interaction: parallel lines = no interaction; non-parallel lines = interaction.


2.2 Limitations of Stratified Analysis

The stratified approach is intuitive and visually compelling, but it has important limitations:

  1. No formal test: We can see the slopes are different, but we cannot compute a p-value for the difference without additional machinery
  2. Reduced power: Each stratum has fewer observations than the full dataset
  3. Multiple comparisons: With \(k\) strata, we would need \(k\) separate models
  4. Cannot adjust for covariates easily while testing the interaction

The solution is to use a single regression model with an interaction term.


3. Testing for Interaction: Regression with Interaction Terms

3.1 The Interaction Model in R

R provides two operators for specifying interactions:

  • X1:X2 — includes only the interaction term \(X_1 \times X_2\)
  • X1*X2 — shorthand for X1 + X2 + X1:X2 (main effects plus interaction)

Rule: Any model with an interaction term must also include the main effects that comprise the interaction. Always use * or explicitly include both main effects with :.

3.3 Testing for Interaction (Parallelism)

The formal test for interaction is simply the t-test for the interaction coefficient:

\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\] \[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]

3.4 Testing for Coincidence

A stronger test asks whether sex has any effect on the relationship (neither intercept nor slope differs):

\[H_0: \beta_2 = \beta_3 = 0 \quad \text{(the two lines are identical)}\] \[H_A: \text{At least one } \neq 0 \quad \text{(the lines differ in intercept and/or slope)}\]

4. Interactions with Categorical Variables (k > 2)

4.1 Multiple Interaction Terms

When the modifier has \(k > 2\) categories (such as education with 4 levels), the interaction between a continuous exposure and the categorical modifier requires \(k - 1\) interaction terms, one for each dummy variable.

For sleep \(\times\) education (with “Less than HS” as reference):

\[Y = \beta_0 + \beta_1(\text{Sleep}) + \beta_2(\text{HS grad}) + \beta_3(\text{Some college}) + \beta_4(\text{College grad}) + \beta_5(\text{Sleep} \times \text{HS grad}) + \beta_6(\text{Sleep} \times \text{Some college}) + \beta_7(\text{Sleep} \times \text{College grad}) + \varepsilon\]

Each interaction coefficient \(\beta_5, \beta_6, \beta_7\) represents the difference in the sleep slope between that education group and the reference group.

4.2 Testing the Overall Interaction with a Partial F-Test

Individual t-tests for each interaction term may be non-significant, yet the interaction as a whole could be meaningful. To test whether sleep \(\times\) education is significant overall, we use a partial F-test comparing the model with and without the interaction terms:

\[H_0: \beta_5 = \beta_6 = \beta_7 = 0 \quad \text{(the sleep slope is the same across all education levels)}\]


5. Continuous × Continuous Interactions

5.1 When Both Variables Are Continuous

Interaction is not limited to categorical modifiers. We can also examine whether the effect of one continuous predictor changes across values of another continuous predictor. For example, does the sleep-physical health association differ by age?

\[Y = \beta_0 + \beta_1(\text{Sleep}) + \beta_2(\text{Age}) + \beta_3(\text{Sleep} \times \text{Age}) + \varepsilon\]

Here, \(\beta_3\) tells us how the slope of sleep changes for each one-year increase in age:

  • The slope of sleep for a person of age \(a\) is: \(\beta_1 + \beta_3 \cdot a\)
  • If \(\beta_3 < 0\), the protective effect of sleep weakens (becomes less negative) as age increases
  • If \(\beta_3 > 0\), the protective effect of sleep strengthens as age increases

5.2 Visualizing Continuous Interactions

With continuous modifiers, we visualize the interaction by plotting the predicted relationship at specific values of the modifier (e.g., age = 30, 50, 70):

If the lines are approximately parallel, age does not modify the sleep effect. If they fan out or converge, the interaction is meaningful.


6. Confounding

6.1 What Is Confounding?

Confounding exists when the estimated association between an exposure and an outcome is distorted because of a third variable that is related to both. When confounding is present, ignoring the confounder leads to a biased estimate of the exposure effect.

For a variable to be a confounder, it must satisfy three conditions:

  1. Associated with the exposure (sleep hours)
  2. Associated with the outcome (physical health days), even in the absence of the exposure
  3. Not on the causal pathway from exposure to outcome (i.e., not a mediator)
Confounding Structure: The Confounder Affects Both Exposure and Outcome

Confounding Structure: The Confounder Affects Both Exposure and Outcome

Age is a classic confounder of the sleep-physical health relationship:

  • Older adults tend to sleep fewer hours (associated with exposure)
  • Older adults have more physically unhealthy days (associated with outcome)
  • Age is not caused by sleep or physical health (not on the causal pathway)

6.2 Detecting Confounding: The 10% Change-in-Estimate Rule

The standard approach in epidemiology is to compare the crude (unadjusted) estimate to the adjusted estimate:

  1. Fit the crude model: \(Y = \beta_0 + \beta_1 X_1\)
  2. Fit the adjusted model: \(Y = \beta_0^* + \beta_1^* X_1 + \beta_2^* X_2\)
  3. Compute the percent change: \(\frac{|\beta_1 - \beta_1^*|}{|\beta_1|} \times 100\%\)
  4. If the change exceeds 10%, \(X_2\) is a confounder and should be included in the model

Important: This is a rule of thumb, not a rigid cutoff. The 10% threshold is conventional, not absolute. Some researchers use 5% or 15% depending on context.

6.5 Important Caveats About Confounding Assessment

Identifying candidate confounders is not purely a statistical exercise. The three conditions for confounding (associated with exposure, associated with outcome, not on the causal pathway) require substantive knowledge from the literature and your understanding of the causal structure.

Caution about missing data: When you add a covariate to the model, observations with missing values on that covariate are dropped. This changes the analytic sample, which could alter \(\hat{\beta}\) for reasons unrelated to confounding. Always ensure that the crude and adjusted models are fit on the same set of observations.

# Verify our dataset has no missing values (we filtered earlier)
cat("Missing values in analytic dataset:", sum(!complete.cases(brfss_ci)), "\n")
## Missing values in analytic dataset: 0
cat("All models are fit on the same n =", nrow(brfss_ci), "observations.\n")
## All models are fit on the same n = 5000 observations.

Confounders vs. mediators: A variable on the causal pathway between exposure and outcome is a mediator, not a confounder. For example, if sleep affects exercise habits, which in turn affect physical health, then exercise is a mediator. Adjusting for a mediator would attenuate the exposure effect, but this attenuation is not bias; it reflects the removal of an indirect pathway. Whether to adjust depends on your research question.

General health status is a tricky case. It could be a confounder (poor overall health causes both poor sleep and more physically unhealthy days) or a mediator (poor sleep leads to poor general health, which leads to more physically unhealthy days). The direction depends on the assumed causal structure and should be guided by subject-matter knowledge.


7. Order of Operations: Interaction Before Confounding

7.1 Why Interaction Comes First

The standard epidemiological approach to model building follows this order:

  1. Assess interaction first — test whether the exposure-outcome relationship differs across subgroups
  2. If interaction is present — stratify by the modifier and assess confounding within each stratum
  3. If no interaction — assess confounding in the pooled (unstratified) data

The reason for this order is that confounding assessment assumes a single exposure effect. If the effect actually varies across subgroups (interaction), then a single “adjusted” coefficient is misleading. Reporting an average effect when the true effect differs by sex, for example, could mask important public health heterogeneity.

7.2 A Practical Workflow

Workflow: Interaction Before Confounding
Step Action
1 Specify the exposure-outcome relationship of interest
2 Identify potential effect modifiers (from literature, biological plausibility)
3 Test for interaction using interaction terms or stratified analysis
4 If interaction is present: report stratum-specific effects; assess confounding within strata
5 If no interaction: assess confounding in the full sample using the 10% change-in-estimate rule

Summary of Key Concepts

Concept | Key Point |

|*********|*********–| | Predictive vs. associative | Confounding and interaction matter for associative (etiologic) modeling | | Interaction | The effect of the exposure differs across levels of a modifier | | Stratified analysis | Fit separate models per stratum; informative but no formal test | | Interaction term | \(X_1 \times X_2\) in the model; t-test on \(\beta_3\) tests parallelism | | : vs. * in R | : = interaction only; * = main effects + interaction | | Must include main effects | Never include \(X_1 X_2\) without also including \(X_1\) and \(X_2\) | | Partial F-test for interaction | Tests all \(k - 1\) interaction terms simultaneously | | Continuous \(\times\) continuous | The slope of one predictor changes linearly with the other | | Confounding | A third variable distorts the exposure-outcome association | | Three conditions | Associated with exposure, associated with outcome, not on causal pathway | | 10% change-in-estimate | Compare crude and adjusted \(\hat{\beta}\); >10% change = confounder | | Interaction before confounding | Assess interaction first; if present, stratify before assessing confounding | | Mediator vs. confounder | Adjusting for a mediator removes an indirect effect, not bias |



Part 2: In-Class Lab Activity

EPI 553 — Confounding and Interactions Lab Due: End of class, March 24, 2026


Instructions

In this lab, you will assess interaction and confounding in the BRFSS 2020 dataset. Work through each task systematically. You may discuss concepts with classmates, but your written answers and R code must be your own.

Submission: Knit your .Rmd to HTML and upload to Brightspace by end of class.


Data for the Lab

Use the saved analytic dataset from today’s lecture. It contains 5,000 randomly sampled BRFSS 2020 respondents.

Variable | Description | Type |

|*********-|************-|******| | physhlth_days | Physically unhealthy days in past 30 | Continuous (0–30) | | sleep_hrs | Sleep hours per night | Continuous (1–14) | | menthlth_days | Mentally unhealthy days in past 30 | Continuous (0–30) | | age | Age in years (capped at 80) | Continuous | | sex | Sex (Male/Female) | Factor | | education | Education level (4 categories) | Factor | | exercise | Any physical activity (Yes/No) | Factor | | gen_health | General health status (5 categories) | Factor | | income_cat | Household income (1–8 ordinal) | Numeric |

# Load the dataset
library(tidyverse)
library(broom)
library(knitr)
library(kableExtra)
library(car)
library(ggeffects)

brfss_ci <- readRDS(
  "C:/Users/AA843241/OneDrive - University at Albany - SUNY/Desktop/EPI553/lab11/brfss_ci_2020.rds"
)

Task 1: Exploratory Data Analysis (15 points)

1a. (5 pts) Create a scatterplot of physhlth_days (y-axis) vs. age (x-axis), colored by exercise status. Add separate regression lines for each group. Describe the pattern you observe.

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +  geom_point(alpha = 0.3, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Physical Health Days vs. Age by Exercise Status",
    x = "Age",
    y = "Poor Physical Health Days",
    color = "Exercise"
  ) +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

>Interpretation: The panel above illustrates interaction: Exercise group “No” has a steeper positive slope, while the “Yes” group has a near-flat slope. These non-parallel lines indicate that the association between age and poor physical health days is stronger among non-exercisers — a single pooled slope would misrepresent the relationship for both groups.

1b. (5 pts) Compute the mean physhlth_days for each combination of sex and exercise. Present the results in a table. Does it appear that the association between exercise and physical health days might differ by sex?

mean_phys_exer_sex <- aggregate(physhlth_days ~ exercise + sex, data = brfss_ci, FUN=mean) 

print(mean_phys_exer_sex)
##   exercise    sex physhlth_days
## 1       No   Male      6.643498
## 2      Yes   Male      2.323577
## 3       No Female      7.430400
## 4      Yes Female      2.451056

Interpretation: There seems to be a difference of association between groups of exercise and physical health days by sex. Therefore sex - interaction.

1c. (5 pts) Create a scatterplot of physhlth_days vs. sleep_hrs, faceted by education level. Comment on whether the slopes appear similar or different across education groups.

ggplot(brfss_ci, aes(x = sleep_hrs, y = physhlth_days, color = education)) +  geom_point(alpha = 0.3, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Physical Health Days vs. Sleep Hours by Education category",
    x = "Sleep hours",
    y = "Poor Physical Health Days",
    color = "Education"
  ) +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

>Interpretation: The panel above illustrates interaction: Education group “College Graduate” has a near-flat slope, while for other education groups it is steeper at different digrees. These non-parallel lines indicate that the association between sleep hours and poor physical health days is weaker among college graduates — a single pooled slope would misrepresent the relationship for these groups.


Task 2: Stratified Analysis for Interaction (15 points)

2a. (5 pts) Fit separate simple linear regression models of physhlth_days ~ age for exercisers and non-exercisers. Report the slope, SE, and 95% CI for age in each stratum.

# Fit separate models for exerciseYes and exerciseNo
mod_yes_exer <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "Yes"))
mod_no_exer <- lm(physhlth_days ~ age, data = brfss_ci |> filter(exercise == "No"))


bind_rows(
  tidy(mod_yes_exer, conf.int = TRUE) |> mutate(Stratum = "exerciseYes"),
  tidy(mod_no_exer, conf.int = TRUE) |> mutate(Stratum = "exerciseNo")
) |>
  filter(term == "age") |>
  select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stratified Analysis: age → Physical Health Days, by exercise",
    col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stratified Analysis: age → Physical Health Days, by exercise
Stratum Estimate SE CI Lower CI Upper p-value
exerciseYes 0.0192 0.0060 0.0075 0.0309 0.0013
exerciseNo 0.0745 0.0212 0.0328 0.1161 0.0005

Interpretation: The stratified analysis reveals that the age-physical health association is positive in both groups: among not exercising people, each additional year of age is associated with approximately 0.0745 more physically unhealthy days, while among exercising people the association is weaker at approximately 0.0192 more days per additional year of age. Both estimates are statistically significant. The slopes appear to differ in magnitude, suggesting that the association of age with physical health may be somewhat stronger for non exercising people.

2b. (5 pts) Create a single plot showing the two fitted regression lines (one per exercise group) overlaid on the data. Are the lines approximately parallel?

ggplot(brfss_ci, aes(x = age, y = physhlth_days, color = exercise)) +
  geom_jitter(alpha = 0.08, width = 0.3, height = 0.3, size = 0.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  labs(
    title    = "Association Between Age and Physical Health Days, Stratified by exercise",
    subtitle = "Are the slopes parallel or different?",
    x        = "Age",
    y        = "Physically Unhealthy Days (Past 30)",
    color    = "Exercise"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

>Interpretation: The plot shows both regression lines sloping upward, confirming the positive association between age and physically unhealthy days for both exercise groups. But The non-exercising group’s regression line (steeper slope) suggests a stronger association of age compared to the exercising group’s line.The confidence bands do not overlap, therefore the difference in slopes statistically significant (<0.05). However, the jittered points reveal the highly skewed nature of the outcome: most respondents cluster near zero unhealthy days regardless of age, with a long tail of individuals reporting many unhealthy days.

2c. (5 pts) Can you formally test whether the two slopes are different using only the stratified results? Explain why or why not.

No, there are limitations of stratified approach: 1. No formal test: We can see the slopes are different, but we cannot compute a p-value for the difference without additional machinery 2. Reduced power: Each stratum has fewer observations than the full dataset 3. Multiple comparisons: With \(k\) strata, we would need \(k\) separate models 4. Cannot adjust for covariates easily while testing the interaction

The solution is to use a single regression model with an interaction term.


Task 3: Interaction via Regression (25 points)

3a. (5 pts) Fit the interaction model: physhlth_days ~ age * exercise. Write out the fitted equation.

# Model without interaction (additive model)
mod_phys_age_ex <- lm(physhlth_days ~ age + exercise, data = brfss_ci)

# Model with interaction
mod_phys_age_ex_int <- lm(physhlth_days ~ age * exercise, data = brfss_ci)

# Equivalent to: lm(physhlth_days ~ age * exercise, data = brfss_ci)

tidy(mod_phys_age_ex_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Age * exercise → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Interaction Model: Age * exercise → Physical Health Days
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.7610 0.8798 3.1381 0.0017 1.0361 4.4858
age 0.0745 0.0145 5.1203 0.0000 0.0460 0.1030
exerciseYes -1.3858 0.9664 -1.4340 0.1516 -3.2804 0.5087
age:exerciseYes -0.0553 0.0162 -3.4072 0.0007 -0.0871 -0.0235

\[{The\ fitted\ model:}\widehat{\text{Phys Days}} = 2.76 + 0.07\cdot(\text{Age}) -1.38\cdot (\text{ExerciseYes}) -0.05\cdot(\text{Age} \times \text{ExerciseYes})\] \[{Exercise-No}:\widehat{\text{Phys Days}} = 2.76 + 0.07\cdot(\text{Age})\]

\[{Exercise-Yes:}\widehat{\text{Phys Days}} = (2.76 -1.38) + (0.07 -0.05)\cdot(\text{Age}) = 1.38 + 0.02\cdot(\text{Age})\]

3b. (5 pts) Using the fitted equation, derive the stratum-specific equations for exercisers and non-exercisers. Verify that these match the stratified analysis from Task 2.

# Compare both models
tidy(mod_phys_age_ex_int) |> mutate(model = "With Interaction") |>
  bind_rows(tidy(mod_yes_exer) |> mutate(model = "Stratum-specific:Yes"),
            tidy(mod_no_exer) |> mutate(model = "Stratum-specific:No")
            ) |>
  select(model, term, estimate, std.error, p.value) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Model Comparison: With vs Without Interaction") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Model Comparison: With vs Without Interaction
model term estimate std.error p.value
With Interaction (Intercept) 2.7610 0.8798 0.0017
With Interaction age 0.0745 0.0145 0.0000
With Interaction exerciseYes -1.3858 0.9664 0.1516
With Interaction age:exerciseYes -0.0553 0.0162 0.0007
Stratum-specific:Yes (Intercept) 1.3752 0.3327 0.0000
Stratum-specific:Yes age 0.0192 0.0060 0.0013
Stratum-specific:No (Intercept) 2.7610 1.2837 0.0317
Stratum-specific:No age 0.0745 0.0212 0.0005

\[{Stratum:Exercise-No}:\widehat{\text{Phys Days}} = 2.76 + 0.07\cdot(\text{Age})\] \[{Stratum:Exercise-Yes}:\widehat{\text{Phys Days}}= 1.3752 + 0.0192 \cdot(\text{Age}) = 1.38 + 0.02\cdot (\text{Age})\] 3c. (5 pts) Conduct the t-test for the interaction term (age:exerciseYes). State the null and alternative hypotheses, report the test statistic and p-value, and state your conclusion about whether interaction is present.

\[H_0: \beta_3 = 0 \quad \text{(slopes are equal, lines are parallel, no interaction)}\] \[H_A: \beta_3 \neq 0 \quad \text{(slopes differ, interaction is present)}\]

int_term <- tidy(mod_phys_age_ex_int) |> filter(term == "age:exerciseYes")
cat("Interaction term (age:exerciseYes):\n")
## Interaction term (age:exerciseYes):
cat("  Estimate:", round(int_term$estimate, 4), "\n")
##   Estimate: -0.0553
cat("  t-statistic:", round(int_term$statistic, 3), "\n")
##   t-statistic: -3.407
cat("  p-value:", round(int_term$p.value, 4), "\n")
##   p-value: 7e-04

Interpretation: The interaction term (age:exerciseYes) has a coefficient of -0.0553, with a t-statistic of -3.407 and p-value of <0.001. Since the p-value lower than \(\alpha = 0.05\) threshold, we reject the null hypothesis that the slopes are equal. In other words, there is statistically significant evidence that the association between age and physically unhealthy days differs by exercise group. Non-exercisers accumulate 0.07 poor health days per year of age, compared to only 0.02 for exercisers. The negative interaction coefficient captures this difference, indicating that the association between age and poor physical health days is significantly weaker among those who exercise.

3d. (5 pts) Now fit a model with an interaction between age and education: physhlth_days ~ age * education. How many interaction terms are produced? Use a partial F-test to test whether the age \(\times\) education interaction as a whole is significant.

mod_phys_age_edu_int <- lm(physhlth_days ~ age * education, data = brfss_ci)

# Equivalent to: lm(physhlth_days ~ age * education, data = brfss_ci)
tidy(mod_phys_age_edu_int, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Sleep * Education → Physical Health Days (less than high school reference group",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Interaction Model: Sleep * Education → Physical Health Days (less than high school reference group
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 2.7542 1.5422 1.7859 0.0742 -0.2692 5.7775
age 0.0706 0.0269 2.6277 0.0086 0.0179 0.1233
educationHS graduate -1.4016 1.6900 -0.8293 0.4069 -4.7146 1.9115
educationSome college -1.1261 1.6893 -0.6666 0.5051 -4.4380 2.1857
educationCollege graduate -2.7102 1.6580 -1.6346 0.1022 -5.9606 0.5402
age:educationHS graduate -0.0197 0.0296 -0.6661 0.5054 -0.0776 0.0382
age:educationSome college -0.0326 0.0295 -1.1039 0.2697 -0.0904 0.0253
age:educationCollege graduate -0.0277 0.0289 -0.9583 0.3380 -0.0844 0.0290

Interpretation: This model produces three interaction terms, one for each education level compared to the reference category (“Less than HS”). The reference group slope for age is 0.0706, meaning that among those with less than a high school education, each additional year of age is associated with approximately 0.0706 more physically unhealthy days. The interaction terms for “HS graduate” (-0.0197), “Some college” (-0.0326), and “College graduate” (-0.0277) are negative but not individually significant (p > 0.05), indicating their slopes do not significantly differ from the reference group. Suggesting that the negative effect of age is not weaker among other education categories compared to those with less than a high school education.

mod_age_only <- lm(physhlth_days ~ age, data = brfss_ci)

anova(mod_age_only, mod_phys_age_edu_int) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Test for Coincidence: Does Education Affect the Age-Physical Health Relationship at All?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Test for Coincidence: Does Education Affect the Age-Physical Health Relationship at All?
term df.residual rss df sumsq statistic p.value
physhlth_days ~ age 4998 312544.3 NA NA NA NA
physhlth_days ~ age * education 4992 306671.9 6 5872.367 15.9317 0

Interpretation: The partial F-test evaluates whether education contributes anything to the model (either through a different intercept, a different slope, or both). The F-statistic is 15.93 with a p-value <0.001. Since p < 0.05, we reject the null hypothesis that education adds nothing to the model (i.e., that all education coefficients are jointly zero). This means that, in this unadjusted model, education significantly modifies either the baseline level or the slope of the age-physical health relationship.

3e. (5 pts) Create a visualization using ggpredict() showing the predicted physhlth_days by age for each education level. Do the lines appear parallel?

pred_int_age_edu <- ggpredict(mod_phys_age_edu_int, terms = c("age", "education"))

ggplot(pred_int_age_edu, aes(x = x, y = predicted, color = group, fill = group)) +
  geom_line(linewidth = 1.2) +
  geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = 0.12, color = NA) +
  labs(
    title    = "Predicted Physical Health Days by Age and Education",
    subtitle = "From interaction model: Age * Education",
    x        = "Age in years",
    y        = "Predicted Physically Unhealthy Days (last 30 days)",
    color    = "Education",
    fill     = "Education"
  ) +
  theme_minimal(base_size = 13) +
  scale_color_brewer(palette = "Set1")

> The lines are approximately parallel, age does not modify the education effect.

Interpretation: The 4 education-specific lines clearly approximately parallel among almost all education categories, confirming the statistically non-significant interaction. The Confidence interval for less than HS is overlapping with all other education categories, indicating that difference is not-statistically significant. Overall, Confidence intervals of all education categories are overlapping between each other in some way.
***

Task 4: Confounding Assessment (25 points)

For this task, the exposure.is exercise and the outcome is physhlth_days.

4a. (5 pts) Fit the crude model: physhlth_days ~ exercise. Report the exercise coefficient. This is the unadjusted estimate.

# Crude model: exercise → physical health days
mod_crude_exe   <- lm(physhlth_days ~ exercise, data = brfss_ci)
# Adjusted model: adding sex
mod_adj_exe_sex <- lm(physhlth_days ~ exercise + sex, data = brfss_ci)
# Extract exercise coefficients
mod_crude_ex <- coef(mod_crude_exe)["exerciseYes"]
mod_adj_sex  <- coef(mod_adj_exe_sex)["exerciseYes"]
pct_change   <- abs(mod_crude_ex - mod_adj_sex) / abs(mod_crude_ex) * 100

tribble(
  ~Model, ~`Exercise B`, ~SE, ~`95% CI`,
  "Crude (Exercise only)",
    round(mod_crude_ex, 4),
    round(tidy(mod_crude_exe) |> filter(term == "exerciseYes") |> pull(std.error), 4),
    paste0("(", round(confint(mod_crude_exe)["exerciseYes", 1], 4), ", ",
                round(confint(mod_crude_exe)["exerciseYes", 2], 4), ")"),
  "Adjusted (+ Sex)",
    round(mod_adj_sex, 4),
    round(tidy(mod_adj_exe_sex) |> filter(term == "exerciseYes") |> pull(std.error), 4),
    paste0("(", round(confint(mod_adj_exe_sex)["exerciseYes", 1], 4), ", ",
                round(confint(mod_adj_exe_sex)["exerciseYes", 2], 4), ")")
) |>
  kable(caption = "Confounding Assessment: Does Sex Confound the Exercise–Physical Health Association?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Confounding Assessment: Does Sex Confound the Exercise–Physical Health Association?
Model Exercise B SE 95% CI
Crude (Exercise only) -4.7115 0.2656 (-5.2322, -4.1908)
Adjusted (+ Sex) -4.6974 0.2658 (-5.2185, -4.1762)
cat("\nPercent change in exercise coefficient when adding Sex:", round(pct_change, 1), "%\n")
## 
## Percent change in exercise coefficient when adding Sex: 0.3 %
cat("10% rule: crude =", round(mod_crude_ex, 4),
    "| adjusted =", round(mod_adj_sex, 4),
    "| change =", round(pct_change, 1), "%\n")
## 10% rule: crude = -4.7115 | adjusted = -4.6974 | change = 0.3 %

Interpretation: The crude (unadjusted) estimate for exercise is -4.7115, meaning that without controlling for any covariates, any physical activity is associated with approximately 4.71 fewer physically unhealthy days.After adjusting for Sex, the exercise coefficient changes to -4.6974 , a 0.3 % change. Because this does not exceed the 10% threshold, Sex is not a confounder of the exercise-physical health association.

4b. (10 pts) Systematically assess whether each of the following is a confounder of the exercise-physical health association: age, sex, sleep_hrs, education, and income_cat. For each:

  • Fit the model physhlth_days ~ exercise + [covariate]
  • Report the adjusted exercise coefficient
  • Compute the percent change from the crude estimate
  • Apply the 10% rule to determine if the variable is a confounder
# Crude model
mod_crude_val <- coef(mod_crude_exe)["exerciseYes"]

# One-at-a-time adjusted models
confounders <- list(
  "Age"             = lm(physhlth_days ~ exercise + age, data = brfss_ci),
  "Sleep hours"     = lm(physhlth_days ~ exercise + sleep_hrs, data = brfss_ci),
  "Education"       = lm(physhlth_days ~ exercise + education, data = brfss_ci),
  "General health"  = lm(physhlth_days ~ exercise + gen_health, data = brfss_ci),
  "Sex"             = lm(physhlth_days ~ exercise + sex, data = brfss_ci),
  "Income"          = lm(physhlth_days ~ exercise + income_cat, data = brfss_ci)
)

conf_table <- map_dfr(names(confounders), \(name) {
  mod <- confounders[[name]]
  mod_adj_val <- coef(mod)["exerciseYes"]
  tibble(
    Covariate = name,
    `Crude B (exerciseYes)` = round(mod_crude_val, 4),
    `Adjusted B exerciseYes)` = round(mod_adj_val, 4),
    `% Change` = round(abs(mod_crude_val - mod_adj_val) / abs(mod_crude_val) * 100, 1),
    Confounder = ifelse(abs(mod_crude_val - mod_adj_val) / abs(mod_crude_val) * 100 > 10,
                        "Yes (>10%)", "No")
  )
})

conf_table |>
  kable(caption = "Systematic Confounding Assessment: One-at-a-Time Addition") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) |>
  column_spec(5, bold = TRUE)
Systematic Confounding Assessment: One-at-a-Time Addition
Covariate Crude B (exerciseYes) Adjusted B exerciseYes) % Change Confounder
Age -4.7115 -4.5504 3.4 No
Sleep hours -4.7115 -4.6831 0.6 No
Education -4.7115 -4.3912 6.8 No
General health -4.7115 -1.6596 64.8 Yes (>10%)
Sex -4.7115 -4.6974 0.3 No
Income -4.7115 -3.9406 16.4 Yes (>10%)

Present your results in a single summary table. >Interpretation: The systematic assessment reveals only two variables that meet the 10% change-in-estimate criterion for confounding: - General health (64.8% change): This variable produced the largest change by far, cutting the exercise coefficient by more than half. However, as we discuss in Section 6.5, this extreme attenuation raises the question of whether general health is a confounder or a mediator.Because general health is likely lays in causal pathaway between exercise and physical health days. - Income (16.4% change): Income confounds the association, likely because higher-income individuals tend to exercise more regularly and have better access to healthcare, both of which relate to physical health. 4 variables did not meet the confounding threshold: age (3.4%), sleep hours (0.6%), education (6.8%), and sex (0.3%). Their inclusion in the model would not meaningfully alter the estimated exercise effect.

4c. (5 pts) Fit a fully adjusted model including exercise and all identified confounders. Report the exercise coefficient and compare it to the crude estimate. How much did the estimate change overall?

#  If interaction is not significant, fit additive model with confounders
mod_adj_final <- lm(physhlth_days ~ exercise + sleep_hrs + sex + age + education,
                data = brfss_ci)

tidy(mod_adj_final, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Final Model: Exercise → Physical Health Days, Adjusted for Confounders",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Final Model: Exercise → Physical Health Days, Adjusted for Confounders
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 9.9181 0.7955 12.4676 0.0000 8.3585 11.4776
exerciseYes -4.1551 0.2711 -15.3273 0.0000 -4.6866 -3.6236
sleep_hrs -0.4318 0.0805 -5.3662 0.0000 -0.5895 -0.2740
sexFemale 0.2862 0.2178 1.3140 0.1889 -0.1408 0.7132
age 0.0355 0.0065 5.4878 0.0000 0.0228 0.0482
educationHS graduate -1.9241 0.5085 -3.7838 0.0002 -2.9209 -0.9272
educationSome college -2.0609 0.5065 -4.0688 0.0000 -3.0539 -1.0679
educationCollege graduate -2.9504 0.4955 -5.9548 0.0000 -3.9217 -1.9791

Interpretation of the final model: After adjusting for sleep_hours, sex, age, and education, any physical activity in past 30 days (exercise) is associated with -4.15 fewer physically unhealthy days (95% CI: -4.69 to -3.62, p < 0.001). Additional notable findings from the adjusted model: - Age has a only small positive association: each additional year of age is associated with 0.035 more unhealthy days. - Education shows a graded protective pattern: compared to those with less than a high school education, college graduates report approximately -2.95 fewer unhealthy days. - Sleep has also negative association: each additional hour of sleep is associated with 0.4318 fewer unhealthy days. - Sex is not significantly associated with physical health days after adjustment p= 0.71.

mod_final <- coef(mod_adj_final)["exerciseYes"]

tribble(
  ~Model, ~`Exercise B`, ~`% Change from Crude`,
  "Crude",
    round(mod_crude_val, 4), "—",
  "Adjusted (sex only)",
    round(coef(mod_adj_exe_sex)["exerciseYes"], 4),
    paste0(round(abs(mod_crude_val - coef(mod_adj_exe_sex)["exerciseYes"]) / abs(mod_crude_val) * 100, 1), "%"),
  "Final (all confounders)",
    round(mod_final, 4),
    paste0(round(abs(mod_crude_val - mod_final) / abs(mod_crude_val) * 100, 1), "%")
) |>
  kable(caption = "Exercise Coefficient: Crude vs. Progressively Adjusted Models") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Exercise Coefficient: Crude vs. Progressively Adjusted Models
Model Exercise B % Change from Crude
Crude -4.7115
Adjusted (sex only) -4.6974 0.3%
Final (all confounders) -4.1551 11.8%

Interpretation: This table tracks how the exercise coefficient changes as we progressively add confounders. The crude estimate -4.7115 represents the unadjusted association. Adding sexFemale alone moved the estimate to -4.6974 (0.3%change), stating that sexFemale as unmportant confounder. In the final model with all confounders (age, sex, education, and sleep hours), the exercise coefficient is -4.1551, with a 11.8% change from the crude estimate. This small overall change may seem surprising given that individual confounders produced larger shifts. The reason is that the confounders act in opposing directions: age adjustment makes the exercise effect weaker (less negative), while education and sleep hours adjustment makes it stronger (more negative). When all confounders are included simultaneously, these opposing adjustments partially cancel out. This underscores why it is important to assess confounders both individually and jointly.

4d. (5 pts) Is gen_health a confounder or a mediator of the exercise-physical health relationship? Could it be both? Explain your reasoning with reference to the three conditions for confounding and the concept of the causal pathway.

brfss_ci <- brfss_ci |>
  mutate(
    cat_health = factor(
      case_when(
        gen_health %in% c("Excellent", "Very Good", "Good") ~ "Better",
        gen_health %in% c("Fair", "Poor")                   ~ "Worse",
        TRUE ~ NA_character_
      ),
      levels = c("Worse", "Better")
    )
  )
# Fit separate models for better and worse health status
mod_better <- lm(physhlth_days ~ exercise, data = brfss_ci |> filter(cat_health == "Better"))
mod_worse  <- lm(physhlth_days ~ exercise, data = brfss_ci |> filter(cat_health == "Worse"))

bind_rows(
  tidy(mod_better, conf.int = TRUE) |> mutate(Stratum = "Better gen_health"),
  tidy(mod_worse,  conf.int = TRUE) |> mutate(Stratum = "Worse gen_health")
) |>
  filter(term == "exerciseYes") |>
  select(Stratum, estimate, std.error, conf.low, conf.high, p.value) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Stratified Analysis: Exercise → Physical Health Days, by General Health",
    col.names = c("Stratum", "Estimate", "SE", "CI Lower", "CI Upper", "p-value")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Stratified Analysis: Exercise → Physical Health Days, by General Health
Stratum Estimate SE CI Lower CI Upper p-value
Better gen_health -1.2752 0.2894 -1.8427 -0.7078 0
Worse gen_health -6.2482 0.9291 -8.0724 -4.4240 0
mod_int_health <- lm(physhlth_days ~ exercise * gen_health, data = brfss_ci)

tidy(mod_int_health, conf.int = TRUE) |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(
    caption = "Interaction Model: Exercise * General Health → Physical Health Days",
    col.names = c("Term", "Estimate", "SE", "t", "p-value", "CI Lower", "CI Upper")
  ) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Interaction Model: Exercise * General Health → Physical Health Days
Term Estimate SE t p-value CI Lower CI Upper
(Intercept) 1.0265 0.6006 1.7091 0.0875 -0.1509 2.2040
exerciseYes -0.3512 0.6365 -0.5517 0.5812 -1.5990 0.8967
gen_healthVery good 1.5443 0.7190 2.1479 0.0318 0.1348 2.9539
gen_healthGood 2.6806 0.6843 3.9171 0.0001 1.3390 4.0222
gen_healthFair 11.9436 0.7507 15.9098 0.0000 10.4719 13.4153
gen_healthPoor 22.9735 0.8421 27.2804 0.0000 21.3225 24.6244
exerciseYes:gen_healthVery good -1.1797 0.7671 -1.5378 0.1242 -2.6837 0.3242
exerciseYes:gen_healthGood -0.6113 0.7408 -0.8251 0.4094 -2.0637 0.8411
exerciseYes:gen_healthFair -3.6128 0.8568 -4.2164 0.0000 -5.2926 -1.9330
exerciseYes:gen_healthPoor -3.1573 1.2019 -2.6270 0.0086 -5.5135 -0.8011
mod_no_int_health <- lm(physhlth_days ~ exercise + gen_health, data = brfss_ci)

anova(mod_no_int_health, mod_int_health) |>
  tidy() |>
  mutate(across(where(is.numeric), \(x) round(x, 4))) |>
  kable(caption = "Partial F-test: Is the Sleep x Education Interaction Significant?") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Partial F-test: Is the Sleep x Education Interaction Significant?
term df.residual rss df sumsq statistic p.value
physhlth_days ~ exercise + gen_health 4994 204523.0 NA NA NA NA
physhlth_days ~ exercise * gen_health 4990 203417.2 4 1105.796 6.7815 0

There is consistent evidence that general health status modifies the association between exercise and poor physical health days. The stratified analysis, individual interaction terms, and partial F-test all support this conclusion.

General health is more likely a mediator of the exercise-physical health relationship: - Adults with excellent general health status tend to exercise more (associated with exposure) - Adults with excellent general health status have less physically unhealthy days (associated with outcome) - Excellent health status is caused by exercise and/or physical health ( on the causal pathway)


Task 5: Public Health Interpretation (20 points)

5a. (10 pts) Based on your analyses, write a 4–5 sentence paragraph for a public health audience summarizing:

  • Whether the association between exercise and physical health differs by any subgroup (interaction assessment) >Based on the analyses the association of exercise and physical health days does not differ between men and women. Based on the education specific category assessed association between exercise and physical health days does decrease steadily, but the evidence is not meaningful.

Based on the analysis income is a meaningful confounder, meaning that it both affect physical activity in the past 30 days and poor physical health days. It’s more likely that people with higher income have more time to exercise, have better access to healthcare and overall have less poor physical health days.
Overall, adjusting for all other factors (confounders) exercise is associated with 4.1551 fewer days of poor physical health status in comparison with exercise alone (almost 5 days fewer) (with 11.8% difference) As the data has been collected within a cross-sectional study (survey), meaning that information of exercise, days of poor physical activity and other factors has been collected at the same point of time, we can not established wether no physical activity caused poor physical health days, or other way around and what happened first.

5b. (10 pts) A colleague suggests including gen_health as a covariate in the final model because it changes the exercise coefficient by more than 10%. You disagree. Write a 3–4 sentence argument explaining why adjusting for general health may not be appropriate if the goal is to estimate the total effect of exercise on physical health days. Use the concept of mediation in your argument. As the general health status changes the coefficient of exercise on poor physical health days by more than 50%, we should consider it mediator and nor merely a confounder. Because it is more likely that general health status not only associated with both exposure and outcome, but also lies on a causal pathway.

End of Lab Activity