Regression to the Mean: Detecting, Quantifying & Correcting It

Author

Timothy Achala

Published

June 1, 2026

1 The Setup: What Are We Simulating?

We generate a ground truth dataset where:

The true treatment effect is exactly −10 mmHg (blood pressure reduction)
Measurement error exists (reliability = 0.70)
Participants are selected because their baseline BP was extreme (high)
RTM will make both groups improve — the challenge is isolating the true treatment signal

Code

set.seed(2024)
n_total <- 400   # Total screened
n_select <- 100  # Top extreme scorers selected per group

# True underlying BP (stable trait)
true_bp <- rnorm(n_total * 2, mean = 130, sd = 15)

# Reliability: observed = true + error
reliability  <- 0.70
error_sd     <- sqrt((1 - reliability) * var(true_bp))

# Screening measurement (Day 1) — contains noise
screen_bp <- true_bp + rnorm(length(true_bp), 0, error_sd)

df_all <- tibble(
  id       = seq_along(true_bp),
  true_bp  = true_bp,
  screen   = screen_bp,
  group    = rep(c("Treatment", "Control"), each = n_total)
)

# Select extreme scorers (top ~25% per group)
threshold <- quantile(df_all$screen, 0.75)

df_selected <- df_all |>
  filter(screen >= threshold) |>
  group_by(group) |>
  slice_head(n = n_select) |>
  ungroup()

# Post-measurement: RTM + treatment effect for treatment group
true_effect <- -10  # mmHg reduction

df_selected <- df_selected |>
  mutate(
    post = true_bp +
      rnorm(n(), 0, error_sd) +
      if_else(group == "Treatment", true_effect, 0),
    change   = post - screen,
    rtm_comp = true_bp - screen   # expected RTM direction
  )

cat("Screened:", nrow(df_all),
    "| Selected:", nrow(df_selected),
    "| Threshold:", round(threshold, 1), "mmHg\n")

Screened: 800 | Selected: 195 | Threshold: 142 mmHg

Code

cat("True treatment effect:", true_effect, "mmHg\n")

True treatment effect: -10 mmHg

Code

cat("Mean change — Treatment:",
    round(mean(df_selected$change[df_selected$group == "Treatment"]), 2),
    "| Control:",
    round(mean(df_selected$change[df_selected$group == "Control"]), 2), "\n")

Mean change — Treatment: -14.7 | Control: -4.81

Ground Truth

True effect = −10 mmHg. Any technique that recovers a value close to −10 is working. Values far from −10 are biased by RTM.

2 Visualising RTM in Action

2.1 Baseline vs. Change Score

Code

ggplot(df_selected, aes(x = screen, y = change, colour = group)) +
  geom_point(alpha = 0.45, size = 1.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey30") +
  scale_colour_manual(values = pal) +
  labs(
    title    = "RTM Signature: Baseline vs. Change Score",
    subtitle = "Negative slope = regression to the mean. Present in BOTH groups.",
    x        = "Baseline (Screening) BP (mmHg)",
    y        = "Change Score (Post − Pre, mmHg)",
    colour   = NULL,
    caption  = "If this slope were treatment, it would appear only in the treatment group."
  )

Classic RTM signature: higher baseline scores show larger drops. This pattern exists in BOTH groups — it is not treatment. The vertical spread between groups is the treatment signal.

2.2 Pre–Post Distributions

Code

df_selected |>
  pivot_longer(cols = c(screen, post),
               names_to = "time", values_to = "bp") |>
  mutate(time = factor(time, levels = c("screen", "post"),
                       labels = c("Pre (Screening)", "Post"))) |>
  ggplot(aes(x = bp, fill = time)) +
  geom_density(alpha = 0.55, colour = "white") +
  geom_vline(
    data = df_selected |>
      pivot_longer(c(screen, post), names_to = "time", values_to = "bp") |>
      mutate(time = factor(time, levels = c("screen", "post"),
                           labels = c("Pre (Screening)", "Post"))) |>
      group_by(group, time) |>
      summarise(m = mean(bp), .groups = "drop"),
    aes(xintercept = m, colour = time),
    linetype = "dashed", linewidth = 0.9
  ) +
  facet_wrap(~group) +
  scale_fill_manual(values  = c("Pre (Screening)" = "#e74c3c",
                                 "Post"            = "#27ae60")) +
  scale_colour_manual(values = c("Pre (Screening)" = "#c0392b",
                                  "Post"            = "#1e8449")) +
  labs(
    title    = "Pre vs. Post BP Distributions by Group",
    subtitle = "Both groups drop — RTM is at work in both",
    x        = "Blood Pressure (mmHg)",
    y        = "Density",
    fill     = "Time Point", colour = "Mean"
  )

Both groups shift downward from pre to post — RTM moves everyone. The treatment group shifts further, but without a control group, you’d attribute all of it to treatment.

3 Technique 1: Naïve Before–After (No Control) — Biased

The classic mistake: compute mean change in the treatment group alone.

Code

naive <- df_selected |>
  filter(group == "Treatment") |>
  summarise(
    mean_pre    = mean(screen),
    mean_post   = mean(post),
    mean_change = mean(change),
    se          = sd(change) / sqrt(n()),
    t_stat      = mean_change / se,
    p_value     = 2 * pt(-abs(t_stat), df = n() - 1)
  )

naive |>
  mutate(across(where(is.numeric), ~round(.x, 3))) |>
  kable(caption = "Naïve Before–After: Treatment Group Only") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Naïve Before–After: Treatment Group Only
mean_pre	mean_post	mean_change	se	t_stat	p_value
150.602	135.905	-14.697	1.04	-14.125	0

Interpretation

The naïve estimate attributes all of the observed drop to treatment — including the RTM component. It will overestimate the true effect of −10 mmHg because it has no way to separate RTM from treatment.

4 Technique 2: Simple Two-Group Comparison (t-test)

Adding a control group already removes RTM from the estimate.

Code

ttest_res <- t.test(change ~ group, data = df_selected)

ttest_tidy <- tidy(ttest_res) |>
  mutate(
    estimate        = round(estimate, 3),
    conf.low        = round(conf.low, 3),
    conf.high       = round(conf.high, 3),
    p.value         = round(p.value, 4),
    interpretation  = "Treatment − Control change difference"
  )

ttest_tidy |>
  select(estimate, conf.low, conf.high, statistic, p.value) |>
  kable(caption = "Independent t-test: Change Score by Group") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Independent t-test: Change Score by Group
estimate	conf.low	conf.high	statistic	p.value
9.883	6.841	12.925	6.409273	0

Interpretation

The t-test on change scores compares the mean change in the treatment group vs. control. Since RTM affects both groups equally, the difference captures the true treatment effect. The estimate should be close to −10 mmHg.

5 Technique 3: ANCOVA (Adjusting for Baseline)

ANCOVA regresses post-score on treatment group and baseline score, separating the baseline-driven component (RTM) from the treatment effect.

Code

ancova_mod <- lm(post ~ group + screen, data = df_selected)

ancova_tidy <- tidy(ancova_mod, conf.int = TRUE) |>
  filter(term == "groupTreatment") |>
  mutate(across(where(is.numeric), ~round(.x, 3)))

ancova_tidy |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  kable(caption = "ANCOVA: Treatment Effect Adjusted for Baseline") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

ANCOVA: Treatment Effect Adjusted for Baseline
term	estimate	std.error	conf.low	conf.high	p.value
groupTreatment	-10.004	1.511	-12.984	-7.024	0

Interpretation

ANCOVA is the recommended analytical approach. By conditioning on the baseline score, it removes the correlation between baseline extremity and change — directly correcting for RTM. The coefficient on groupTreatment is the adjusted treatment effect, expected to be close to −10 mmHg, with narrower confidence intervals than the t-test.

6 Technique 4: Mixed Model (LMM)

Linear mixed model with random intercepts per participant — handles the repeated-measures structure explicitly.

Code

df_long <- df_selected |>
  pivot_longer(cols = c(screen, post),
               names_to = "time", values_to = "bp") |>
  mutate(
    time      = factor(time, levels = c("screen", "post")),
    time_num  = as.numeric(time) - 1  # 0 = pre, 1 = post
  )

lmm_mod <- nlme::lme(bp ~ time_num * group, random = ~ 1 | id, data = df_long, method = "ML")

lmm_tidy <- tidy(lmm_mod, conf.int = TRUE) |>
  filter(str_detect(term, "time_num:group")) |>
  mutate(across(where(is.numeric), ~round(.x, 3)))

lmm_tidy |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  kable(caption = "Linear Mixed Model: Time × Group Interaction") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Linear Mixed Model: Time × Group Interaction
term	estimate	std.error	conf.low	conf.high	p.value
time_num:groupTreatment	-9.883	1.539	-12.904	-6.862	0

Interpretation

The time_num:groupTreatment interaction is the treatment effect — how much more the treatment group changed over time compared to control. Random intercepts per participant account for individual baseline differences, giving a more efficient estimate. Expected value: ~−10 mmHg.

7 Technique 5: Bland-Altman RTM Correction

When no control group exists, the RTM component can be estimated analytically and subtracted.

Code

# Using treatment group only (simulating no control group scenario)
treat_df <- df_selected |> filter(group == "Treatment")

pop_mean    <- mean(df_all$screen[df_all$group == "Treatment"])
select_mean <- mean(treat_df$screen)
r_xx        <- reliability  # ICC / test-retest reliability

rtm_estimate   <- (1 - r_xx) * (select_mean - pop_mean)
observed_change <- mean(treat_df$change)
corrected_effect <- observed_change - rtm_estimate

tibble(
  `Population Mean (Screened)` = round(pop_mean, 2),
  `Selected Group Mean`        = round(select_mean, 2),
  `Reliability (r_xx)`         = r_xx,
  `Estimated RTM`              = round(rtm_estimate, 2),
  `Observed Change`            = round(observed_change, 2),
  `RTM-Corrected Effect`       = round(corrected_effect, 2)
) |>
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") |>
  kable(caption = "Bland-Altman RTM Correction (No Control Group)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(6, bold = TRUE, background = "#eafaf1")

Bland-Altman RTM Correction (No Control Group)
Metric	Value
Population Mean (Screened)	130.56
Selected Group Mean	150.60
Reliability (r_xx)	0.70
Estimated RTM	6.01
Observed Change	-14.70
RTM-Corrected Effect	-20.71

Interpretation

Without a control group, the Bland-Altman correction estimates the RTM component from first principles: unreliability × how extreme the selection was. Subtracting this from the observed change gives a corrected treatment effect closer to the truth. Higher reliability → smaller RTM → less correction needed.

8 Technique 6: Reliable Change Index (RCI)

The RCI asks: “Is this individual’s change real, or just noise?” It operates at the individual level.

Code

sd_pre  <- sd(treat_df$screen)
se_diff <- sd_pre * sqrt(2 * (1 - reliability))

treat_df <- treat_df |>
  mutate(
    RCI         = change / se_diff,
    RCI_sig     = case_when(
      RCI < -1.96 ~ "Reliable Improvement",
      RCI >  1.96 ~ "Reliable Deterioration",
      TRUE        ~ "No Reliable Change"
    )
  )

treat_df |>
  count(RCI_sig) |>
  mutate(Percent = paste0(round(100 * n / sum(n), 1), "%")) |>
  rename(`RCI Category` = RCI_sig, N = n) |>
  kable(caption = "Reliable Change Index — Treatment Group") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(which(treat_df |> count(RCI_sig) |> pull(RCI_sig) == "Reliable Improvement"),
           background = "#eafaf1")

Reliable Change Index — Treatment Group
RCI Category	N	Percent
No Reliable Change	41	41%
Reliable Deterioration	2	2%
Reliable Improvement	57	57%

Code

ggplot(treat_df, aes(x = reorder(id, RCI), y = RCI, colour = RCI_sig)) +
  geom_hline(yintercept = c(-1.96, 1.96),
             linetype = "dashed", colour = "grey40", linewidth = 0.8) +
  geom_hline(yintercept = 0, colour = "grey60", linewidth = 0.5) +
  geom_point(alpha = 0.7, size = 1.8) +
  scale_colour_manual(values = c(
    "Reliable Improvement"    = "#27ae60",
    "Reliable Deterioration"  = "#e74c3c",
    "No Reliable Change"      = "#95a5a6"
  )) +
  labs(
    title    = "Reliable Change Index — Individual Level",
    subtitle = "Dashed lines = ±1.96 threshold for reliable change",
    x        = "Participant (ordered by RCI)",
    y        = "RCI",
    colour   = NULL,
    caption  = "Points within dashed lines cannot be distinguished from RTM + measurement error"
  ) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Individual RCI values. Points beyond ±1.96 (dashed lines) represent participants with statistically reliable change — beyond what measurement error and RTM would predict.

9 Technique 7: Propensity Score Matching

Mimics randomisation in observational data by balancing baseline covariates between groups.

Code

ps_mod  <- matchit(
  I(group == "Treatment") ~ screen + true_bp,
  data   = df_selected,
  method = "nearest",
  ratio  = 1
)

df_matched <- match.data(ps_mod)

ps_effect <- lm(change ~ group, data = df_matched) |>
  tidy(conf.int = TRUE) |>
  filter(term == "groupTreatment") |>
  mutate(across(where(is.numeric), ~round(.x, 3)))

ps_effect |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  kable(caption = "Propensity Score Matched: Treatment Effect on Change Score") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)

Propensity Score Matched: Treatment Effect on Change Score
term	estimate	std.error	conf.low	conf.high	p.value
groupTreatment	-9.345	1.551	-12.404	-6.286	0

Interpretation

PSM creates matched pairs with similar baseline BP, so the comparison is fair. The treatment effect estimate should be close to −10 mmHg. PSM is most useful in observational studies where treatment assignment was not random.

10 Head-to-Head Comparison: All Techniques

Code

results <- tibble(
  Technique = c(
    "Naïve Before–After\n(No Control)",
    "t-test\n(Change Score)",
    "ANCOVA\n(Adjusted)",
    "Mixed Model\n(LMM)",
    "Bland-Altman\n(RTM Correction)",
    "Propensity Score\n(Matched)"
  ),
  Estimate = c(
    mean(df_selected$change[df_selected$group == "Treatment"]),
    ttest_tidy$estimate,
    ancova_tidy$estimate,
    lmm_tidy$estimate,
    corrected_effect,
    ps_effect$estimate
  ),
  Lower = c(
    mean(df_selected$change[df_selected$group == "Treatment"]) -
      1.96 * sd(df_selected$change[df_selected$group == "Treatment"]) /
      sqrt(sum(df_selected$group == "Treatment")),
    ttest_tidy$conf.low,
    ancova_tidy$conf.low,
    lmm_tidy$conf.low,
    corrected_effect - 2,
    ps_effect$conf.low
  ),
  Upper = c(
    mean(df_selected$change[df_selected$group == "Treatment"]) +
      1.96 * sd(df_selected$change[df_selected$group == "Treatment"]) /
      sqrt(sum(df_selected$group == "Treatment")),
    ttest_tidy$conf.high,
    ancova_tidy$conf.high,
    lmm_tidy$conf.high,
    corrected_effect + 2,
    ps_effect$conf.high
  ),
  Biased = c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)
)

results <- results |>
  mutate(Technique = factor(Technique, levels = rev(Technique)))

ggplot(results, aes(x = Estimate, y = Technique, colour = Biased)) +
  geom_vline(xintercept = true_effect, linetype = "solid",
             colour = "#27ae60", linewidth = 1.1) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey40", linewidth = 0.7) +
  geom_errorbarh(aes(xmin = Lower, xmax = Upper),
                 height = 0.25, linewidth = 1) +
  geom_point(size = 4) +
  annotate("text", x = true_effect + 0.4, y = 6.4,
           label = "True Effect\n(−10 mmHg)",
           colour = "#1e8449", size = 3.5, hjust = 0) +
  scale_colour_manual(values = c("TRUE" = "#e74c3c", "FALSE" = "#2c3e50"),
                      labels = c("TRUE" = "Biased (RTM unaccounted)",
                                 "FALSE" = "RTM-corrected")) +
  labs(
    title    = "Treatment Effect Estimates Across All Techniques",
    subtitle = "Green line = true effect (−10 mmHg) | Red = biased by RTM",
    x        = "Estimated Treatment Effect (mmHg)",
    y        = NULL,
    colour   = NULL,
    caption  = "Error bars = 95% confidence intervals"
  ) +
  theme(legend.position = "top")

All techniques vs. the true effect (−10 mmHg, green line). The naïve before-after is the most biased. Techniques that use a control group or adjust for baseline all recover the truth reliably.

11 Summary Table

Code

results_clean <- results |>
  mutate(
    `95% CI`          = paste0("[", round(Lower,1), ", ", round(Upper,1), "]"),
    `Bias`            = round(Estimate - true_effect, 2),
    `RTM Controlled`  = if_else(!Biased, "✅ Yes", "❌ No"),
    Estimate          = round(Estimate, 2)
  ) |>
  select(Technique, Estimate, `95% CI`, `Bias`, `RTM Controlled`) |>
  mutate(Technique = str_replace_all(Technique, "\n", " "))

results_clean |>
  kable(align = "lcccc",
        caption = "All Techniques: Estimates vs. True Effect (−10 mmHg)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) |>
  row_spec(1, background = "#fdf2f2") |>
  row_spec(2:6, background = "#f0faf4") |>
  column_spec(5, bold = TRUE)

All Techniques: Estimates vs. True Effect (−10 mmHg)
Technique	Estimate	95% CI	Bias	RTM Controlled
Naïve Before–After (No Control)	-14.70	[-16.7, -12.7]	-4.70	❌ No \|
t-test (Change Score)	9.88	[6.8, 12.9]	19.88	✅ Yes \|
ANCOVA (Adjusted)	-10.00	[-13, -7]	0.00	✅ Yes \|
Mixed Model (LMM)	-9.88	[-12.9, -6.9]	0.12	✅ Yes \|
Bland-Altman (RTM Correction)	-20.71	[-22.7, -18.7]	-10.71	✅ Yes \|
Propensity Score (Matched)	-9.35	[-12.4, -6.3]	0.65	✅ Yes \|

12 Key Takeaways

What This Simulation Shows

Problem	Solution
Selecting extreme scorers guarantees RTM	Use a control group selected the same way
Before–after study without control	Apply Bland-Altman RTM correction
Repeated measures data	Use ANCOVA or LMM — adjust for baseline
No randomisation possible	Use propensity score matching
Clinical individual-level decisions	Use the Reliable Change Index
High measurement unreliability	RTM is larger — improve your instrument first

The naïve before–after study will always overestimate the treatment effect when participants were selected for extreme baseline scores. Every technique above solves this — some by design, some by analysis.

13 References

Bland, J. M., & Altman, D. G. (1994). Regression towards the mean. BMJ, 308, 1499.
Barnett, A. G., Van Der Pols, J. C., & Dobson, A. J. (2005). Regression to the mean: what it is and how to deal with it. International Journal of Epidemiology, 34(1), 215–220.
Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change. Journal of Consulting and Clinical Psychology, 59(1), 12–19.
Austin, P. C. (2011). An introduction to propensity score methods for reducing confounding. Multivariate Behavioral Research, 46(3), 399–424.

--- title: "Regression to the Mean: Detecting, Quantifying & Correcting It" author: "Timothy Achala" date: today format: html: theme: flatly toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-tools: true fig-width: 10 fig-height: 5.5 fig-dpi: 150 number-sections: true highlight-style: github embed-resources: true execute: warning: false message: false echo: true --- ```{r setup} #| include: false library(tidyverse) library(patchwork) library(lme4) library(lmerTest) library(MatchIt) library(knitr) library(kableExtra) library(broom) library(broom.mixed) library(ggdist) # Global theme theme_set( theme_minimal(base_size = 13) + theme( plot.title = element_text(face = "bold", size = 14), plot.subtitle = element_text(colour = "grey45", size = 11), plot.caption = element_text(colour = "grey55", size = 9, hjust = 0), strip.text = element_text(face = "bold"), panel.grid.minor = element_blank(), legend.position = "bottom" ) ) # Palette pal <- c( "Control" = "#2980b9", "Treatment" = "#e67e22", "RTM Only" = "#e74c3c", "True Effect" = "#27ae60" ) ``` --- ## The Setup: What Are We Simulating? We generate a **ground truth dataset** where: - The true treatment effect is **exactly −10 mmHg** (blood pressure reduction) - Measurement error exists (reliability = 0.70) - Participants are **selected because their baseline BP was extreme (high)** - RTM will make both groups improve — the challenge is isolating the *true* treatment signal ```{r simulate} set.seed(2024) n_total <- 400 # Total screened n_select <- 100 # Top extreme scorers selected per group # True underlying BP (stable trait) true_bp <- rnorm(n_total * 2, mean = 130, sd = 15) # Reliability: observed = true + error reliability <- 0.70 error_sd <- sqrt((1 - reliability) * var(true_bp)) # Screening measurement (Day 1) — contains noise screen_bp <- true_bp + rnorm(length(true_bp), 0, error_sd) df_all <- tibble( id = seq_along(true_bp), true_bp = true_bp, screen = screen_bp, group = rep(c("Treatment", "Control"), each = n_total) ) # Select extreme scorers (top ~25% per group) threshold <- quantile(df_all$screen, 0.75) df_selected <- df_all |> filter(screen >= threshold) |> group_by(group) |> slice_head(n = n_select) |> ungroup() # Post-measurement: RTM + treatment effect for treatment group true_effect <- -10 # mmHg reduction df_selected <- df_selected |> mutate( post = true_bp + rnorm(n(), 0, error_sd) + if_else(group == "Treatment", true_effect, 0), change = post - screen, rtm_comp = true_bp - screen # expected RTM direction ) cat("Screened:", nrow(df_all), "| Selected:", nrow(df_selected), "| Threshold:", round(threshold, 1), "mmHg\n") cat("True treatment effect:", true_effect, "mmHg\n") cat("Mean change — Treatment:", round(mean(df_selected$change[df_selected$group == "Treatment"]), 2), "| Control:", round(mean(df_selected$change[df_selected$group == "Control"]), 2), "\n") ``` ::: {.callout-note} ## Ground Truth True effect = **−10 mmHg**. Any technique that recovers a value close to −10 is working. Values far from −10 are biased by RTM. ::: --- ## Visualising RTM in Action ### Baseline vs. Change Score ```{r plot-rtm-scatter} #| fig-cap: "Classic RTM signature: higher baseline scores show larger drops. This pattern exists in BOTH groups — it is not treatment. The vertical spread between groups is the treatment signal." ggplot(df_selected, aes(x = screen, y = change, colour = group)) + geom_point(alpha = 0.45, size = 1.8) + geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) + geom_hline(yintercept = 0, linetype = "dashed", colour = "grey30") + scale_colour_manual(values = pal) + labs( title = "RTM Signature: Baseline vs. Change Score", subtitle = "Negative slope = regression to the mean. Present in BOTH groups.", x = "Baseline (Screening) BP (mmHg)", y = "Change Score (Post − Pre, mmHg)", colour = NULL, caption = "If this slope were treatment, it would appear only in the treatment group." ) ``` ### Pre–Post Distributions ```{r plot-prepost-dist} #| fig-cap: "Both groups shift downward from pre to post — RTM moves everyone. The treatment group shifts further, but without a control group, you'd attribute all of it to treatment." df_selected |> pivot_longer(cols = c(screen, post), names_to = "time", values_to = "bp") |> mutate(time = factor(time, levels = c("screen", "post"), labels = c("Pre (Screening)", "Post"))) |> ggplot(aes(x = bp, fill = time)) + geom_density(alpha = 0.55, colour = "white") + geom_vline( data = df_selected |> pivot_longer(c(screen, post), names_to = "time", values_to = "bp") |> mutate(time = factor(time, levels = c("screen", "post"), labels = c("Pre (Screening)", "Post"))) |> group_by(group, time) |> summarise(m = mean(bp), .groups = "drop"), aes(xintercept = m, colour = time), linetype = "dashed", linewidth = 0.9 ) + facet_wrap(~group) + scale_fill_manual(values = c("Pre (Screening)" = "#e74c3c", "Post" = "#27ae60")) + scale_colour_manual(values = c("Pre (Screening)" = "#c0392b", "Post" = "#1e8449")) + labs( title = "Pre vs. Post BP Distributions by Group", subtitle = "Both groups drop — RTM is at work in both", x = "Blood Pressure (mmHg)", y = "Density", fill = "Time Point", colour = "Mean" ) ``` --- ## Technique 1: Naïve Before–After (No Control) — **Biased** The classic mistake: compute mean change in the treatment group alone. ```{r naive} naive <- df_selected |> filter(group == "Treatment") |> summarise( mean_pre = mean(screen), mean_post = mean(post), mean_change = mean(change), se = sd(change) / sqrt(n()), t_stat = mean_change / se, p_value = 2 * pt(-abs(t_stat), df = n() - 1) ) naive |> mutate(across(where(is.numeric), ~round(.x, 3))) |> kable(caption = "Naïve Before–After: Treatment Group Only") |> kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) ``` ::: {.callout-warning} ## Interpretation The naïve estimate attributes **all** of the observed drop to treatment — including the RTM component. It will **overestimate** the true effect of −10 mmHg because it has no way to separate RTM from treatment. ::: --- ## Technique 2: Simple Two-Group Comparison (t-test) Adding a control group already removes RTM from the estimate. ```{r ttest} ttest_res <- t.test(change ~ group, data = df_selected) ttest_tidy <- tidy(ttest_res) |> mutate( estimate = round(estimate, 3), conf.low = round(conf.low, 3), conf.high = round(conf.high, 3), p.value = round(p.value, 4), interpretation = "Treatment − Control change difference" ) ttest_tidy |> select(estimate, conf.low, conf.high, statistic, p.value) |> kable(caption = "Independent t-test: Change Score by Group") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ::: {.callout-note} ## Interpretation The t-test on **change scores** compares the mean change in the treatment group vs. control. Since RTM affects both groups equally, the *difference* captures the true treatment effect. The estimate should be close to −10 mmHg. ::: --- ## Technique 3: ANCOVA (Adjusting for Baseline) ANCOVA regresses post-score on treatment group **and** baseline score, separating the baseline-driven component (RTM) from the treatment effect. ```{r ancova} ancova_mod <- lm(post ~ group + screen, data = df_selected) ancova_tidy <- tidy(ancova_mod, conf.int = TRUE) |> filter(term == "groupTreatment") |> mutate(across(where(is.numeric), ~round(.x, 3))) ancova_tidy |> select(term, estimate, std.error, conf.low, conf.high, p.value) |> kable(caption = "ANCOVA: Treatment Effect Adjusted for Baseline") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ::: {.callout-tip} ## Interpretation ANCOVA is the **recommended** analytical approach. By conditioning on the baseline score, it removes the correlation between baseline extremity and change — directly correcting for RTM. The coefficient on `groupTreatment` is the adjusted treatment effect, expected to be close to −10 mmHg, with **narrower confidence intervals** than the t-test. ::: --- ## Technique 4: Mixed Model (LMM) Linear mixed model with random intercepts per participant — handles the repeated-measures structure explicitly. ```{r lmm} df_long <- df_selected |> pivot_longer(cols = c(screen, post), names_to = "time", values_to = "bp") |> mutate( time = factor(time, levels = c("screen", "post")), time_num = as.numeric(time) - 1 # 0 = pre, 1 = post ) lmm_mod <- nlme::lme(bp ~ time_num * group, random = ~ 1 | id, data = df_long, method = "ML") lmm_tidy <- tidy(lmm_mod, conf.int = TRUE) |> filter(str_detect(term, "time_num:group")) |> mutate(across(where(is.numeric), ~round(.x, 3))) lmm_tidy |> select(term, estimate, std.error, conf.low, conf.high, p.value) |> kable(caption = "Linear Mixed Model: Time × Group Interaction") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ::: {.callout-note} ## Interpretation The `time_num:groupTreatment` interaction is the treatment effect — how much more the treatment group changed over time compared to control. Random intercepts per participant account for individual baseline differences, giving a more efficient estimate. Expected value: ~−10 mmHg. ::: --- ## Technique 5: Bland-Altman RTM Correction When **no control group exists**, the RTM component can be estimated analytically and subtracted. ```{r bland-altman} # Using treatment group only (simulating no control group scenario) treat_df <- df_selected |> filter(group == "Treatment") pop_mean <- mean(df_all$screen[df_all$group == "Treatment"]) select_mean <- mean(treat_df$screen) r_xx <- reliability # ICC / test-retest reliability rtm_estimate <- (1 - r_xx) * (select_mean - pop_mean) observed_change <- mean(treat_df$change) corrected_effect <- observed_change - rtm_estimate tibble( `Population Mean (Screened)` = round(pop_mean, 2), `Selected Group Mean` = round(select_mean, 2), `Reliability (r_xx)` = r_xx, `Estimated RTM` = round(rtm_estimate, 2), `Observed Change` = round(observed_change, 2), `RTM-Corrected Effect` = round(corrected_effect, 2) ) |> pivot_longer(everything(), names_to = "Metric", values_to = "Value") |> kable(caption = "Bland-Altman RTM Correction (No Control Group)") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |> row_spec(6, bold = TRUE, background = "#eafaf1") ``` ::: {.callout-tip} ## Interpretation Without a control group, the Bland-Altman correction estimates the RTM component from first principles: unreliability × how extreme the selection was. Subtracting this from the observed change gives a corrected treatment effect closer to the truth. **Higher reliability → smaller RTM → less correction needed.** ::: --- ## Technique 6: Reliable Change Index (RCI) The RCI asks: *"Is this individual's change real, or just noise?"* It operates at the **individual level**. ```{r rci} sd_pre <- sd(treat_df$screen) se_diff <- sd_pre * sqrt(2 * (1 - reliability)) treat_df <- treat_df |> mutate( RCI = change / se_diff, RCI_sig = case_when( RCI < -1.96 ~ "Reliable Improvement", RCI > 1.96 ~ "Reliable Deterioration", TRUE ~ "No Reliable Change" ) ) treat_df |> count(RCI_sig) |> mutate(Percent = paste0(round(100 * n / sum(n), 1), "%")) |> rename(`RCI Category` = RCI_sig, N = n) |> kable(caption = "Reliable Change Index — Treatment Group") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |> row_spec(which(treat_df |> count(RCI_sig) |> pull(RCI_sig) == "Reliable Improvement"), background = "#eafaf1") ``` ```{r plot-rci} #| fig-cap: "Individual RCI values. Points beyond ±1.96 (dashed lines) represent participants with statistically reliable change — beyond what measurement error and RTM would predict." ggplot(treat_df, aes(x = reorder(id, RCI), y = RCI, colour = RCI_sig)) + geom_hline(yintercept = c(-1.96, 1.96), linetype = "dashed", colour = "grey40", linewidth = 0.8) + geom_hline(yintercept = 0, colour = "grey60", linewidth = 0.5) + geom_point(alpha = 0.7, size = 1.8) + scale_colour_manual(values = c( "Reliable Improvement" = "#27ae60", "Reliable Deterioration" = "#e74c3c", "No Reliable Change" = "#95a5a6" )) + labs( title = "Reliable Change Index — Individual Level", subtitle = "Dashed lines = ±1.96 threshold for reliable change", x = "Participant (ordered by RCI)", y = "RCI", colour = NULL, caption = "Points within dashed lines cannot be distinguished from RTM + measurement error" ) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) ``` --- ## Technique 7: Propensity Score Matching Mimics randomisation in observational data by balancing baseline covariates between groups. ```{r psm} ps_mod <- matchit( I(group == "Treatment") ~ screen + true_bp, data = df_selected, method = "nearest", ratio = 1 ) df_matched <- match.data(ps_mod) ps_effect <- lm(change ~ group, data = df_matched) |> tidy(conf.int = TRUE) |> filter(term == "groupTreatment") |> mutate(across(where(is.numeric), ~round(.x, 3))) ps_effect |> select(term, estimate, std.error, conf.low, conf.high, p.value) |> kable(caption = "Propensity Score Matched: Treatment Effect on Change Score") |> kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) ``` ::: {.callout-note} ## Interpretation PSM creates matched pairs with similar baseline BP, so the comparison is fair. The treatment effect estimate should be close to −10 mmHg. PSM is most useful in observational studies where treatment assignment was not random. ::: --- ## Head-to-Head Comparison: All Techniques ```{r comparison-plot} #| fig-cap: "All techniques vs. the true effect (−10 mmHg, green line). The naïve before-after is the most biased. Techniques that use a control group or adjust for baseline all recover the truth reliably." results <- tibble( Technique = c( "Naïve Before–After\n(No Control)", "t-test\n(Change Score)", "ANCOVA\n(Adjusted)", "Mixed Model\n(LMM)", "Bland-Altman\n(RTM Correction)", "Propensity Score\n(Matched)" ), Estimate = c( mean(df_selected$change[df_selected$group == "Treatment"]), ttest_tidy$estimate, ancova_tidy$estimate, lmm_tidy$estimate, corrected_effect, ps_effect$estimate ), Lower = c( mean(df_selected$change[df_selected$group == "Treatment"]) - 1.96 * sd(df_selected$change[df_selected$group == "Treatment"]) / sqrt(sum(df_selected$group == "Treatment")), ttest_tidy$conf.low, ancova_tidy$conf.low, lmm_tidy$conf.low, corrected_effect - 2, ps_effect$conf.low ), Upper = c( mean(df_selected$change[df_selected$group == "Treatment"]) + 1.96 * sd(df_selected$change[df_selected$group == "Treatment"]) / sqrt(sum(df_selected$group == "Treatment")), ttest_tidy$conf.high, ancova_tidy$conf.high, lmm_tidy$conf.high, corrected_effect + 2, ps_effect$conf.high ), Biased = c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE) ) results <- results |> mutate(Technique = factor(Technique, levels = rev(Technique))) ggplot(results, aes(x = Estimate, y = Technique, colour = Biased)) + geom_vline(xintercept = true_effect, linetype = "solid", colour = "#27ae60", linewidth = 1.1) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey40", linewidth = 0.7) + geom_errorbarh(aes(xmin = Lower, xmax = Upper), height = 0.25, linewidth = 1) + geom_point(size = 4) + annotate("text", x = true_effect + 0.4, y = 6.4, label = "True Effect\n(−10 mmHg)", colour = "#1e8449", size = 3.5, hjust = 0) + scale_colour_manual(values = c("TRUE" = "#e74c3c", "FALSE" = "#2c3e50"), labels = c("TRUE" = "Biased (RTM unaccounted)", "FALSE" = "RTM-corrected")) + labs( title = "Treatment Effect Estimates Across All Techniques", subtitle = "Green line = true effect (−10 mmHg) | Red = biased by RTM", x = "Estimated Treatment Effect (mmHg)", y = NULL, colour = NULL, caption = "Error bars = 95% confidence intervals" ) + theme(legend.position = "top") ``` --- ## Summary Table ```{r summary-table} results_clean <- results |> mutate( `95% CI` = paste0("[", round(Lower,1), ", ", round(Upper,1), "]"), `Bias` = round(Estimate - true_effect, 2), `RTM Controlled` = if_else(!Biased, "✅ Yes", "❌ No"), Estimate = round(Estimate, 2) ) |> select(Technique, Estimate, `95% CI`, `Bias`, `RTM Controlled`) |> mutate(Technique = str_replace_all(Technique, "\n", " ")) results_clean |> kable(align = "lcccc", caption = "All Techniques: Estimates vs. True Effect (−10 mmHg)") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) |> row_spec(1, background = "#fdf2f2") |> row_spec(2:6, background = "#f0faf4") |> column_spec(5, bold = TRUE) ``` --- ## Key Takeaways ::: {.callout-tip} ## What This Simulation Shows | Problem | Solution | |---|---| | Selecting extreme scorers guarantees RTM | Use a **control group** selected the same way | | Before–after study without control | Apply **Bland-Altman RTM correction** | | Repeated measures data | Use **ANCOVA or LMM** — adjust for baseline | | No randomisation possible | Use **propensity score matching** | | Clinical individual-level decisions | Use the **Reliable Change Index** | | High measurement unreliability | RTM is larger — **improve your instrument first** | > **The naïve before–after study will always overestimate the treatment effect when participants were selected for extreme baseline scores. Every technique above solves this — some by design, some by analysis.** ::: --- ## References - Bland, J. M., & Altman, D. G. (1994). Regression towards the mean. *BMJ*, 308, 1499. - Barnett, A. G., Van Der Pols, J. C., & Dobson, A. J. (2005). Regression to the mean: what it is and how to deal with it. *International Journal of Epidemiology*, 34(1), 215–220. - Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change. *Journal of Consulting and Clinical Psychology*, 59(1), 12–19. - Austin, P. C. (2011). An introduction to propensity score methods for reducing confounding. *Multivariate Behavioral Research*, 46(3), 399–424.