Regression to the Mean: Detecting, Quantifying & Correcting It

Author

Timothy Achala

Published

June 1, 2026


1 The Setup: What Are We Simulating?

We generate a ground truth dataset where:

  • The true treatment effect is exactly −10 mmHg (blood pressure reduction)
  • Measurement error exists (reliability = 0.70)
  • Participants are selected because their baseline BP was extreme (high)
  • RTM will make both groups improve — the challenge is isolating the true treatment signal
Code
set.seed(2024)
n_total <- 400   # Total screened
n_select <- 100  # Top extreme scorers selected per group

# True underlying BP (stable trait)
true_bp <- rnorm(n_total * 2, mean = 130, sd = 15)

# Reliability: observed = true + error
reliability  <- 0.70
error_sd     <- sqrt((1 - reliability) * var(true_bp))

# Screening measurement (Day 1) — contains noise
screen_bp <- true_bp + rnorm(length(true_bp), 0, error_sd)

df_all <- tibble(
  id       = seq_along(true_bp),
  true_bp  = true_bp,
  screen   = screen_bp,
  group    = rep(c("Treatment", "Control"), each = n_total)
)

# Select extreme scorers (top ~25% per group)
threshold <- quantile(df_all$screen, 0.75)

df_selected <- df_all |>
  filter(screen >= threshold) |>
  group_by(group) |>
  slice_head(n = n_select) |>
  ungroup()

# Post-measurement: RTM + treatment effect for treatment group
true_effect <- -10  # mmHg reduction

df_selected <- df_selected |>
  mutate(
    post = true_bp +
      rnorm(n(), 0, error_sd) +
      if_else(group == "Treatment", true_effect, 0),
    change   = post - screen,
    rtm_comp = true_bp - screen   # expected RTM direction
  )

cat("Screened:", nrow(df_all),
    "| Selected:", nrow(df_selected),
    "| Threshold:", round(threshold, 1), "mmHg\n")
Screened: 800 | Selected: 195 | Threshold: 142 mmHg
Code
cat("True treatment effect:", true_effect, "mmHg\n")
True treatment effect: -10 mmHg
Code
cat("Mean change — Treatment:",
    round(mean(df_selected$change[df_selected$group == "Treatment"]), 2),
    "| Control:",
    round(mean(df_selected$change[df_selected$group == "Control"]), 2), "\n")
Mean change — Treatment: -14.7 | Control: -4.81 
Ground Truth

True effect = −10 mmHg. Any technique that recovers a value close to −10 is working. Values far from −10 are biased by RTM.


2 Visualising RTM in Action

2.1 Baseline vs. Change Score

Code
ggplot(df_selected, aes(x = screen, y = change, colour = group)) +
  geom_point(alpha = 0.45, size = 1.8) +
  geom_smooth(method = "lm", se = TRUE, linewidth = 1.2) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey30") +
  scale_colour_manual(values = pal) +
  labs(
    title    = "RTM Signature: Baseline vs. Change Score",
    subtitle = "Negative slope = regression to the mean. Present in BOTH groups.",
    x        = "Baseline (Screening) BP (mmHg)",
    y        = "Change Score (Post − Pre, mmHg)",
    colour   = NULL,
    caption  = "If this slope were treatment, it would appear only in the treatment group."
  )

Classic RTM signature: higher baseline scores show larger drops. This pattern exists in BOTH groups — it is not treatment. The vertical spread between groups is the treatment signal.

2.2 Pre–Post Distributions

Code
df_selected |>
  pivot_longer(cols = c(screen, post),
               names_to = "time", values_to = "bp") |>
  mutate(time = factor(time, levels = c("screen", "post"),
                       labels = c("Pre (Screening)", "Post"))) |>
  ggplot(aes(x = bp, fill = time)) +
  geom_density(alpha = 0.55, colour = "white") +
  geom_vline(
    data = df_selected |>
      pivot_longer(c(screen, post), names_to = "time", values_to = "bp") |>
      mutate(time = factor(time, levels = c("screen", "post"),
                           labels = c("Pre (Screening)", "Post"))) |>
      group_by(group, time) |>
      summarise(m = mean(bp), .groups = "drop"),
    aes(xintercept = m, colour = time),
    linetype = "dashed", linewidth = 0.9
  ) +
  facet_wrap(~group) +
  scale_fill_manual(values  = c("Pre (Screening)" = "#e74c3c",
                                 "Post"            = "#27ae60")) +
  scale_colour_manual(values = c("Pre (Screening)" = "#c0392b",
                                  "Post"            = "#1e8449")) +
  labs(
    title    = "Pre vs. Post BP Distributions by Group",
    subtitle = "Both groups drop — RTM is at work in both",
    x        = "Blood Pressure (mmHg)",
    y        = "Density",
    fill     = "Time Point", colour = "Mean"
  )

Both groups shift downward from pre to post — RTM moves everyone. The treatment group shifts further, but without a control group, you’d attribute all of it to treatment.

3 Technique 1: Naïve Before–After (No Control) — Biased

The classic mistake: compute mean change in the treatment group alone.

Code
naive <- df_selected |>
  filter(group == "Treatment") |>
  summarise(
    mean_pre    = mean(screen),
    mean_post   = mean(post),
    mean_change = mean(change),
    se          = sd(change) / sqrt(n()),
    t_stat      = mean_change / se,
    p_value     = 2 * pt(-abs(t_stat), df = n() - 1)
  )

naive |>
  mutate(across(where(is.numeric), ~round(.x, 3))) |>
  kable(caption = "Naïve Before–After: Treatment Group Only") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Naïve Before–After: Treatment Group Only
mean_pre mean_post mean_change se t_stat p_value
150.602 135.905 -14.697 1.04 -14.125 0
Interpretation

The naïve estimate attributes all of the observed drop to treatment — including the RTM component. It will overestimate the true effect of −10 mmHg because it has no way to separate RTM from treatment.


4 Technique 2: Simple Two-Group Comparison (t-test)

Adding a control group already removes RTM from the estimate.

Code
ttest_res <- t.test(change ~ group, data = df_selected)

ttest_tidy <- tidy(ttest_res) |>
  mutate(
    estimate        = round(estimate, 3),
    conf.low        = round(conf.low, 3),
    conf.high       = round(conf.high, 3),
    p.value         = round(p.value, 4),
    interpretation  = "Treatment − Control change difference"
  )

ttest_tidy |>
  select(estimate, conf.low, conf.high, statistic, p.value) |>
  kable(caption = "Independent t-test: Change Score by Group") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Independent t-test: Change Score by Group
estimate conf.low conf.high statistic p.value
9.883 6.841 12.925 6.409273 0
Interpretation

The t-test on change scores compares the mean change in the treatment group vs. control. Since RTM affects both groups equally, the difference captures the true treatment effect. The estimate should be close to −10 mmHg.


5 Technique 3: ANCOVA (Adjusting for Baseline)

ANCOVA regresses post-score on treatment group and baseline score, separating the baseline-driven component (RTM) from the treatment effect.

Code
ancova_mod <- lm(post ~ group + screen, data = df_selected)

ancova_tidy <- tidy(ancova_mod, conf.int = TRUE) |>
  filter(term == "groupTreatment") |>
  mutate(across(where(is.numeric), ~round(.x, 3)))

ancova_tidy |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  kable(caption = "ANCOVA: Treatment Effect Adjusted for Baseline") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
ANCOVA: Treatment Effect Adjusted for Baseline
term estimate std.error conf.low conf.high p.value
groupTreatment -10.004 1.511 -12.984 -7.024 0
Interpretation

ANCOVA is the recommended analytical approach. By conditioning on the baseline score, it removes the correlation between baseline extremity and change — directly correcting for RTM. The coefficient on groupTreatment is the adjusted treatment effect, expected to be close to −10 mmHg, with narrower confidence intervals than the t-test.


6 Technique 4: Mixed Model (LMM)

Linear mixed model with random intercepts per participant — handles the repeated-measures structure explicitly.

Code
df_long <- df_selected |>
  pivot_longer(cols = c(screen, post),
               names_to = "time", values_to = "bp") |>
  mutate(
    time      = factor(time, levels = c("screen", "post")),
    time_num  = as.numeric(time) - 1  # 0 = pre, 1 = post
  )

lmm_mod <- nlme::lme(bp ~ time_num * group, random = ~ 1 | id, data = df_long, method = "ML")

lmm_tidy <- tidy(lmm_mod, conf.int = TRUE) |>
  filter(str_detect(term, "time_num:group")) |>
  mutate(across(where(is.numeric), ~round(.x, 3)))

lmm_tidy |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  kable(caption = "Linear Mixed Model: Time × Group Interaction") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Linear Mixed Model: Time × Group Interaction
term estimate std.error conf.low conf.high p.value
time_num:groupTreatment -9.883 1.539 -12.904 -6.862 0
Interpretation

The time_num:groupTreatment interaction is the treatment effect — how much more the treatment group changed over time compared to control. Random intercepts per participant account for individual baseline differences, giving a more efficient estimate. Expected value: ~−10 mmHg.


7 Technique 5: Bland-Altman RTM Correction

When no control group exists, the RTM component can be estimated analytically and subtracted.

Code
# Using treatment group only (simulating no control group scenario)
treat_df <- df_selected |> filter(group == "Treatment")

pop_mean    <- mean(df_all$screen[df_all$group == "Treatment"])
select_mean <- mean(treat_df$screen)
r_xx        <- reliability  # ICC / test-retest reliability

rtm_estimate   <- (1 - r_xx) * (select_mean - pop_mean)
observed_change <- mean(treat_df$change)
corrected_effect <- observed_change - rtm_estimate

tibble(
  `Population Mean (Screened)` = round(pop_mean, 2),
  `Selected Group Mean`        = round(select_mean, 2),
  `Reliability (r_xx)`         = r_xx,
  `Estimated RTM`              = round(rtm_estimate, 2),
  `Observed Change`            = round(observed_change, 2),
  `RTM-Corrected Effect`       = round(corrected_effect, 2)
) |>
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") |>
  kable(caption = "Bland-Altman RTM Correction (No Control Group)") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(6, bold = TRUE, background = "#eafaf1")
Bland-Altman RTM Correction (No Control Group)
Metric Value
Population Mean (Screened) 130.56
Selected Group Mean 150.60
Reliability (r_xx) 0.70
Estimated RTM 6.01
Observed Change -14.70
RTM-Corrected Effect -20.71
Interpretation

Without a control group, the Bland-Altman correction estimates the RTM component from first principles: unreliability × how extreme the selection was. Subtracting this from the observed change gives a corrected treatment effect closer to the truth. Higher reliability → smaller RTM → less correction needed.


8 Technique 6: Reliable Change Index (RCI)

The RCI asks: “Is this individual’s change real, or just noise?” It operates at the individual level.

Code
sd_pre  <- sd(treat_df$screen)
se_diff <- sd_pre * sqrt(2 * (1 - reliability))

treat_df <- treat_df |>
  mutate(
    RCI         = change / se_diff,
    RCI_sig     = case_when(
      RCI < -1.96 ~ "Reliable Improvement",
      RCI >  1.96 ~ "Reliable Deterioration",
      TRUE        ~ "No Reliable Change"
    )
  )

treat_df |>
  count(RCI_sig) |>
  mutate(Percent = paste0(round(100 * n / sum(n), 1), "%")) |>
  rename(`RCI Category` = RCI_sig, N = n) |>
  kable(caption = "Reliable Change Index — Treatment Group") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE) |>
  row_spec(which(treat_df |> count(RCI_sig) |> pull(RCI_sig) == "Reliable Improvement"),
           background = "#eafaf1")
Reliable Change Index — Treatment Group
RCI Category N Percent
No Reliable Change 41 41%
Reliable Deterioration 2 2%
Reliable Improvement 57 57%
Code
ggplot(treat_df, aes(x = reorder(id, RCI), y = RCI, colour = RCI_sig)) +
  geom_hline(yintercept = c(-1.96, 1.96),
             linetype = "dashed", colour = "grey40", linewidth = 0.8) +
  geom_hline(yintercept = 0, colour = "grey60", linewidth = 0.5) +
  geom_point(alpha = 0.7, size = 1.8) +
  scale_colour_manual(values = c(
    "Reliable Improvement"    = "#27ae60",
    "Reliable Deterioration"  = "#e74c3c",
    "No Reliable Change"      = "#95a5a6"
  )) +
  labs(
    title    = "Reliable Change Index — Individual Level",
    subtitle = "Dashed lines = ±1.96 threshold for reliable change",
    x        = "Participant (ordered by RCI)",
    y        = "RCI",
    colour   = NULL,
    caption  = "Points within dashed lines cannot be distinguished from RTM + measurement error"
  ) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Individual RCI values. Points beyond ±1.96 (dashed lines) represent participants with statistically reliable change — beyond what measurement error and RTM would predict.

9 Technique 7: Propensity Score Matching

Mimics randomisation in observational data by balancing baseline covariates between groups.

Code
ps_mod  <- matchit(
  I(group == "Treatment") ~ screen + true_bp,
  data   = df_selected,
  method = "nearest",
  ratio  = 1
)

df_matched <- match.data(ps_mod)

ps_effect <- lm(change ~ group, data = df_matched) |>
  tidy(conf.int = TRUE) |>
  filter(term == "groupTreatment") |>
  mutate(across(where(is.numeric), ~round(.x, 3)))

ps_effect |>
  select(term, estimate, std.error, conf.low, conf.high, p.value) |>
  kable(caption = "Propensity Score Matched: Treatment Effect on Change Score") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)
Propensity Score Matched: Treatment Effect on Change Score
term estimate std.error conf.low conf.high p.value
groupTreatment -9.345 1.551 -12.404 -6.286 0
Interpretation

PSM creates matched pairs with similar baseline BP, so the comparison is fair. The treatment effect estimate should be close to −10 mmHg. PSM is most useful in observational studies where treatment assignment was not random.


10 Head-to-Head Comparison: All Techniques

Code
results <- tibble(
  Technique = c(
    "Naïve Before–After\n(No Control)",
    "t-test\n(Change Score)",
    "ANCOVA\n(Adjusted)",
    "Mixed Model\n(LMM)",
    "Bland-Altman\n(RTM Correction)",
    "Propensity Score\n(Matched)"
  ),
  Estimate = c(
    mean(df_selected$change[df_selected$group == "Treatment"]),
    ttest_tidy$estimate,
    ancova_tidy$estimate,
    lmm_tidy$estimate,
    corrected_effect,
    ps_effect$estimate
  ),
  Lower = c(
    mean(df_selected$change[df_selected$group == "Treatment"]) -
      1.96 * sd(df_selected$change[df_selected$group == "Treatment"]) /
      sqrt(sum(df_selected$group == "Treatment")),
    ttest_tidy$conf.low,
    ancova_tidy$conf.low,
    lmm_tidy$conf.low,
    corrected_effect - 2,
    ps_effect$conf.low
  ),
  Upper = c(
    mean(df_selected$change[df_selected$group == "Treatment"]) +
      1.96 * sd(df_selected$change[df_selected$group == "Treatment"]) /
      sqrt(sum(df_selected$group == "Treatment")),
    ttest_tidy$conf.high,
    ancova_tidy$conf.high,
    lmm_tidy$conf.high,
    corrected_effect + 2,
    ps_effect$conf.high
  ),
  Biased = c(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)
)

results <- results |>
  mutate(Technique = factor(Technique, levels = rev(Technique)))

ggplot(results, aes(x = Estimate, y = Technique, colour = Biased)) +
  geom_vline(xintercept = true_effect, linetype = "solid",
             colour = "#27ae60", linewidth = 1.1) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey40", linewidth = 0.7) +
  geom_errorbarh(aes(xmin = Lower, xmax = Upper),
                 height = 0.25, linewidth = 1) +
  geom_point(size = 4) +
  annotate("text", x = true_effect + 0.4, y = 6.4,
           label = "True Effect\n(−10 mmHg)",
           colour = "#1e8449", size = 3.5, hjust = 0) +
  scale_colour_manual(values = c("TRUE" = "#e74c3c", "FALSE" = "#2c3e50"),
                      labels = c("TRUE" = "Biased (RTM unaccounted)",
                                 "FALSE" = "RTM-corrected")) +
  labs(
    title    = "Treatment Effect Estimates Across All Techniques",
    subtitle = "Green line = true effect (−10 mmHg) | Red = biased by RTM",
    x        = "Estimated Treatment Effect (mmHg)",
    y        = NULL,
    colour   = NULL,
    caption  = "Error bars = 95% confidence intervals"
  ) +
  theme(legend.position = "top")

All techniques vs. the true effect (−10 mmHg, green line). The naïve before-after is the most biased. Techniques that use a control group or adjust for baseline all recover the truth reliably.

11 Summary Table

Code
results_clean <- results |>
  mutate(
    `95% CI`          = paste0("[", round(Lower,1), ", ", round(Upper,1), "]"),
    `Bias`            = round(Estimate - true_effect, 2),
    `RTM Controlled`  = if_else(!Biased, "✅ Yes", "❌ No"),
    Estimate          = round(Estimate, 2)
  ) |>
  select(Technique, Estimate, `95% CI`, `Bias`, `RTM Controlled`) |>
  mutate(Technique = str_replace_all(Technique, "\n", " "))

results_clean |>
  kable(align = "lcccc",
        caption = "All Techniques: Estimates vs. True Effect (−10 mmHg)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE) |>
  row_spec(1, background = "#fdf2f2") |>
  row_spec(2:6, background = "#f0faf4") |>
  column_spec(5, bold = TRUE)
All Techniques: Estimates vs. True Effect (−10 mmHg)
Technique Estimate 95% CI Bias RTM Controlled
Naïve Before–After (No Control) -14.70 [-16.7, -12.7] -4.70 ❌ No |
t-test (Change Score) 9.88 [6.8, 12.9] 19.88 ✅ Yes |
ANCOVA (Adjusted) -10.00 [-13, -7] 0.00 ✅ Yes |
Mixed Model (LMM) -9.88 [-12.9, -6.9] 0.12 ✅ Yes |
Bland-Altman (RTM Correction) -20.71 [-22.7, -18.7] -10.71 ✅ Yes |
Propensity Score (Matched) -9.35 [-12.4, -6.3] 0.65 ✅ Yes |

12 Key Takeaways

What This Simulation Shows
Problem Solution
Selecting extreme scorers guarantees RTM Use a control group selected the same way
Before–after study without control Apply Bland-Altman RTM correction
Repeated measures data Use ANCOVA or LMM — adjust for baseline
No randomisation possible Use propensity score matching
Clinical individual-level decisions Use the Reliable Change Index
High measurement unreliability RTM is larger — improve your instrument first

The naïve before–after study will always overestimate the treatment effect when participants were selected for extreme baseline scores. Every technique above solves this — some by design, some by analysis.


13 References

  • Bland, J. M., & Altman, D. G. (1994). Regression towards the mean. BMJ, 308, 1499.
  • Barnett, A. G., Van Der Pols, J. C., & Dobson, A. J. (2005). Regression to the mean: what it is and how to deal with it. International Journal of Epidemiology, 34(1), 215–220.
  • Jacobson, N. S., & Truax, P. (1991). Clinical significance: A statistical approach to defining meaningful change. Journal of Consulting and Clinical Psychology, 59(1), 12–19.
  • Austin, P. C. (2011). An introduction to propensity score methods for reducing confounding. Multivariate Behavioral Research, 46(3), 399–424.