The Stepwise Trap: How ‘Data-Driven’ Model Selection Lies to You

A Simulation Study with Commentary

Author

Timothy Achala

Published

June 19, 2026

1 · The Setup: A Confession Before the Crime

“I trust the algorithm. It picks only the significant variables.” — Every junior analyst, at least once.

There is something seductive about stepwise selection. You hand it your messy dataset — 25 variables, no theory, a deadline — and it hands you back a clean, parsimonious model with nothing but asterisks. What’s not to love?

Everything. Let’s show you exactly why, with numbers you can’t argue with.

We will simulate a world where we know the truth: only 3 out of 20 predictors actually matter. Then we’ll watch stepwise selection embarrass itself — repeatedly — across 500 independent samples from that same truth.

2 · Simulation Design

Code

n_obs      <- 80    # small-ish — realistic clinical/epi sample
n_pred     <- 20    # 20 candidate predictors
n_true     <- 3     # only 3 are real
n_sims     <- 500   # repetitions

true_vars  <- paste0("X", 1:n_true)   # X1, X2, X3
noise_vars <- paste0("X", (n_true+1):n_pred)  # X4 … X20

# True coefficients (known only to us, the simulators / gods of this world)
beta_true  <- c(intercept = 1, X1 = 0.6, X2 = -0.4, X3 = 0.8)
# Everything else = 0. Silence. Noise.

Note

Ground truth: $Y = 1 + 0.6X_1 - 0.4X_2 + 0.8X_3 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1)$

$X_4, X_5, \ldots, X_{20}$ are pure noise — zero relationship with $Y$. Stepwise doesn’t know that. Let’s see what it does.

3 · Running the Experiment (500 Times)

Code

simulate_one <- function(seed) {
  set.seed(seed)
  
  # Generate data
  X   <- matrix(rnorm(n_obs * n_pred), nrow = n_obs,
                dimnames = list(NULL, paste0("X", 1:n_pred)))
  eps <- rnorm(n_obs)
  Y   <- beta_true["intercept"] +
         X[, "X1"] * beta_true["X1"] +
         X[, "X2"] * beta_true["X2"] +
         X[, "X3"] * beta_true["X3"] + eps
  
  df <- as.data.frame(cbind(Y, X))
  
  # ── Full model ─────────────────────────────────────────────────────────────
  full_mod <- lm(Y ~ ., data = df)
  
  # ── Stepwise (both directions, AIC) ────────────────────────────────────────
  step_mod <- stepAIC(full_mod, direction = "both", trace = FALSE)
  step_vars <- names(coef(step_mod))[-1]   # drop intercept
  
  # ── Metrics ────────────────────────────────────────────────────────────────
  # In-sample R²
  r2_full <- summary(full_mod)$r.squared
  r2_step <- summary(step_mod)$r.squared
  
  # Out-of-sample RMSE on a fresh test set (same DGP)
  X_test  <- matrix(rnorm(200 * n_pred), nrow = 200,
                    dimnames = list(NULL, paste0("X", 1:n_pred)))
  eps_test <- rnorm(200)
  Y_test  <- beta_true["intercept"] +
             X_test[, "X1"] * beta_true["X1"] +
             X_test[, "X2"] * beta_true["X2"] +
             X_test[, "X3"] * beta_true["X3"] + eps_test
  
  df_test <- as.data.frame(cbind(Y = Y_test, X_test))
  
  rmse_full <- sqrt(mean((Y_test - predict(full_mod, df_test))^2))
  rmse_step <- sqrt(mean((Y_test - predict(step_mod, df_test))^2))
  
  # Variable selection accuracy
  true_pos   <- sum(true_vars %in% step_vars)  # real vars found
  false_pos  <- sum(step_vars %in% noise_vars) # noise vars included
  
  tibble(
    sim       = seed,
    n_step    = length(step_vars),
    true_pos  = true_pos,
    false_pos = false_pos,
    r2_full   = r2_full,
    r2_step   = r2_step,
    rmse_full = rmse_full,
    rmse_step = rmse_step,
    step_vars = list(step_vars)
  )
}

results <- map_dfr(1:n_sims, simulate_one)

4 · The Evidence

4.1 · How Many Noise Variables Does Stepwise Drag In?

Code

fp_summary <- results %>%
  count(false_pos) %>%
  mutate(pct = n / n_sims * 100)

p1 <- ggplot(results, aes(x = false_pos)) +
  geom_histogram(binwidth = 1, fill = col_noise, colour = "white",
                 alpha = 0.85) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = col_true, linewidth = 1) +
  annotate("text", x = 0.3, y = Inf, vjust = 1.8, hjust = 0,
           colour = col_true, fontface = "bold", size = 3.8,
           label = "← Correct answer:\n   0 noise variables") +
  labs(
    title = "🚨 Stepwise keeps noise — over and over",
    subtitle = glue("Across {n_sims} simulations: median **{median(results$false_pos)}** false-positive noise variables selected per run"),
    x = "Number of noise variables retained by stepwise",
    y = "Count (out of 500 simulations)",
    caption = "n = 80 obs, 20 predictors (3 real + 17 noise), stepwise AIC both-direction"
  ) +
  theme_tim()

p1

Each simulation used 20 predictors, only 3 of which were real. The red distribution shows how many *noise* variables stepwise selection dragged into the final model.

Commentary: The correct answer is zero noise variables. Stepwise rarely gets that right. Most runs include 2–5 pure noise variables in the final model — and then reports them with p-values that look totally respectable. That’s not discovery. That’s statistical confabulation.

4.2 · Does It Even Find the True Variables?

Code

tp_tbl <- results %>%
  count(true_pos) %>%
  mutate(pct = round(n / n_sims * 100, 1),
         label = glue("{pct}%"))

p2 <- ggplot(results, aes(x = true_pos)) +
  geom_histogram(binwidth = 1, fill = col_true, colour = "white", alpha = 0.85) +
  scale_x_continuous(breaks = 0:3) +
  labs(
    title = "True variables found by stepwise",
    subtitle = "3 real predictors exist — how many does stepwise recover?",
    x = "Number of real predictors correctly retained",
    y = "Count (out of 500 simulations)",
    caption = "Stepwise finds all 3 real vars in only a fraction of runs"
  ) +
  theme_tim()

p2

Distribution of true positives (real predictors recovered) across 500 simulations.

Code

tp_tbl %>%
  gt() %>%
  cols_label(true_pos = "Real vars found", n = "Simulations", pct = "%", label = "Share") %>%
  tab_header(title = "True Positive Recovery Rate") %>%
  tab_style(style = cell_fill(color = "#d6eaf8"),
            locations = cells_body(rows = true_pos == 3)) %>%
  fmt_number(columns = pct, decimals = 1)

Real vars found	Simulations	%	Share
True Positive Recovery Rate
2	11	2.2	2.2%
3	489	97.8	97.8%

4.3 · The R² Inflation Illusion

“But the model has an R² of 0.48 — surely that means something?”

Code

r2_long <- results %>%
  dplyr::select(sim, r2_full, r2_step) %>%
  pivot_longer(-sim, names_to = "model", values_to = "r2") %>%
  mutate(model = case_match(model,
                            "r2_full" ~ "Full model",
                            "r2_step" ~ "Stepwise"))

p3 <- ggplot(r2_long, aes(x = r2, fill = model)) +
  geom_density(alpha = 0.65, colour = NA) +
  scale_fill_manual(values = c("Full model" = col_full, "Stepwise" = col_step)) +
  labs(
    title = "📈 Stepwise claims higher R² — but it's a mirage",
    subtitle = "In-sample R² is inflated by variable-hunting; it won't replicate",
    x = "In-sample R²",
    y = "Density",
    fill = NULL
  ) +
  theme_tim()

p3

Stepwise inflates in-sample R² (optimism), then pays the price out-of-sample.

Code

rmse_long <- results %>%
  dplyr::select(sim, rmse_full, rmse_step) %>%
  pivot_longer(-sim, names_to = "model", values_to = "rmse") %>%
  mutate(model = case_match(model,
                            "rmse_full" ~ "Full model",
                            "rmse_step" ~ "Stepwise"))

p4 <- ggplot(rmse_long, aes(x = rmse, fill = model)) +
  geom_density(alpha = 0.65, colour = NA) +
  scale_fill_manual(values = c("Full model" = col_full, "Stepwise" = col_step)) +
  labs(
    title = "📉 Out-of-sample RMSE — the real reckoning",
    subtitle = "Stepwise's 'lean' model is not more accurate; it's just more confidently wrong",
    x = "RMSE on held-out test set (n = 200)",
    y = "Density",
    fill = NULL
  ) +
  theme_tim()

p4

Out-of-sample prediction error: who actually generalises better?

Code

results %>%
  summarise(
    `Stepwise — median RMSE` = round(median(rmse_step), 3),
    `Full model — median RMSE` = round(median(rmse_full), 3),
    `Stepwise — median in-sample R²` = round(median(r2_step), 3),
    `Full model — median in-sample R²` = round(median(r2_full), 3),
    `Stepwise — median noise vars included` = round(median(false_pos), 1),
    `Stepwise — median true vars recovered` = round(median(true_pos), 1)
  ) %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
  gt() %>%
  tab_header(title = "Summary Across 500 Simulations") %>%
  fmt_number(columns = Value, decimals = 3) %>%
  tab_style(
    style = cell_fill(color = "#fde8e8"),
    locations = cells_body(rows = str_detect(Metric, "Stepwise"))
  ) %>%
  tab_style(
    style = cell_fill(color = "#e8f8e8"),
    locations = cells_body(rows = str_detect(Metric, "Full"))
  )

Metric	Value
Summary Across 500 Simulations
Stepwise — median RMSE	1.104
Full model — median RMSE	1.157
Stepwise — median in-sample R²	0.618
Full model — median in-sample R²	0.652
Stepwise — median noise vars included	4.000
Stepwise — median true vars recovered	3.000

4.4 · The Variable Lottery: Which Noise Variables “Win”?

The next part is the most disturbing. Of the 17 pure noise variables, how often does each get selected?

Code

noise_freq <- results %>%
  unnest(step_vars) %>%
  filter(step_vars %in% noise_vars) %>%
  count(step_vars, name = "times_selected") %>%
  mutate(
    pct       = times_selected / n_sims * 100,
    step_vars = fct_reorder(step_vars, times_selected)
  )

p5 <- ggplot(noise_freq, aes(x = pct, y = step_vars)) +
  geom_col(fill = col_noise, alpha = 0.85) +
  geom_text(aes(label = glue("{round(pct,0)}%")),
            hjust = -0.15, size = 3.2, colour = "grey30") +
  geom_vline(xintercept = 5, linetype = "dashed",
             colour = "grey50", linewidth = 0.7) +
  annotate("text", x = 5.5, y = 1, hjust = 0, size = 3,
           colour = "grey40", label = "α = 0.05\nnominal rate →") +
  scale_x_continuous(limits = c(0, 55),
                     labels = function(x) paste0(x, "%")) +
  labs(
    title = "🎰 Pure noise variables selected by stepwise",
    subtitle = "Every one of these variables is **completely unrelated to Y** — yet stepwise picks them constantly",
    x = "% of simulations where variable was selected",
    y = NULL,
    caption = "Dashed line = 5% (expected false positive rate under a single honest test)"
  ) +
  theme_tim()

p5

Frequency of each noise variable being selected by stepwise across 500 simulations. These variables have zero relationship with Y.

Commentary: Every bar to the right of the dashed line is a variable that a researcher would present in a paper as “independently associated with the outcome.” The journals would publish it. Future meta-analyses would include it. And it would be zero. Pure chance. This is how noise gets enshrined.

4.5 · The Coefficient Instability Catastrophe

Let’s zoom in on one simulation and show how the selected model lies about its own coefficients.

Code

# Re-run 200 sims and collect X1 coefficient
coef_sims <- map_dfr(1:200, function(seed) {
  set.seed(seed)
  X   <- matrix(rnorm(n_obs * n_pred), nrow = n_obs,
                dimnames = list(NULL, paste0("X", 1:n_pred)))
  eps <- rnorm(n_obs)
  Y   <- beta_true["intercept"] +
         X[, "X1"] * beta_true["X1"] +
         X[, "X2"] * beta_true["X2"] +
         X[, "X3"] * beta_true["X3"] + eps
  df  <- as.data.frame(cbind(Y, X))
  
  full_mod <- lm(Y ~ ., data = df)
  step_mod <- stepAIC(full_mod, direction = "both", trace = FALSE)
  
  step_coefs <- coef(step_mod)
  full_coefs <- coef(full_mod)
  
  tibble(
    seed      = seed,
    coef_full = full_coefs["X1"],
    coef_step = if ("X1" %in% names(step_coefs)) step_coefs["X1"] else NA_real_,
    x1_selected = "X1" %in% names(step_coefs)
  )
})

coef_long <- coef_sims %>%
  pivot_longer(c(coef_full, coef_step), names_to = "model", values_to = "estimate") %>%
  mutate(model = case_match(model,
                            "coef_full" ~ "Full model (always estimates X1)",
                            "coef_step" ~ "Stepwise (only when X1 is selected)"))

p6 <- ggplot(coef_long, aes(x = estimate, fill = model)) +
  geom_density(alpha = 0.65, colour = NA, na.rm = TRUE) +
  geom_vline(xintercept = 0.6, linetype = "dashed",
             colour = "black", linewidth = 1) +
  annotate("text", x = 0.63, y = Inf, vjust = 2, hjust = 0,
           label = "True β = 0.6", fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c(
    "Full model (always estimates X1)"     = col_full,
    "Stepwise (only when X1 is selected)"  = col_step
  )) +
  labs(
    title = "Coefficient estimates for X1 (true β = 0.6)",
    subtitle = "Stepwise only shows X1 when it *looks* big — **selection bias inflates estimates**",
    x = "Estimated coefficient",
    y = "Density",
    fill = NULL,
    caption = "Stepwise estimates conditioned on selection — classic winner's curse"
  ) +
  theme_tim()

p6

Estimated coefficient of X1 (true β = 0.6) across 200 replications, stepwise vs. full model. Stepwise biases estimates upward in selected models.

This is the winner’s curse in plain sight. Stepwise only includes X1 when its estimated coefficient crossed whatever threshold the algorithm used. So the reported estimates are systematically larger than the truth. Your effect sizes are wrong. Your confidence intervals are wrong. Your power calculations for the next study will be wrong.

5 · A Side-by-Side Summary

Code

(p1 | p2) / (p3 | p4) +
  plot_annotation(
    title    = "Stepwise Selection: 500 Simulations, One Verdict",
    subtitle = "Ground truth: 3 real predictors, 17 noise. Stepwise doesn't know that — and acts accordingly.",
    theme    = theme(
      plot.title    = element_text(face = "bold", size = 15),
      plot.subtitle = element_text(size = 11, colour = "grey35")
    )
  )

The full picture. Left column: selection behaviour. Right column: predictive performance.

6 · What Should You Do Instead?

Okay. We’ve seen the crime scene. What’s the alternative?

The answer depends on your goal:

Goal	Better approach
Prediction	LASSO / Ridge / Elastic Net with cross-validation
Inference / explanation	Theory-driven pre-specification; adjust for all confounders, period
Exploratory	Penalised regression + bootstrap variable importance — report uncertainty honestly
Clinical score	LASSO with optimism-corrected validation (bootstrap or external dataset)

The fundamental rule

Model selection and inference cannot use the same data without a penalty. If you used the data to choose variables, your p-values are not what they say they are. Full stop.

7 · The Closing Argument

“But everyone does it. The reviewers expect it.”

That’s an institutional problem, not a statistical defence.

The simulation above is not exotic. It uses a perfectly ordinary linear DGP, a reasonable sample size (n = 80), and off-the-shelf stepwise-AIC — the exact setup taught in most regression courses and implemented in most analysis pipelines. And it produces:

False positives at rates far above 5%
Biased coefficient estimates
Inflated in-sample R² that doesn’t survive contact with new data
Inconsistent variable selection — run the same study twice, get a different “significant” model

Stepwise selection is not data-driven. It is noise-driven, dressed up in the clothes of rigor.

--- title: "The Stepwise Trap: How 'Data-Driven' Model Selection Lies to You" subtitle: "A Simulation Study with Commentary" author: "Timothy Achala" date: today format: html: toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-tools: true theme: flatly highlight-style: github self-contained: true fig-width: 9 fig-height: 5.5 fig-dpi: 150 embed-resources: true execute: echo: true warning: false message: false cache: false --- ```{r setup} #| include: false library(tidyverse) library(MASS) # stepAIC library(gt) library(patchwork) library(ggtext) library(glue) set.seed(2024) # ── palette ────────────────────────────────────────────────────────────────── col_true <- "#2166ac" # blue – true predictors col_noise <- "#d6604d" # red – noise predictors col_step <- "#b2182b" # dark red – stepwise monster col_full <- "#4dac26" # green – full model col_bg <- "#f8f9fa" theme_tim <- function() { theme_minimal(base_size = 12, base_family = "sans") + theme( plot.background = element_rect(fill = col_bg, colour = NA), panel.background = element_rect(fill = col_bg, colour = NA), plot.title = element_markdown(face = "bold", size = 14), plot.subtitle = element_markdown(size = 11, colour = "grey35"), plot.caption = element_markdown(size = 8.5, colour = "grey50"), axis.title = element_text(size = 11), legend.position = "bottom" ) } ``` --- ## 1 · The Setup: A Confession Before the Crime > *"I trust the algorithm. It picks only the significant variables."* > — Every junior analyst, at least once. There is something seductive about stepwise selection. You hand it your messy dataset — 25 variables, no theory, a deadline — and it hands you back a clean, parsimonious model with nothing but asterisks. What's not to love? **Everything.** Let's show you exactly why, with numbers you can't argue with. We will simulate a world where **we know the truth**: only 3 out of 20 predictors actually matter. Then we'll watch stepwise selection embarrass itself — repeatedly — across 500 independent samples from that same truth. --- ## 2 · Simulation Design ```{r sim-params} n_obs <- 80 # small-ish — realistic clinical/epi sample n_pred <- 20 # 20 candidate predictors n_true <- 3 # only 3 are real n_sims <- 500 # repetitions true_vars <- paste0("X", 1:n_true) # X1, X2, X3 noise_vars <- paste0("X", (n_true+1):n_pred) # X4 … X20 # True coefficients (known only to us, the simulators / gods of this world) beta_true <- c(intercept = 1, X1 = 0.6, X2 = -0.4, X3 = 0.8) # Everything else = 0. Silence. Noise. ``` ::: {.callout-note} **Ground truth:** $Y = 1 + 0.6X_1 - 0.4X_2 + 0.8X_3 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1)$ $X_4, X_5, \ldots, X_{20}$ are **pure noise** — zero relationship with $Y$. Stepwise doesn't know that. Let's see what it does. ::: --- ## 3 · Running the Experiment (500 Times) ```{r run-simulation} simulate_one <- function(seed) { set.seed(seed) # Generate data X <- matrix(rnorm(n_obs * n_pred), nrow = n_obs, dimnames = list(NULL, paste0("X", 1:n_pred))) eps <- rnorm(n_obs) Y <- beta_true["intercept"] + X[, "X1"] * beta_true["X1"] + X[, "X2"] * beta_true["X2"] + X[, "X3"] * beta_true["X3"] + eps df <- as.data.frame(cbind(Y, X)) # ── Full model ───────────────────────────────────────────────────────────── full_mod <- lm(Y ~ ., data = df) # ── Stepwise (both directions, AIC) ──────────────────────────────────────── step_mod <- stepAIC(full_mod, direction = "both", trace = FALSE) step_vars <- names(coef(step_mod))[-1] # drop intercept # ── Metrics ──────────────────────────────────────────────────────────────── # In-sample R² r2_full <- summary(full_mod)$r.squared r2_step <- summary(step_mod)$r.squared # Out-of-sample RMSE on a fresh test set (same DGP) X_test <- matrix(rnorm(200 * n_pred), nrow = 200, dimnames = list(NULL, paste0("X", 1:n_pred))) eps_test <- rnorm(200) Y_test <- beta_true["intercept"] + X_test[, "X1"] * beta_true["X1"] + X_test[, "X2"] * beta_true["X2"] + X_test[, "X3"] * beta_true["X3"] + eps_test df_test <- as.data.frame(cbind(Y = Y_test, X_test)) rmse_full <- sqrt(mean((Y_test - predict(full_mod, df_test))^2)) rmse_step <- sqrt(mean((Y_test - predict(step_mod, df_test))^2)) # Variable selection accuracy true_pos <- sum(true_vars %in% step_vars) # real vars found false_pos <- sum(step_vars %in% noise_vars) # noise vars included tibble( sim = seed, n_step = length(step_vars), true_pos = true_pos, false_pos = false_pos, r2_full = r2_full, r2_step = r2_step, rmse_full = rmse_full, rmse_step = rmse_step, step_vars = list(step_vars) ) } results <- map_dfr(1:n_sims, simulate_one) ``` --- ## 4 · The Evidence ### 4.1 · How Many Noise Variables Does Stepwise Drag In? ```{r plot-false-pos} #| fig-cap: "Each simulation used 20 predictors, only 3 of which were real. The red distribution shows how many *noise* variables stepwise selection dragged into the final model." fp_summary <- results %>% count(false_pos) %>% mutate(pct = n / n_sims * 100) p1 <- ggplot(results, aes(x = false_pos)) + geom_histogram(binwidth = 1, fill = col_noise, colour = "white", alpha = 0.85) + geom_vline(xintercept = 0, linetype = "dashed", colour = col_true, linewidth = 1) + annotate("text", x = 0.3, y = Inf, vjust = 1.8, hjust = 0, colour = col_true, fontface = "bold", size = 3.8, label = "← Correct answer:\n 0 noise variables") + labs( title = "🚨 Stepwise keeps noise — over and over", subtitle = glue("Across {n_sims} simulations: median **{median(results$false_pos)}** false-positive noise variables selected per run"), x = "Number of noise variables retained by stepwise", y = "Count (out of 500 simulations)", caption = "n = 80 obs, 20 predictors (3 real + 17 noise), stepwise AIC both-direction" ) + theme_tim() p1 ``` > **Commentary:** *The correct answer is zero noise variables. Stepwise rarely gets that right. Most runs include 2–5 pure noise variables in the final model — and then reports them with p-values that look totally respectable. That's not discovery. That's statistical confabulation.* --- ### 4.2 · Does It Even Find the True Variables? ```{r plot-true-pos} #| fig-cap: "Distribution of true positives (real predictors recovered) across 500 simulations." tp_tbl <- results %>% count(true_pos) %>% mutate(pct = round(n / n_sims * 100, 1), label = glue("{pct}%")) p2 <- ggplot(results, aes(x = true_pos)) + geom_histogram(binwidth = 1, fill = col_true, colour = "white", alpha = 0.85) + scale_x_continuous(breaks = 0:3) + labs( title = "True variables found by stepwise", subtitle = "3 real predictors exist — how many does stepwise recover?", x = "Number of real predictors correctly retained", y = "Count (out of 500 simulations)", caption = "Stepwise finds all 3 real vars in only a fraction of runs" ) + theme_tim() p2 ``` ```{r tp-table} tp_tbl %>% gt() %>% cols_label(true_pos = "Real vars found", n = "Simulations", pct = "%", label = "Share") %>% tab_header(title = "True Positive Recovery Rate") %>% tab_style(style = cell_fill(color = "#d6eaf8"), locations = cells_body(rows = true_pos == 3)) %>% fmt_number(columns = pct, decimals = 1) ``` --- ### 4.3 · The R² Inflation Illusion > *"But the model has an R² of 0.48 — surely that means something?"* ```{r plot-r2} #| fig-cap: "Stepwise inflates in-sample R² (optimism), then pays the price out-of-sample." r2_long <- results %>% dplyr::select(sim, r2_full, r2_step) %>% pivot_longer(-sim, names_to = "model", values_to = "r2") %>% mutate(model = case_match(model, "r2_full" ~ "Full model", "r2_step" ~ "Stepwise")) p3 <- ggplot(r2_long, aes(x = r2, fill = model)) + geom_density(alpha = 0.65, colour = NA) + scale_fill_manual(values = c("Full model" = col_full, "Stepwise" = col_step)) + labs( title = "📈 Stepwise claims higher R² — but it's a mirage", subtitle = "In-sample R² is inflated by variable-hunting; it won't replicate", x = "In-sample R²", y = "Density", fill = NULL ) + theme_tim() p3 ``` ```{r plot-rmse} #| fig-cap: "Out-of-sample prediction error: who actually generalises better?" rmse_long <- results %>% dplyr::select(sim, rmse_full, rmse_step) %>% pivot_longer(-sim, names_to = "model", values_to = "rmse") %>% mutate(model = case_match(model, "rmse_full" ~ "Full model", "rmse_step" ~ "Stepwise")) p4 <- ggplot(rmse_long, aes(x = rmse, fill = model)) + geom_density(alpha = 0.65, colour = NA) + scale_fill_manual(values = c("Full model" = col_full, "Stepwise" = col_step)) + labs( title = "📉 Out-of-sample RMSE — the real reckoning", subtitle = "Stepwise's 'lean' model is not more accurate; it's just more confidently wrong", x = "RMSE on held-out test set (n = 200)", y = "Density", fill = NULL ) + theme_tim() p4 ``` ```{r rmse-summary} results %>% summarise( `Stepwise — median RMSE` = round(median(rmse_step), 3), `Full model — median RMSE` = round(median(rmse_full), 3), `Stepwise — median in-sample R²` = round(median(r2_step), 3), `Full model — median in-sample R²` = round(median(r2_full), 3), `Stepwise — median noise vars included` = round(median(false_pos), 1), `Stepwise — median true vars recovered` = round(median(true_pos), 1) ) %>% pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>% gt() %>% tab_header(title = "Summary Across 500 Simulations") %>% fmt_number(columns = Value, decimals = 3) %>% tab_style( style = cell_fill(color = "#fde8e8"), locations = cells_body(rows = str_detect(Metric, "Stepwise")) ) %>% tab_style( style = cell_fill(color = "#e8f8e8"), locations = cells_body(rows = str_detect(Metric, "Full")) ) ``` --- ### 4.4 · The Variable Lottery: Which Noise Variables "Win"? > *The next part is the most disturbing. Of the 17 pure noise variables, how often does each get selected?* ```{r plot-noise-freq} #| fig-cap: "Frequency of each noise variable being selected by stepwise across 500 simulations. These variables have zero relationship with Y." noise_freq <- results %>% unnest(step_vars) %>% filter(step_vars %in% noise_vars) %>% count(step_vars, name = "times_selected") %>% mutate( pct = times_selected / n_sims * 100, step_vars = fct_reorder(step_vars, times_selected) ) p5 <- ggplot(noise_freq, aes(x = pct, y = step_vars)) + geom_col(fill = col_noise, alpha = 0.85) + geom_text(aes(label = glue("{round(pct,0)}%")), hjust = -0.15, size = 3.2, colour = "grey30") + geom_vline(xintercept = 5, linetype = "dashed", colour = "grey50", linewidth = 0.7) + annotate("text", x = 5.5, y = 1, hjust = 0, size = 3, colour = "grey40", label = "α = 0.05\nnominal rate →") + scale_x_continuous(limits = c(0, 55), labels = function(x) paste0(x, "%")) + labs( title = "🎰 Pure noise variables selected by stepwise", subtitle = "Every one of these variables is **completely unrelated to Y** — yet stepwise picks them constantly", x = "% of simulations where variable was selected", y = NULL, caption = "Dashed line = 5% (expected false positive rate under a single honest test)" ) + theme_tim() p5 ``` > **Commentary:** *Every bar to the right of the dashed line is a variable that a researcher would present in a paper as "independently associated with the outcome." The journals would publish it. Future meta-analyses would include it. And it would be zero. Pure chance. This is how noise gets enshrined.* --- ### 4.5 · The Coefficient Instability Catastrophe > *Let's zoom in on one simulation and show how the selected model lies about its own coefficients.* ```{r coef-instability} #| fig-cap: "Estimated coefficient of X1 (true β = 0.6) across 200 replications, stepwise vs. full model. Stepwise biases estimates upward in selected models." # Re-run 200 sims and collect X1 coefficient coef_sims <- map_dfr(1:200, function(seed) { set.seed(seed) X <- matrix(rnorm(n_obs * n_pred), nrow = n_obs, dimnames = list(NULL, paste0("X", 1:n_pred))) eps <- rnorm(n_obs) Y <- beta_true["intercept"] + X[, "X1"] * beta_true["X1"] + X[, "X2"] * beta_true["X2"] + X[, "X3"] * beta_true["X3"] + eps df <- as.data.frame(cbind(Y, X)) full_mod <- lm(Y ~ ., data = df) step_mod <- stepAIC(full_mod, direction = "both", trace = FALSE) step_coefs <- coef(step_mod) full_coefs <- coef(full_mod) tibble( seed = seed, coef_full = full_coefs["X1"], coef_step = if ("X1" %in% names(step_coefs)) step_coefs["X1"] else NA_real_, x1_selected = "X1" %in% names(step_coefs) ) }) coef_long <- coef_sims %>% pivot_longer(c(coef_full, coef_step), names_to = "model", values_to = "estimate") %>% mutate(model = case_match(model, "coef_full" ~ "Full model (always estimates X1)", "coef_step" ~ "Stepwise (only when X1 is selected)")) p6 <- ggplot(coef_long, aes(x = estimate, fill = model)) + geom_density(alpha = 0.65, colour = NA, na.rm = TRUE) + geom_vline(xintercept = 0.6, linetype = "dashed", colour = "black", linewidth = 1) + annotate("text", x = 0.63, y = Inf, vjust = 2, hjust = 0, label = "True β = 0.6", fontface = "bold", size = 3.5) + scale_fill_manual(values = c( "Full model (always estimates X1)" = col_full, "Stepwise (only when X1 is selected)" = col_step )) + labs( title = "Coefficient estimates for X1 (true β = 0.6)", subtitle = "Stepwise only shows X1 when it *looks* big — **selection bias inflates estimates**", x = "Estimated coefficient", y = "Density", fill = NULL, caption = "Stepwise estimates conditioned on selection — classic winner's curse" ) + theme_tim() p6 ``` > **This is the winner's curse in plain sight.** Stepwise only includes X1 when its estimated coefficient crossed whatever threshold the algorithm used. So the reported estimates are *systematically larger than the truth*. Your effect sizes are wrong. Your confidence intervals are wrong. Your power calculations for the next study will be wrong. --- ## 5 · A Side-by-Side Summary ```{r panel-plot} #| fig-height: 10 #| fig-cap: "The full picture. Left column: selection behaviour. Right column: predictive performance." (p1 | p2) / (p3 | p4) + plot_annotation( title = "Stepwise Selection: 500 Simulations, One Verdict", subtitle = "Ground truth: 3 real predictors, 17 noise. Stepwise doesn't know that — and acts accordingly.", theme = theme( plot.title = element_text(face = "bold", size = 15), plot.subtitle = element_text(size = 11, colour = "grey35") ) ) ``` --- ## 6 · What Should You Do Instead? > *Okay. We've seen the crime scene. What's the alternative?* The answer depends on your goal: | Goal | Better approach | |---|---| | **Prediction** | LASSO / Ridge / Elastic Net with cross-validation | | **Inference / explanation** | Theory-driven pre-specification; adjust for all confounders, period | | **Exploratory** | Penalised regression + bootstrap variable importance — report uncertainty honestly | | **Clinical score** | LASSO with optimism-corrected validation (bootstrap or external dataset) | ::: {.callout-warning} ## The fundamental rule **Model selection and inference cannot use the same data without a penalty.** If you used the data to choose variables, your p-values are not what they say they are. Full stop. ::: --- ## 7 · The Closing Argument > *"But everyone does it. The reviewers expect it."* That's an institutional problem, not a statistical defence. The simulation above is not exotic. It uses a perfectly ordinary linear DGP, a reasonable sample size (n = 80), and off-the-shelf stepwise-AIC — the exact setup taught in most regression courses and implemented in most analysis pipelines. And it produces: - **False positives** at rates far above 5% - **Biased coefficient estimates** - **Inflated in-sample R²** that doesn't survive contact with new data - **Inconsistent variable selection** — run the same study twice, get a different "significant" model Stepwise selection is not data-driven. It is **noise-driven**, dressed up in the clothes of rigor. ---