The Stepwise Trap: How ‘Data-Driven’ Model Selection Lies to You

A Simulation Study with Commentary

Author

Timothy Achala

Published

June 19, 2026


1 · The Setup: A Confession Before the Crime

“I trust the algorithm. It picks only the significant variables.” — Every junior analyst, at least once.

There is something seductive about stepwise selection. You hand it your messy dataset — 25 variables, no theory, a deadline — and it hands you back a clean, parsimonious model with nothing but asterisks. What’s not to love?

Everything. Let’s show you exactly why, with numbers you can’t argue with.

We will simulate a world where we know the truth: only 3 out of 20 predictors actually matter. Then we’ll watch stepwise selection embarrass itself — repeatedly — across 500 independent samples from that same truth.


2 · Simulation Design

Code
n_obs      <- 80    # small-ish — realistic clinical/epi sample
n_pred     <- 20    # 20 candidate predictors
n_true     <- 3     # only 3 are real
n_sims     <- 500   # repetitions

true_vars  <- paste0("X", 1:n_true)   # X1, X2, X3
noise_vars <- paste0("X", (n_true+1):n_pred)  # X4 … X20

# True coefficients (known only to us, the simulators / gods of this world)
beta_true  <- c(intercept = 1, X1 = 0.6, X2 = -0.4, X3 = 0.8)
# Everything else = 0. Silence. Noise.
Note

Ground truth: \(Y = 1 + 0.6X_1 - 0.4X_2 + 0.8X_3 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1)\)

\(X_4, X_5, \ldots, X_{20}\) are pure noise — zero relationship with \(Y\). Stepwise doesn’t know that. Let’s see what it does.


3 · Running the Experiment (500 Times)

Code
simulate_one <- function(seed) {
  set.seed(seed)
  
  # Generate data
  X   <- matrix(rnorm(n_obs * n_pred), nrow = n_obs,
                dimnames = list(NULL, paste0("X", 1:n_pred)))
  eps <- rnorm(n_obs)
  Y   <- beta_true["intercept"] +
         X[, "X1"] * beta_true["X1"] +
         X[, "X2"] * beta_true["X2"] +
         X[, "X3"] * beta_true["X3"] + eps
  
  df <- as.data.frame(cbind(Y, X))
  
  # ── Full model ─────────────────────────────────────────────────────────────
  full_mod <- lm(Y ~ ., data = df)
  
  # ── Stepwise (both directions, AIC) ────────────────────────────────────────
  step_mod <- stepAIC(full_mod, direction = "both", trace = FALSE)
  step_vars <- names(coef(step_mod))[-1]   # drop intercept
  
  # ── Metrics ────────────────────────────────────────────────────────────────
  # In-sample R²
  r2_full <- summary(full_mod)$r.squared
  r2_step <- summary(step_mod)$r.squared
  
  # Out-of-sample RMSE on a fresh test set (same DGP)
  X_test  <- matrix(rnorm(200 * n_pred), nrow = 200,
                    dimnames = list(NULL, paste0("X", 1:n_pred)))
  eps_test <- rnorm(200)
  Y_test  <- beta_true["intercept"] +
             X_test[, "X1"] * beta_true["X1"] +
             X_test[, "X2"] * beta_true["X2"] +
             X_test[, "X3"] * beta_true["X3"] + eps_test
  
  df_test <- as.data.frame(cbind(Y = Y_test, X_test))
  
  rmse_full <- sqrt(mean((Y_test - predict(full_mod, df_test))^2))
  rmse_step <- sqrt(mean((Y_test - predict(step_mod, df_test))^2))
  
  # Variable selection accuracy
  true_pos   <- sum(true_vars %in% step_vars)  # real vars found
  false_pos  <- sum(step_vars %in% noise_vars) # noise vars included
  
  tibble(
    sim       = seed,
    n_step    = length(step_vars),
    true_pos  = true_pos,
    false_pos = false_pos,
    r2_full   = r2_full,
    r2_step   = r2_step,
    rmse_full = rmse_full,
    rmse_step = rmse_step,
    step_vars = list(step_vars)
  )
}

results <- map_dfr(1:n_sims, simulate_one)

4 · The Evidence

4.1 · How Many Noise Variables Does Stepwise Drag In?

Code
fp_summary <- results %>%
  count(false_pos) %>%
  mutate(pct = n / n_sims * 100)

p1 <- ggplot(results, aes(x = false_pos)) +
  geom_histogram(binwidth = 1, fill = col_noise, colour = "white",
                 alpha = 0.85) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = col_true, linewidth = 1) +
  annotate("text", x = 0.3, y = Inf, vjust = 1.8, hjust = 0,
           colour = col_true, fontface = "bold", size = 3.8,
           label = "← Correct answer:\n   0 noise variables") +
  labs(
    title = "🚨 Stepwise keeps noise — over and over",
    subtitle = glue("Across {n_sims} simulations: median **{median(results$false_pos)}** false-positive noise variables selected per run"),
    x = "Number of noise variables retained by stepwise",
    y = "Count (out of 500 simulations)",
    caption = "n = 80 obs, 20 predictors (3 real + 17 noise), stepwise AIC both-direction"
  ) +
  theme_tim()

p1

Each simulation used 20 predictors, only 3 of which were real. The red distribution shows how many noise variables stepwise selection dragged into the final model.

Commentary: The correct answer is zero noise variables. Stepwise rarely gets that right. Most runs include 2–5 pure noise variables in the final model — and then reports them with p-values that look totally respectable. That’s not discovery. That’s statistical confabulation.


4.2 · Does It Even Find the True Variables?

Code
tp_tbl <- results %>%
  count(true_pos) %>%
  mutate(pct = round(n / n_sims * 100, 1),
         label = glue("{pct}%"))

p2 <- ggplot(results, aes(x = true_pos)) +
  geom_histogram(binwidth = 1, fill = col_true, colour = "white", alpha = 0.85) +
  scale_x_continuous(breaks = 0:3) +
  labs(
    title = "True variables found by stepwise",
    subtitle = "3 real predictors exist — how many does stepwise recover?",
    x = "Number of real predictors correctly retained",
    y = "Count (out of 500 simulations)",
    caption = "Stepwise finds all 3 real vars in only a fraction of runs"
  ) +
  theme_tim()

p2

Distribution of true positives (real predictors recovered) across 500 simulations.
Code
tp_tbl %>%
  gt() %>%
  cols_label(true_pos = "Real vars found", n = "Simulations", pct = "%", label = "Share") %>%
  tab_header(title = "True Positive Recovery Rate") %>%
  tab_style(style = cell_fill(color = "#d6eaf8"),
            locations = cells_body(rows = true_pos == 3)) %>%
  fmt_number(columns = pct, decimals = 1)
True Positive Recovery Rate
Real vars found Simulations % Share
2 11 2.2 2.2%
3 489 97.8 97.8%

4.3 · The R² Inflation Illusion

“But the model has an R² of 0.48 — surely that means something?”

Code
r2_long <- results %>%
  dplyr::select(sim, r2_full, r2_step) %>%
  pivot_longer(-sim, names_to = "model", values_to = "r2") %>%
  mutate(model = case_match(model,
                            "r2_full" ~ "Full model",
                            "r2_step" ~ "Stepwise"))

p3 <- ggplot(r2_long, aes(x = r2, fill = model)) +
  geom_density(alpha = 0.65, colour = NA) +
  scale_fill_manual(values = c("Full model" = col_full, "Stepwise" = col_step)) +
  labs(
    title = "📈 Stepwise claims higher R² — but it's a mirage",
    subtitle = "In-sample R² is inflated by variable-hunting; it won't replicate",
    x = "In-sample R²",
    y = "Density",
    fill = NULL
  ) +
  theme_tim()

p3

Stepwise inflates in-sample R² (optimism), then pays the price out-of-sample.
Code
rmse_long <- results %>%
  dplyr::select(sim, rmse_full, rmse_step) %>%
  pivot_longer(-sim, names_to = "model", values_to = "rmse") %>%
  mutate(model = case_match(model,
                            "rmse_full" ~ "Full model",
                            "rmse_step" ~ "Stepwise"))

p4 <- ggplot(rmse_long, aes(x = rmse, fill = model)) +
  geom_density(alpha = 0.65, colour = NA) +
  scale_fill_manual(values = c("Full model" = col_full, "Stepwise" = col_step)) +
  labs(
    title = "📉 Out-of-sample RMSE — the real reckoning",
    subtitle = "Stepwise's 'lean' model is not more accurate; it's just more confidently wrong",
    x = "RMSE on held-out test set (n = 200)",
    y = "Density",
    fill = NULL
  ) +
  theme_tim()

p4

Out-of-sample prediction error: who actually generalises better?
Code
results %>%
  summarise(
    `Stepwise — median RMSE` = round(median(rmse_step), 3),
    `Full model — median RMSE` = round(median(rmse_full), 3),
    `Stepwise — median in-sample R²` = round(median(r2_step), 3),
    `Full model — median in-sample R²` = round(median(r2_full), 3),
    `Stepwise — median noise vars included` = round(median(false_pos), 1),
    `Stepwise — median true vars recovered` = round(median(true_pos), 1)
  ) %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value") %>%
  gt() %>%
  tab_header(title = "Summary Across 500 Simulations") %>%
  fmt_number(columns = Value, decimals = 3) %>%
  tab_style(
    style = cell_fill(color = "#fde8e8"),
    locations = cells_body(rows = str_detect(Metric, "Stepwise"))
  ) %>%
  tab_style(
    style = cell_fill(color = "#e8f8e8"),
    locations = cells_body(rows = str_detect(Metric, "Full"))
  )
Summary Across 500 Simulations
Metric Value
Stepwise — median RMSE 1.104
Full model — median RMSE 1.157
Stepwise — median in-sample R² 0.618
Full model — median in-sample R² 0.652
Stepwise — median noise vars included 4.000
Stepwise — median true vars recovered 3.000

4.4 · The Variable Lottery: Which Noise Variables “Win”?

The next part is the most disturbing. Of the 17 pure noise variables, how often does each get selected?

Code
noise_freq <- results %>%
  unnest(step_vars) %>%
  filter(step_vars %in% noise_vars) %>%
  count(step_vars, name = "times_selected") %>%
  mutate(
    pct       = times_selected / n_sims * 100,
    step_vars = fct_reorder(step_vars, times_selected)
  )

p5 <- ggplot(noise_freq, aes(x = pct, y = step_vars)) +
  geom_col(fill = col_noise, alpha = 0.85) +
  geom_text(aes(label = glue("{round(pct,0)}%")),
            hjust = -0.15, size = 3.2, colour = "grey30") +
  geom_vline(xintercept = 5, linetype = "dashed",
             colour = "grey50", linewidth = 0.7) +
  annotate("text", x = 5.5, y = 1, hjust = 0, size = 3,
           colour = "grey40", label = "α = 0.05\nnominal rate →") +
  scale_x_continuous(limits = c(0, 55),
                     labels = function(x) paste0(x, "%")) +
  labs(
    title = "🎰 Pure noise variables selected by stepwise",
    subtitle = "Every one of these variables is **completely unrelated to Y** — yet stepwise picks them constantly",
    x = "% of simulations where variable was selected",
    y = NULL,
    caption = "Dashed line = 5% (expected false positive rate under a single honest test)"
  ) +
  theme_tim()

p5

Frequency of each noise variable being selected by stepwise across 500 simulations. These variables have zero relationship with Y.

Commentary: Every bar to the right of the dashed line is a variable that a researcher would present in a paper as “independently associated with the outcome.” The journals would publish it. Future meta-analyses would include it. And it would be zero. Pure chance. This is how noise gets enshrined.


4.5 · The Coefficient Instability Catastrophe

Let’s zoom in on one simulation and show how the selected model lies about its own coefficients.

Code
# Re-run 200 sims and collect X1 coefficient
coef_sims <- map_dfr(1:200, function(seed) {
  set.seed(seed)
  X   <- matrix(rnorm(n_obs * n_pred), nrow = n_obs,
                dimnames = list(NULL, paste0("X", 1:n_pred)))
  eps <- rnorm(n_obs)
  Y   <- beta_true["intercept"] +
         X[, "X1"] * beta_true["X1"] +
         X[, "X2"] * beta_true["X2"] +
         X[, "X3"] * beta_true["X3"] + eps
  df  <- as.data.frame(cbind(Y, X))
  
  full_mod <- lm(Y ~ ., data = df)
  step_mod <- stepAIC(full_mod, direction = "both", trace = FALSE)
  
  step_coefs <- coef(step_mod)
  full_coefs <- coef(full_mod)
  
  tibble(
    seed      = seed,
    coef_full = full_coefs["X1"],
    coef_step = if ("X1" %in% names(step_coefs)) step_coefs["X1"] else NA_real_,
    x1_selected = "X1" %in% names(step_coefs)
  )
})

coef_long <- coef_sims %>%
  pivot_longer(c(coef_full, coef_step), names_to = "model", values_to = "estimate") %>%
  mutate(model = case_match(model,
                            "coef_full" ~ "Full model (always estimates X1)",
                            "coef_step" ~ "Stepwise (only when X1 is selected)"))

p6 <- ggplot(coef_long, aes(x = estimate, fill = model)) +
  geom_density(alpha = 0.65, colour = NA, na.rm = TRUE) +
  geom_vline(xintercept = 0.6, linetype = "dashed",
             colour = "black", linewidth = 1) +
  annotate("text", x = 0.63, y = Inf, vjust = 2, hjust = 0,
           label = "True β = 0.6", fontface = "bold", size = 3.5) +
  scale_fill_manual(values = c(
    "Full model (always estimates X1)"     = col_full,
    "Stepwise (only when X1 is selected)"  = col_step
  )) +
  labs(
    title = "Coefficient estimates for X1 (true β = 0.6)",
    subtitle = "Stepwise only shows X1 when it *looks* big — **selection bias inflates estimates**",
    x = "Estimated coefficient",
    y = "Density",
    fill = NULL,
    caption = "Stepwise estimates conditioned on selection — classic winner's curse"
  ) +
  theme_tim()

p6

Estimated coefficient of X1 (true β = 0.6) across 200 replications, stepwise vs. full model. Stepwise biases estimates upward in selected models.

This is the winner’s curse in plain sight. Stepwise only includes X1 when its estimated coefficient crossed whatever threshold the algorithm used. So the reported estimates are systematically larger than the truth. Your effect sizes are wrong. Your confidence intervals are wrong. Your power calculations for the next study will be wrong.


5 · A Side-by-Side Summary

Code
(p1 | p2) / (p3 | p4) +
  plot_annotation(
    title    = "Stepwise Selection: 500 Simulations, One Verdict",
    subtitle = "Ground truth: 3 real predictors, 17 noise. Stepwise doesn't know that — and acts accordingly.",
    theme    = theme(
      plot.title    = element_text(face = "bold", size = 15),
      plot.subtitle = element_text(size = 11, colour = "grey35")
    )
  )

The full picture. Left column: selection behaviour. Right column: predictive performance.

6 · What Should You Do Instead?

Okay. We’ve seen the crime scene. What’s the alternative?

The answer depends on your goal:

Goal Better approach
Prediction LASSO / Ridge / Elastic Net with cross-validation
Inference / explanation Theory-driven pre-specification; adjust for all confounders, period
Exploratory Penalised regression + bootstrap variable importance — report uncertainty honestly
Clinical score LASSO with optimism-corrected validation (bootstrap or external dataset)
The fundamental rule

Model selection and inference cannot use the same data without a penalty. If you used the data to choose variables, your p-values are not what they say they are. Full stop.


7 · The Closing Argument

“But everyone does it. The reviewers expect it.”

That’s an institutional problem, not a statistical defence.

The simulation above is not exotic. It uses a perfectly ordinary linear DGP, a reasonable sample size (n = 80), and off-the-shelf stepwise-AIC — the exact setup taught in most regression courses and implemented in most analysis pipelines. And it produces:

  • False positives at rates far above 5%
  • Biased coefficient estimates
  • Inflated in-sample R² that doesn’t survive contact with new data
  • Inconsistent variable selection — run the same study twice, get a different “significant” model

Stepwise selection is not data-driven. It is noise-driven, dressed up in the clothes of rigor.