“I trust the algorithm. It picks only the significant variables.” — Every junior analyst, at least once.
There is something seductive about stepwise selection. You hand it your messy dataset — 25 variables, no theory, a deadline — and it hands you back a clean, parsimonious model with nothing but asterisks. What’s not to love?
Everything. Let’s show you exactly why, with numbers you can’t argue with.
We will simulate a world where we know the truth: only 3 out of 20 predictors actually matter. Then we’ll watch stepwise selection embarrass itself — repeatedly — across 500 independent samples from that same truth.
2 · Simulation Design
Code
n_obs <-80# small-ish — realistic clinical/epi samplen_pred <-20# 20 candidate predictorsn_true <-3# only 3 are realn_sims <-500# repetitionstrue_vars <-paste0("X", 1:n_true) # X1, X2, X3noise_vars <-paste0("X", (n_true+1):n_pred) # X4 … X20# True coefficients (known only to us, the simulators / gods of this world)beta_true <-c(intercept =1, X1 =0.6, X2 =-0.4, X3 =0.8)# Everything else = 0. Silence. Noise.
4.1 · How Many Noise Variables Does Stepwise Drag In?
Code
fp_summary <- results %>%count(false_pos) %>%mutate(pct = n / n_sims *100)p1 <-ggplot(results, aes(x = false_pos)) +geom_histogram(binwidth =1, fill = col_noise, colour ="white",alpha =0.85) +geom_vline(xintercept =0, linetype ="dashed", colour = col_true, linewidth =1) +annotate("text", x =0.3, y =Inf, vjust =1.8, hjust =0,colour = col_true, fontface ="bold", size =3.8,label ="← Correct answer:\n 0 noise variables") +labs(title ="🚨 Stepwise keeps noise — over and over",subtitle =glue("Across {n_sims} simulations: median **{median(results$false_pos)}** false-positive noise variables selected per run"),x ="Number of noise variables retained by stepwise",y ="Count (out of 500 simulations)",caption ="n = 80 obs, 20 predictors (3 real + 17 noise), stepwise AIC both-direction" ) +theme_tim()p1
Each simulation used 20 predictors, only 3 of which were real. The red distribution shows how many noise variables stepwise selection dragged into the final model.
Commentary:The correct answer is zero noise variables. Stepwise rarely gets that right. Most runs include 2–5 pure noise variables in the final model — and then reports them with p-values that look totally respectable. That’s not discovery. That’s statistical confabulation.
4.2 · Does It Even Find the True Variables?
Code
tp_tbl <- results %>%count(true_pos) %>%mutate(pct =round(n / n_sims *100, 1),label =glue("{pct}%"))p2 <-ggplot(results, aes(x = true_pos)) +geom_histogram(binwidth =1, fill = col_true, colour ="white", alpha =0.85) +scale_x_continuous(breaks =0:3) +labs(title ="True variables found by stepwise",subtitle ="3 real predictors exist — how many does stepwise recover?",x ="Number of real predictors correctly retained",y ="Count (out of 500 simulations)",caption ="Stepwise finds all 3 real vars in only a fraction of runs" ) +theme_tim()p2
Distribution of true positives (real predictors recovered) across 500 simulations.
“But the model has an R² of 0.48 — surely that means something?”
Code
r2_long <- results %>% dplyr::select(sim, r2_full, r2_step) %>%pivot_longer(-sim, names_to ="model", values_to ="r2") %>%mutate(model =case_match(model,"r2_full"~"Full model","r2_step"~"Stepwise"))p3 <-ggplot(r2_long, aes(x = r2, fill = model)) +geom_density(alpha =0.65, colour =NA) +scale_fill_manual(values =c("Full model"= col_full, "Stepwise"= col_step)) +labs(title ="📈 Stepwise claims higher R² — but it's a mirage",subtitle ="In-sample R² is inflated by variable-hunting; it won't replicate",x ="In-sample R²",y ="Density",fill =NULL ) +theme_tim()p3
Stepwise inflates in-sample R² (optimism), then pays the price out-of-sample.
Code
rmse_long <- results %>% dplyr::select(sim, rmse_full, rmse_step) %>%pivot_longer(-sim, names_to ="model", values_to ="rmse") %>%mutate(model =case_match(model,"rmse_full"~"Full model","rmse_step"~"Stepwise"))p4 <-ggplot(rmse_long, aes(x = rmse, fill = model)) +geom_density(alpha =0.65, colour =NA) +scale_fill_manual(values =c("Full model"= col_full, "Stepwise"= col_step)) +labs(title ="📉 Out-of-sample RMSE — the real reckoning",subtitle ="Stepwise's 'lean' model is not more accurate; it's just more confidently wrong",x ="RMSE on held-out test set (n = 200)",y ="Density",fill =NULL ) +theme_tim()p4
Out-of-sample prediction error: who actually generalises better?
Code
results %>%summarise(`Stepwise — median RMSE`=round(median(rmse_step), 3),`Full model — median RMSE`=round(median(rmse_full), 3),`Stepwise — median in-sample R²`=round(median(r2_step), 3),`Full model — median in-sample R²`=round(median(r2_full), 3),`Stepwise — median noise vars included`=round(median(false_pos), 1),`Stepwise — median true vars recovered`=round(median(true_pos), 1) ) %>%pivot_longer(everything(), names_to ="Metric", values_to ="Value") %>%gt() %>%tab_header(title ="Summary Across 500 Simulations") %>%fmt_number(columns = Value, decimals =3) %>%tab_style(style =cell_fill(color ="#fde8e8"),locations =cells_body(rows =str_detect(Metric, "Stepwise")) ) %>%tab_style(style =cell_fill(color ="#e8f8e8"),locations =cells_body(rows =str_detect(Metric, "Full")) )
Summary Across 500 Simulations
Metric
Value
Stepwise — median RMSE
1.104
Full model — median RMSE
1.157
Stepwise — median in-sample R²
0.618
Full model — median in-sample R²
0.652
Stepwise — median noise vars included
4.000
Stepwise — median true vars recovered
3.000
4.4 · The Variable Lottery: Which Noise Variables “Win”?
The next part is the most disturbing. Of the 17 pure noise variables, how often does each get selected?
Code
noise_freq <- results %>%unnest(step_vars) %>%filter(step_vars %in% noise_vars) %>%count(step_vars, name ="times_selected") %>%mutate(pct = times_selected / n_sims *100,step_vars =fct_reorder(step_vars, times_selected) )p5 <-ggplot(noise_freq, aes(x = pct, y = step_vars)) +geom_col(fill = col_noise, alpha =0.85) +geom_text(aes(label =glue("{round(pct,0)}%")),hjust =-0.15, size =3.2, colour ="grey30") +geom_vline(xintercept =5, linetype ="dashed",colour ="grey50", linewidth =0.7) +annotate("text", x =5.5, y =1, hjust =0, size =3,colour ="grey40", label ="α = 0.05\nnominal rate →") +scale_x_continuous(limits =c(0, 55),labels =function(x) paste0(x, "%")) +labs(title ="🎰 Pure noise variables selected by stepwise",subtitle ="Every one of these variables is **completely unrelated to Y** — yet stepwise picks them constantly",x ="% of simulations where variable was selected",y =NULL,caption ="Dashed line = 5% (expected false positive rate under a single honest test)" ) +theme_tim()p5
Frequency of each noise variable being selected by stepwise across 500 simulations. These variables have zero relationship with Y.
Commentary:Every bar to the right of the dashed line is a variable that a researcher would present in a paper as “independently associated with the outcome.” The journals would publish it. Future meta-analyses would include it. And it would be zero. Pure chance. This is how noise gets enshrined.
4.5 · The Coefficient Instability Catastrophe
Let’s zoom in on one simulation and show how the selected model lies about its own coefficients.
Code
# Re-run 200 sims and collect X1 coefficientcoef_sims <-map_dfr(1:200, function(seed) {set.seed(seed) X <-matrix(rnorm(n_obs * n_pred), nrow = n_obs,dimnames =list(NULL, paste0("X", 1:n_pred))) eps <-rnorm(n_obs) Y <- beta_true["intercept"] + X[, "X1"] * beta_true["X1"] + X[, "X2"] * beta_true["X2"] + X[, "X3"] * beta_true["X3"] + eps df <-as.data.frame(cbind(Y, X)) full_mod <-lm(Y ~ ., data = df) step_mod <-stepAIC(full_mod, direction ="both", trace =FALSE) step_coefs <-coef(step_mod) full_coefs <-coef(full_mod)tibble(seed = seed,coef_full = full_coefs["X1"],coef_step =if ("X1"%in%names(step_coefs)) step_coefs["X1"] elseNA_real_,x1_selected ="X1"%in%names(step_coefs) )})coef_long <- coef_sims %>%pivot_longer(c(coef_full, coef_step), names_to ="model", values_to ="estimate") %>%mutate(model =case_match(model,"coef_full"~"Full model (always estimates X1)","coef_step"~"Stepwise (only when X1 is selected)"))p6 <-ggplot(coef_long, aes(x = estimate, fill = model)) +geom_density(alpha =0.65, colour =NA, na.rm =TRUE) +geom_vline(xintercept =0.6, linetype ="dashed",colour ="black", linewidth =1) +annotate("text", x =0.63, y =Inf, vjust =2, hjust =0,label ="True β = 0.6", fontface ="bold", size =3.5) +scale_fill_manual(values =c("Full model (always estimates X1)"= col_full,"Stepwise (only when X1 is selected)"= col_step )) +labs(title ="Coefficient estimates for X1 (true β = 0.6)",subtitle ="Stepwise only shows X1 when it *looks* big — **selection bias inflates estimates**",x ="Estimated coefficient",y ="Density",fill =NULL,caption ="Stepwise estimates conditioned on selection — classic winner's curse" ) +theme_tim()p6
Estimated coefficient of X1 (true β = 0.6) across 200 replications, stepwise vs. full model. Stepwise biases estimates upward in selected models.
This is the winner’s curse in plain sight. Stepwise only includes X1 when its estimated coefficient crossed whatever threshold the algorithm used. So the reported estimates are systematically larger than the truth. Your effect sizes are wrong. Your confidence intervals are wrong. Your power calculations for the next study will be wrong.
5 · A Side-by-Side Summary
Code
(p1 | p2) / (p3 | p4) +plot_annotation(title ="Stepwise Selection: 500 Simulations, One Verdict",subtitle ="Ground truth: 3 real predictors, 17 noise. Stepwise doesn't know that — and acts accordingly.",theme =theme(plot.title =element_text(face ="bold", size =15),plot.subtitle =element_text(size =11, colour ="grey35") ) )
The full picture. Left column: selection behaviour. Right column: predictive performance.
6 · What Should You Do Instead?
Okay. We’ve seen the crime scene. What’s the alternative?
The answer depends on your goal:
Goal
Better approach
Prediction
LASSO / Ridge / Elastic Net with cross-validation
Inference / explanation
Theory-driven pre-specification; adjust for all confounders, period
LASSO with optimism-corrected validation (bootstrap or external dataset)
The fundamental rule
Model selection and inference cannot use the same data without a penalty. If you used the data to choose variables, your p-values are not what they say they are. Full stop.
7 · The Closing Argument
“But everyone does it. The reviewers expect it.”
That’s an institutional problem, not a statistical defence.
The simulation above is not exotic. It uses a perfectly ordinary linear DGP, a reasonable sample size (n = 80), and off-the-shelf stepwise-AIC — the exact setup taught in most regression courses and implemented in most analysis pipelines. And it produces:
False positives at rates far above 5%
Biased coefficient estimates
Inflated in-sample R² that doesn’t survive contact with new data
Inconsistent variable selection — run the same study twice, get a different “significant” model
Stepwise selection is not data-driven. It is noise-driven, dressed up in the clothes of rigor.
Source Code
---title: "The Stepwise Trap: How 'Data-Driven' Model Selection Lies to You"subtitle: "A Simulation Study with Commentary"author: "Timothy Achala"date: todayformat: html: toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-tools: true theme: flatly highlight-style: github self-contained: true fig-width: 9 fig-height: 5.5 fig-dpi: 150 embed-resources: trueexecute: echo: true warning: false message: false cache: false---```{r setup}#| include: falselibrary(tidyverse)library(MASS) # stepAIClibrary(gt)library(patchwork)library(ggtext)library(glue)set.seed(2024)# ── palette ──────────────────────────────────────────────────────────────────col_true <-"#2166ac"# blue – true predictorscol_noise <-"#d6604d"# red – noise predictorscol_step <-"#b2182b"# dark red – stepwise monstercol_full <-"#4dac26"# green – full modelcol_bg <-"#f8f9fa"theme_tim <-function() {theme_minimal(base_size =12, base_family ="sans") +theme(plot.background =element_rect(fill = col_bg, colour =NA),panel.background =element_rect(fill = col_bg, colour =NA),plot.title =element_markdown(face ="bold", size =14),plot.subtitle =element_markdown(size =11, colour ="grey35"),plot.caption =element_markdown(size =8.5, colour ="grey50"),axis.title =element_text(size =11),legend.position ="bottom" )}```---## 1 · The Setup: A Confession Before the Crime> *"I trust the algorithm. It picks only the significant variables."*> — Every junior analyst, at least once.There is something seductive about stepwise selection. You hand it your messy dataset — 25 variables, no theory, a deadline — and it hands you back a clean, parsimonious model with nothing but asterisks. What's not to love?**Everything.** Let's show you exactly why, with numbers you can't argue with.We will simulate a world where **we know the truth**: only 3 out of 20 predictors actually matter. Then we'll watch stepwise selection embarrass itself — repeatedly — across 500 independent samples from that same truth.---## 2 · Simulation Design```{r sim-params}n_obs <-80# small-ish — realistic clinical/epi samplen_pred <-20# 20 candidate predictorsn_true <-3# only 3 are realn_sims <-500# repetitionstrue_vars <-paste0("X", 1:n_true) # X1, X2, X3noise_vars <-paste0("X", (n_true+1):n_pred) # X4 … X20# True coefficients (known only to us, the simulators / gods of this world)beta_true <-c(intercept =1, X1 =0.6, X2 =-0.4, X3 =0.8)# Everything else = 0. Silence. Noise.```::: {.callout-note}**Ground truth:** $Y = 1 + 0.6X_1 - 0.4X_2 + 0.8X_3 + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, 1)$$X_4, X_5, \ldots, X_{20}$ are **pure noise** — zero relationship with $Y$. Stepwise doesn't know that. Let's see what it does.:::---## 3 · Running the Experiment (500 Times)```{r run-simulation}simulate_one <-function(seed) {set.seed(seed)# Generate data X <-matrix(rnorm(n_obs * n_pred), nrow = n_obs,dimnames =list(NULL, paste0("X", 1:n_pred))) eps <-rnorm(n_obs) Y <- beta_true["intercept"] + X[, "X1"] * beta_true["X1"] + X[, "X2"] * beta_true["X2"] + X[, "X3"] * beta_true["X3"] + eps df <-as.data.frame(cbind(Y, X))# ── Full model ───────────────────────────────────────────────────────────── full_mod <-lm(Y ~ ., data = df)# ── Stepwise (both directions, AIC) ──────────────────────────────────────── step_mod <-stepAIC(full_mod, direction ="both", trace =FALSE) step_vars <-names(coef(step_mod))[-1] # drop intercept# ── Metrics ────────────────────────────────────────────────────────────────# In-sample R² r2_full <-summary(full_mod)$r.squared r2_step <-summary(step_mod)$r.squared# Out-of-sample RMSE on a fresh test set (same DGP) X_test <-matrix(rnorm(200* n_pred), nrow =200,dimnames =list(NULL, paste0("X", 1:n_pred))) eps_test <-rnorm(200) Y_test <- beta_true["intercept"] + X_test[, "X1"] * beta_true["X1"] + X_test[, "X2"] * beta_true["X2"] + X_test[, "X3"] * beta_true["X3"] + eps_test df_test <-as.data.frame(cbind(Y = Y_test, X_test)) rmse_full <-sqrt(mean((Y_test -predict(full_mod, df_test))^2)) rmse_step <-sqrt(mean((Y_test -predict(step_mod, df_test))^2))# Variable selection accuracy true_pos <-sum(true_vars %in% step_vars) # real vars found false_pos <-sum(step_vars %in% noise_vars) # noise vars includedtibble(sim = seed,n_step =length(step_vars),true_pos = true_pos,false_pos = false_pos,r2_full = r2_full,r2_step = r2_step,rmse_full = rmse_full,rmse_step = rmse_step,step_vars =list(step_vars) )}results <-map_dfr(1:n_sims, simulate_one)```---## 4 · The Evidence### 4.1 · How Many Noise Variables Does Stepwise Drag In?```{r plot-false-pos}#| fig-cap: "Each simulation used 20 predictors, only 3 of which were real. The red distribution shows how many *noise* variables stepwise selection dragged into the final model."fp_summary <- results %>%count(false_pos) %>%mutate(pct = n / n_sims *100)p1 <-ggplot(results, aes(x = false_pos)) +geom_histogram(binwidth =1, fill = col_noise, colour ="white",alpha =0.85) +geom_vline(xintercept =0, linetype ="dashed", colour = col_true, linewidth =1) +annotate("text", x =0.3, y =Inf, vjust =1.8, hjust =0,colour = col_true, fontface ="bold", size =3.8,label ="← Correct answer:\n 0 noise variables") +labs(title ="🚨 Stepwise keeps noise — over and over",subtitle =glue("Across {n_sims} simulations: median **{median(results$false_pos)}** false-positive noise variables selected per run"),x ="Number of noise variables retained by stepwise",y ="Count (out of 500 simulations)",caption ="n = 80 obs, 20 predictors (3 real + 17 noise), stepwise AIC both-direction" ) +theme_tim()p1```> **Commentary:** *The correct answer is zero noise variables. Stepwise rarely gets that right. Most runs include 2–5 pure noise variables in the final model — and then reports them with p-values that look totally respectable. That's not discovery. That's statistical confabulation.*---### 4.2 · Does It Even Find the True Variables?```{r plot-true-pos}#| fig-cap: "Distribution of true positives (real predictors recovered) across 500 simulations."tp_tbl <- results %>%count(true_pos) %>%mutate(pct =round(n / n_sims *100, 1),label =glue("{pct}%"))p2 <-ggplot(results, aes(x = true_pos)) +geom_histogram(binwidth =1, fill = col_true, colour ="white", alpha =0.85) +scale_x_continuous(breaks =0:3) +labs(title ="True variables found by stepwise",subtitle ="3 real predictors exist — how many does stepwise recover?",x ="Number of real predictors correctly retained",y ="Count (out of 500 simulations)",caption ="Stepwise finds all 3 real vars in only a fraction of runs" ) +theme_tim()p2``````{r tp-table}tp_tbl %>%gt() %>%cols_label(true_pos ="Real vars found", n ="Simulations", pct ="%", label ="Share") %>%tab_header(title ="True Positive Recovery Rate") %>%tab_style(style =cell_fill(color ="#d6eaf8"),locations =cells_body(rows = true_pos ==3)) %>%fmt_number(columns = pct, decimals =1)```---### 4.3 · The R² Inflation Illusion> *"But the model has an R² of 0.48 — surely that means something?"*```{r plot-r2}#| fig-cap: "Stepwise inflates in-sample R² (optimism), then pays the price out-of-sample."r2_long <- results %>% dplyr::select(sim, r2_full, r2_step) %>%pivot_longer(-sim, names_to ="model", values_to ="r2") %>%mutate(model =case_match(model,"r2_full"~"Full model","r2_step"~"Stepwise"))p3 <-ggplot(r2_long, aes(x = r2, fill = model)) +geom_density(alpha =0.65, colour =NA) +scale_fill_manual(values =c("Full model"= col_full, "Stepwise"= col_step)) +labs(title ="📈 Stepwise claims higher R² — but it's a mirage",subtitle ="In-sample R² is inflated by variable-hunting; it won't replicate",x ="In-sample R²",y ="Density",fill =NULL ) +theme_tim()p3``````{r plot-rmse}#| fig-cap: "Out-of-sample prediction error: who actually generalises better?"rmse_long <- results %>% dplyr::select(sim, rmse_full, rmse_step) %>%pivot_longer(-sim, names_to ="model", values_to ="rmse") %>%mutate(model =case_match(model,"rmse_full"~"Full model","rmse_step"~"Stepwise"))p4 <-ggplot(rmse_long, aes(x = rmse, fill = model)) +geom_density(alpha =0.65, colour =NA) +scale_fill_manual(values =c("Full model"= col_full, "Stepwise"= col_step)) +labs(title ="📉 Out-of-sample RMSE — the real reckoning",subtitle ="Stepwise's 'lean' model is not more accurate; it's just more confidently wrong",x ="RMSE on held-out test set (n = 200)",y ="Density",fill =NULL ) +theme_tim()p4``````{r rmse-summary}results %>%summarise(`Stepwise — median RMSE`=round(median(rmse_step), 3),`Full model — median RMSE`=round(median(rmse_full), 3),`Stepwise — median in-sample R²`=round(median(r2_step), 3),`Full model — median in-sample R²`=round(median(r2_full), 3),`Stepwise — median noise vars included`=round(median(false_pos), 1),`Stepwise — median true vars recovered`=round(median(true_pos), 1) ) %>%pivot_longer(everything(), names_to ="Metric", values_to ="Value") %>%gt() %>%tab_header(title ="Summary Across 500 Simulations") %>%fmt_number(columns = Value, decimals =3) %>%tab_style(style =cell_fill(color ="#fde8e8"),locations =cells_body(rows =str_detect(Metric, "Stepwise")) ) %>%tab_style(style =cell_fill(color ="#e8f8e8"),locations =cells_body(rows =str_detect(Metric, "Full")) )```---### 4.4 · The Variable Lottery: Which Noise Variables "Win"?> *The next part is the most disturbing. Of the 17 pure noise variables, how often does each get selected?*```{r plot-noise-freq}#| fig-cap: "Frequency of each noise variable being selected by stepwise across 500 simulations. These variables have zero relationship with Y."noise_freq <- results %>%unnest(step_vars) %>%filter(step_vars %in% noise_vars) %>%count(step_vars, name ="times_selected") %>%mutate(pct = times_selected / n_sims *100,step_vars =fct_reorder(step_vars, times_selected) )p5 <-ggplot(noise_freq, aes(x = pct, y = step_vars)) +geom_col(fill = col_noise, alpha =0.85) +geom_text(aes(label =glue("{round(pct,0)}%")),hjust =-0.15, size =3.2, colour ="grey30") +geom_vline(xintercept =5, linetype ="dashed",colour ="grey50", linewidth =0.7) +annotate("text", x =5.5, y =1, hjust =0, size =3,colour ="grey40", label ="α = 0.05\nnominal rate →") +scale_x_continuous(limits =c(0, 55),labels =function(x) paste0(x, "%")) +labs(title ="🎰 Pure noise variables selected by stepwise",subtitle ="Every one of these variables is **completely unrelated to Y** — yet stepwise picks them constantly",x ="% of simulations where variable was selected",y =NULL,caption ="Dashed line = 5% (expected false positive rate under a single honest test)" ) +theme_tim()p5```> **Commentary:** *Every bar to the right of the dashed line is a variable that a researcher would present in a paper as "independently associated with the outcome." The journals would publish it. Future meta-analyses would include it. And it would be zero. Pure chance. This is how noise gets enshrined.*---### 4.5 · The Coefficient Instability Catastrophe> *Let's zoom in on one simulation and show how the selected model lies about its own coefficients.*```{r coef-instability}#| fig-cap: "Estimated coefficient of X1 (true β = 0.6) across 200 replications, stepwise vs. full model. Stepwise biases estimates upward in selected models."# Re-run 200 sims and collect X1 coefficientcoef_sims <-map_dfr(1:200, function(seed) {set.seed(seed) X <-matrix(rnorm(n_obs * n_pred), nrow = n_obs,dimnames =list(NULL, paste0("X", 1:n_pred))) eps <-rnorm(n_obs) Y <- beta_true["intercept"] + X[, "X1"] * beta_true["X1"] + X[, "X2"] * beta_true["X2"] + X[, "X3"] * beta_true["X3"] + eps df <-as.data.frame(cbind(Y, X)) full_mod <-lm(Y ~ ., data = df) step_mod <-stepAIC(full_mod, direction ="both", trace =FALSE) step_coefs <-coef(step_mod) full_coefs <-coef(full_mod)tibble(seed = seed,coef_full = full_coefs["X1"],coef_step =if ("X1"%in%names(step_coefs)) step_coefs["X1"] elseNA_real_,x1_selected ="X1"%in%names(step_coefs) )})coef_long <- coef_sims %>%pivot_longer(c(coef_full, coef_step), names_to ="model", values_to ="estimate") %>%mutate(model =case_match(model,"coef_full"~"Full model (always estimates X1)","coef_step"~"Stepwise (only when X1 is selected)"))p6 <-ggplot(coef_long, aes(x = estimate, fill = model)) +geom_density(alpha =0.65, colour =NA, na.rm =TRUE) +geom_vline(xintercept =0.6, linetype ="dashed",colour ="black", linewidth =1) +annotate("text", x =0.63, y =Inf, vjust =2, hjust =0,label ="True β = 0.6", fontface ="bold", size =3.5) +scale_fill_manual(values =c("Full model (always estimates X1)"= col_full,"Stepwise (only when X1 is selected)"= col_step )) +labs(title ="Coefficient estimates for X1 (true β = 0.6)",subtitle ="Stepwise only shows X1 when it *looks* big — **selection bias inflates estimates**",x ="Estimated coefficient",y ="Density",fill =NULL,caption ="Stepwise estimates conditioned on selection — classic winner's curse" ) +theme_tim()p6```> **This is the winner's curse in plain sight.** Stepwise only includes X1 when its estimated coefficient crossed whatever threshold the algorithm used. So the reported estimates are *systematically larger than the truth*. Your effect sizes are wrong. Your confidence intervals are wrong. Your power calculations for the next study will be wrong.---## 5 · A Side-by-Side Summary```{r panel-plot}#| fig-height: 10#| fig-cap: "The full picture. Left column: selection behaviour. Right column: predictive performance."(p1 | p2) / (p3 | p4) +plot_annotation(title ="Stepwise Selection: 500 Simulations, One Verdict",subtitle ="Ground truth: 3 real predictors, 17 noise. Stepwise doesn't know that — and acts accordingly.",theme =theme(plot.title =element_text(face ="bold", size =15),plot.subtitle =element_text(size =11, colour ="grey35") ) )```---## 6 · What Should You Do Instead?> *Okay. We've seen the crime scene. What's the alternative?*The answer depends on your goal:| Goal | Better approach ||---|---|| **Prediction** | LASSO / Ridge / Elastic Net with cross-validation || **Inference / explanation** | Theory-driven pre-specification; adjust for all confounders, period || **Exploratory** | Penalised regression + bootstrap variable importance — report uncertainty honestly || **Clinical score** | LASSO with optimism-corrected validation (bootstrap or external dataset) |::: {.callout-warning}## The fundamental rule**Model selection and inference cannot use the same data without a penalty.** If you used the data to choose variables, your p-values are not what they say they are. Full stop.:::---## 7 · The Closing Argument> *"But everyone does it. The reviewers expect it."*That's an institutional problem, not a statistical defence.The simulation above is not exotic. It uses a perfectly ordinary linear DGP, a reasonable sample size (n = 80), and off-the-shelf stepwise-AIC — the exact setup taught in most regression courses and implemented in most analysis pipelines. And it produces:- **False positives** at rates far above 5%- **Biased coefficient estimates**- **Inflated in-sample R²** that doesn't survive contact with new data- **Inconsistent variable selection** — run the same study twice, get a different "significant" modelStepwise selection is not data-driven. It is **noise-driven**, dressed up in the clothes of rigor.---