Code
# Load packages
library(ggdag)
Attaching package: 'ggdag'
The following object is masked from 'package:stats':
filter
Code
library(dagitty)
library(ggplot2)
library(viridis)Loading required package: viridisLite
Statistical model selection procedures such as Akaike’s Information Criterion (AIC) are widely used to identify well-fitting models and are often implicitly treated as tools for causal inference. However, model selection optimizes predictive performance rather than causal validity, and may therefore favor models that yield biased estimates of causal effects. In contrast, causal directed acyclic graphs (DAGs) make assumptions about the data-generating process explicit and provide principled rules for covariate adjustment that are independent of model fit criteria. Here, we use simulation to demonstrate that DAG-based adjustment can outperform model selection for causal effect estimation, even when the model preferred by AIC has substantially better predictive fit.
Specifically, we simulate data under two common causal structures: 1. a scenario involving a collider 2 a scenario involving a confounder
For each scenario, we generate data 100 times from a known data-generating process with a fixed true causal effect. For each simulated dataset, we estimate the effect of an exposure on an outcome using both a DAG-identified adjustment set and models selected using AIC. We then compare the distribution and mean of estimated effects across replications to the known true effect. This repeated-simulation approach allows to assess bias and consistency, and to illustrate how model selection can systematically favor causally invalid models despite superior predictive performance.
# Load packages
library(ggdag)
Attaching package: 'ggdag'
The following object is masked from 'package:stats':
filter
library(dagitty)
library(ggplot2)
library(viridis)Loading required package: viridisLite
adjustment_set_formulas <- function(dag, exposure, required_variable, outcome, effect = "total",
type = "minimal", formula_parts = c(outcome), latent = NULL, remove = NULL, plot = TRUE,
...) {
if (plot)
gg <- ggdag_adjustment_set(.tdy_dag = tidy_dagitty(dag), exposure = exposure,
outcome = outcome, ...) + theme_dag()
temp_set <- adjustmentSets(x = dag, exposure = exposure, outcome = outcome, effect = effect,
type = type)
form_set <- lapply(temp_set, function(x) {
if (!is.null(remove))
x <- x[!x %in% remove]
if (length(x) > 0){
form <- paste(formula_parts[1], " ~ ", exposure, " + ", paste(x, collapse = " + "),
if (length(formula_parts) == 2)
formula_parts[2] else NULL)
} else {
form <- paste(formula_parts[1], " ~ ", exposure, if (length(formula_parts) == 2)
formula_parts[2] else NULL)
}
return(form)
})
form_set <- form_set[!duplicated(form_set)]
if (!is.null(latent))
for (i in latent) form_set <- form_set[!grepl(paste0(" ", i, " "), form_set)]
# form_set <- sapply(form_set, as.formula)
names(form_set) <- seq_along(form_set)
# add formula as attribute
attributes(form_set)$exposure.formula <- paste(formula_parts[1], " ~ ", exposure,
if (length(formula_parts) == 2)
formula_parts[2] else NULL)
if (plot)
return(gg) else return(form_set)
}A collider is a variable that is causally influenced by two other variables, such as when both an exposure and an outcome affect a third variable. In DAG terminology, a collider appears in the structure X→C←Y. A key property of colliders is that, although they block association between their causes by default, conditioning on a collider (or one of its descendants) opens a non-causal path between those causes. As a result, controlling for a collider induces a spurious association that did not exist prior to conditioning. This phenomenon, known as collider bias, means that including a collider as a covariate can bias causal effect estimates, even if the collider strongly predicts the outcome. Because model selection criteria reward improved predictive fit, they often favor models that condition on colliders, making collider bias a systematic failure mode of model selection for causal inference.
# DAG structure:
# U -> X -> Y
# U -> Y
# X -> C <- Y (C is a collider)
dag <- dagify(
Y ~ X,
C ~ X + Y,
exposure = "X",
outcome = "Y"
)
tidy_dag <- tidy_dagitty(dag)
tidy_dag$data$type <- ""
tidy_dag$data$label <- tidy_dag$data$name
tidy_dag$data$label[tidy_dag$data$name == "X"] <- "X\n(predictor)"
tidy_dag$data$label[tidy_dag$data$name == "Y"] <- "Y\n(response)"
tidy_dag$data$label[tidy_dag$data$name == "C"] <- "Z\n(collider)"
ggplot(tidy_dag, aes(x = x, y = y, xend = xend, yend = yend)) + scale_color_viridis_d(begin = 0.2,
end = 0.8, alpha = 0.5) + geom_dag_edges_fan(edge_color = viridis(10, alpha = 0.4)[2],
arrow = grid::arrow(length = grid::unit(10, "pt"), type = "closed")) + geom_dag_point(size = 24) + geom_dag_text(color = "white", aes(label = label)) + theme_dag(base_size = 14) + theme(legend.position = "bottom")# sample size
N <- 100
# True causal effect of X -> Y
eff_size <- 10
out <- pbapply::pblapply(1:100, function(i) {
# X is the exposure
X <- rnorm(N)
# Y is the outcome
Y <- eff_size * X + rnorm(N)
# C is a COLLIDER caused by both X and Y
C <- X + Y + rnorm(N)
dat <- data.frame(X, Y, C)
m_dag <- lm(Y ~ X, data = dat)
## model selection approach
# Candidate models a researcher might try
m1 <- lm(Y ~ X , data = dat)
m2 <- lm(Y ~ X + C, data = dat)
# define adjustment set
aictab <- AIC(m1, m2)
# get object from name
aic_mod <- get(rownames(aictab)[which.min(aictab$AIC)])
# DAG approach
#
d <- as.dagitty(dag)
# Find valid adjustment sets
# All minimal sufficient adjustment sets
a <- adjustment_set_formulas(
d,
exposure = "X",
outcome = "Y",
type = "all",
plot = FALSE
)
dag_mod <- lm(a[[1]], data = dat)
# return estimates
return(data.frame(
method = c("DAG causal\ninference", "Model selection"),
estimate = c(coef(dag_mod)["X"], coef(aic_mod)["X"])
))
})
results <- do.call(rbind, out)Blue line represents true effect size:
agg_results <- aggregate(estimate ~ method, data = results, mean)
agg_results$sd <- aggregate(estimate ~ method, data = results, sd)$estimate
ggplot(agg_results, aes(x = method, y = estimate)) +
geom_pointrange(aes(ymin = estimate - sd, ymax = estimate + sd)) +
geom_hline(yintercept = eff_size,
linetype = "dashed",
color = "blue") +
ylab("Estimated effect of X on Y") +
xlab("Method")A confounder is a variable that causally influences both the exposure and the outcome, creating a spurious association between them if left unadjusted. In DAGs, a confounder appears in the structure X←Z→Y. Unlike colliders, confounders transmit non-causal association by default, and conditioning on them blocks this backdoor path. Proper adjustment for confounders is therefore necessary to recover unbiased estimates of causal effects. However, model selection procedures do not distinguish confounders from other predictive variables, such as mediators or post-treatment variables. As a result, even in the presence of a true confounder, model selection may favor over-controlled models that include variables downstream of the outcome, leading to biased estimates of the total causal effect. DAG-based reasoning is thus required not only to identify confounders that should be adjusted for, but also to identify variables that should not be conditioned on, regardless of their predictive utility.
# DAG structure:
#
# Z -> X -> Y -> M
# Z -> Y
#
# Z is a confounder (should adjust)
# M is post-treatment (should NOT adjust for total effect)
dag <- dagify(
X ~ Z,
Y ~ X + Z,
M ~ Y,
exposure = "X",
outcome = "Y"
)
tidy_dag <- tidy_dagitty(dag)
tidy_dag$data$type <- ""
tidy_dag$data$label <- tidy_dag$data$name
tidy_dag$data$label[tidy_dag$data$name == "M"] <- "M\n(mediator)"
tidy_dag$data$label[tidy_dag$data$name == "X"] <- "X\n(predictor)"
tidy_dag$data$label[tidy_dag$data$name == "Y"] <- "Y\n(response)"
tidy_dag$data$label[tidy_dag$data$name == "Z"] <- "Z\n(counfounder)"
ggplot(tidy_dag, aes(x = x, y = y, xend = xend, yend = yend)) + scale_color_viridis_d(begin = 0.2,
end = 0.8, alpha = 0.5) + geom_dag_edges_fan(edge_color = viridis(10, alpha = 0.4)[2],
arrow = grid::arrow(length = grid::unit(10, "pt"), type = "closed")) + geom_dag_point(size = 24) + geom_dag_text(color = "white", aes(label = label)) + theme_dag(base_size = 14) + theme(legend.position = "bottom")# sample size
N <- 100
# True causal effect of X -> Y
eff_size <- 1
out <- pbapply::pblapply(1:100, function(i) {
# Z is a TRUE CONFOUNDER (common cause of X and Y)
Z <- rnorm(N)
# X is the exposure
X <- 0.8 * Z + rnorm(N)
# Y is the outcome
# True causal effect of X -> Y is EXACTLY 1.0
Y <- 1.0 * X + 0.8 * Z + rnorm(N)
# M is a POST-TREATMENT VARIABLE (mediator / outcome proxy)
# Caused by Y, and therefore downstream of X
M <- 0.9 * Y + rnorm(N)
dat <- data.frame(X, Y, Z, M)
m_dag <- lm(Y ~ X, data = dat)
## model selection approach
# Candidate models a researcher might try
# m1: No adjustment (confounding bias)
m1 <- lm(Y ~ X, data = dat)
# m2: Correct DAG-based model (adjust for confounder Z)
m2 <- lm(Y ~ X + Z, data = dat)
# m3: Overcontrolled model (adjusts for post-treatment variable M)
m3 <- lm(Y ~ X + Z + M, data = dat)
# define adjustment set
aictab <- AIC(m1, m2, m3)
# get object from name
aic_mod <- get(rownames(aictab)[which.min(aictab$AIC)])
# DAG approach
#
d <- as.dagitty(dag)
# Find valid adjustment sets
# All minimal sufficient adjustment sets
a <- adjustment_set_formulas(
d,
exposure = "X",
outcome = "Y",
type = "all",
plot = FALSE
)
dag_mod <- lm(a[[1]], data = dat)
# return estimates
return(data.frame(
method = c("DAG causal\ninference", "Model selection"),
estimate = c(coef(dag_mod)["X"], coef(aic_mod)["X"])
))
})
results2 <- do.call(rbind, out)Blue line represents true effect size:
agg_results2 <- aggregate(estimate ~ method, data = results2, mean)
agg_results2$sd <- aggregate(estimate ~ method, data = results2, sd)$estimate
ggplot(agg_results2, aes(x = method, y = estimate)) +
geom_pointrange(aes(ymin = estimate - sd, ymax = estimate + sd)) +
geom_hline(yintercept = eff_size,
linetype = "dashed",
color = "blue") +
ylab("Estimated effect of X on Y") +
xlab("Method")─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.2 (2023-10-31)
os Ubuntu 22.04.2 LTS
system x86_64, linux-gnu
ui X11
language es_ES:en
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Costa_Rica
date 2026-01-19
pandoc 3.8 @ /home/marce/anaconda3/bin/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package * version date (UTC) lib source
boot 1.3-28 2021-05-03 [4] CRAN (R 4.1.1)
cachem 1.1.0 2024-05-16 [1] CRAN (R 4.3.2)
cli 3.6.5 2025-04-23 [1] CRAN (R 4.3.2)
curl 7.0.0 2025-08-19 [1] CRAN (R 4.3.2)
dagitty * 0.3-4 2023-12-07 [1] CRAN (R 4.3.2)
devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.2)
digest 0.6.39 2025-11-19 [1] CRAN (R 4.3.2)
dplyr 1.1.4 2023-11-17 [1] CRAN (R 4.3.2)
ellipsis 0.3.2 2021-04-29 [3] CRAN (R 4.1.1)
evaluate 1.0.5 2025-08-27 [1] CRAN (R 4.3.2)
farver 2.1.2 2024-05-13 [1] CRAN (R 4.3.2)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.3.2)
fs 1.6.6 2025-04-12 [1] CRAN (R 4.3.2)
generics 0.1.4 2025-05-09 [1] CRAN (R 4.3.2)
ggdag * 0.2.12 2024-03-08 [1] CRAN (R 4.3.2)
ggforce 0.4.2 2024-02-19 [1] CRAN (R 4.3.2)
ggplot2 * 4.0.1 2025-11-14 [1] CRAN (R 4.3.2)
ggraph 2.2.1 2024-03-07 [1] CRAN (R 4.3.2)
ggrepel 0.9.5 2024-01-10 [1] CRAN (R 4.3.2)
glue 1.8.0 2024-09-30 [1] CRAN (R 4.3.2)
graphlayouts 1.1.1 2024-03-09 [1] CRAN (R 4.3.2)
gridExtra 2.3 2017-09-09 [3] CRAN (R 4.0.1)
gtable 0.3.6 2024-10-25 [1] CRAN (R 4.3.2)
htmltools 0.5.9 2025-12-04 [1] CRAN (R 4.3.2)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.3.2)
httpuv 1.6.13 2023-12-06 [1] CRAN (R 4.3.2)
igraph 2.1.4 2025-01-23 [1] CRAN (R 4.3.2)
jsonlite 2.0.0 2025-03-27 [1] CRAN (R 4.3.2)
knitr 1.50 2025-03-16 [1] CRAN (R 4.3.2)
labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.2)
later 1.3.2 2023-12-06 [1] CRAN (R 4.3.2)
lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.2)
magrittr 2.0.4 2025-09-12 [1] CRAN (R 4.3.2)
MASS 7.3-55 2022-01-13 [4] CRAN (R 4.1.2)
memoise 2.0.1 2021-11-26 [3] CRAN (R 4.1.2)
mime 0.13 2025-03-17 [1] CRAN (R 4.3.2)
miniUI 0.1.1.1 2018-05-18 [3] CRAN (R 4.0.1)
pbapply 1.7-4 2025-07-20 [1] CRAN (R 4.3.2)
pillar 1.11.1 2025-09-17 [1] CRAN (R 4.3.2)
pkgbuild 1.4.8 2025-05-26 [1] CRAN (R 4.3.2)
pkgconfig 2.0.3 2019-09-22 [3] CRAN (R 4.0.1)
pkgload 1.4.0 2024-06-28 [1] CRAN (R 4.3.2)
polyclip 1.10-7 2024-07-23 [1] CRAN (R 4.3.2)
profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.2)
promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.2)
purrr 1.2.0 2025-11-04 [1] CRAN (R 4.3.2)
R6 2.6.1 2025-02-15 [1] CRAN (R 4.3.2)
RColorBrewer 1.1-3 2022-04-03 [1] CRAN (R 4.3.2)
Rcpp 1.1.0 2025-07-02 [1] CRAN (R 4.3.2)
remotes 2.5.0 2024-03-17 [1] CRAN (R 4.3.2)
rlang 1.1.6 2025-04-11 [1] CRAN (R 4.3.2)
rmarkdown 2.30 2025-09-28 [1] CRAN (R 4.3.2)
rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.3.2)
S7 0.2.1 2025-11-14 [1] CRAN (R 4.3.2)
scales 1.4.0 2025-04-24 [1] CRAN (R 4.3.2)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.2)
shiny 1.8.0 2023-11-17 [1] CRAN (R 4.3.2)
stringi 1.8.7 2025-03-27 [1] CRAN (R 4.3.2)
stringr 1.6.0 2025-11-04 [1] CRAN (R 4.3.2)
tibble 3.3.0 2025-06-08 [1] CRAN (R 4.3.2)
tidygraph 1.3.1 2024-01-30 [1] CRAN (R 4.3.2)
tidyr 1.3.1 2024-01-24 [1] CRAN (R 4.3.2)
tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.2)
tweenr 2.0.2 2022-09-06 [1] CRAN (R 4.3.2)
urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.2)
usethis 3.1.0 2024-11-26 [1] CRAN (R 4.3.2)
V8 4.4.2 2024-02-15 [1] CRAN (R 4.3.2)
vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.2)
viridis * 0.6.5 2024-01-29 [1] CRAN (R 4.3.2)
viridisLite * 0.4.2 2023-05-02 [1] CRAN (R 4.3.2)
withr 3.0.2 2024-10-28 [1] CRAN (R 4.3.2)
xfun 0.55 2025-12-16 [1] CRAN (R 4.3.2)
xtable 1.8-4 2019-04-21 [3] CRAN (R 4.0.1)
yaml 2.3.12 2025-12-10 [1] CRAN (R 4.3.2)
[1] /home/marce/R/x86_64-pc-linux-gnu-library/4.3
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library
──────────────────────────────────────────────────────────────────────────────