Machine learning for estimating average treatment effects

HEOR-DS 530, Practical Demonstration

Author

Noemi Kreif. Previous version developed together with Julia Hatamyar

Published

May 1, 2026

This tutorial demonstrates the main coding workflow for estimating and interpreting conditional average treatment effects (CATEs) with causal forests.

The structure is as follows:

estimate nuisance functions,
fit a causal forest using policy-relevant effect modifiers,
inspect heterogeneity,
summarize it with interpretable models,
and compare the causal-forest scores with Double Machine Learning scores.

Note

This document is designed for live teaching. The code is intentionally explicit rather than overly compact, so students can map each line to a step in the assignment.

1 Learning Goals

By the end of the walkthrough, students should be able to:

estimate nuisance functions for causal forests;
explain why outcome and propensity score models matter for confounding adjustment;
choose effect modifiers with a policy decision in mind;
interpret predicted CATEs and their distribution;
use variable importance and Best Linear Predictor (BLP) regressions as summaries of heterogeneity;
compare causal-forest double-robust scores with Double Machine Learning scores.

2 Setup

Check availability of packages and stop with a clear message if a package is missing.
Apply UW branding for plots and tables via formatQuarto.
Create helper functions for tables and plots to keep the code clean and focused on the analysis steps.

3 Data

We use the IHDP simulation data. The outcome Y is a cognitive test score, and W indicates treatment assignment. All other columns are baseline covariates.

Code

load("ihdp_sim.data")

data.frame(
  rows = nrow(IQdata),
  columns = ncol(IQdata)
) |>
  gt_uw("Dataset dimensions")

Dataset dimensions
rows	columns
787	36

Code

head(dplyr::select(IQdata, Y, W, bw, momage, nnhealth, ppvt.imp, female)) |>
  gt_uw("First six rows: key analysis variables")

First six rows: key analysis variables
Y	W	bw	momage	nnhealth	ppvt.imp	female
120	1	1559	33	94	75.00000	1
90	1	1420	15	85	73.00000	1
76	0	1000	33	89	77.00000	0
43	0	1430	22	112	62.37235	0
73	0	1984	20	99	73.00000	0
83	0	1320	23	110	56.00000	1

4 Variable Dictionary

The variable descriptions below are taken from VariableDescription.doc. Some names in the original document differ slightly from the names used in IQdata; the table keeps both labels visible.

IHDP variable descriptions
Source: VariableDescription.doc
Variable in source document	Variable in IQdata	Description
treat	W	Indicator for IHDP intervention group; 1 for intervention, 0 for control.
bw	bw	Child birthweight in grams.
b.head	Not included in IQdata	Child head circumference at birth, measured in centimeters.
preterm	Not included in IQdata	Number of weeks pre-term that the child was born.
birth.o	birth.o	Birth order.
nnhealth	nnhealth	Neonatal health index based on birth-related variables such as hospital days and gestational age.
momage	momage	Mother's age when she gave birth to the child.
sex	female	Child gender; female is coded as 1.
twin	Not included in IQdata	Indicator for whether the child is a twin.
b.marry	b.marry	Indicator for whether the mother was married when the child was born.
momed indicators	mom.lths, mom.hs, mom.scoll, mom.coll	Maternal education at birth: less than high school, high school, some college, or college.
first	Not included in IQdata	Indicator for whether the child is first born.
cig	cigs	Indicator for whether the mother smoked cigarettes while pregnant.
booze	alcohol	Indicator for whether the mother consumed alcohol while pregnant.
drugs	drugs	Indicator for whether the mother ever used drugs while pregnant.
work.dur	workdur.imp	Indicator for whether the mother worked during pregnancy.
prenatal / whenpren	whenpren	Timing of prenatal care; in this dataset, trimester when prenatal care began.
site indicator variables	site1-site8	Indicators for study site; represented as site1 through site8 in this dataset.
momwhite, momblack, momhisp	momwhite, momblack, momhisp	Indicators for mother's race or ethnicity; white includes white and other in the original description.
parity	parity	Number of children the mother has given birth to.
moreprem	moreprem	Number of other children the mother has given birth to prematurely.
ppvt.imp	ppvt.imp	Mother's PPVT score measured one year post-birth; some values are imputed.
mlt.birt	mlt.birt	Number of multiple births the mother has had.
livwho	livwho	Family member with whom the child primarily lives.
language	language	Primary language spoken at home.
othstudy	othstudy	Indicator for whether the family was participating in another study at the same time.

Code

X <- IQdata[, !names(IQdata) %in% c("W", "Y")]
Y <- IQdata$Y
W <- IQdata$W

data.frame(
  n = nrow(IQdata),
  covariates = ncol(X),
  treated = sum(W == 1),
  control = sum(W == 0),
  mean_outcome = mean(Y)
) |>
  dplyr::mutate(mean_outcome = round(mean_outcome, 2)) |>
  gt_uw("Analytic sample")

Analytic sample
n	covariates	treated	control	mean_outcome
787	34	226	561	88

5 Step 1: Estimate Nuisance Models

Causal forests use nuisance functions to adjust for confounding:

W.hat: the propensity score, or probability of treatment conditional on covariates;
Y.hat: the conditional mean of the outcome.

Here we estimate both with regression_forest() and tune all available hyperparameters.

Code

propscore_forest <- regression_forest(
  X = X,
  Y = W,
  num.threads = 1,
  tune.parameters = "all"
)

outcome_forest <- regression_forest(
  X = X,
  Y = Y,
  num.threads = 1,
  tune.parameters = "all"
)

W.hat <- predict(propscore_forest)$predictions
Y.hat <- predict(outcome_forest)$predictions

Identify the hyperparameter that controls how many variables are tried at each split. In grf, this is mtry.

Code

default_mtry <- min(ceiling(sqrt(ncol(X)) + 20), ncol(X))

dplyr::bind_rows(
  data.frame(model = "Default rule", parameter = "mtry", value = as.character(default_mtry)),
  tuning_table(propscore_forest, "Propensity forest"),
  tuning_table(outcome_forest, "Outcome forest")
) |>
  gt_uw("Selected tuning parameters")

Selected tuning parameters
model	parameter	value
Default rule	mtry	26
Propensity forest	sample.fraction	0.5
Propensity forest	mtry	26
Propensity forest	min.node.size	5
Propensity forest	honesty.fraction	0.5
Propensity forest	honesty.prune.leaves	1
Propensity forest	alpha	0.05
Propensity forest	imbalance.penalty	0
Outcome forest	sample.fraction	0.379
Outcome forest	mtry	12
Outcome forest	min.node.size	2
Outcome forest	honesty.fraction	0.626
Outcome forest	honesty.prune.leaves	1
Outcome forest	alpha	0.01
Outcome forest	imbalance.penalty	0.655

Code

par(mfrow = c(1, 2))
hist(W.hat, main = "Predicted propensity score", xlab = "W.hat", col = pal[["gold"]])
hist(Y.hat, main = "Predicted outcome", xlab = "Y.hat", col = pal[["purple"]])

Predicted propensity scores and predicted outcomes from nuisance forests.

Code

par(mfrow = c(1, 1))

Tip

The nuisance models are not the final estimand. They are supporting models that help the causal forest compare treated and control observations that are comparable on observed covariates.

6 Step 2: Choose Policy-Relevant Effect Modifiers

Let’s choose variables that could plausibly be used by a policy maker. This is not only a modeling decision; it is also an ethical and operational decision.

For this demonstration, momwhite is kept as the reference category, while momblack and momhisp enter the model.

Code

myvars <- c(
  "bw", "female", "ppvt.imp", "momage", "nnhealth", "birth.o",
  "cigs", "alcohol", "b.marry", "mom.hs", "mom.coll", "mom.scoll",
  "momblack", "momhisp"
)

descriptive_vars <- c(myvars, "momwhite")
X_selected <- X[, names(X) %in% myvars]

data.frame(effect_modifier = names(X_selected)) |>
  gt_uw("Selected effect modifiers")

Selected effect modifiers
effect_modifier
bw
momage
nnhealth
birth.o
cigs
alcohol
ppvt.imp
female
b.marry
mom.hs
mom.coll
mom.scoll
momblack
momhisp

Warning

This list is meant for discussion, not as a final policy recommendation. Some variables may be predictive but ethically sensitive or unavailable to a local decision maker. The model can reveal heterogeneity; it does not decide what is fair to use for targeting.

7 Step 3: Fit the Causal Forest

We now estimate the CATE function:

\[ \tau(x) = E[Y(1) - Y(0) \mid X = x] \]

Code

tau_forest <- causal_forest(
  X = X_selected,
  Y = Y,
  W = W,
  W.hat = W.hat,
  Y.hat = Y.hat,
  num.threads = 1,
  tune.parameters = "all"
)

tau_hat <- predict(tau_forest)$predictions

Code

mean_cate <- mean(tau_hat)

data.frame(
  statistic = c("Minimum", "25th percentile", "Median", "Mean", "75th percentile", "Maximum"),
  value = as.numeric(summary(tau_hat))
) |>
  dplyr::mutate(value = round(value, 3)) |>
  gt_uw("Distribution of predicted CATEs")

Distribution of predicted CATEs
statistic	value
Minimum	-1.349
25th percentile	6.435
Median	8.548
Mean	8.713
75th percentile	10.845
Maximum	18.091

Code

data.frame(
  diagnostic = c("Average predicted CATE", "Any predicted CATE below 0?"),
  value = c(round(mean_cate, 3), ifelse(any(tau_hat < 0), "Yes", "No"))
) |>
  gt_uw("CATE diagnostics")

CATE diagnostics
diagnostic	value
Average predicted CATE	8.713
Any predicted CATE below 0?	Yes

Code

ggplot(data.frame(tau_hat), aes(x = tau_hat)) +
  geom_histogram(bins = 30, fill = pal[["purple"]], colour = "white", alpha = 0.88) +
  geom_vline(xintercept = 0, colour = pal[["gold_dark"]], linewidth = 1) +
  geom_vline(xintercept = mean_cate, colour = pal[["gold"]], linewidth = 1.2) +
  labs(
    title = "Predicted CATEs",
    x = "Predicted treatment effect",
    y = "Number of children"
  )

Distribution of predicted individualized treatment effects.

Teaching prompt If some predicted CATEs are below zero, do we conclude the intervention harmed those children? Discuss estimation uncertainty, model dependence, and the difference between individual predictions and subgroup-level evidence.

8 Step 4: Who Benefits More?

One simple descriptive exercise is to split observations by above-average versus below-average predicted benefit, then compare baseline covariates.

Code

highcate <- as.numeric(tau_hat >= mean_cate)

balancetable_CATE <- CreateTableOne(
  vars = descriptive_vars,
  strata = "highcate",
  data = cbind(IQdata, highcate)
)

tableone_df <- print(
  balancetable_CATE,
  smd = TRUE,
  printToggle = FALSE,
  noSpaces = TRUE
) |>
  as.data.frame() |>
  tibble::rownames_to_column("Covariate")

tableone_df |>
  gt_uw("Baseline characteristics by predicted benefit group")

Baseline characteristics by predicted benefit group
Covariate	0	1	p	test	SMD
n	411	376
bw (mean (SD))	1549.75 (392.99)	2077.75 (353.13)	<0.001		1.413
female (mean (SD))	0.50 (0.50)	0.52 (0.50)	0.574		0.040
ppvt.imp (mean (SD))	82.27 (22.70)	81.68 (19.00)	0.692		0.028
momage (mean (SD))	24.63 (6.34)	25.41 (5.70)	0.071		0.129
nnhealth (mean (SD))	101.89 (16.19)	97.19 (14.65)	<0.001		0.304
birth.o (mean (SD))	1.65 (0.93)	2.19 (1.28)	<0.001		0.488
cigs (mean (SD))	3.30 (6.62)	4.97 (8.24)	0.002		0.223
alcohol (mean (SD))	0.38 (1.60)	0.30 (1.07)	0.362		0.066
b.marry (mean (SD))	1.56 (0.60)	1.62 (0.68)	0.192		0.093
mom.hs (mean (SD))	0.21 (0.41)	0.34 (0.47)	<0.001		0.303
mom.coll (mean (SD))	0.20 (0.40)	0.22 (0.42)	0.413		0.058
mom.scoll (mean (SD))	0.19 (0.40)	0.09 (0.28)	<0.001		0.310
momblack (mean (SD))	0.47 (0.50)	0.47 (0.50)	0.912		0.008
momhisp (mean (SD))	0.09 (0.28)	0.11 (0.32)	0.259		0.080
momwhite (mean (SD))	0.44 (0.50)	0.42 (0.49)	0.569		0.041

9 Step 5: Variable Importance for Heterogeneity

The variable_importance() function tells us which variables the causal forest split on most often while estimating heterogeneity. This is useful, but limited: it is not a regression coefficient, and it does not tell us the direction of association.

Code

tau_varimp <- variable_importance(tau_forest)

importance_df <- data.frame(
  variable = colnames(X_selected),
  importance = tau_varimp
) |>
  arrange(desc(importance))

ggplot(importance_df, aes(x = reorder(variable, importance), y = importance)) +
  geom_col(fill = pal[["purple"]], alpha = 0.9) +
  coord_flip() +
  labs(
    title = "Which variables drive forest splits?",
    x = NULL,
    y = "Variable importance"
  )

Causal-forest variable importance for treatment effect heterogeneity.

10 Step 6: Double-Robust Scores and the ATE

The causal forest also produces double-robust scores. Their average is a double-robust estimate of the average treatment effect.

Double-robust means that the ATE estimate is consistent if either the propensity score model or the outcome model is correctly specified. This is a useful robustness check, but it does not mean the ATE estimate is unbiased if both models are misspecified.

Code

DR_scores <- get_scores(tau_forest)

average_treatment_effect(tau_forest) |>
  ate_table("Causal forest") |>
  dplyr::bind_rows(
    data.frame(
      estimator = "Mean of DR scores",
      estimate = mean(DR_scores),
      std_error = NA_real_
    )
  ) |>
  dplyr::mutate(
    estimate = round(estimate, 3),
    std_error = round(std_error, 3)
  ) |>
  gt_uw("Average treatment effect estimates")

Average treatment effect estimates
estimator	estimate	std_error
Causal forest	8.995	1.274
Mean of DR scores	8.995	NA

Code

ggplot(data.frame(DR_scores), aes(x = DR_scores)) +
  geom_histogram(bins = 30, fill = pal[["gold"]], colour = "white") +
  geom_vline(xintercept = mean(DR_scores), colour = pal[["purple"]], linewidth = 1.2) +
  labs(
    title = "Double-robust scores from the causal forest",
    x = "Score",
    y = "Number of children"
  )

Distribution of causal-forest double-robust scores.

11 Step 7: Best Linear Predictor of Heterogeneity

A BLP regression summarizes heterogeneity by regressing the double-robust scores on selected effect modifiers. This is deliberately interpretable, but it can miss nonlinear relationships and interactions.

Code

linear_predictors_CF <- lm(
  DR_scores ~ .,
  data = as.data.frame(cbind(DR_scores, X_selected))
)

model_table(
  linear_predictors_CF,
  "Best Linear Predictor using causal-forest scores"
)

Best Linear Predictor using causal-forest scores
Term	Estimate	SE	t	p-value	95% CI low	95% CI high
(Intercept)	-19.210	14.345	-1.339	0.181	-47.369	8.950
bw	0.008	0.003	2.866	0.004	0.003	0.014
momage	0.098	0.283	0.345	0.731	-0.459	0.654
nnhealth	-0.068	0.084	-0.816	0.415	-0.232	0.096
birth.o	1.628	1.274	1.278	0.202	-0.872	4.128
cigs	0.077	0.182	0.423	0.672	-0.281	0.435
alcohol	-0.810	0.954	-0.849	0.396	-2.683	1.063
ppvt.imp	0.097	0.090	1.076	0.282	-0.080	0.274
female	3.195	2.576	1.240	0.215	-1.861	8.251
b.marry	0.812	2.306	0.352	0.725	-3.715	5.339
mom.hs	5.533	3.405	1.625	0.105	-1.151	12.217
mom.coll	-0.393	4.151	-0.095	0.925	-8.542	7.755
mom.scoll	-1.335	5.822	-0.229	0.819	-12.764	10.093
momblack	4.678	3.557	1.315	0.189	-2.304	11.661
momhisp	6.315	4.968	1.271	0.204	-3.438	16.067

Code

get_blp_df <- function(model, label) {
  s <- broom::tidy(model)
  data.frame(
    variable = s$term,
    estimate = s$estimate,
    lower = s$estimate - 1.96 * s$std.error,
    upper = s$estimate + 1.96 * s$std.error,
    model = label,
    stringsAsFactors = FALSE
  )
}

Code

cf_blp_plot_df <- get_blp_df(linear_predictors_CF, "Causal Forest")
cf_blp_plot_df <- subset(cf_blp_plot_df, variable != "(Intercept)")
cf_blp_plot_df$variable <- factor(
  cf_blp_plot_df$variable,
  levels = cf_blp_plot_df$variable[order(cf_blp_plot_df$estimate)]
)

ggplot(cf_blp_plot_df, aes(x = estimate, y = variable)) +
  geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
  geom_pointrange(aes(xmin = lower, xmax = upper), colour = pal[["purple"]]) +
  labs(
    title = "Linear summary of treatment effect heterogeneity",
    x = "Coefficient estimate with 95% CI",
    y = NULL
  )

Best Linear Predictor coefficients using causal-forest scores.

Note

We can interpret the birth weight coefficient as the expected change in the double-robust score associated with a one-unit increase in bw, holding the other selected effect modifiers fixed. If the scale is hard to interpret, rescale bw before fitting the BLP.

12 Step 8: Compare With a Traditional Interaction Regression

A traditional approach is to fit a linear outcome model with treatment-by- covariate interactions. This imposes stronger functional-form assumptions: linear covariate effects and linear interactions with treatment.

Code

confounders <- setdiff(names(IQdata), "Y")
interaction_terms <- paste0("W:", myvars, collapse = " + ")

linear_formula <- as.formula(
  paste("Y ~", paste(confounders, collapse = " + "), "+", interaction_terms)
)

linear_interact <- lm(linear_formula, data = IQdata)

broom::tidy(linear_interact, conf.int = TRUE) |>
  dplyr::filter(term == "W" | grepl(":W|W:", term)) |>
  dplyr::mutate(
    dplyr::across(
      c(estimate, std.error, statistic, p.value, conf.low, conf.high),
      \(x) round(x, 3)
    )
  ) |>
  gt_uw("Treatment and interaction terms from the traditional outcome regression") |>
  gt::cols_label(
    term = "Term",
    estimate = "Estimate",
    std.error = "SE",
    statistic = "t",
    p.value = "p-value",
    conf.low = "95% CI low",
    conf.high = "95% CI high"
  )

Treatment and interaction terms from the traditional outcome regression
Term	Estimate	SE	t	p-value	95% CI low	95% CI high
W	-32.791	13.441	-2.440	0.015	-59.179	-6.403
bw:W	0.009	0.003	3.335	0.001	0.004	0.015
female:W	2.142	2.461	0.870	0.384	-2.689	6.973
ppvt.imp:W	0.178	0.085	2.108	0.035	0.012	0.344
momage:W	-0.033	0.278	-0.118	0.906	-0.579	0.513
nnhealth:W	0.039	0.081	0.477	0.633	-0.120	0.197
birth.o:W	1.452	1.277	1.137	0.256	-1.056	3.960
cigs:W	0.067	0.163	0.409	0.683	-0.254	0.387
alcohol:W	-0.754	0.990	-0.761	0.447	-2.699	1.190
b.marry:W	0.564	2.228	0.253	0.800	-3.810	4.939
mom.hs:W	3.429	3.302	1.038	0.299	-3.054	9.912
mom.coll:W	-0.520	4.027	-0.129	0.897	-8.426	7.387
mom.scoll:W	-5.197	5.367	-0.968	0.333	-15.733	5.339
momblack:W	3.267	3.323	0.983	0.326	-3.256	9.790
momhisp:W	5.627	5.057	1.113	0.266	-4.300	15.554

Compare this output to the causal-forest BLP. Are the same variables highlighted? If not, is that a substantive disagreement or a consequence of different modeling assumptions?

13 Step 9: Extension With Double Machine Learning

Double Machine Learning also produces double-robust scores, but it obtains them through cross-fitting nuisance models rather than through the causal forest CATE estimator. Comparing BLPs from both scores is a robustness check on the heterogeneity story.

Code

set.seed(123456)

obj_dml_data <- DoubleMLData$new(IQdata, y_col = "Y", d_cols = "W")

Y_learner <- lrn(
  "regr.ranger",
  num.trees = 100,
  min.node.size = 2,
  max.depth = 5,
  num.threads = 1
)

W_learner <- lrn(
  "classif.ranger",
  num.trees = 500,
  min.node.size = 2,
  max.depth = 5,
  num.threads = 1
)

doubleml <- DoubleMLIRM$new(
  obj_dml_data,
  ml_g = Y_learner,
  ml_m = W_learner,
  n_folds = 5,
  score = "ATE"
)

doubleml$fit(store_predictions = TRUE)
dml_ate <- data.frame(
  estimator = "DoubleML",
  estimate = as.numeric(doubleml$coef),
  std_error = as.numeric(doubleml$se),
  row.names = NULL
)

cf_ate <- average_treatment_effect(tau_forest) |>
  ate_table("Causal forest")

dplyr::bind_rows(dml_ate, cf_ate) |>
  dplyr::mutate(
    estimate = round(estimate, 3),
    std_error = round(std_error, 3)
  ) |>
  gt_uw("ATE comparison")

ATE comparison
estimator	estimate	std_error
DoubleML	8.716	1.331
Causal forest	8.995	1.274

Code

DML_scores <- as.numeric(doubleml$psi[, 1, 1])

linear_predictors_DML <- lm(
  DML_scores ~ .,
  data = as.data.frame(cbind(DML_scores, X_selected))
)

model_table(
  linear_predictors_DML,
  "Best Linear Predictor using DoubleML scores"
)

Best Linear Predictor using DoubleML scores
Term	Estimate	SE	t	p-value	95% CI low	95% CI high
(Intercept)	-26.439	14.993	-1.763	0.078	-55.871	2.993
bw	0.009	0.003	3.139	0.002	0.003	0.015
momage	-0.003	0.296	-0.011	0.991	-0.585	0.578
nnhealth	-0.074	0.087	-0.847	0.397	-0.245	0.097
birth.o	1.650	1.331	1.240	0.215	-0.963	4.263
cigs	0.108	0.191	0.566	0.572	-0.266	0.482
alcohol	-0.864	0.997	-0.867	0.386	-2.822	1.093
ppvt.imp	0.104	0.094	1.106	0.269	-0.081	0.289
female	3.664	2.692	1.361	0.174	-1.620	8.949
b.marry	-0.415	2.410	-0.172	0.863	-5.146	4.317
mom.hs	5.329	3.559	1.497	0.135	-1.657	12.315
mom.coll	-0.441	4.338	-0.102	0.919	-8.958	8.075
mom.scoll	-1.838	6.085	-0.302	0.763	-13.784	10.107
momblack	5.240	3.718	1.410	0.159	-2.057	12.538
momhisp	7.156	5.192	1.378	0.169	-3.037	17.349

Code

cf_coefs <- broom::tidy(linear_predictors_CF)
dml_coefs <- broom::tidy(linear_predictors_DML)

compare_BLP <- data.frame(
  variable = cf_coefs$term,
  est_CF = round(cf_coefs$estimate, 3),
  SE_CF = round(cf_coefs$std.error, 3),
  est_DML = round(
    dml_coefs$estimate[match(cf_coefs$term, dml_coefs$term)],
    3
  ),
  SE_DML = round(
    dml_coefs$std.error[match(cf_coefs$term, dml_coefs$term)],
    3
  )
)

compare_BLP |>
  gt_uw("BLP coefficient comparison")

BLP coefficient comparison
variable	est_CF	SE_CF	est_DML	SE_DML
(Intercept)	-19.210	14.345	-26.439	14.993
bw	0.008	0.003	0.009	0.003
momage	0.098	0.283	-0.003	0.296
nnhealth	-0.068	0.084	-0.074	0.087
birth.o	1.628	1.274	1.650	1.331
cigs	0.077	0.182	0.108	0.191
alcohol	-0.810	0.954	-0.864	0.997
ppvt.imp	0.097	0.090	0.104	0.094
female	3.195	2.576	3.664	2.692
b.marry	0.812	2.306	-0.415	2.410
mom.hs	5.533	3.405	5.329	3.559
mom.coll	-0.393	4.151	-0.441	4.338
mom.scoll	-1.335	5.822	-1.838	6.085
momblack	4.678	3.557	5.240	3.718
momhisp	6.315	4.968	7.156	5.192

Code

blp_plot_df <- rbind(
  get_blp_df(linear_predictors_CF, "Causal Forest"),
  get_blp_df(linear_predictors_DML, "Double ML")
)

blp_plot_df <- subset(blp_plot_df, variable != "(Intercept)")

cf_only <- blp_plot_df[blp_plot_df$model == "Causal Forest", ]
ordervars <- cf_only$variable[order(cf_only$estimate)]
blp_plot_df$variable <- factor(blp_plot_df$variable, levels = ordervars)

ggplot(blp_plot_df, aes(x = estimate, y = variable, colour = model)) +
  geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
  geom_pointrange(
    aes(xmin = lower, xmax = upper),
    position = position_dodge(width = 0.55),
    linewidth = 0.45
  ) +
  scale_colour_manual(values = c("Causal Forest" = pal[["purple"]], "Double ML" = pal[["gold_dark"]])) +
  labs(
    title = "Do both score constructions tell the same story?",
    x = "Coefficient estimate with 95% CI",
    y = NULL,
    colour = NULL
  )

Side-by-side comparison of BLP coefficients from causal-forest and DoubleML scores.

14 Policy Discussion

Evidence of benefit: Is the estimated average treatment effect positive?
Evidence of heterogeneity: Are predicted CATEs meaningfully different across children?
Targeting ethics: Even if some variables predict benefit, should they be used to allocate access?

Important

Prediction is not policy. A targeting rule should be judged not only by estimated benefit, but also by transparency, feasibility, budget constraints, and equity.