Machine learning for estimating average treatment effects

HEOR-DS 530, Practical Demonstration

Author

Noemi Kreif. Previous version developed together with Julia Hatamyar

Published

May 1, 2026

This tutorial demonstrates the main coding workflow for estimating and interpreting conditional average treatment effects (CATEs) with causal forests.

The structure is as follows:

Note

This document is designed for live teaching. The code is intentionally explicit rather than overly compact, so students can map each line to a step in the assignment.

1 Learning Goals

By the end of the walkthrough, students should be able to:

  1. estimate nuisance functions for causal forests;
  2. explain why outcome and propensity score models matter for confounding adjustment;
  3. choose effect modifiers with a policy decision in mind;
  4. interpret predicted CATEs and their distribution;
  5. use variable importance and Best Linear Predictor (BLP) regressions as summaries of heterogeneity;
  6. compare causal-forest double-robust scores with Double Machine Learning scores.

2 Setup

  • Check availability of packages and stop with a clear message if a package is missing.
  • Apply UW branding for plots and tables via formatQuarto.
  • Create helper functions for tables and plots to keep the code clean and focused on the analysis steps.

3 Data

We use the IHDP simulation data. The outcome Y is a cognitive test score, and W indicates treatment assignment. All other columns are baseline covariates.

Code
load("ihdp_sim.data")

data.frame(
  rows = nrow(IQdata),
  columns = ncol(IQdata)
) |>
  gt_uw("Dataset dimensions")
Dataset dimensions
rows columns
787 36
Code
head(dplyr::select(IQdata, Y, W, bw, momage, nnhealth, ppvt.imp, female)) |>
  gt_uw("First six rows: key analysis variables")
First six rows: key analysis variables
Y W bw momage nnhealth ppvt.imp female
120 1 1559 33 94 75.00000 1
90 1 1420 15 85 73.00000 1
76 0 1000 33 89 77.00000 0
43 0 1430 22 112 62.37235 0
73 0 1984 20 99 73.00000 0
83 0 1320 23 110 56.00000 1

4 Variable Dictionary

The variable descriptions below are taken from VariableDescription.doc. Some names in the original document differ slightly from the names used in IQdata; the table keeps both labels visible.

IHDP variable descriptions
Source: VariableDescription.doc
Variable in source document Variable in IQdata Description
treat W Indicator for IHDP intervention group; 1 for intervention, 0 for control.
bw bw Child birthweight in grams.
b.head Not included in IQdata Child head circumference at birth, measured in centimeters.
preterm Not included in IQdata Number of weeks pre-term that the child was born.
birth.o birth.o Birth order.
nnhealth nnhealth Neonatal health index based on birth-related variables such as hospital days and gestational age.
momage momage Mother's age when she gave birth to the child.
sex female Child gender; female is coded as 1.
twin Not included in IQdata Indicator for whether the child is a twin.
b.marry b.marry Indicator for whether the mother was married when the child was born.
momed indicators mom.lths, mom.hs, mom.scoll, mom.coll Maternal education at birth: less than high school, high school, some college, or college.
first Not included in IQdata Indicator for whether the child is first born.
cig cigs Indicator for whether the mother smoked cigarettes while pregnant.
booze alcohol Indicator for whether the mother consumed alcohol while pregnant.
drugs drugs Indicator for whether the mother ever used drugs while pregnant.
work.dur workdur.imp Indicator for whether the mother worked during pregnancy.
prenatal / whenpren whenpren Timing of prenatal care; in this dataset, trimester when prenatal care began.
site indicator variables site1-site8 Indicators for study site; represented as site1 through site8 in this dataset.
momwhite, momblack, momhisp momwhite, momblack, momhisp Indicators for mother's race or ethnicity; white includes white and other in the original description.
parity parity Number of children the mother has given birth to.
moreprem moreprem Number of other children the mother has given birth to prematurely.
ppvt.imp ppvt.imp Mother's PPVT score measured one year post-birth; some values are imputed.
mlt.birt mlt.birt Number of multiple births the mother has had.
livwho livwho Family member with whom the child primarily lives.
language language Primary language spoken at home.
othstudy othstudy Indicator for whether the family was participating in another study at the same time.
Code
X <- IQdata[, !names(IQdata) %in% c("W", "Y")]
Y <- IQdata$Y
W <- IQdata$W

data.frame(
  n = nrow(IQdata),
  covariates = ncol(X),
  treated = sum(W == 1),
  control = sum(W == 0),
  mean_outcome = mean(Y)
) |>
  dplyr::mutate(mean_outcome = round(mean_outcome, 2)) |>
  gt_uw("Analytic sample")
Analytic sample
n covariates treated control mean_outcome
787 34 226 561 88

5 Step 1: Estimate Nuisance Models

Causal forests use nuisance functions to adjust for confounding:

  1. W.hat: the propensity score, or probability of treatment conditional on covariates;
  2. Y.hat: the conditional mean of the outcome.

Here we estimate both with regression_forest() and tune all available hyperparameters.

Code
propscore_forest <- regression_forest(
  X = X,
  Y = W,
  num.threads = 1,
  tune.parameters = "all"
)

outcome_forest <- regression_forest(
  X = X,
  Y = Y,
  num.threads = 1,
  tune.parameters = "all"
)

W.hat <- predict(propscore_forest)$predictions
Y.hat <- predict(outcome_forest)$predictions

Identify the hyperparameter that controls how many variables are tried at each split. In grf, this is mtry.

Code
default_mtry <- min(ceiling(sqrt(ncol(X)) + 20), ncol(X))

dplyr::bind_rows(
  data.frame(model = "Default rule", parameter = "mtry", value = as.character(default_mtry)),
  tuning_table(propscore_forest, "Propensity forest"),
  tuning_table(outcome_forest, "Outcome forest")
) |>
  gt_uw("Selected tuning parameters")
Selected tuning parameters
model parameter value
Default rule mtry 26
Propensity forest sample.fraction 0.5
Propensity forest mtry 26
Propensity forest min.node.size 5
Propensity forest honesty.fraction 0.5
Propensity forest honesty.prune.leaves 1
Propensity forest alpha 0.05
Propensity forest imbalance.penalty 0
Outcome forest sample.fraction 0.379
Outcome forest mtry 12
Outcome forest min.node.size 2
Outcome forest honesty.fraction 0.626
Outcome forest honesty.prune.leaves 1
Outcome forest alpha 0.01
Outcome forest imbalance.penalty 0.655
Code
par(mfrow = c(1, 2))
hist(W.hat, main = "Predicted propensity score", xlab = "W.hat", col = pal[["gold"]])
hist(Y.hat, main = "Predicted outcome", xlab = "Y.hat", col = pal[["purple"]])

Predicted propensity scores and predicted outcomes from nuisance forests.
Code
par(mfrow = c(1, 1))
Tip

The nuisance models are not the final estimand. They are supporting models that help the causal forest compare treated and control observations that are comparable on observed covariates.

6 Step 2: Choose Policy-Relevant Effect Modifiers

Let’s choose variables that could plausibly be used by a policy maker. This is not only a modeling decision; it is also an ethical and operational decision.

For this demonstration, momwhite is kept as the reference category, while momblack and momhisp enter the model.

Code
myvars <- c(
  "bw", "female", "ppvt.imp", "momage", "nnhealth", "birth.o",
  "cigs", "alcohol", "b.marry", "mom.hs", "mom.coll", "mom.scoll",
  "momblack", "momhisp"
)

descriptive_vars <- c(myvars, "momwhite")
X_selected <- X[, names(X) %in% myvars]

data.frame(effect_modifier = names(X_selected)) |>
  gt_uw("Selected effect modifiers")
Selected effect modifiers
effect_modifier
bw
momage
nnhealth
birth.o
cigs
alcohol
ppvt.imp
female
b.marry
mom.hs
mom.coll
mom.scoll
momblack
momhisp
Warning

This list is meant for discussion, not as a final policy recommendation. Some variables may be predictive but ethically sensitive or unavailable to a local decision maker. The model can reveal heterogeneity; it does not decide what is fair to use for targeting.

7 Step 3: Fit the Causal Forest

We now estimate the CATE function:

\[ \tau(x) = E[Y(1) - Y(0) \mid X = x] \]

Code
tau_forest <- causal_forest(
  X = X_selected,
  Y = Y,
  W = W,
  W.hat = W.hat,
  Y.hat = Y.hat,
  num.threads = 1,
  tune.parameters = "all"
)

tau_hat <- predict(tau_forest)$predictions
Code
mean_cate <- mean(tau_hat)

data.frame(
  statistic = c("Minimum", "25th percentile", "Median", "Mean", "75th percentile", "Maximum"),
  value = as.numeric(summary(tau_hat))
) |>
  dplyr::mutate(value = round(value, 3)) |>
  gt_uw("Distribution of predicted CATEs")
Distribution of predicted CATEs
statistic value
Minimum -1.349
25th percentile 6.435
Median 8.548
Mean 8.713
75th percentile 10.845
Maximum 18.091
Code
data.frame(
  diagnostic = c("Average predicted CATE", "Any predicted CATE below 0?"),
  value = c(round(mean_cate, 3), ifelse(any(tau_hat < 0), "Yes", "No"))
) |>
  gt_uw("CATE diagnostics")
CATE diagnostics
diagnostic value
Average predicted CATE 8.713
Any predicted CATE below 0? Yes
Code
ggplot(data.frame(tau_hat), aes(x = tau_hat)) +
  geom_histogram(bins = 30, fill = pal[["purple"]], colour = "white", alpha = 0.88) +
  geom_vline(xintercept = 0, colour = pal[["gold_dark"]], linewidth = 1) +
  geom_vline(xintercept = mean_cate, colour = pal[["gold"]], linewidth = 1.2) +
  labs(
    title = "Predicted CATEs",
    x = "Predicted treatment effect",
    y = "Number of children"
  )

Distribution of predicted individualized treatment effects.

Teaching prompt If some predicted CATEs are below zero, do we conclude the intervention harmed those children? Discuss estimation uncertainty, model dependence, and the difference between individual predictions and subgroup-level evidence.

8 Step 4: Who Benefits More?

One simple descriptive exercise is to split observations by above-average versus below-average predicted benefit, then compare baseline covariates.

Code
highcate <- as.numeric(tau_hat >= mean_cate)

balancetable_CATE <- CreateTableOne(
  vars = descriptive_vars,
  strata = "highcate",
  data = cbind(IQdata, highcate)
)

tableone_df <- print(
  balancetable_CATE,
  smd = TRUE,
  printToggle = FALSE,
  noSpaces = TRUE
) |>
  as.data.frame() |>
  tibble::rownames_to_column("Covariate")

tableone_df |>
  gt_uw("Baseline characteristics by predicted benefit group")
Baseline characteristics by predicted benefit group
Covariate 0 1 p test SMD
n 411 376
bw (mean (SD)) 1549.75 (392.99) 2077.75 (353.13) <0.001 1.413
female (mean (SD)) 0.50 (0.50) 0.52 (0.50) 0.574 0.040
ppvt.imp (mean (SD)) 82.27 (22.70) 81.68 (19.00) 0.692 0.028
momage (mean (SD)) 24.63 (6.34) 25.41 (5.70) 0.071 0.129
nnhealth (mean (SD)) 101.89 (16.19) 97.19 (14.65) <0.001 0.304
birth.o (mean (SD)) 1.65 (0.93) 2.19 (1.28) <0.001 0.488
cigs (mean (SD)) 3.30 (6.62) 4.97 (8.24) 0.002 0.223
alcohol (mean (SD)) 0.38 (1.60) 0.30 (1.07) 0.362 0.066
b.marry (mean (SD)) 1.56 (0.60) 1.62 (0.68) 0.192 0.093
mom.hs (mean (SD)) 0.21 (0.41) 0.34 (0.47) <0.001 0.303
mom.coll (mean (SD)) 0.20 (0.40) 0.22 (0.42) 0.413 0.058
mom.scoll (mean (SD)) 0.19 (0.40) 0.09 (0.28) <0.001 0.310
momblack (mean (SD)) 0.47 (0.50) 0.47 (0.50) 0.912 0.008
momhisp (mean (SD)) 0.09 (0.28) 0.11 (0.32) 0.259 0.080
momwhite (mean (SD)) 0.44 (0.50) 0.42 (0.49) 0.569 0.041

9 Step 5: Variable Importance for Heterogeneity

The variable_importance() function tells us which variables the causal forest split on most often while estimating heterogeneity. This is useful, but limited: it is not a regression coefficient, and it does not tell us the direction of association.

Code
tau_varimp <- variable_importance(tau_forest)

importance_df <- data.frame(
  variable = colnames(X_selected),
  importance = tau_varimp
) |>
  arrange(desc(importance))

ggplot(importance_df, aes(x = reorder(variable, importance), y = importance)) +
  geom_col(fill = pal[["purple"]], alpha = 0.9) +
  coord_flip() +
  labs(
    title = "Which variables drive forest splits?",
    x = NULL,
    y = "Variable importance"
  )

Causal-forest variable importance for treatment effect heterogeneity.

10 Step 6: Double-Robust Scores and the ATE

The causal forest also produces double-robust scores. Their average is a double-robust estimate of the average treatment effect.

Double-robust means that the ATE estimate is consistent if either the propensity score model or the outcome model is correctly specified. This is a useful robustness check, but it does not mean the ATE estimate is unbiased if both models are misspecified.

Code
DR_scores <- get_scores(tau_forest)

average_treatment_effect(tau_forest) |>
  ate_table("Causal forest") |>
  dplyr::bind_rows(
    data.frame(
      estimator = "Mean of DR scores",
      estimate = mean(DR_scores),
      std_error = NA_real_
    )
  ) |>
  dplyr::mutate(
    estimate = round(estimate, 3),
    std_error = round(std_error, 3)
  ) |>
  gt_uw("Average treatment effect estimates")
Average treatment effect estimates
estimator estimate std_error
Causal forest 8.995 1.274
Mean of DR scores 8.995 NA
Code
ggplot(data.frame(DR_scores), aes(x = DR_scores)) +
  geom_histogram(bins = 30, fill = pal[["gold"]], colour = "white") +
  geom_vline(xintercept = mean(DR_scores), colour = pal[["purple"]], linewidth = 1.2) +
  labs(
    title = "Double-robust scores from the causal forest",
    x = "Score",
    y = "Number of children"
  )

Distribution of causal-forest double-robust scores.

11 Step 7: Best Linear Predictor of Heterogeneity

A BLP regression summarizes heterogeneity by regressing the double-robust scores on selected effect modifiers. This is deliberately interpretable, but it can miss nonlinear relationships and interactions.

Code
linear_predictors_CF <- lm(
  DR_scores ~ .,
  data = as.data.frame(cbind(DR_scores, X_selected))
)

model_table(
  linear_predictors_CF,
  "Best Linear Predictor using causal-forest scores"
)
Best Linear Predictor using causal-forest scores
Term Estimate SE t p-value 95% CI low 95% CI high
(Intercept) -19.210 14.345 -1.339 0.181 -47.369 8.950
bw 0.008 0.003 2.866 0.004 0.003 0.014
momage 0.098 0.283 0.345 0.731 -0.459 0.654
nnhealth -0.068 0.084 -0.816 0.415 -0.232 0.096
birth.o 1.628 1.274 1.278 0.202 -0.872 4.128
cigs 0.077 0.182 0.423 0.672 -0.281 0.435
alcohol -0.810 0.954 -0.849 0.396 -2.683 1.063
ppvt.imp 0.097 0.090 1.076 0.282 -0.080 0.274
female 3.195 2.576 1.240 0.215 -1.861 8.251
b.marry 0.812 2.306 0.352 0.725 -3.715 5.339
mom.hs 5.533 3.405 1.625 0.105 -1.151 12.217
mom.coll -0.393 4.151 -0.095 0.925 -8.542 7.755
mom.scoll -1.335 5.822 -0.229 0.819 -12.764 10.093
momblack 4.678 3.557 1.315 0.189 -2.304 11.661
momhisp 6.315 4.968 1.271 0.204 -3.438 16.067
Code
get_blp_df <- function(model, label) {
  s <- broom::tidy(model)
  data.frame(
    variable = s$term,
    estimate = s$estimate,
    lower = s$estimate - 1.96 * s$std.error,
    upper = s$estimate + 1.96 * s$std.error,
    model = label,
    stringsAsFactors = FALSE
  )
}
Code
cf_blp_plot_df <- get_blp_df(linear_predictors_CF, "Causal Forest")
cf_blp_plot_df <- subset(cf_blp_plot_df, variable != "(Intercept)")
cf_blp_plot_df$variable <- factor(
  cf_blp_plot_df$variable,
  levels = cf_blp_plot_df$variable[order(cf_blp_plot_df$estimate)]
)

ggplot(cf_blp_plot_df, aes(x = estimate, y = variable)) +
  geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
  geom_pointrange(aes(xmin = lower, xmax = upper), colour = pal[["purple"]]) +
  labs(
    title = "Linear summary of treatment effect heterogeneity",
    x = "Coefficient estimate with 95% CI",
    y = NULL
  )

Best Linear Predictor coefficients using causal-forest scores.
Note

We can interpret the birth weight coefficient as the expected change in the double-robust score associated with a one-unit increase in bw, holding the other selected effect modifiers fixed. If the scale is hard to interpret, rescale bw before fitting the BLP.

12 Step 8: Compare With a Traditional Interaction Regression

A traditional approach is to fit a linear outcome model with treatment-by- covariate interactions. This imposes stronger functional-form assumptions: linear covariate effects and linear interactions with treatment.

Code
confounders <- setdiff(names(IQdata), "Y")
interaction_terms <- paste0("W:", myvars, collapse = " + ")

linear_formula <- as.formula(
  paste("Y ~", paste(confounders, collapse = " + "), "+", interaction_terms)
)

linear_interact <- lm(linear_formula, data = IQdata)

broom::tidy(linear_interact, conf.int = TRUE) |>
  dplyr::filter(term == "W" | grepl(":W|W:", term)) |>
  dplyr::mutate(
    dplyr::across(
      c(estimate, std.error, statistic, p.value, conf.low, conf.high),
      \(x) round(x, 3)
    )
  ) |>
  gt_uw("Treatment and interaction terms from the traditional outcome regression") |>
  gt::cols_label(
    term = "Term",
    estimate = "Estimate",
    std.error = "SE",
    statistic = "t",
    p.value = "p-value",
    conf.low = "95% CI low",
    conf.high = "95% CI high"
  )
Treatment and interaction terms from the traditional outcome regression
Term Estimate SE t p-value 95% CI low 95% CI high
W -32.791 13.441 -2.440 0.015 -59.179 -6.403
bw:W 0.009 0.003 3.335 0.001 0.004 0.015
female:W 2.142 2.461 0.870 0.384 -2.689 6.973
ppvt.imp:W 0.178 0.085 2.108 0.035 0.012 0.344
momage:W -0.033 0.278 -0.118 0.906 -0.579 0.513
nnhealth:W 0.039 0.081 0.477 0.633 -0.120 0.197
birth.o:W 1.452 1.277 1.137 0.256 -1.056 3.960
cigs:W 0.067 0.163 0.409 0.683 -0.254 0.387
alcohol:W -0.754 0.990 -0.761 0.447 -2.699 1.190
b.marry:W 0.564 2.228 0.253 0.800 -3.810 4.939
mom.hs:W 3.429 3.302 1.038 0.299 -3.054 9.912
mom.coll:W -0.520 4.027 -0.129 0.897 -8.426 7.387
mom.scoll:W -5.197 5.367 -0.968 0.333 -15.733 5.339
momblack:W 3.267 3.323 0.983 0.326 -3.256 9.790
momhisp:W 5.627 5.057 1.113 0.266 -4.300 15.554

Compare this output to the causal-forest BLP. Are the same variables highlighted? If not, is that a substantive disagreement or a consequence of different modeling assumptions?

13 Step 9: Extension With Double Machine Learning

Double Machine Learning also produces double-robust scores, but it obtains them through cross-fitting nuisance models rather than through the causal forest CATE estimator. Comparing BLPs from both scores is a robustness check on the heterogeneity story.

Code
set.seed(123456)

obj_dml_data <- DoubleMLData$new(IQdata, y_col = "Y", d_cols = "W")

Y_learner <- lrn(
  "regr.ranger",
  num.trees = 100,
  min.node.size = 2,
  max.depth = 5,
  num.threads = 1
)

W_learner <- lrn(
  "classif.ranger",
  num.trees = 500,
  min.node.size = 2,
  max.depth = 5,
  num.threads = 1
)

doubleml <- DoubleMLIRM$new(
  obj_dml_data,
  ml_g = Y_learner,
  ml_m = W_learner,
  n_folds = 5,
  score = "ATE"
)

doubleml$fit(store_predictions = TRUE)
dml_ate <- data.frame(
  estimator = "DoubleML",
  estimate = as.numeric(doubleml$coef),
  std_error = as.numeric(doubleml$se),
  row.names = NULL
)

cf_ate <- average_treatment_effect(tau_forest) |>
  ate_table("Causal forest")

dplyr::bind_rows(dml_ate, cf_ate) |>
  dplyr::mutate(
    estimate = round(estimate, 3),
    std_error = round(std_error, 3)
  ) |>
  gt_uw("ATE comparison")
ATE comparison
estimator estimate std_error
DoubleML 8.716 1.331
Causal forest 8.995 1.274
Code
DML_scores <- as.numeric(doubleml$psi[, 1, 1])

linear_predictors_DML <- lm(
  DML_scores ~ .,
  data = as.data.frame(cbind(DML_scores, X_selected))
)

model_table(
  linear_predictors_DML,
  "Best Linear Predictor using DoubleML scores"
)
Best Linear Predictor using DoubleML scores
Term Estimate SE t p-value 95% CI low 95% CI high
(Intercept) -26.439 14.993 -1.763 0.078 -55.871 2.993
bw 0.009 0.003 3.139 0.002 0.003 0.015
momage -0.003 0.296 -0.011 0.991 -0.585 0.578
nnhealth -0.074 0.087 -0.847 0.397 -0.245 0.097
birth.o 1.650 1.331 1.240 0.215 -0.963 4.263
cigs 0.108 0.191 0.566 0.572 -0.266 0.482
alcohol -0.864 0.997 -0.867 0.386 -2.822 1.093
ppvt.imp 0.104 0.094 1.106 0.269 -0.081 0.289
female 3.664 2.692 1.361 0.174 -1.620 8.949
b.marry -0.415 2.410 -0.172 0.863 -5.146 4.317
mom.hs 5.329 3.559 1.497 0.135 -1.657 12.315
mom.coll -0.441 4.338 -0.102 0.919 -8.958 8.075
mom.scoll -1.838 6.085 -0.302 0.763 -13.784 10.107
momblack 5.240 3.718 1.410 0.159 -2.057 12.538
momhisp 7.156 5.192 1.378 0.169 -3.037 17.349
Code
cf_coefs <- broom::tidy(linear_predictors_CF)
dml_coefs <- broom::tidy(linear_predictors_DML)

compare_BLP <- data.frame(
  variable = cf_coefs$term,
  est_CF = round(cf_coefs$estimate, 3),
  SE_CF = round(cf_coefs$std.error, 3),
  est_DML = round(
    dml_coefs$estimate[match(cf_coefs$term, dml_coefs$term)],
    3
  ),
  SE_DML = round(
    dml_coefs$std.error[match(cf_coefs$term, dml_coefs$term)],
    3
  )
)

compare_BLP |>
  gt_uw("BLP coefficient comparison")
BLP coefficient comparison
variable est_CF SE_CF est_DML SE_DML
(Intercept) -19.210 14.345 -26.439 14.993
bw 0.008 0.003 0.009 0.003
momage 0.098 0.283 -0.003 0.296
nnhealth -0.068 0.084 -0.074 0.087
birth.o 1.628 1.274 1.650 1.331
cigs 0.077 0.182 0.108 0.191
alcohol -0.810 0.954 -0.864 0.997
ppvt.imp 0.097 0.090 0.104 0.094
female 3.195 2.576 3.664 2.692
b.marry 0.812 2.306 -0.415 2.410
mom.hs 5.533 3.405 5.329 3.559
mom.coll -0.393 4.151 -0.441 4.338
mom.scoll -1.335 5.822 -1.838 6.085
momblack 4.678 3.557 5.240 3.718
momhisp 6.315 4.968 7.156 5.192
Code
blp_plot_df <- rbind(
  get_blp_df(linear_predictors_CF, "Causal Forest"),
  get_blp_df(linear_predictors_DML, "Double ML")
)

blp_plot_df <- subset(blp_plot_df, variable != "(Intercept)")

cf_only <- blp_plot_df[blp_plot_df$model == "Causal Forest", ]
ordervars <- cf_only$variable[order(cf_only$estimate)]
blp_plot_df$variable <- factor(blp_plot_df$variable, levels = ordervars)

ggplot(blp_plot_df, aes(x = estimate, y = variable, colour = model)) +
  geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
  geom_pointrange(
    aes(xmin = lower, xmax = upper),
    position = position_dodge(width = 0.55),
    linewidth = 0.45
  ) +
  scale_colour_manual(values = c("Causal Forest" = pal[["purple"]], "Double ML" = pal[["gold_dark"]])) +
  labs(
    title = "Do both score constructions tell the same story?",
    x = "Coefficient estimate with 95% CI",
    y = NULL,
    colour = NULL
  )

Side-by-side comparison of BLP coefficients from causal-forest and DoubleML scores.

14 Policy Discussion

  1. Evidence of benefit: Is the estimated average treatment effect positive?
  2. Evidence of heterogeneity: Are predicted CATEs meaningfully different across children?
  3. Targeting ethics: Even if some variables predict benefit, should they be used to allocate access?
Important

Prediction is not policy. A targeting rule should be judged not only by estimated benefit, but also by transparency, feasibility, budget constraints, and equity.