Noemi Kreif. Previous version developed together with Julia Hatamyar
Published
May 1, 2026
This tutorial demonstrates the main coding workflow for estimating and interpreting conditional average treatment effects (CATEs) with causal forests.
The structure is as follows:
estimate nuisance functions,
fit a causal forest using policy-relevant effect modifiers,
inspect heterogeneity,
summarize it with interpretable models,
and compare the causal-forest scores with Double Machine Learning scores.
Note
This document is designed for live teaching. The code is intentionally explicit rather than overly compact, so students can map each line to a step in the assignment.
1 Learning Goals
By the end of the walkthrough, students should be able to:
estimate nuisance functions for causal forests;
explain why outcome and propensity score models matter for confounding adjustment;
choose effect modifiers with a policy decision in mind;
interpret predicted CATEs and their distribution;
use variable importance and Best Linear Predictor (BLP) regressions as summaries of heterogeneity;
compare causal-forest double-robust scores with Double Machine Learning scores.
2 Setup
Check availability of packages and stop with a clear message if a package is missing.
Apply UW branding for plots and tables via formatQuarto.
Create helper functions for tables and plots to keep the code clean and focused on the analysis steps.
3 Data
We use the IHDP simulation data. The outcome Y is a cognitive test score, and W indicates treatment assignment. All other columns are baseline covariates.
head(dplyr::select(IQdata, Y, W, bw, momage, nnhealth, ppvt.imp, female)) |>gt_uw("First six rows: key analysis variables")
First six rows: key analysis variables
Y
W
bw
momage
nnhealth
ppvt.imp
female
120
1
1559
33
94
75.00000
1
90
1
1420
15
85
73.00000
1
76
0
1000
33
89
77.00000
0
43
0
1430
22
112
62.37235
0
73
0
1984
20
99
73.00000
0
83
0
1320
23
110
56.00000
1
4 Variable Dictionary
The variable descriptions below are taken from VariableDescription.doc. Some names in the original document differ slightly from the names used in IQdata; the table keeps both labels visible.
IHDP variable descriptions
Source: VariableDescription.doc
Variable in source document
Variable in IQdata
Description
treat
W
Indicator for IHDP intervention group; 1 for intervention, 0 for control.
bw
bw
Child birthweight in grams.
b.head
Not included in IQdata
Child head circumference at birth, measured in centimeters.
preterm
Not included in IQdata
Number of weeks pre-term that the child was born.
birth.o
birth.o
Birth order.
nnhealth
nnhealth
Neonatal health index based on birth-related variables such as hospital days and gestational age.
momage
momage
Mother's age when she gave birth to the child.
sex
female
Child gender; female is coded as 1.
twin
Not included in IQdata
Indicator for whether the child is a twin.
b.marry
b.marry
Indicator for whether the mother was married when the child was born.
momed indicators
mom.lths, mom.hs, mom.scoll, mom.coll
Maternal education at birth: less than high school, high school, some college, or college.
first
Not included in IQdata
Indicator for whether the child is first born.
cig
cigs
Indicator for whether the mother smoked cigarettes while pregnant.
booze
alcohol
Indicator for whether the mother consumed alcohol while pregnant.
drugs
drugs
Indicator for whether the mother ever used drugs while pregnant.
work.dur
workdur.imp
Indicator for whether the mother worked during pregnancy.
prenatal / whenpren
whenpren
Timing of prenatal care; in this dataset, trimester when prenatal care began.
site indicator variables
site1-site8
Indicators for study site; represented as site1 through site8 in this dataset.
momwhite, momblack, momhisp
momwhite, momblack, momhisp
Indicators for mother's race or ethnicity; white includes white and other in the original description.
parity
parity
Number of children the mother has given birth to.
moreprem
moreprem
Number of other children the mother has given birth to prematurely.
ppvt.imp
ppvt.imp
Mother's PPVT score measured one year post-birth; some values are imputed.
mlt.birt
mlt.birt
Number of multiple births the mother has had.
livwho
livwho
Family member with whom the child primarily lives.
language
language
Primary language spoken at home.
othstudy
othstudy
Indicator for whether the family was participating in another study at the same time.
par(mfrow =c(1, 2))hist(W.hat, main ="Predicted propensity score", xlab ="W.hat", col = pal[["gold"]])hist(Y.hat, main ="Predicted outcome", xlab ="Y.hat", col = pal[["purple"]])
Predicted propensity scores and predicted outcomes from nuisance forests.
Code
par(mfrow =c(1, 1))
Tip
The nuisance models are not the final estimand. They are supporting models that help the causal forest compare treated and control observations that are comparable on observed covariates.
6 Step 2: Choose Policy-Relevant Effect Modifiers
Let’s choose variables that could plausibly be used by a policy maker. This is not only a modeling decision; it is also an ethical and operational decision.
For this demonstration, momwhite is kept as the reference category, while momblack and momhisp enter the model.
This list is meant for discussion, not as a final policy recommendation. Some variables may be predictive but ethically sensitive or unavailable to a local decision maker. The model can reveal heterogeneity; it does not decide what is fair to use for targeting.
Distribution of predicted individualized treatment effects.
Teaching prompt If some predicted CATEs are below zero, do we conclude the intervention harmed those children? Discuss estimation uncertainty, model dependence, and the difference between individual predictions and subgroup-level evidence.
8 Step 4: Who Benefits More?
One simple descriptive exercise is to split observations by above-average versus below-average predicted benefit, then compare baseline covariates.
Baseline characteristics by predicted benefit group
Covariate
0
1
p
test
SMD
n
411
376
bw (mean (SD))
1549.75 (392.99)
2077.75 (353.13)
<0.001
1.413
female (mean (SD))
0.50 (0.50)
0.52 (0.50)
0.574
0.040
ppvt.imp (mean (SD))
82.27 (22.70)
81.68 (19.00)
0.692
0.028
momage (mean (SD))
24.63 (6.34)
25.41 (5.70)
0.071
0.129
nnhealth (mean (SD))
101.89 (16.19)
97.19 (14.65)
<0.001
0.304
birth.o (mean (SD))
1.65 (0.93)
2.19 (1.28)
<0.001
0.488
cigs (mean (SD))
3.30 (6.62)
4.97 (8.24)
0.002
0.223
alcohol (mean (SD))
0.38 (1.60)
0.30 (1.07)
0.362
0.066
b.marry (mean (SD))
1.56 (0.60)
1.62 (0.68)
0.192
0.093
mom.hs (mean (SD))
0.21 (0.41)
0.34 (0.47)
<0.001
0.303
mom.coll (mean (SD))
0.20 (0.40)
0.22 (0.42)
0.413
0.058
mom.scoll (mean (SD))
0.19 (0.40)
0.09 (0.28)
<0.001
0.310
momblack (mean (SD))
0.47 (0.50)
0.47 (0.50)
0.912
0.008
momhisp (mean (SD))
0.09 (0.28)
0.11 (0.32)
0.259
0.080
momwhite (mean (SD))
0.44 (0.50)
0.42 (0.49)
0.569
0.041
9 Step 5: Variable Importance for Heterogeneity
The variable_importance() function tells us which variables the causal forest split on most often while estimating heterogeneity. This is useful, but limited: it is not a regression coefficient, and it does not tell us the direction of association.
Causal-forest variable importance for treatment effect heterogeneity.
10 Step 6: Double-Robust Scores and the ATE
The causal forest also produces double-robust scores. Their average is a double-robust estimate of the average treatment effect.
Double-robust means that the ATE estimate is consistent if either the propensity score model or the outcome model is correctly specified. This is a useful robustness check, but it does not mean the ATE estimate is unbiased if both models are misspecified.
ggplot(data.frame(DR_scores), aes(x = DR_scores)) +geom_histogram(bins =30, fill = pal[["gold"]], colour ="white") +geom_vline(xintercept =mean(DR_scores), colour = pal[["purple"]], linewidth =1.2) +labs(title ="Double-robust scores from the causal forest",x ="Score",y ="Number of children" )
Distribution of causal-forest double-robust scores.
11 Step 7: Best Linear Predictor of Heterogeneity
A BLP regression summarizes heterogeneity by regressing the double-robust scores on selected effect modifiers. This is deliberately interpretable, but it can miss nonlinear relationships and interactions.
Code
linear_predictors_CF <-lm( DR_scores ~ .,data =as.data.frame(cbind(DR_scores, X_selected)))model_table( linear_predictors_CF,"Best Linear Predictor using causal-forest scores")
Best Linear Predictor coefficients using causal-forest scores.
Note
We can interpret the birth weight coefficient as the expected change in the double-robust score associated with a one-unit increase in bw, holding the other selected effect modifiers fixed. If the scale is hard to interpret, rescale bw before fitting the BLP.
12 Step 8: Compare With a Traditional Interaction Regression
A traditional approach is to fit a linear outcome model with treatment-by- covariate interactions. This imposes stronger functional-form assumptions: linear covariate effects and linear interactions with treatment.
Code
confounders <-setdiff(names(IQdata), "Y")interaction_terms <-paste0("W:", myvars, collapse =" + ")linear_formula <-as.formula(paste("Y ~", paste(confounders, collapse =" + "), "+", interaction_terms))linear_interact <-lm(linear_formula, data = IQdata)broom::tidy(linear_interact, conf.int =TRUE) |> dplyr::filter(term =="W"|grepl(":W|W:", term)) |> dplyr::mutate( dplyr::across(c(estimate, std.error, statistic, p.value, conf.low, conf.high), \(x) round(x, 3) ) ) |>gt_uw("Treatment and interaction terms from the traditional outcome regression") |> gt::cols_label(term ="Term",estimate ="Estimate",std.error ="SE",statistic ="t",p.value ="p-value",conf.low ="95% CI low",conf.high ="95% CI high" )
Treatment and interaction terms from the traditional outcome regression
Term
Estimate
SE
t
p-value
95% CI low
95% CI high
W
-32.791
13.441
-2.440
0.015
-59.179
-6.403
bw:W
0.009
0.003
3.335
0.001
0.004
0.015
female:W
2.142
2.461
0.870
0.384
-2.689
6.973
ppvt.imp:W
0.178
0.085
2.108
0.035
0.012
0.344
momage:W
-0.033
0.278
-0.118
0.906
-0.579
0.513
nnhealth:W
0.039
0.081
0.477
0.633
-0.120
0.197
birth.o:W
1.452
1.277
1.137
0.256
-1.056
3.960
cigs:W
0.067
0.163
0.409
0.683
-0.254
0.387
alcohol:W
-0.754
0.990
-0.761
0.447
-2.699
1.190
b.marry:W
0.564
2.228
0.253
0.800
-3.810
4.939
mom.hs:W
3.429
3.302
1.038
0.299
-3.054
9.912
mom.coll:W
-0.520
4.027
-0.129
0.897
-8.426
7.387
mom.scoll:W
-5.197
5.367
-0.968
0.333
-15.733
5.339
momblack:W
3.267
3.323
0.983
0.326
-3.256
9.790
momhisp:W
5.627
5.057
1.113
0.266
-4.300
15.554
Compare this output to the causal-forest BLP. Are the same variables highlighted? If not, is that a substantive disagreement or a consequence of different modeling assumptions?
13 Step 9: Extension With Double Machine Learning
Double Machine Learning also produces double-robust scores, but it obtains them through cross-fitting nuisance models rather than through the causal forest CATE estimator. Comparing BLPs from both scores is a robustness check on the heterogeneity story.
Side-by-side comparison of BLP coefficients from causal-forest and DoubleML scores.
14 Policy Discussion
Evidence of benefit: Is the estimated average treatment effect positive?
Evidence of heterogeneity: Are predicted CATEs meaningfully different across children?
Targeting ethics: Even if some variables predict benefit, should they be used to allocate access?
Important
Prediction is not policy. A targeting rule should be judged not only by estimated benefit, but also by transparency, feasibility, budget constraints, and equity.
Source Code
---title: "Machine learning for estimating average treatment effects"subtitle: "HEOR-DS 530, Practical Demonstration"author: "Noemi Kreif. Previous version developed together with Julia Hatamyar"date: "May 2026"format: html: theme: - flatly - theme-template.scss toc: true toc-depth: 3 number-sections: true code-fold: show code-tools: true page-layout: article df-print: paged pdf: number-sections: trueexecute: warning: false message: false freeze: autoeditor: source---This tutorial demonstrates the main coding workflow for estimating and interpreting conditional average treatment effects (CATEs) with causal forests.The structure is as follows:- estimate nuisance functions, - fit a causal forest using policy-relevant effect modifiers,- inspect heterogeneity, - summarize it with interpretable models, - and compare the causal-forest scores with Double Machine Learning scores.::: callout-noteThis document is designed for live teaching. The code is intentionally explicit rather than overly compact, so students can map each line to a step in the assignment.:::## Learning GoalsBy the end of the walkthrough, students should be able to:1. estimate nuisance functions for causal forests;2. explain why outcome and propensity score models matter for confounding adjustment;3. choose effect modifiers with a policy decision in mind;4. interpret predicted CATEs and their distribution;5. use variable importance and Best Linear Predictor (BLP) regressions as summaries of heterogeneity;6. compare causal-forest double-robust scores with Double Machine Learning scores.## Setup- Check availability of packages and stop with a clear message if a package is missing.- Apply UW branding for plots and tables via `formatQuarto`.- Create helper functions for tables and plots to keep the code clean and focused on the analysis steps.```{r}#| label: setup#| echo: false#| include: falserm(list =ls())set.seed(123456)Sys.setenv(OMP_NUM_THREADS ="1",OPENBLAS_NUM_THREADS ="1",MKL_NUM_THREADS ="1",VECLIB_MAXIMUM_THREADS ="1",R_USER_CACHE_DIR ="/private/tmp/r-cache")dir.create("/private/tmp/r-cache", recursive =TRUE, showWarnings =FALSE)``````{r}#| label: packages#| echo: falseremotes::install_github("CarlosPiant/formatQuarto") # for UW-themed tables and plotsrequired_packages <-c("ranger", "grf", "caret", "tableone", "DoubleML", "mlr3","mlr3learners", "dplyr", "ggplot2", "gt", "broom", "tibble","formatQuarto")missing_packages <- required_packages[!vapply(required_packages, requireNamespace, logical(1), quietly =TRUE)]if (length(missing_packages) >0) {stop("Please install the following packages before rendering: ",paste(missing_packages, collapse =", ") )}invisible(lapply(required_packages, library, character.only =TRUE))``````{r}#| label: themes#| echo: false#| include: falseuse_theme("uw")pal <-palette_uw()model_table <-function(model, title) { broom::tidy(model, conf.int =TRUE) |> dplyr::mutate( dplyr::across(c(estimate, std.error, statistic, p.value, conf.low, conf.high), \(x) round(x, 3) ) ) |>gt_uw(title = title) |> gt::cols_label(term ="Term",estimate ="Estimate",std.error ="SE",statistic ="t",p.value ="p-value",conf.low ="95% CI low",conf.high ="95% CI high" )}tuning_table <-function(forest, model_name) {data.frame(parameter =names(forest$tuning.output$params),value =unlist(forest$tuning.output$params),row.names =NULL ) |> dplyr::mutate(model = model_name,value =as.character(round(as.numeric(value), 3)) ) |> dplyr::select(model, parameter, value)}ate_table <-function(x, label) {data.frame(estimator = label,estimate =unname(x["estimate"]),std_error =unname(x["std.err"]) )}```## DataWe use the IHDP simulation data. The outcome `Y` is a cognitive test score, and `W` indicates treatment assignment. All other columns are baseline covariates.```{r}#| label: load-dataload("ihdp_sim.data")data.frame(rows =nrow(IQdata),columns =ncol(IQdata)) |>gt_uw("Dataset dimensions")head(dplyr::select(IQdata, Y, W, bw, momage, nnhealth, ppvt.imp, female)) |>gt_uw("First six rows: key analysis variables")```## Variable DictionaryThe variable descriptions below are taken from `VariableDescription.doc`. Some names in the original document differ slightly from the names used in `IQdata`; the table keeps both labels visible.```{r}#| label: variable-dictionary#| echo: falsevariable_dictionary <-data.frame(source_variable =c("treat","bw","b.head","preterm","birth.o","nnhealth","momage","sex","twin","b.marry","momed indicators","first","cig","booze","drugs","work.dur","prenatal / whenpren","site indicator variables","momwhite, momblack, momhisp","parity","moreprem","ppvt.imp","mlt.birt","livwho","language","othstudy" ),dataset_variable =c("W","bw","Not included in IQdata","Not included in IQdata","birth.o","nnhealth","momage","female","Not included in IQdata","b.marry","mom.lths, mom.hs, mom.scoll, mom.coll","Not included in IQdata","cigs","alcohol","drugs","workdur.imp","whenpren","site1-site8","momwhite, momblack, momhisp","parity","moreprem","ppvt.imp","mlt.birt","livwho","language","othstudy" ),description =c("Indicator for IHDP intervention group; 1 for intervention, 0 for control.","Child birthweight in grams.","Child head circumference at birth, measured in centimeters.","Number of weeks pre-term that the child was born.","Birth order.","Neonatal health index based on birth-related variables such as hospital days and gestational age.","Mother's age when she gave birth to the child.","Child gender; female is coded as 1.","Indicator for whether the child is a twin.","Indicator for whether the mother was married when the child was born.","Maternal education at birth: less than high school, high school, some college, or college.","Indicator for whether the child is first born.","Indicator for whether the mother smoked cigarettes while pregnant.","Indicator for whether the mother consumed alcohol while pregnant.","Indicator for whether the mother ever used drugs while pregnant.","Indicator for whether the mother worked during pregnancy.","Timing of prenatal care; in this dataset, trimester when prenatal care began.","Indicators for study site; represented as site1 through site8 in this dataset.","Indicators for mother's race or ethnicity; white includes white and other in the original description.","Number of children the mother has given birth to.","Number of other children the mother has given birth to prematurely.","Mother's PPVT score measured one year post-birth; some values are imputed.","Number of multiple births the mother has had.","Family member with whom the child primarily lives.","Primary language spoken at home.","Indicator for whether the family was participating in another study at the same time." ))variable_dictionary |>gt_uw("IHDP variable descriptions", "Source: VariableDescription.doc") |> gt::cols_label(source_variable ="Variable in source document",dataset_variable ="Variable in IQdata",description ="Description" )``````{r}#| label: data-objectsX <- IQdata[, !names(IQdata) %in%c("W", "Y")]Y <- IQdata$YW <- IQdata$Wdata.frame(n =nrow(IQdata),covariates =ncol(X),treated =sum(W ==1),control =sum(W ==0),mean_outcome =mean(Y)) |> dplyr::mutate(mean_outcome =round(mean_outcome, 2)) |>gt_uw("Analytic sample")```## Step 1: Estimate Nuisance ModelsCausal forests use nuisance functions to adjust for confounding:1. `W.hat`: the propensity score, or probability of treatment conditional on covariates;2. `Y.hat`: the conditional mean of the outcome.Here we estimate both with `regression_forest()` and tune all available hyperparameters.```{r}#| label: nuisance-models#| cache: truepropscore_forest <-regression_forest(X = X,Y = W,num.threads =1,tune.parameters ="all")outcome_forest <-regression_forest(X = X,Y = Y,num.threads =1,tune.parameters ="all")W.hat <-predict(propscore_forest)$predictionsY.hat <-predict(outcome_forest)$predictions```Identify the hyperparameter that controls how many variables are tried at each split. In `grf`, this is `mtry`.```{r}#| label: mtry-checkdefault_mtry <-min(ceiling(sqrt(ncol(X)) +20), ncol(X))dplyr::bind_rows(data.frame(model ="Default rule", parameter ="mtry", value =as.character(default_mtry)),tuning_table(propscore_forest, "Propensity forest"),tuning_table(outcome_forest, "Outcome forest")) |>gt_uw("Selected tuning parameters")``````{r}#| label: nuisance-plots#| fig-cap: "Predicted propensity scores and predicted outcomes from nuisance forests."par(mfrow =c(1, 2))hist(W.hat, main ="Predicted propensity score", xlab ="W.hat", col = pal[["gold"]])hist(Y.hat, main ="Predicted outcome", xlab ="Y.hat", col = pal[["purple"]])par(mfrow =c(1, 1))```::: callout-tipThe nuisance models are not the final estimand. They are supporting models that help the causal forest compare treated and control observations that are comparable on observed covariates.:::## Step 2: Choose Policy-Relevant Effect ModifiersLet's choose variables that could plausibly be used by a policy maker. This is not only a modeling decision; it is also an ethical and operational decision.For this demonstration, `momwhite` is kept as the reference category, while `momblack` and `momhisp` enter the model.```{r}#| label: select-effect-modifiersmyvars <-c("bw", "female", "ppvt.imp", "momage", "nnhealth", "birth.o","cigs", "alcohol", "b.marry", "mom.hs", "mom.coll", "mom.scoll","momblack", "momhisp")descriptive_vars <-c(myvars, "momwhite")X_selected <- X[, names(X) %in% myvars]data.frame(effect_modifier =names(X_selected)) |>gt_uw("Selected effect modifiers")```::: callout-warningThis list is meant for discussion, not as a final policy recommendation. Some variables may be predictive but ethically sensitive or unavailable to a local decision maker. The model can reveal heterogeneity; it does not decide what is fair to use for targeting.:::## Step 3: Fit the Causal ForestWe now estimate the CATE function:$$ \tau(x) = E[Y(1) - Y(0) \mid X = x] $$```{r}#| label: causal-forest#| cache: truetau_forest <-causal_forest(X = X_selected,Y = Y,W = W,W.hat = W.hat,Y.hat = Y.hat,num.threads =1,tune.parameters ="all")tau_hat <-predict(tau_forest)$predictions``````{r}#| label: cate-summarymean_cate <-mean(tau_hat)data.frame(statistic =c("Minimum", "25th percentile", "Median", "Mean", "75th percentile", "Maximum"),value =as.numeric(summary(tau_hat))) |> dplyr::mutate(value =round(value, 3)) |>gt_uw("Distribution of predicted CATEs")data.frame(diagnostic =c("Average predicted CATE", "Any predicted CATE below 0?"),value =c(round(mean_cate, 3), ifelse(any(tau_hat <0), "Yes", "No"))) |>gt_uw("CATE diagnostics")``````{r}#| label: cate-histogram#| fig-cap: "Distribution of predicted individualized treatment effects."ggplot(data.frame(tau_hat), aes(x = tau_hat)) +geom_histogram(bins =30, fill = pal[["purple"]], colour ="white", alpha =0.88) +geom_vline(xintercept =0, colour = pal[["gold_dark"]], linewidth =1) +geom_vline(xintercept = mean_cate, colour = pal[["gold"]], linewidth =1.2) +labs(title ="Predicted CATEs",x ="Predicted treatment effect",y ="Number of children" )```::: tutorial-note[Teaching prompt]{.label} If some predicted CATEs are below zero, do we conclude the intervention harmed those children? Discuss estimation uncertainty, model dependence, and the difference between individual predictions and subgroup-level evidence.:::## Step 4: Who Benefits More?One simple descriptive exercise is to split observations by above-average versus below-average predicted benefit, then compare baseline covariates.```{r}#| label: tableone-catehighcate <-as.numeric(tau_hat >= mean_cate)balancetable_CATE <-CreateTableOne(vars = descriptive_vars,strata ="highcate",data =cbind(IQdata, highcate))tableone_df <-print( balancetable_CATE,smd =TRUE,printToggle =FALSE,noSpaces =TRUE) |>as.data.frame() |> tibble::rownames_to_column("Covariate")tableone_df |>gt_uw("Baseline characteristics by predicted benefit group")```## Step 5: Variable Importance for HeterogeneityThe `variable_importance()` function tells us which variables the causal forest split on most often while estimating heterogeneity. This is useful, but limited: it is not a regression coefficient, and it does not tell us the direction of association.```{r}#| label: tau-variable-importance#| fig-cap: "Causal-forest variable importance for treatment effect heterogeneity."tau_varimp <-variable_importance(tau_forest)importance_df <-data.frame(variable =colnames(X_selected),importance = tau_varimp) |>arrange(desc(importance))ggplot(importance_df, aes(x =reorder(variable, importance), y = importance)) +geom_col(fill = pal[["purple"]], alpha =0.9) +coord_flip() +labs(title ="Which variables drive forest splits?",x =NULL,y ="Variable importance" )```## Step 6: Double-Robust Scores and the ATEThe causal forest also produces double-robust scores. Their average is a double-robust estimate of the average treatment effect.Double-robust means that the ATE estimate is consistent if either the propensity score model or the outcome model is correctly specified. This is a useful robustness check, but it does not mean the ATE estimate is unbiased if both models are misspecified.```{r}#| label: cf-scoresDR_scores <-get_scores(tau_forest)average_treatment_effect(tau_forest) |>ate_table("Causal forest") |> dplyr::bind_rows(data.frame(estimator ="Mean of DR scores",estimate =mean(DR_scores),std_error =NA_real_ ) ) |> dplyr::mutate(estimate =round(estimate, 3),std_error =round(std_error, 3) ) |>gt_uw("Average treatment effect estimates")``````{r}#| label: dr-score-histogram#| fig-cap: "Distribution of causal-forest double-robust scores."ggplot(data.frame(DR_scores), aes(x = DR_scores)) +geom_histogram(bins =30, fill = pal[["gold"]], colour ="white") +geom_vline(xintercept =mean(DR_scores), colour = pal[["purple"]], linewidth =1.2) +labs(title ="Double-robust scores from the causal forest",x ="Score",y ="Number of children" )```## Step 7: Best Linear Predictor of HeterogeneityA BLP regression summarizes heterogeneity by regressing the double-robust scores on selected effect modifiers. This is deliberately interpretable, but it can miss nonlinear relationships and interactions.```{r}#| label: blp-cflinear_predictors_CF <-lm( DR_scores ~ .,data =as.data.frame(cbind(DR_scores, X_selected)))model_table( linear_predictors_CF,"Best Linear Predictor using causal-forest scores")``````{r}#| label: coef-helperget_blp_df <-function(model, label) { s <- broom::tidy(model)data.frame(variable = s$term,estimate = s$estimate,lower = s$estimate -1.96* s$std.error,upper = s$estimate +1.96* s$std.error,model = label,stringsAsFactors =FALSE )}``````{r}#| label: blp-cf-plot#| fig-cap: "Best Linear Predictor coefficients using causal-forest scores."cf_blp_plot_df <-get_blp_df(linear_predictors_CF, "Causal Forest")cf_blp_plot_df <-subset(cf_blp_plot_df, variable !="(Intercept)")cf_blp_plot_df$variable <-factor( cf_blp_plot_df$variable,levels = cf_blp_plot_df$variable[order(cf_blp_plot_df$estimate)])ggplot(cf_blp_plot_df, aes(x = estimate, y = variable)) +geom_vline(xintercept =0, colour ="grey60", linetype =2) +geom_pointrange(aes(xmin = lower, xmax = upper), colour = pal[["purple"]]) +labs(title ="Linear summary of treatment effect heterogeneity",x ="Coefficient estimate with 95% CI",y =NULL )```::: callout-noteWe can interpret the birth weight coefficient as the expected change in the double-robust score associated with a one-unit increase in `bw`, holding the other selected effect modifiers fixed. If the scale is hard to interpret, rescale `bw` before fitting the BLP.:::## Step 8: Compare With a Traditional Interaction RegressionA traditional approach is to fit a linear outcome model with treatment-by- covariate interactions. This imposes stronger functional-form assumptions: linear covariate effects and linear interactions with treatment.```{r}#| label: interaction-modelconfounders <-setdiff(names(IQdata), "Y")interaction_terms <-paste0("W:", myvars, collapse =" + ")linear_formula <-as.formula(paste("Y ~", paste(confounders, collapse =" + "), "+", interaction_terms))linear_interact <-lm(linear_formula, data = IQdata)broom::tidy(linear_interact, conf.int =TRUE) |> dplyr::filter(term =="W"|grepl(":W|W:", term)) |> dplyr::mutate( dplyr::across(c(estimate, std.error, statistic, p.value, conf.low, conf.high), \(x) round(x, 3) ) ) |>gt_uw("Treatment and interaction terms from the traditional outcome regression") |> gt::cols_label(term ="Term",estimate ="Estimate",std.error ="SE",statistic ="t",p.value ="p-value",conf.low ="95% CI low",conf.high ="95% CI high" )```::: tutorial-noteCompare this output to the causal-forest BLP. Are the same variables highlighted? If not, is that a substantive disagreement or a consequence of different modeling assumptions?:::## Step 9: Extension With Double Machine LearningDouble Machine Learning also produces double-robust scores, but it obtains them through cross-fitting nuisance models rather than through the causal forest CATE estimator. Comparing BLPs from both scores is a robustness check on the heterogeneity story.```{r}#| label: doubleml-fit#| cache: trueset.seed(123456)obj_dml_data <- DoubleMLData$new(IQdata, y_col ="Y", d_cols ="W")Y_learner <-lrn("regr.ranger",num.trees =100,min.node.size =2,max.depth =5,num.threads =1)W_learner <-lrn("classif.ranger",num.trees =500,min.node.size =2,max.depth =5,num.threads =1)doubleml <- DoubleMLIRM$new( obj_dml_data,ml_g = Y_learner,ml_m = W_learner,n_folds =5,score ="ATE")doubleml$fit(store_predictions =TRUE)dml_ate <-data.frame(estimator ="DoubleML",estimate =as.numeric(doubleml$coef),std_error =as.numeric(doubleml$se),row.names =NULL)cf_ate <-average_treatment_effect(tau_forest) |>ate_table("Causal forest")dplyr::bind_rows(dml_ate, cf_ate) |> dplyr::mutate(estimate =round(estimate, 3),std_error =round(std_error, 3) ) |>gt_uw("ATE comparison")``````{r}#| label: doubleml-blpDML_scores <-as.numeric(doubleml$psi[, 1, 1])linear_predictors_DML <-lm( DML_scores ~ .,data =as.data.frame(cbind(DML_scores, X_selected)))model_table( linear_predictors_DML,"Best Linear Predictor using DoubleML scores")``````{r}#| label: compare-blp-tablecf_coefs <- broom::tidy(linear_predictors_CF)dml_coefs <- broom::tidy(linear_predictors_DML)compare_BLP <-data.frame(variable = cf_coefs$term,est_CF =round(cf_coefs$estimate, 3),SE_CF =round(cf_coefs$std.error, 3),est_DML =round( dml_coefs$estimate[match(cf_coefs$term, dml_coefs$term)],3 ),SE_DML =round( dml_coefs$std.error[match(cf_coefs$term, dml_coefs$term)],3 ))compare_BLP |>gt_uw("BLP coefficient comparison")``````{r}#| label: compare-blp-plot#| fig-cap: "Side-by-side comparison of BLP coefficients from causal-forest and DoubleML scores."blp_plot_df <-rbind(get_blp_df(linear_predictors_CF, "Causal Forest"),get_blp_df(linear_predictors_DML, "Double ML"))blp_plot_df <-subset(blp_plot_df, variable !="(Intercept)")cf_only <- blp_plot_df[blp_plot_df$model =="Causal Forest", ]ordervars <- cf_only$variable[order(cf_only$estimate)]blp_plot_df$variable <-factor(blp_plot_df$variable, levels = ordervars)ggplot(blp_plot_df, aes(x = estimate, y = variable, colour = model)) +geom_vline(xintercept =0, colour ="grey60", linetype =2) +geom_pointrange(aes(xmin = lower, xmax = upper),position =position_dodge(width =0.55),linewidth =0.45 ) +scale_colour_manual(values =c("Causal Forest"= pal[["purple"]], "Double ML"= pal[["gold_dark"]])) +labs(title ="Do both score constructions tell the same story?",x ="Coefficient estimate with 95% CI",y =NULL,colour =NULL )```## Policy Discussion1. **Evidence of benefit:** Is the estimated average treatment effect positive?2. **Evidence of heterogeneity:** Are predicted CATEs meaningfully different across children?3. **Targeting ethics:** Even if some variables predict benefit, should they be used to allocate access?::: callout-importantPrediction is not policy. A targeting rule should be judged not only by estimated benefit, but also by transparency, feasibility, budget constraints, and equity.:::