Estimating Heterogeneous Treatment Effects (HTE) in Randomized Data with Machine Learning Tecniques

This tutorial discusses and implements different methods to estimate heterogeneous treatment effects (HTE) in :

We will compare the heterogeneity in each of these methods and then compare the in each of these methods.

Then, we compare the causal models by the mean square error (MSE).

First, let’s define our directory and install required packages:

# Directory
setwd('~/Google Drive/PhD Insper/Thesis/Paper 3/Empirics/Tutorials/HTE Randomized')

# Packages
library(glmnet)               # LASSO
library(rpart)    
library(rpart.plot)
library(randomForest)         # Random Forest
library(devtools)             # GitHub installation
library(tidyverse)    
library(ggplot2)
library(dplyr)
library(grf)                  # Generalized Random Forests
#install_github('susanathey/causalTree')
library(causalTree)           # Causal Tree 
#install_github('swager/randomForestCI')
library(randomForestCI)
# install_github('swager/balanceHD')
library(balanceHD)            
library(SuperLearner)
# install.packages("caret")
library(caret)
# install.packages("xgboost")
library(xgboost)

We used data from the article “Social Pressure and Voter Turnout: Evidence from a Large-Scale Field Experiment” by Gerber, Green and Larimer (2008). In a random experiment, the article study the effect of letters encouraging voters in the amount of voter turnout. The paper estimated that there is an average of 8 p.p. increase in turnout.

Data is splitted in our set of variables \(Z = \{Y_i,X_i,W_i\}\), where \(Y\) is the outcome variable (voted or not), \(X\) is the treatment variable (received letter or not) and \(W\) the covariates.

# Load data
my_data <- readRDS('social_voting.rds')

#Restrict the sample size
n_obs <- 33000 # Change this number depending on the speed of your computer. 6000 is also fine. 
my_data <- my_data[sample(nrow(my_data), n_obs), ]

# Split data into 3 samples
folds = createFolds(1:nrow(my_data), k=2)

Y_train <- my_data[folds[[1]],1]
Y_test <- my_data[folds[[2]],1]

X_train <- my_data[folds[[1]],2]
X_test <- my_data[folds[[2]],2]

W_train <- my_data[folds[[1]],3:ncol(my_data)]
W_test <- my_data[folds[[2]],3:ncol(my_data)]

### Creates a vector of 0s and a vector of 1s of length n (hack for later usage)
zeros <- function(n) {
  return(integer(n))
}
ones <- function(n) {
  return(integer(n)+1)
}

We first estimate our CATE using the standard OLS. Let’s start with a simple OLS with interaction terms. This is a simple way to estimate differential effects of \(X\) on \(Y\), where we only need to include interaction terms in an OLS regression. The algorithm follows:

Thus, we model an OLS model with interactions as the following:

\[ Y = \beta_0 + \beta_1 X + \beta_2 W + \beta_3 (W \times X) + \varepsilon \] Where \(X\) is the treatment vector and \(W\) the covariates.

We will use the R package SuperLearner to implement some of our ML algorithms:

# Estimate a linear model algorithm
sl_lm = SuperLearner(Y = Y_train,
                     X = data.frame(X=X_train, W_train , W_train*X_train),
                     family = binomial(),              # Distribution of errors
                     SL.library = "SL.lm",             # Linear model
                     cvControl = list(V=0))            # Method for cross-validation

summary(sl_lm$fitLibrary$SL.lm_All$object)

## 
## Call:
## stats::lm(formula = Y ~ ., data = X, weights = obsWeights, model = model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8128 -0.3359 -0.2130  0.5319  1.0120 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.126735   0.013235   9.576  < 2e-16 ***
## X                           0.069671   0.032219   2.162 0.030600 *  
## g2000                      -0.008606   0.012438  -0.692 0.488973    
## g2002                       0.067405   0.011408   5.909 3.52e-09 ***
## p2000                       0.089134   0.008906  10.008  < 2e-16 ***
## p2002                       0.118349   0.008117  14.580  < 2e-16 ***
## p2004                       0.141986   0.007945  17.872  < 2e-16 ***
## sex                        -0.003755   0.003836  -0.979 0.327670    
## yob                        -0.030279   0.004269  -7.092 1.37e-12 ***
## city                        0.045528   0.004182  10.887  < 2e-16 ***
## hh_size                     0.012088   0.004092   2.954 0.003143 ** 
## totalpopulation_estimate    0.012005   0.005001   2.401 0.016372 *  
## percent_male               -0.005445   0.004680  -1.163 0.244710    
## median_age                 -0.008889   0.009230  -0.963 0.335501    
## percent_62yearsandover      0.019804   0.009216   2.149 0.031660 *  
## percent_white               0.073592   0.022403   3.285 0.001022 ** 
## percent_black               0.053259   0.016463   3.235 0.001219 ** 
## percent_asian               0.029786   0.011591   2.570 0.010188 *  
## median_income               0.041025   0.008821   4.651 3.33e-06 ***
## employ_20to64              -0.017455   0.005472  -3.190 0.001425 ** 
## highschool                  0.033956   0.013895   2.444 0.014543 *  
## bach_orhigher              -0.005137   0.015210  -0.338 0.735545    
## percent_hispanicorlatino    0.015011   0.006459   2.324 0.020141 *  
## g2000.1                    -0.058461   0.030338  -1.927 0.054000 .  
## g2002.1                     0.047490   0.027288   1.740 0.081826 .  
## p2000.1                    -0.004109   0.021583  -0.190 0.849005    
## p2002.1                     0.027793   0.019593   1.419 0.156056    
## p2004.1                     0.027862   0.019377   1.438 0.150488    
## sex.1                      -0.006998   0.009350  -0.748 0.454206    
## yob.1                       0.005543   0.010267   0.540 0.589287    
## city.1                     -0.003481   0.009911  -0.351 0.725407    
## hh_size.1                  -0.015547   0.009794  -1.587 0.112441    
## totalpopulation_estimate.1  0.024643   0.011959   2.061 0.039364 *  
## percent_male.1              0.021209   0.011000   1.928 0.053868 .  
## median_age.1               -0.007161   0.023018  -0.311 0.755724    
## percent_62yearsandover.1    0.006990   0.022414   0.312 0.755160    
## percent_white.1            -0.130322   0.049613  -2.627 0.008628 ** 
## percent_black.1            -0.098030   0.036552  -2.682 0.007327 ** 
## percent_asian.1            -0.027219   0.026129  -1.042 0.297553    
## median_income.1            -0.011259   0.021521  -0.523 0.600885    
## employ_20to64.1             0.017057   0.013017   1.310 0.190101    
## highschool.1               -0.019287   0.033646  -0.573 0.566484    
## bach_orhigher.1            -0.035951   0.037331  -0.963 0.335541    
## percent_hispanicorlatino.1 -0.053741   0.015465  -3.475 0.000512 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4482 on 16456 degrees of freedom
## Multiple R-squared:  0.07874,    Adjusted R-squared:  0.07634 
## F-statistic: 32.71 on 43 and 16456 DF,  p-value: < 2.2e-16

Thus, note that the treatment \(X\) itself has a positive statistically significant effect of 0.064 in the vote turnout. However, we observe statistically significant heterogeneous effect when we interact with the 2004-vote dummy, the percent_asian, the median income variable. It seems that the effect is heterogeneous for some of our covariates. Moreover, a bunch of features seems to be relevant for our linear model, which reinforces the need to estimate the CATE instead of the ATE. That is why we need one step forward.

We can simply predict our outcome for both treated and non-treated groups in order to estimate the CATE:

\[ CATE = \hat{\tau} = E(Y|X = 1,W) - E(Y|X=0,W) \] For this, we use the predict function in our sl_lm model for each group, using our test sample:

# Prediction on control group (X = 0)
ols_pred_control <- predict(sl_lm, data.frame(X = zeros(nrow(W_test)), W_test, W_test*zeros(nrow(W_test))), onlySL = T)

# Prediction on treated group (X = 1)
ols_pred_treated <- predict(sl_lm, data.frame(X = ones(nrow(W_test)), W_test, W_test*ones(nrow(W_test))), onlySL = T)

# Calculate CATE
cate_ols <- ols_pred_treated$pred - ols_pred_control$pred

plot(cate_ols)

# Calculate ATE
mean(cate_ols)

## [1] 0.08088858

plot(mean(cate_ols))

Now, we use LASSO to estimate the heterogeneous effects. However, before estimating the CATE, we use it as a screening algorithm. What does it mean? That in order to reduce the number of variables, we can use LASSO to select the relevant variables. We will use the SuperLearner library again:

# Defining LASSO
lasso = create.Learner("SL.glmnet",
                       params = list(alpha = 1),
                       name_prefix = "lasso")


# Getting coefficients by LASSO
get_lasso_coeffs <- function(sl_lasso) {
  return(coef(sl_lasso$fitLibrary$lasso_1_All$object, se = "lambda.min")[-1,])
}


SL.library <- lasso$names

predict_y_lasso <- SuperLearner(Y = Y_train,
                                X = data.frame(X = X_train, W_train ,W_train*X_train),
                                family =  binomial(),
                                SL.library = SL.library,
                                cvControl = list(V=0))

kept_variables <- which(get_lasso_coeffs(predict_y_lasso)!=0)


predict_x_lasso <- SuperLearner(Y = X_train,
                                X = data.frame(W_train),
                                family = binomial(),
                                SL.library = lasso$names,
                                cvControl = list(V=0))


kept_variables2 <- which(get_lasso_coeffs(predict_x_lasso)!=0) + 1

After selecting the variables by LASSO, we can use the OLS to estimate treatment heterogeneity in the relevant variables selected by the algorithm above. But first, let’s formulate our post selection OLS:

sl_post_lasso <- SuperLearner(Y = Y_train,
                              X = data.frame(X = X_train, W_train, W_train*X_train)[,c(kept_variables,kept_variables2)],
                              family = binomial(),
                              SL.library = "SL.lm",
                              cvControl = list(V=0))

summary(sl_post_lasso$fitLibrary$SL.lm_All$object)

## 
## Call:
## stats::lm(formula = Y ~ ., data = X, weights = obsWeights, model = model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8032 -0.3371 -0.2229  0.5447  0.9717 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.130403   0.009609  13.571  < 2e-16 ***
## g2002          0.059027   0.010267   5.749 9.14e-09 ***
## p2000          0.086575   0.008034  10.776  < 2e-16 ***
## p2002          0.117455   0.008065  14.564  < 2e-16 ***
## p2004          0.136567   0.007802  17.504  < 2e-16 ***
## yob           -0.026107   0.003696  -7.063 1.70e-12 ***
## city           0.043216   0.003580  12.072  < 2e-16 ***
## employ_20to64 -0.016205   0.003607  -4.492 7.10e-06 ***
## g2002.1        0.061280   0.015613   3.925 8.71e-05 ***
## p2002.1        0.032079   0.019158   1.674   0.0941 .  
## p2004.1        0.028270   0.018047   1.566   0.1173    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4492 on 16489 degrees of freedom
## Multiple R-squared:  0.07296,    Adjusted R-squared:  0.0724 
## F-statistic: 129.8 on 10 and 16489 DF,  p-value: < 2.2e-16

That is, now, after selecting by LASSO our most relevant variables, we note heterogeneity only in the p2004 dummy. However, we have a set of relevant variables different from the treatment that are also statistically significant.

What is remaining is our estimation of CATE for post-selection LASSO. We can code it as the following, using the test sample:

# Prediction on control group (X = 0)
postlasso_pred_control <- predict(sl_post_lasso, data.frame(X = zeros(nrow(W_test)), W_test, W_test*zeros(nrow(W_train)))[,c(kept_variables,kept_variables2)], onlySL = T)

# Prediction on control group (X = 1)
postlasso_pred_treated <- predict(sl_post_lasso, data.frame(X = ones(nrow(W_test)), W_test, W_test*ones(nrow(W_test)))[,c(kept_variables,kept_variables2)], onlySL = T)

# Estimating CATE with post-selection LASSO
cate_postlasso <- postlasso_pred_treated$pred - postlasso_pred_control$pred

# Plot cate_postlasso
plot(cate_postlasso)

# ATE
mean(cate_postlasso)

## [1] 0.07634882

# Plot ATE
plot(mean(cate_postlasso))

Now we are going to predict the CATE from tree-based algorithms. There are a few packages that do this, but I like Susan Athey’s causalTree and grf. The first is a general algorithm that builds a regression model and returns an object, implementing ideas from the CART (Classification and Regression Trees), by Breiman et al. The second implements the generalized random forest algorithm for causal forest, which uses a splitting tree-based rule to divide covariates based in the heterogeneity of treatment effects and, moreover, assumes honesty as one of main assumptions.

In summary, we want to model

\[ Y_i = W_i + \theta X_i + \varepsilon \] Where \(c\) are possible covariates that brings hetrogeneity on the treatment \(X_i\) for the outcome \(Y_i\). The algorithm finds

\[ \hat{\tau}(W) = argmin_\tau \alpha_i(W) W_i(Y_i - \tau X_i)^2 \] Where the weights \(\alpha_i\) are estimated by a random forest algorithm. Let’s begin by building our causal tree:

# Witting the regression formula (to facilitate later)
tree_fml <- as.formula(paste("Y", paste(names(W_train), collapse = ' + '), sep = " ~ "))

# Building causal tree
causal_tree <- causalTree(formula = tree_fml,
                          data = data.frame(Y = Y_train,W_train),
                          treatment = X_train,
                          split.Rule = "CT",           # Causal Tree
                          split.Honest = FALSE,        # So far, we are not assuming honesty
                          split.alpha = 1,
                          cv.option = "CT",
                          cv.Honest = FALSE,
                          split.Bucket = TRUE,
                          bucketNum = 5,
                          bucketMax = 100,
                          minsize = 250                # Number of obs in treatment and control on leaf
                          )

## [1] 6
## [1] "CTD"

rpart.plot(causal_tree, roundint = FALSE)

Honest trees can be obtained when we use part of the sample (train) for building the leafs and part (test) to calculate the heterogeneity of treatment effects. The function honest.causalTree does the job:

honest_tree <- honest.causalTree(formula = tree_fml,
                                 data = data.frame(Y=Y_train, W_train),
                                 treatment = X_train,
                                 est_data = data.frame(Y=Y_test, W_test),
                                 est_treatment = X_test,
                                 split.alpha = 0.5,
                                 split.Rule = "CT",
                                 split.Honest = TRUE,
                                 cv.alpha = 0.5,
                                 cv.option = "CT",
                                 cv.Honest = TRUE,
                                 split.Bucket = TRUE,
                                 bucketNum = 5,
                                 bucketMax = 100, # maximum number of buckets
                                 minsize = 250) # number of observations in treatment and control on leaf

## [1] 6
## [1] "CTD"

rpart.plot(honest_tree, roundint = F)

Then, we prune the tree with cross validation, choosing the simplest tree that minimizes the objective function in a left-out sample:

opcpid <- which.min(honest_tree$cp[, 4]) 
opcp <- honest_tree$cp[opcpid, 1]
honest_tree_prune <- prune(honest_tree, cp = opcp)

rpart.plot(honest_tree_prune, roundint = F)

That is, we have spitted our sample in order to maximize the heterogeneity in each node. That is the way these algorithms work: By splitting in sub-samples by heterogeneity, we are able to estimate with more accuracy the treatment effect.

To estimate the standard errors on the leaves, we can use an OLS. The linear regression is specified such that the coefficients on the leaves are teh treatment effects.

# Constructing factors variables for the leaves
leaf_test <- as.factor(round(predict(honest_tree_prune,
                                     newdata = data.frame(Y = Y_test, W_test),
                                     type = "vector"), 4))

# Run an OLS that estimate the treatment effect magnitudes and standard errors
honest_ols_test <- lm(Y ~ leaf + X * leaf - X -1, data = data.frame(Y = Y_test, X = X_test, leaf = leaf_test, W_test))

summary(honest_ols_test)

## 
## Call:
## lm(formula = Y ~ leaf + X * leaf - X - 1, data = data.frame(Y = Y_test, 
##     X = X_test, leaf = leaf_test, W_test))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4891 -0.3558 -0.2580  0.6314  0.7952 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## leaf0.0114   0.258019   0.012300  20.977  < 2e-16 ***
## leaf0.0512   0.355807   0.010221  34.813  < 2e-16 ***
## leaf0.0624   0.294144   0.009778  30.081  < 2e-16 ***
## leaf0.0704   0.315320   0.010874  28.996  < 2e-16 ***
## leaf0.0707   0.223160   0.012759  17.491  < 2e-16 ***
## leaf0.099    0.204778   0.012037  17.012  < 2e-16 ***
## leaf0.1026   0.368602   0.009555  38.577  < 2e-16 ***
## leaf0.1316   0.357551   0.013163  27.162  < 2e-16 ***
## leaf0.0114:X 0.011354   0.030571   0.371  0.71034    
## leaf0.0512:X 0.051228   0.025254   2.028  0.04253 *  
## leaf0.0624:X 0.062358   0.023907   2.608  0.00911 ** 
## leaf0.0704:X 0.070394   0.026921   2.615  0.00893 ** 
## leaf0.0707:X 0.070718   0.032081   2.204  0.02751 *  
## leaf0.099:X  0.098976   0.029485   3.357  0.00079 ***
## leaf0.1026:X 0.102637   0.023684   4.334 1.48e-05 ***
## leaf0.1316:X 0.131579   0.030698   4.286 1.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4607 on 16484 degrees of freedom
## Multiple R-squared:  0.3307, Adjusted R-squared:  0.3301 
## F-statistic: 509.1 on 16 and 16484 DF,  p-value: < 2.2e-16

That is, only the leaf on median income is not statistically significant. All other heterogeneities are relevant in our honest tree.

But we still need to predict our CATE from our tree. This is easily done by the function predict:

# Estimate CATE
cate_honesttree <- predict(honest_tree_prune, newdata = data.frame(Y = Y_test, W_test), type = "vector")

# Plot CATE
plot(cate_honesttree)

# ATE
mean(cate_honesttree)

## [1] 0.0743514

# Plot ATE
plot(mean(cate_honesttree))

Finally, we estimate the CATE with the causal forest algorithm. The method is similar to the R-learner, but with a splitting procedure using a tree-based algorithm. It uses a residual-residual approach do estimate the propensity score in order to find the conditional average treatment effect. For more information, check Jacob (2021).

Let’s do in two ways. The first one will be using the grf package that estimates the CATE directly in the function. The second uses the causalForest package, estimating a random forest with honest causal trees.

The grf algorithm assumes honesty as the main assumption. We can easily do this by fitting the causal forest with our training sample and predicting with our testing sample. Estimating our causal forest is simply:

# Train our causal forest
cf_grf <- causal_forest(X = W_train,
                        Y = Y_train,
                        W = X_train)

# Get predictions from test sample
effects_cf_grf <- predict(cf_grf,W_test)

# Get effects
effects_cf_grf_pred <- predict(cf_grf,W_test)$predictions

# Histogram
hist(effects_cf_grf_pred)

# Plot of treatment effects
plot(W_test[, 1], effects_cf_grf_pred, ylim = range(effects_cf_grf_pred, 0, 2), xlab = "x", ylab = "tau", type = "l")

# Estimate the CATE for the full sample
cate_grf <- average_treatment_effect(cf_grf, target.sample = "all")

cate_grf

##   estimate    std.err 
## 0.07716841 0.00963298

# Estimate the CATE for the treated sample (CATT)
average_treatment_effect(cf_grf, target.sample = "treated")

##    estimate     std.err 
## 0.078843577 0.009610122

# Best Linear Projection of the CATE
cate_best_grf <- best_linear_projection(cf_grf)

cate_best_grf

## 
## Best linear projection of the conditional average treatment effect.
## Confidence intervals are cluster- and heteroskedasticity-robust (HC3):
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.077163   0.009747  7.9166 2.594e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

That is, our estimated CATE is around 0.10, both when we predict the result using the average_treatment_effect function or when we use the best_linear_projection. Let’s see if it holds without the grf package.

The same estimation can be obtained by the function causalForest inside the causalTree package:

cf_causalTree <- causalForest(tree_fml,
                             data=data.frame(Y=Y_train, W_train), 
                             treatment=X_train, 
                             split.Rule="CT", 
                             split.Honest=T,  
                             split.Bucket=T, 
                             bucketNum = 5,
                             bucketMax = 100, 
                             cv.option="CT", 
                             cv.Honest=T, 
                             minsize = 2, 
                             split.alpha = 0.5, 
                             cv.alpha = 0.5,
                             sample.size.total = floor(nrow(Y_train) / 2), 
                             sample.size.train.frac = .5,
                             mtry = ceiling(ncol(W_train)/3), 
                             nodesize = 5, 
                             num.trees = 10, 
                             ncov_sample = ncol(W_train), 
                             ncolx = ncol(W_train))

## [1] "Building trees ..."
## [1] "Tree 1"
## [1] 6
## [1] "CTD"
## [1] "Tree 2"
## [1] 6
## [1] "CTD"
## [1] "Tree 3"
## [1] 6
## [1] "CTD"
## [1] "Tree 4"
## [1] 6
## [1] "CTD"
## [1] "Tree 5"
## [1] 6
## [1] "CTD"
## [1] "Tree 6"
## [1] 6
## [1] "CTD"
## [1] "Tree 7"
## [1] 6
## [1] "CTD"
## [1] "Tree 8"
## [1] 6
## [1] "CTD"
## [1] "Tree 9"
## [1] 6
## [1] "CTD"
## [1] "Tree 10"
## [1] 6
## [1] "CTD"

And, to estimate the CATE:

cate_causalTree <- predict(cf_causalTree, newdata = data.frame(Y = Y_test, W_test), type = "vector")

## [1] 16500    10

Finally, we can compare all of our models (OLS with interaction terms, post-selection LASSO, Causal trees and Causal forest). Let’s first see some histograms:

# Creating a data frame withh all CATEs
het_effects <- data.frame(ols = cate_ols,
                          post_selec_lasso = cate_postlasso,
                          causal_tree = cate_honesttree,
                          causal_forest_grf = effects_cf_grf,
                          causal_forest_causalTree = cate_causalTree)

# Set range of x-axis
xrange <- range(c(het_effects[,1],het_effects[,2],het_effects[,3],het_effects[,4],het_effects[,5]))

# Set margins (two rows, five columns)
par(mfrow = c(1,5))

hist(het_effects[, 1], main = "OLS", xlim = xrange)
hist(het_effects[, 2], main = "Post-selection Lasso", xlim = xrange)
hist(het_effects[, 3], main = "Honest Causal tree", xlim = xrange)
hist(het_effects[, 4], main = "GRF Causal forest", xlim = xrange)
hist(het_effects[, 5], main = "causalTree Causal forest", xlim = xrange)

And to finalize, we can summary a table of results with each of CATEs:

summary_stats <- do.call(data.frame, 
                         list(mean = apply(het_effects, 2, mean),
                              sd = apply(het_effects, 2, sd),
                              median = apply(het_effects, 2, median),
                              min = apply(het_effects, 2, min),
                              max = apply(het_effects, 2, max)))

summary_stats

From the histograms and the summary statistics, it seems that the causal forest from the grf package yields most heterogeneity, once we have higher standard deviation among our treatment effects.

We can also compare the methods by looking at the minimum square error (MSE) on a test set using the transformed outcome (which I call \(Y^*\)) as a proxy for the true treatment effect. For this, we need to construct the propensity score (\(E(X = 1)\)) of our treatment in our test sample. Let’s code it:

# Construct propensity score from randomized experiment
prop_score <- mean(X_test)

# Construct Y_star in our test sample
Y_star <- X_test * (Y_test / prop_score) - (1 - X_test) * (Y_test / (1 - prop_score))

## MSEs

# OLS with interaction
MSE_ols <- mean((Y_star - cate_ols)^2)

# Post-selection LASSO
MSE_lasso <- mean((Y_star - cate_postlasso)^2)

# Honest Tree
MSE_causalTree <- mean((Y_star - cate_honesttree)^2)

# Causal Forest GRF
MSE_cf_grf <- mean((Y_star - cate_grf)^2)

# Causal Forest causalTree
MSE_cf_causalTree <- mean((Y_star - cate_causalTree)^2)

# Create data frame with all MSEs
performance_MSE <- data.frame(matrix(rep(NA,1), nrow = 1, ncol = 1))
rownames(performance_MSE) <- c("OLS")
colnames(performance_MSE) <- c("MSE")

# Load in results
performance_MSE["OLS","MSE"] <- MSE_ols
performance_MSE["Post-selection LASSO","MSE"] <- MSE_lasso
performance_MSE["Honest Tree","MSE"] <- MSE_causalTree
performance_MSE["Causal Forest GRF","MSE"] <- MSE_cf_grf
performance_MSE["Causal Forest causalTree","MSE"] <- MSE_cf_causalTree

# Setting range
xrange2 <- range(performance_MSE$MSE - 2*sd(performance_MSE$MSE), 
                 performance_MSE$MSE,
                 performance_MSE$MSE + 2*sd(performance_MSE$MSE))

# Create plot
MSEplot <- ggplot(performance_MSE) + 
  geom_bar(mapping = aes(x = factor(rownames(performance_MSE), 
                                    levels = rownames(performance_MSE)), 
                         y = MSE),
           stat = "identity", fill = "gray44", width=0.7, 
           position = position_dodge(width=0.2)) + 
  theme_bw() + 
  coord_cartesian(ylim=c(xrange2[1], xrange2[2])) +
  theme(axis.ticks.x = element_blank(), axis.title.x = element_blank(),
        panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
        plot.background = element_blank(),
        axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5)) +
  ylab("MSE out-of-sample") + 
  ggtitle("Comparing performance based on MSE") +
  theme(plot.title = element_text(hjust = 0.5, face ="bold", 
                                  colour = "black", size = 14))

# Plot
MSEplot

From the MSE analysis, it seems that both causal forest models perform worse than simpler models like OLS, post-selection LASS) and honest tree. One explanation could be that since we have little heterogeneity in our model, simpler models do best.

https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html

https://gsbdbi.github.io/ml_tutorial/

https://ml-in-econ.appspot.com

https://github.com/QuantLet/Meta_learner-for-Causal-ML/blob/main/GRF/Causal-Forest.R

https://grf-labs.github.io/grf/

https://www.markhw.com/blog/causalforestintro

https://lost-stats.github.io/Machine_Learning/causal_forest.html

Estimating Heterogeneous Treatment Effects (HTE) in Randomized Data with Machine Learning Tecniques

Victor Hugo C. Alexandrino da Silva

8/11/2021