Building a Regression Tree and Estimating Treatment Effect Using Honest Approach

In this analysis, we forego the implementation of honest cross-validation and focus solely on preparing the dataset for further statistical modeling. The initial step involves ensuring that your dataset, ideally structured with columns labeled ‘treatment’, ‘outcome’, and additional covariates such as ‘covar1’, ‘covar2’, etc., is properly loaded into the R environment. If your data resides in a CSV file, you should replace ‘your_data.csv’ with the actual file path or adapt the loading method accordingly to suit the format in which your data is stored. This preparation is critical as it forms the foundation for all subsequent analyses, ensuring that the data is appropriately structured and accessible for manipulation and analysis.

Data Generating Function

This section consider the Simulation exercise by Athey & Imbens(2016). The simulations compare outcomes for equal numbers of training and estimation samples (Ntr = Nest) set at either 500 or 1,000 observations. Particularly, we assess the trade-offs between sample size and bias reduction by comparing the Mean Squared Error of treatment effect (MSEτ) between adaptive and honest estimation methods. Specifically, we compute the ratio of MSEτ using adaptive estimation with $Ntr = 1,000$ to MSEτ using honest estimation with $Ntr = Nest = 500$. The evaluation of MSEτ employs a large test sample ($Nte = 8,000$) to minimize sampling variance.

Each design maintains a constant treatment probability of 0.5 and varies in the number of features (K), models for mean effects (η(x)) and treatment effects (κ(x)), and the specification of potential outcomes. Outcomes are defined as \[Y_i(w) = η(X_i) + \frac{1}{2} (2w - 1) κ(X_i) + e_i,\] where $e_i \sim N(0, 0.01)$ and the covariates $X_i$ are independent normal variables.

The designs are structured as follows:

Design 1 (K = 2):
- $\eta(x):$ $\frac{1}{2}(x_1 + x_2)$
- $\kappa(x):$ $\frac{1}{2} x_1$
Design 2 (K = 10):
- $\eta(x):$ $\frac{1}{2} \sum_{k=1}^2 x_k + \sum_{k=3}^6 x_k$
- $\kappa(x):$ $\sum_{k=1}^2 \mathbf{1}_{x_k > 0} \cdot x_k$
Design 3 (K = 20):
- $\eta(x):$ $\frac{1}{2} \sum_{k=1}^4 x_k + \sum_{k=5}^8 x_k$
- $\kappa(x):$ $\sum_{k=1}^4 \mathbf{1}_{x_k > 0} \cdot x_k$

In Designs 2 and 3, covariates affect the treatment effects and mean outcomes differently, demonstrating that varying covariate values influence the optimal splitting strategies within the models, particularly when considering negative values for optimization. This nuanced analysis highlights the complexity and importance of covariate selection in predictive modeling.

Now we consider only the first Data Generating Process.

# Initialize the seed to ensure reproducibility of the results
set.seed(1)

# Define the sizes for estimation, training, and testing datasets
N_est = 500  # Number of samples for estimation
N_tr = 500   # Number of samples for training
N_te = 8000  # Number of samples for testing
N = N_est + N_tr + N_te  # Total number of samples

# Number of covariates used in the model
K = 2

# Load the MASS library for multivariate distributions
library(MASS)

# Define the mean vector and covariance matrix for the multivariate normal distribution
mean_vec <- rep(0, K)  # Zero mean for simplicity
cov_mat <- diag(K)     # Identity matrix as covariance matrix for independence

# Generate multivariate normal covariates

x <- mvrnorm(n = N, mu = mean_vec, Sigma = cov_mat)

# Generate binary treatment variable w where w = 0 or 1 with equal probability
w = rbinom(n = N, size = 1, prob = 0.5)

# Generate random error terms with small variance
eps = rnorm(n = N, mean = 0, sd = 0.1)

# Calculate potential outcomes based on the covariates
eta = x[, 1]/2 + x[, 2]/2  # Linear combination of covariates
kappa = x[, 1]/2         # Interaction effect of the first covariate
y = eta + (2 * w - 1) * kappa/2 + eps  # Outcome equation with treatment effect

# Calculate the true treatment effect for simulation purposes
tau =  kappa
tau_eps = kappa + eps * 20

# Output the generated samples

# Define indices for each dataset
ind_est <- 1:N_est
ind_tr <- (N_est + 1):(N_tr + N_est)
ind_te <- (N_tr + N_est + 1):(N_tr + N_est + N_te)
ind_est_tr <- 1:(N_tr + N_est)

# If your data is stored in a data frame
# data <- read.csv('your_data.csv')
data = data.frame(y = y, w = w, x = x, tau = tau, tau_eps = tau_eps)
data_training_set = data[ind_tr, ]
data_est_set = data[ind_est, ]
data_te_set = data[ind_te, ]
data_est_tr_set = data[ind_est_tr, ]

print(dim(data_est_set))

## [1] 500   6

print(dim(data_te_set))

## [1] 8000    6

Tree building with regression trees

First we build a regression tree and prune the tree as exercise. We use the tree library for illustration. The tree buidling using the tree library can be done in a stata style, where you post the outcome and the dependent variables on.

Notice that the tree splitting criterion as used in the tree library is the typical MSE criteria used in the adaptive approach. So you should not include the treatment variable $w$ in your regression tree covariates.

In this exercise, we use the honest data split, so we train our tree only using the training dataset.

We also conduct some cross-validation exercise, using the cv.tree function from the tree package.

# Step 2: Fit the recursive partitioning model
library(tree)
model.tree <- tree(as.formula(paste("tau_eps ~ ", paste0("x.", 1:K, collapse = " + ")) ), data = data_training_set)

# Step 3: Determine the optimal tree complexity
# For example, using cross-validation
# The regular CV
model.tree.cv <- cv.tree(model.tree,,prune.tree)
plot (model.tree.cv$size , model.tree.cv$dev, type = "b")

# Step 5: Prune the tree to node = 5, fosimplicity
model.tree.prune <- prune.tree (model.tree , best = 5)
plot (model.tree.prune)
text (model.tree.prune, pretty = 0)

plot (model.tree)
text (model.tree, pretty = 0)

Estimate Treatment effect function

Now, construct the estimation for the treatment effect. To do so, we need to come up with the prediction of the tree leafs first. We use the function of predict(data, tree.structure, type='where') in order to do so.

The first step is to define the $\hat \tau$ function.

The tau_hat function calculates the treatment effect estimates for different terminal nodes in a decision tree structure. This function is particularly useful for analyzing the impact of different treatments applied to groups categorized based on certain criteria.

Parameters

data_set: A dataframe containing the data used for prediction and analysis. This data set must include a variable for the outcome and a binary variable w indicating treatment status (1 for treated, 0 for control).
tree.structure: An object representing a trained tree model. This tree model is used to predict which terminal node each observation in the dataset belongs to.

Process

Node Assignment:
- The function first uses the predict function with type="where" to determine the terminal node for each observation in the data_set. This information is stored in the new column where of the data_set.
Unique Terminal Nodes:
- It extracts the unique terminal nodes from the tree structure using tree.structure$where.
Treatment Effect Calculation:
- For each terminal node, the function subsets data_set based on the node assignment.
- It calculates the mean outcome y for the treated (w == 1) and control (w == 0) groups within each node.
- The treatment effect for each node is computed as the difference between the mean outcomes of the treated and control groups.

# Estimate treatment effects: write a function, that returns 
# the node list and the treatment effect
tau.hat <- function(data_est_set,model.tree.prune){
  data_est_set$where <- predict(model.tree.prune, data_est_set,type="where")
  # predict the node for the data_est
  terminal_nodes <- unique(model.tree.prune$where)
  treatment_effects <- sapply(terminal_nodes, function(node) {
  subset_data <- data_est_set[data_est_set$where == node, ]
  mean_treatment <- mean(subset_data$y[subset_data$w == 1])
  mean_control <- mean(subset_data$y[subset_data$w == 0])
  treatment_effect <- mean_treatment - mean_control
  return(treatment_effect)
  })
 return(list(terminal_nodes=terminal_nodes,treatment_effects=treatment_effects,model.tree=model.tree.prune)) 
}

# use the function
treatment_effects <- tau.hat(data_est_set,model.tree.prune)

Summarize the MSE

Then we define the estimation for the MSE function \[ \widehat{MSE}_\tau \left(S^{\text{te}}, S^{\text{tr}}, \Pi\right) \equiv -\frac{2}{N_{tr}} \sum_{i \in S^{\text{te}}} \hat{\tau} \left(X_i; S^{\text{te}}, \Pi\right) \cdot \hat{\tau}\left(X_i; S^{\text{tr}}, \Pi\right) \\ + \frac{1}{N_{tr}} \sum_{i \in S^{\text{te}}} \hat{\tau}^2 \left(X_i; S^{\text{tr}}, \Pi\right). \]

But we have a very large dataset in the test dataset ($N^{tr} = 8000$), therefore the estimation of $\hat{\tau}$ using the test dataset will be very close to that of the true value of $\tau$.

report.mse.test <- function(data_te_set,treatment_effects){
  data_te_set$where <- predict(treatment_effects$model.tree, data_te_set,type="where")
  # Create a new column 'tau_hat' in the dataframe
  data_te_set$tau_hat_tr <- 0
  data_te_set$tau_hat_te <- 0
  
  # Map treatment effects to the corresponding rows based on terminal nodes
  for (i in 1:length(treatment_effects$terminal_nodes)) {
    node <- treatment_effects$terminal_nodes[i]
    effect <- treatment_effects$treatment_effects[i]
    
    data_te_set$tau_hat_tr[data_te_set$where == node] <- effect
    data_te_set$tau_hat_te[data_te_set$where == node] <- mean( data_te_set$y[data_te_set$where == node] )
  }
  
  MSE = mean( (data_te_set$tau_hat_tr)^2) - 2 * mean(data_te_set$tau_hat_te*data_te_set$tau_hat_tr ) 
  return(MSE)
}

# Output results
cat("Estimated treatment effects within each terminal node:\n")

## Estimated treatment effects within each terminal node:

MSE <- report.mse.test(data_te_set,treatment_effects)
cat("\nSummary statistics of MSE:", MSE)

## 
## Summary statistics of MSE: -0.1890302

# Now estimate the MSE if we use pooled training + est to estimate the tree and treatment effect
model.tree.2 <- tree(as.formula(paste("tau ~ ", paste0("x.", 1:K, collapse = " + "))), data = data_est_tr_set)
model.tree.prune.2 <- prune.tree (model.tree.2 , best = 5)
treatment_effects.2 <- tau.hat(data_est_tr_set, model.tree.prune.2)
MSE.2 <- report.mse.test(data_te_set,treatment_effects.2)

cat("\nSummary statistics of MSE use same data for tree splitting and treatment estimation:", MSE.2)

## 
## Summary statistics of MSE use same data for tree splitting and treatment estimation: -0.2195343

Pack your tree building function of $\hat\pi$

Ideally, this $\hat\pi$ should be a function, that has certain decision rule. Once you give the sample a certain in sampledataset( $S^{tr}$ or $S^{tr-est}$, it should return a tree structure.

One simplest decision rule, is that we build the tree, using the in-sample goodness of fit \[\hat\Pi = \arg \min_{\Pi} \hat{MSE}(S^{tr}, S^{tr}, \Pi). \]

The example in doing so is is as specified. It takes a dataset, build a regression tree based on the data set, and eventually return a tree structure.

pi.simple <- function(data_set, best.node = 5){
  # Step 1: collect the estimators
  predictors <- names(datasets$training_set)[grepl("^x.[0-9]+$", names(data_set))]

  #Step 2: Construct the formula string: you should modify this!
  formula_str <- paste("tau ~", paste(predictors, collapse = " + "))

  #Step 3: Fit the model  
  model.tree <- tree(as.formula(formula_str), data = data_set)
  # Lets pick the leaf of 5
  model.tree.prune <- prune.tree(model.tree , best = best.node)
  return(model.tree.prune)
}