Random Forest Adjustment with the RFA package

RFA in a Nutshell

RFA is a flexible approach to regression adjustment for controlling for covariates in observational settings. By leveraging a nonparametric machine learning algorithm such as random forests, RFA sidesteps the incidental functional form assumptions imposed by the standard multiple linear regression approach to covariate adjustment. This makes RFA robust to more complex (nonlinear or nonadditive) forms of confounding relationships in observational data that a parametric approach may fail to account for. The latest version of the package can accommodate fixed and random effects as well.

RFA works by partialing out variation in an outcome \(y_i\) and some causal variable of interest \(z_i\) as a function of covariates \(x_{ip} \in X_i\) where \(y_i,z_i \not\!\perp\!\!\!\perp X_i\). The variables \(y_i\) and \(z_i\) denote vectors of response and predictor values for the \(i^\text{th}\) observation where \(i \in I = \{1,2,...,n \}\).

Because \(z_i\) and \(y_i\) are not independent of \(X_i\), the estimated slope coefficient (\(\alpha_1\)) from the following linear model may not reflect the correct relationship between \(z_i\) on \(y_i\): \[y_i = \alpha_0 + \alpha_1z_i + \epsilon_i. \tag{1}\] RFA’s solution to this problem is to partial out the variation in \(y_i\) and \(z_i\) explained by \(X_i\) prior to estimating the relationship between the two using random forests to pre-process the explanatory variable and response. This is done via the following steps:

Fit \(y_i\) as a function of \(X_i\) via random forests (the random forest fit for \(y_i\) is denoted \(\hat{f}_y(X_i)\)), and then demean, or residualize, \(y_i\) by its predicted conditional mean:

\[ \begin{aligned} \hat{y}_i & = \hat{f}_y(X_i), \\ \hat{y}_i^\varepsilon & = y_i - \hat{y}_i. \end{aligned} \tag{2} \]

Repeat this procedure for the explanatory variable of interest, \(z_i\):

\[ \begin{aligned} \hat{z}_i & = \hat{f}_z(X_i), \\ \hat{z}_i^\varepsilon & = z_i - \hat{z}_i. \end{aligned} \tag{3} \]

Finally, the marginal relationship, adjusting for the confounding influence of \(X_i\), is obtained by estimating the following linear model using the transformed data:

\[\hat{y}_i^\varepsilon = \beta_0 + \beta_1\hat{z}_i^\varepsilon + \mu_i.\tag{4}\]

If desired, fixed or random effects are incorporated by conducting an additional pre-processing step where the response, the causal variable, and each of the covariates are demeaned via fixed effect, random effect, or mixed effect regressions prior to random forest adjustment.

What is the Benefit of This Approach?

The advantage of RFA is that it sidesteps incidental functional form assumptions that the more conventional multiple regression adjustment strategy imposes. This can be easily demonstrated with a simple simulation.

For the simulation:

I set \(n = 500\) and set the the parameter for the true relationship between a response and explanatory variable of interest (\(\beta_1\)) to 5.
The data-generating process for the outcome variable \(y_i\) is given as

\[y_i = 1 + 5z_i + 0.5x_i + x_i^2 + e_i:e_i \sim \mathcal{N}(0, \sigma = 10).\]

The data generating process for the causal variable \(z_i\) is given as

\[z_i = -1 + 0.05\space \text{stand}(x_i) - 0.1 \space \text{stand}(x_i)^2 + u_i: u_i \sim \mathcal{N}(0, \sigma = 2).\]

The confounding variable \(x_i \sim \mathcal{N}(50, \sigma = 10)\).

For each iteration of the simulated data-generating process, I recover estimates of the effect of \(z_i\) on \(y_i\) via:

A naive linear regression model:

\[y_i = \beta_0 + \beta_1z_i + \epsilon_i.\]

A a multiple regression model with controls:

\[y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i.\]

An interactive regression model with \(x_i\) mean centered (otherwise known as the Lin estimator):

\[y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \beta_3z_i\cdot(x_i - \bar{x}) + \epsilon_i.\]

Adjustment via a Support Vector Machine (SVM)—an alternative machine learner to random forests.
Adjustment via random forests (RFA).

Because both SVM and random forests take a while to run, for time’s sake I restricted the analysis to 100 iterations.

A summary of the performance of each of these approaches is given in the below table. Metrics I use to gauge performance include root mean squared error (MSE), average bias, and coverage of the 95 percent confidence intervals.

model	RMSE	Ave. Bias	Coverage
OLS	222.9	5.12	0.95
OLS + controls	60.8	-5.26	0.66
OLS + interaction	60.1	-5.22	0.67
RFA	9.0	-0.10	0.94
SVM	11.5	-0.24	0.98

RFA clearly performs the best out of all of the adjustment strategies. This is also clear from the following figure showing the distribution of estimates after 100 runs of the simulation. The estimates provided by RFA are the least prone to error and least biased. However, in terms of coverage, naive OLS, RFA, and SVA are relatively similar.

How to Implement in R

To perform RFA analysis in R I created a package called {RFA}, which can be installed by writing:

## install.packages("devtools")
devtools::install_github("milesdwilliams15/RFA")

Here, I demonstrate how the package functions can used to estimate causal effects with observational data. I’ll do this with the GerberGreenImai dataset included in the {Matching} package. This dataset was used in Imai (2005) to replicate and extend Gerber and Green’s (2000) get-out-the-vote (GOT) field experiment. The dataset was used to assess the causal effect of telephone calls on turnout. Note that this data is used for demonstration purposes only, and so results here should not be taken as confirmation or refutation of the findings of the original authors.

First, let’s get the data into the working environment:

library(tidyverse) # grammar
library(Matching)  # for the data
data(GerberGreenImai)
ggi <- GerberGreenImai # give it a shorter name

Next, to implement RFA, we need to open the RFA package:

library(RFA)
#> Loading required package: estimatr
#> Loading required package: ranger
#> Loading required package: lme4
#> Loading required package: Matrix
#> 
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#> 
#>     expand, pack, unpack

The workhorse function in the package is rfa(). The help file for the function can be accessed by entering ?rfa. The function accepts the following arguments:

formula = a formula object where the left-hand variable is the outcome and the right-hand variable is the explanatory variable of interest.
covariates = a right-handed formula object specifying the covariates to be used in the random forest regressions.
fes_and_res = a formula object only containing the right-hand side specifying any fixed effects or random effects. If random effects, you should use the notation ‘~ (1 | id)’ as in the ‘lme4’ package.
data = an optional data frame containing the variables used to implement the RFA routine.
se_type = specifies the standard errors to be returned. If ‘clusters’ is not specified, the user can specify “classical”, “HC0”, “stata” (equivalent to “HC1”), “HC2”, or “HC3”. If ‘clusters’ is specified, the options are “CR0”, “stata” (CR1), and “CR2”. “stata” is the default.
clusters = optional name (quoted) of variable that corresponds to clusters in the data.
... = additional commands to override the default settings for implementing random forests via ‘ranger’. See the ranger package for more details.

Below is an example of how to use rfa with the ggi dataset. In the formula object below, the left-hand side variable is a binary response that equals 1 when an individual citizen turned out to vote in the 1998 congressional election in New Haven, CT. The treatment variable (whether the individual received a GOT phone call) is the right-hand side variable in the formula. The covariates to control for are given in the covariates command in a right-hand side only formula. These covariates are: an individual’s age, whether they voted in the previous election (in 1996), the number of persons residing in their household, whether they are a new voter, and whether they support the majority party.

rfa_fit <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  data = ggi
)

The object rfa_fit consists of a list containing the fitted regression results:

rfa_fit$fit
#>              Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper    DF
#> (Intercept) -5.44e-05    0.00418  -0.013 0.989627 -0.00825  0.00814 10827
#> xres         1.00e-01    0.02736   3.667 0.000246  0.04670  0.15395 10827

the random forest regression for the response as a function of covariates:

rfa_fit$yrf
#> Ranger result
#> 
#> Call:
#>  ranger(y = varmat[, 1], x = cbind(covmat), ...) 
#> 
#> Type:                             Regression 
#> Number of trees:                  500 
#> Sample size:                      10829 
#> Number of independent variables:  5 
#> Mtry:                             2 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       0.19 
#> R squared (OOB):                  0.234

the random forest regression for the causal variable of interest as a function of covariates:

rfa_fit$xrf
#> Ranger result
#> 
#> Call:
#>  ranger(y = varmat[, 2], x = cbind(covmat), ...) 
#> 
#> Type:                             Regression 
#> Number of trees:                  500 
#> Sample size:                      10829 
#> Number of independent variables:  5 
#> Mtry:                             2 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       0.0221 
#> R squared (OOB):                  0.00778

and an updated dataframe that adds the processed response and explanatory variable to the data used in estimation:

head(rfa_fit$data)
#>      VOTED98 PHN.C1 AGE VOTE96.1 PERSONS NEW MAJORPTY   yres  xres
#> 1221       1      1  74        1       2   0        1  0.287 0.955
#> 1228       0      1  37        1       2   0        0 -0.548 0.981
#> 1236       1      1  24        1       2   0        1  0.561 0.969
#> 1242       1      1  81        1       2   0        1  0.302 0.950
#> 1279       0      1  34        0       2   1        1 -0.325 0.980
#> 1289       1      1  54        0       2   0        1  0.715 0.973

From the above, we see that the RFA routine estimates that the effect of a GOT phone call on the probability of turning out to vote is approximately 0.1. A summary for this estimate is provided by using summary_rfa. The output also includes the OOB (out of bag) \(R^2\) for the random forest regressions for the response and the explanatory variable.

summary_rfa(rfa_fit) %>%
  mutate_if(is.numeric, function(x) round(x, 3)) 
#>          term estimate std.error statistic p.value conf.low conf.high
#> 1 (Intercept)      0.0     0.004    -0.013    0.99   -0.008     0.008
#> 2        xres      0.1     0.027     3.667    0.00    0.047     0.154
#>   OOB.response OOB.predictor
#> 1           NA            NA
#> 2        0.234         0.008

Sure enough, the ATE is statistically significant, with \(p < 0.01\).

We can also use the plot_rfa function to quickly make a coefficient plot for the treatment effect:

plot_rfa(rfa_fit)

Since plot_rfa is a wrapper for ggplot2, we can modify the RFA plot however we see fit:

plot_rfa(rfa_fit, varname = "Phone Call") +
  labs(
    x = "Estimate",
    y = NULL,
    title = "Effect of GOT phone calls on voter turnout"
  ) +
  geom_vline(xintercept = 0, lty = 2) +
  theme_test()

Adding Fixed or Random Effects

Fixed or random effects can be specified as well. We could do so using the WARD column in the data.

If we wanted ward fixed effects we would write:

fe_rfa <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  fes_and_res = ~ WARD, # FEs
  data = ggi
)

If we wanted ward random effects we would write:

re_rfa <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  fes_and_res = ~ (1 | WARD), # FEs
  data = ggi
)

To compare the results, we can easily create a regression table using the {texreg} package:

library(texreg)
#> Version:  1.38.6
#> Date:     2022-04-06
#> Author:   Philip Leifeld (University of Essex)
#> 
#> Consider submitting praise using the praise or praise_interactive functions.
#> Please cite the JSS article in your publications -- see citation("texreg").
#> 
#> Attaching package: 'texreg'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
htmlreg(
  list(
    "RFA" = rfa_fit$fit,
    "w/ FEs" = fe_rfa$fit,
    "w/ REs" = re_rfa$fit
  ),
  custom.coef.map = list(
    "xres" = "Phone Call"
  ),
  custom.header = list("Voter Turnout" = 1:3),
  include.ci = F,
  digits = 3,
  caption = "RFA estimates",
  caption.above = T
) %>%
  htmlTable::htmlTable()

RFA estimates
	Voter Turnout
	RFA	w/ FEs	w/ REs
Phone Call	0.100^***	0.074^**	0.075^**
	(0.027)	(0.028)	(0.028)
R²	0.001	0.001	0.001
Adj. R²	0.001	0.001	0.001
Num. obs.	10829	10829	10829
RMSE	0.435	0.421	0.421
^*p < 0.001; ^p < 0.01; ^*p < 0.05

Diagnosing RFA Estimates

The point estimate reported via random forest adjustment may not always be the only, or even most interesting, quantity of importance. Researchers may also have an interest in assessing which covariates in the data sample were the most substantial source of confounding. They may also wish to characterize the effective sample used to identify the partial relationship between the explanatory variable and response. As Aronow and Samii (2016) caution, the process of adjusting for covariates often distorts the data sample used to identify some causal relationship of interest. Often, reported estimates reflect a local effect that cannot be generalized beyond a more circumscribed data sample than that used at the outset of the analysis.

Because it is so important to contextualize how adjustment for covariates changes the sample used to estimate treatment effects, the {RFA} package includes two additional functions that help in performing some diagnostics. These are get_cod() and get_importance().

The latter returns a data frame that reports variable importance metrics for each of the covariates used in the analysis, with respect to the response and explanatory variable of interest.

Usage for get_importance is quite simple. First, if you’d like to obtain importance metrics, when you perform the RFA estimation, use the importance argument to specify either “impurity” or “permutation.” These are the two standard approaches to quantifying the relative predictive importance of covariates in random forests. The first sums the number of splits that include a given variable across all of the regression trees grown in the random forest (this is the Gini importance or Mean Decrease in Impurity). The second is the average decrease in the accuracy of model predictions if a given variable’s relationship with an outcome is “broken” via the permutation process (this is also known as Mean Decrease in Accuracy).

For example, if we wanted to get the Gini importance for covariates we would simply specify:

rfa_fit <- rfa(
  formula = VOTED98 ~ PHN.C1, # treatment variable
  covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
  data = ggi,
  importance = "impurity" # Specify impurity importance
)

Then, we can use get_importance to return the importance metrics of the covariates with respect to both the response and explanatory variable of interest:

get_importance(rfa_fit)
#> Returning importance type: impurity
#>       term response predictor
#> 1      AGE    173.5     5.678
#> 2 VOTE96.1    378.8     0.791
#> 3  PERSONS     11.9     0.785
#> 4      NEW     36.2     0.336
#> 5 MAJORPTY     15.1     0.520

These importance metrics can be easily visualized:

imp <- get_importance(rfa_fit)
#> Returning importance type: impurity
imp %>%
  gather(
    key = "outcome",
    value = "value",
    -term
  ) %>%
  ggplot() +
  aes(
    x = value,
    y = reorder(term, value),
    color = outcome
  ) +
  geom_point() +
  scale_x_log10() +
  labs(
    x = "Gini Importance (log-10 scale)",
    y = NULL,
    title = "Variable Importance"
  )

They can also be analyzed to assess the extent to which confoundingness is correlated between the response and explanatory variable of interest:

imp %>%
  summarise(
    correlation = cor(response, predictor)
  )
#>   correlation
#> 1        0.22

Outside of variable importance, we can also examine the “coefficient of distortion” (COD), which summarizes the difference, in standard deviation units, between the effective sample generated through random forest adjustment, and the nominal sample—the original data sample collected. The COD can help summarize how (un)representative the sample was that was used to identify the point estimate returned via random forest adjustment.

To summarize, the COD is calculated for a given covariate \(p\) by first scaling it to standard deviation units and then taking its weighted average: \[\text{COD}_p = \sum_{i = 1}^N w_i\left[ \frac{x_{ip} - \bar{x}_p}{\text{SD}(x_{ip})}\right] / \sum_{i=1}^N w_i.\] The vector of weights \(w_i\) used is simply the square of the residualized explanatory variable of interest (\(z_i\)): \[w_i = (z_i^\varepsilon)^2 = [z_i - \hat{f}_z(x_i)]^2.\]

The COD is calculated for each covariate used in estimation with the get_cod() function. The function accepts three arguments:

rfa = the fitted rfa object.
include_se = TRUE by default. This specifies whether you’d like to report standard errors and related summary statistics along side the COD estimates per covariate.
bootsims = the number of bootstrap iterations to perform to compute the standard errors. By default, this is set to 1,000.

Inference for each of the COD estimates is done via bootstrapping.

Applying get_cod to the fitted rfa object, we get:

cods <- 
  get_cod(rfa_fit, bootsims = 100) # set to 100 to speed up
#> Bootstrapping....
cods # view
#>       term estimate std.error statistic  p.value conf.low conf.high
#> 1      AGE   0.4497    0.0613     7.340 2.13e-13   0.3296     0.570
#> 2 VOTE96.1   0.3472    0.0587     5.917 3.27e-09   0.2322     0.462
#> 3  PERSONS   0.0228    0.0631     0.361 7.18e-01  -0.1008     0.146
#> 4      NEW  -0.2098    0.0494    -4.244 2.19e-05  -0.3067    -0.113
#> 5 MAJORPTY   0.1262    0.0589     2.142 3.22e-02   0.0107     0.242

Like the importance metrics, these results can be easily visualized:

ggplot(cods) +
  aes(
    x = estimate,
    y = reorder(term, estimate^2),
    xmin = conf.low,
    xmax = conf.high
  ) +
  geom_point() +
  geom_errorbarh(height = 0) +
  geom_vline(
    xintercept = 0,
    lty = 2
  ) +
  labs(
    x = "Distortion with 95% confidence intervals",
    y = NULL,
    title = "Effective vs. Nominal Data Samples"
  )

These tools, combined, help to demystify random forest adjusted estimates. For instance, the importance metrics above reveal a moderately strong correlation between the importance of covariates in predicting the explanatory variable of interest (a get-out-the-vote phone call) and the response (voter turnout). This is suggestive of a nontrivial degree of confounding in the relationship between the explanatory variable and response.

Further, the COD estimates suggest that the process of adjusting for this confounding produces an effective sample that differs markedly from the nominal sample. Notably, the effective sample skews significantly older, is more likely to have voted in the previous congressional election two years prior, and is less likely to be a new voter, among other differences. This suggests a more constrained set of cases to which the identified relationship between GOT phone calls and turnout can be generalized.

Random Forest Adjustment with the `RFA` package

Miles D. Williams (University of Illinois at Urbana-Champaign, milesdw2@illinois.edu)

2023-09-20

Details

Introduction

RFA in a Nutshell

What is the Benefit of This Approach?

How to Implement in R

Adding Fixed or Random Effects

Diagnosing RFA Estimates

Conclusion