RFA packageThe repository for this project can be accessed here. Check out my website to learn more about me and my research.
Random Forest Adjustment (RFA) is a procedure for estimating the relationship between some predictor and a response after partialing out variation in each given a set of covariates via random forests. Below I provide a brief explanation of the technique and demonstrate how it is implemented in R.
RFA is a flexible approach to regression adjustment for controlling for covariates in observational settings. By leveraging a nonparametric machine learning algorithm such as random forests, RFA sidesteps the incidental functional form assumptions imposed by the standard multiple linear regression approach to covariate adjustment. This makes RFA robust to more complex (nonlinear or nonadditive) forms of confounding relationships in observational data that a parametric approach may fail to account for. The latest version of the package can accommodate fixed and random effects as well.
RFA works by partialing out variation in an outcome \(y_i\) and some causal variable of interest \(z_i\) as a function of covariates \(x_{ip} \in X_i\) where \(y_i,z_i \not\!\perp\!\!\!\perp X_i\). The variables \(y_i\) and \(z_i\) denote vectors of response and predictor values for the \(i^\text{th}\) observation where \(i \in I = \{1,2,...,n \}\).
Because \(z_i\) and \(y_i\) are not independent of \(X_i\), the estimated slope coefficient (\(\alpha_1\)) from the following linear model may not reflect the correct relationship between \(z_i\) on \(y_i\): \[y_i = \alpha_0 + \alpha_1z_i + \epsilon_i. \tag{1}\] RFA’s solution to this problem is to partial out the variation in \(y_i\) and \(z_i\) explained by \(X_i\) prior to estimating the relationship between the two using random forests to pre-process the explanatory variable and response. This is done via the following steps:
\[ \begin{aligned} \hat{y}_i & = \hat{f}_y(X_i), \\ \hat{y}_i^\varepsilon & = y_i - \hat{y}_i. \end{aligned} \tag{2} \]
\[ \begin{aligned} \hat{z}_i & = \hat{f}_z(X_i), \\ \hat{z}_i^\varepsilon & = z_i - \hat{z}_i. \end{aligned} \tag{3} \]
\[\hat{y}_i^\varepsilon = \beta_0 + \beta_1\hat{z}_i^\varepsilon + \mu_i.\tag{4}\]
If desired, fixed or random effects are incorporated by conducting an additional pre-processing step where the response, the causal variable, and each of the covariates are demeaned via fixed effect, random effect, or mixed effect regressions prior to random forest adjustment.
The advantage of RFA is that it sidesteps incidental functional form assumptions that the more conventional multiple regression adjustment strategy imposes. This can be easily demonstrated with a simple simulation.
For the simulation:
\[y_i = 1 + 5z_i + 0.5x_i + x_i^2 + e_i:e_i \sim \mathcal{N}(0, \sigma = 10).\]
\[z_i = -1 + 0.05\space \text{stand}(x_i) - 0.1 \space \text{stand}(x_i)^2 + u_i: u_i \sim \mathcal{N}(0, \sigma = 2).\]
For each iteration of the simulated data-generating process, I recover estimates of the effect of \(z_i\) on \(y_i\) via:
\[y_i = \beta_0 + \beta_1z_i + \epsilon_i.\]
\[y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \epsilon_i.\]
\[y_i = \beta_0 + \beta_1z_i + \beta_2x_i + \beta_3z_i\cdot(x_i - \bar{x}) + \epsilon_i.\]
Adjustment via a Support Vector Machine (SVM)—an alternative machine learner to random forests.
Adjustment via random forests (RFA).
Because both SVM and random forests take a while to run, for time’s sake I restricted the analysis to 100 iterations.
A summary of the performance of each of these approaches is given in the below table. Metrics I use to gauge performance include root mean squared error (MSE), average bias, and coverage of the 95 percent confidence intervals.
| model | RMSE | Ave. Bias | Coverage |
|---|---|---|---|
| OLS | 222.9 | 5.12 | 0.95 |
| OLS + controls | 60.8 | -5.26 | 0.66 |
| OLS + interaction | 60.1 | -5.22 | 0.67 |
| RFA | 9.0 | -0.10 | 0.94 |
| SVM | 11.5 | -0.24 | 0.98 |
RFA clearly performs the best out of all of the adjustment strategies. This is also clear from the following figure showing the distribution of estimates after 100 runs of the simulation. The estimates provided by RFA are the least prone to error and least biased. However, in terms of coverage, naive OLS, RFA, and SVA are relatively similar.
To perform RFA analysis in R I created a package called
{RFA}, which can be installed by writing:
## install.packages("devtools")
devtools::install_github("milesdwilliams15/RFA")
Here, I demonstrate how the package functions can used to estimate
causal effects with observational data. I’ll do this with the
GerberGreenImai dataset included in the
{Matching} package. This dataset was used in Imai (2005) to
replicate and extend Gerber and Green’s (2000) get-out-the-vote (GOT)
field experiment. The dataset was used to assess the causal effect of
telephone calls on turnout. Note that this data is used for
demonstration purposes only, and so results here should not be taken as
confirmation or refutation of the findings of the original authors.
First, let’s get the data into the working environment:
library(tidyverse) # grammar
library(Matching) # for the data
data(GerberGreenImai)
ggi <- GerberGreenImai # give it a shorter nameNext, to implement RFA, we need to open the RFA
package:
library(RFA)
#> Loading required package: estimatr
#> Loading required package: ranger
#> Loading required package: lme4
#> Loading required package: Matrix
#>
#> Attaching package: 'Matrix'
#> The following objects are masked from 'package:tidyr':
#>
#> expand, pack, unpackThe workhorse function in the package is rfa(). The help
file for the function can be accessed by entering ?rfa. The
function accepts the following arguments:
formula = a formula object where the left-hand variable
is the outcome and the right-hand variable is the explanatory variable
of interest.covariates = a right-handed formula object specifying
the covariates to be used in the random forest regressions.fes_and_res = a formula object only containing the
right-hand side specifying any fixed effects or random effects. If
random effects, you should use the notation ‘~ (1 | id)’ as in the
‘lme4’ package.data = an optional data frame containing the variables
used to implement the RFA routine.se_type = specifies the standard errors to be returned.
If ‘clusters’ is not specified, the user can specify “classical”, “HC0”,
“stata” (equivalent to “HC1”), “HC2”, or “HC3”. If ‘clusters’ is
specified, the options are “CR0”, “stata” (CR1), and “CR2”. “stata” is
the default.clusters = optional name (quoted) of variable that
corresponds to clusters in the data.... = additional commands to override the default
settings for implementing random forests via ‘ranger’. See the
ranger package for more details.Below is an example of how to use rfa with the
ggi dataset. In the formula object below, the left-hand
side variable is a binary response that equals 1 when an individual
citizen turned out to vote in the 1998 congressional election in New
Haven, CT. The treatment variable (whether the individual received a GOT
phone call) is the right-hand side variable in the formula. The
covariates to control for are given in the covariates
command in a right-hand side only formula. These covariates are: an
individual’s age, whether they voted in the previous election (in 1996),
the number of persons residing in their household, whether they are a
new voter, and whether they support the majority party.
rfa_fit <- rfa(
formula = VOTED98 ~ PHN.C1, # treatment variable
covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
data = ggi
)The object rfa_fit consists of a list containing the
fitted regression results:
rfa_fit$fit
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
#> (Intercept) -5.44e-05 0.00418 -0.013 0.989627 -0.00825 0.00814 10827
#> xres 1.00e-01 0.02736 3.667 0.000246 0.04670 0.15395 10827the random forest regression for the response as a function of covariates:
rfa_fit$yrf
#> Ranger result
#>
#> Call:
#> ranger(y = varmat[, 1], x = cbind(covmat), ...)
#>
#> Type: Regression
#> Number of trees: 500
#> Sample size: 10829
#> Number of independent variables: 5
#> Mtry: 2
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 0.19
#> R squared (OOB): 0.234the random forest regression for the causal variable of interest as a function of covariates:
rfa_fit$xrf
#> Ranger result
#>
#> Call:
#> ranger(y = varmat[, 2], x = cbind(covmat), ...)
#>
#> Type: Regression
#> Number of trees: 500
#> Sample size: 10829
#> Number of independent variables: 5
#> Mtry: 2
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 0.0221
#> R squared (OOB): 0.00778and an updated dataframe that adds the processed response and explanatory variable to the data used in estimation:
head(rfa_fit$data)
#> VOTED98 PHN.C1 AGE VOTE96.1 PERSONS NEW MAJORPTY yres xres
#> 1221 1 1 74 1 2 0 1 0.287 0.955
#> 1228 0 1 37 1 2 0 0 -0.548 0.981
#> 1236 1 1 24 1 2 0 1 0.561 0.969
#> 1242 1 1 81 1 2 0 1 0.302 0.950
#> 1279 0 1 34 0 2 1 1 -0.325 0.980
#> 1289 1 1 54 0 2 0 1 0.715 0.973From the above, we see that the RFA routine estimates that the effect
of a GOT phone call on the probability of turning out to vote is
approximately 0.1. A summary for this estimate is provided by using
summary_rfa. The output also includes the OOB (out of bag)
\(R^2\) for the random forest
regressions for the response and the explanatory variable.
summary_rfa(rfa_fit) %>%
mutate_if(is.numeric, function(x) round(x, 3))
#> term estimate std.error statistic p.value conf.low conf.high
#> 1 (Intercept) 0.0 0.004 -0.013 0.99 -0.008 0.008
#> 2 xres 0.1 0.027 3.667 0.00 0.047 0.154
#> OOB.response OOB.predictor
#> 1 NA NA
#> 2 0.234 0.008Sure enough, the ATE is statistically significant, with \(p < 0.01\).
We can also use the plot_rfa function to quickly make a
coefficient plot for the treatment effect:
Since plot_rfa is a wrapper for ggplot2, we
can modify the RFA plot however we see fit:
plot_rfa(rfa_fit, varname = "Phone Call") +
labs(
x = "Estimate",
y = NULL,
title = "Effect of GOT phone calls on voter turnout"
) +
geom_vline(xintercept = 0, lty = 2) +
theme_test()Fixed or random effects can be specified as well. We could do so
using the WARD column in the data.
If we wanted ward fixed effects we would write:
fe_rfa <- rfa(
formula = VOTED98 ~ PHN.C1, # treatment variable
covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
fes_and_res = ~ WARD, # FEs
data = ggi
)If we wanted ward random effects we would write:
re_rfa <- rfa(
formula = VOTED98 ~ PHN.C1, # treatment variable
covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
fes_and_res = ~ (1 | WARD), # FEs
data = ggi
)To compare the results, we can easily create a regression table using
the {texreg} package:
library(texreg)
#> Version: 1.38.6
#> Date: 2022-04-06
#> Author: Philip Leifeld (University of Essex)
#>
#> Consider submitting praise using the praise or praise_interactive functions.
#> Please cite the JSS article in your publications -- see citation("texreg").
#>
#> Attaching package: 'texreg'
#> The following object is masked from 'package:tidyr':
#>
#> extract
htmlreg(
list(
"RFA" = rfa_fit$fit,
"w/ FEs" = fe_rfa$fit,
"w/ REs" = re_rfa$fit
),
custom.coef.map = list(
"xres" = "Phone Call"
),
custom.header = list("Voter Turnout" = 1:3),
include.ci = F,
digits = 3,
caption = "RFA estimates",
caption.above = T
) %>%
htmlTable::htmlTable()
|
||||||||||||||||||||||||||||||||||||
The point estimate reported via random forest adjustment may not always be the only, or even most interesting, quantity of importance. Researchers may also have an interest in assessing which covariates in the data sample were the most substantial source of confounding. They may also wish to characterize the effective sample used to identify the partial relationship between the explanatory variable and response. As Aronow and Samii (2016) caution, the process of adjusting for covariates often distorts the data sample used to identify some causal relationship of interest. Often, reported estimates reflect a local effect that cannot be generalized beyond a more circumscribed data sample than that used at the outset of the analysis.
Because it is so important to contextualize how adjustment for
covariates changes the sample used to estimate treatment effects, the
{RFA} package includes two additional functions that help
in performing some diagnostics. These are get_cod() and
get_importance().
The latter returns a data frame that reports variable importance metrics for each of the covariates used in the analysis, with respect to the response and explanatory variable of interest.
Usage for get_importance is quite simple. First, if
you’d like to obtain importance metrics, when you perform the RFA
estimation, use the importance argument to specify either
“impurity” or “permutation.” These are the two standard approaches to
quantifying the relative predictive importance of covariates in random
forests. The first sums the number of splits that include a given
variable across all of the regression trees grown in the random forest
(this is the Gini importance or Mean Decrease in Impurity). The second
is the average decrease in the accuracy of model predictions if a given
variable’s relationship with an outcome is “broken” via the permutation
process (this is also known as Mean Decrease in Accuracy).
For example, if we wanted to get the Gini importance for covariates we would simply specify:
rfa_fit <- rfa(
formula = VOTED98 ~ PHN.C1, # treatment variable
covariates = ~ AGE + VOTE96.1 + PERSONS + NEW + MAJORPTY, # confounders
data = ggi,
importance = "impurity" # Specify impurity importance
)Then, we can use get_importance to return the importance
metrics of the covariates with respect to both the response and
explanatory variable of interest:
get_importance(rfa_fit)
#> Returning importance type: impurity
#> term response predictor
#> 1 AGE 173.5 5.678
#> 2 VOTE96.1 378.8 0.791
#> 3 PERSONS 11.9 0.785
#> 4 NEW 36.2 0.336
#> 5 MAJORPTY 15.1 0.520These importance metrics can be easily visualized:
imp <- get_importance(rfa_fit)
#> Returning importance type: impurity
imp %>%
gather(
key = "outcome",
value = "value",
-term
) %>%
ggplot() +
aes(
x = value,
y = reorder(term, value),
color = outcome
) +
geom_point() +
scale_x_log10() +
labs(
x = "Gini Importance (log-10 scale)",
y = NULL,
title = "Variable Importance"
)They can also be analyzed to assess the extent to which confoundingness is correlated between the response and explanatory variable of interest:
Outside of variable importance, we can also examine the “coefficient of distortion” (COD), which summarizes the difference, in standard deviation units, between the effective sample generated through random forest adjustment, and the nominal sample—the original data sample collected. The COD can help summarize how (un)representative the sample was that was used to identify the point estimate returned via random forest adjustment.
To summarize, the COD is calculated for a given covariate \(p\) by first scaling it to standard deviation units and then taking its weighted average: \[\text{COD}_p = \sum_{i = 1}^N w_i\left[ \frac{x_{ip} - \bar{x}_p}{\text{SD}(x_{ip})}\right] / \sum_{i=1}^N w_i.\] The vector of weights \(w_i\) used is simply the square of the residualized explanatory variable of interest (\(z_i\)): \[w_i = (z_i^\varepsilon)^2 = [z_i - \hat{f}_z(x_i)]^2.\]
The COD is calculated for each covariate used in estimation with the
get_cod() function. The function accepts three
arguments:
rfa = the fitted rfa object.include_se = TRUE by default. This
specifies whether you’d like to report standard errors and related
summary statistics along side the COD estimates per covariate.bootsims = the number of bootstrap iterations to
perform to compute the standard errors. By default, this is set to
1,000.Inference for each of the COD estimates is done via bootstrapping.
Applying get_cod to the fitted rfa object,
we get:
cods <-
get_cod(rfa_fit, bootsims = 100) # set to 100 to speed up
#> Bootstrapping....
cods # view
#> term estimate std.error statistic p.value conf.low conf.high
#> 1 AGE 0.4497 0.0613 7.340 2.13e-13 0.3296 0.570
#> 2 VOTE96.1 0.3472 0.0587 5.917 3.27e-09 0.2322 0.462
#> 3 PERSONS 0.0228 0.0631 0.361 7.18e-01 -0.1008 0.146
#> 4 NEW -0.2098 0.0494 -4.244 2.19e-05 -0.3067 -0.113
#> 5 MAJORPTY 0.1262 0.0589 2.142 3.22e-02 0.0107 0.242Like the importance metrics, these results can be easily visualized:
ggplot(cods) +
aes(
x = estimate,
y = reorder(term, estimate^2),
xmin = conf.low,
xmax = conf.high
) +
geom_point() +
geom_errorbarh(height = 0) +
geom_vline(
xintercept = 0,
lty = 2
) +
labs(
x = "Distortion with 95% confidence intervals",
y = NULL,
title = "Effective vs. Nominal Data Samples"
)These tools, combined, help to demystify random forest adjusted estimates. For instance, the importance metrics above reveal a moderately strong correlation between the importance of covariates in predicting the explanatory variable of interest (a get-out-the-vote phone call) and the response (voter turnout). This is suggestive of a nontrivial degree of confounding in the relationship between the explanatory variable and response.
Further, the COD estimates suggest that the process of adjusting for this confounding produces an effective sample that differs markedly from the nominal sample. Notably, the effective sample skews significantly older, is more likely to have voted in the previous congressional election two years prior, and is less likely to be a new voter, among other differences. This suggests a more constrained set of cases to which the identified relationship between GOT phone calls and turnout can be generalized.
Taken together, the tools in the RFA package are
powerful allies in the analysis of observational data. Random forest
adjustment has the potential to be robust to various forms of
confounding in observational data, while it also is amenable to helpful
diagnostics that contextualize the reported results.