Mitigating bias in the analysis of causal effects in observational studies

Coarsened exact matching (CEM) is designed to improve the estimation of causal effects via an extremely powerful method of matching that is widely applicable and exceptionally easy to understand and use (Iacus et al., 2012).

The problem

Observational data tend to be technically and logistically easier to collect than randomised experimental data, hence tends to be more readily available. However, it comes with issues such as the treatment assignment mechanism which is likely to be non-random and unknown. This results in the distribution of the covariates (X) in the treatment groups being dissimiliar. Thus we must try to match the treatment and control groups across variables to mitigate bias.

First, since it is generally not wise to obtain a very precise estimate of a drastically wrong quantity, the investigator should be more concerned about having an estimate with small bias than one with small variance. Second, since in many observational studies the sample sizes are sufficiently large that sampling variances of estimators will be small, the sensitivity of estimators to biases is the dominant source of uncertainty. CEM is a researchers first line of defense against such problems.

Basic evaluation and analysis of unmatched data

We begin with a naive estimate of the sample average treatment effect on the treated (SATT) which would be useful only if the in-sample distribution of pre-treatment covariates were the same in the treatment and control groups:

We load the package, load the LeLonde data then remove missing data from the the data set before starting the analysis. We then look at the data using the glimpse() function.

glimpse(Le)

## Observations: 650
## Variables: 13
## $ treated   (dbl) 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1...
## $ age       (int) 20, 39, 49, 26, 38, 28, 27, 33, 44, 31, 29, 20, 22, ...
## $ education (int) 12, 12, 8, 8, 10, 12, 12, 12, 9, 12, 12, 11, 10, 9, ...
## $ black     (int) 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ married   (int) 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ nodegree  (int) 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ re74      (dbl) 8644.1562, 19785.3203, 9714.5967, 37211.7578, 14759....
## $ re75      (dbl) 8644.1562, 6608.1372, 7285.9478, 36941.2656, 14701.9...
## $ re78      (dbl) 11656.50586, 499.25720, 16717.12109, 30247.50000, 43...
## $ hispanic  (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ u74       (dbl) 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0...
## $ u75       (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ q1        (fctr) strongly agree, neutral, no opinion, no opinion, st...

This is a modified version of the Lalonde experimental dataset used for explanatory reasons only (LaLonde, 1986). The program provided training (treated = 1) to the participants for 12-18 months and helped them in finding a job. The goal of the program was to increase participants’ earnings, and so 1978 earnings (re78) is the key outcome variable. All the other variables can be considered covariates. Some of these are dichotomous (married, nodegree, black, Hispanic, u74, and u75), some are categorical (age and education), and the earnings variables are continuous and highly skewed, with point masses at zero.

tr <- which(Le$treated == 1)
ct <- which(Le$treated == 0)
ntr <- length(tr)  # count the number in the treated group
nct <- length(ct)  # count the number in the control group
ntr

## [1] 258

nct

## [1] 392

We see that there are more control (392) than treated individuals (258). We can naively estimate the SATT by looking at the difference in the mean of the response variable between groups.

mean(Le$re78[tr]) - mean(Le$re78[ct])

## [1] 759.0479

This suggests that treated individuals on average earn $759 more than non-treated.

Because the variable treated was not randomly assigned, the pre-treatment covariates differ between the treated and control groups. This demonstrates that one cannot recover SATT from the observational data set in part because the data are highly imbalanced and relatively few good matches exist within the control group. To see this, we focus on these pre-treatment covariates:

vars <- c("age", "education", "black", "married", "nodegree", "re74", "re75", "hispanic", "u74", "u75", "q1")  # create a vector of covariate names

imbalance(group = Le$treated, data = Le[vars])

## 
## Multivariate Imbalance Measure: L1=0.902
## Percentage of local common support: LCS=5.8%
## 
## Univariate Imbalance Measures:
## 
##               statistic   type           L1 min 25%      50%       75%
## age        -0.252373042 (diff) 5.102041e-03   0   0   0.0000   -1.0000
## education   0.153634710 (diff) 8.463851e-02   1   0   1.0000    1.0000
## black      -0.010322734 (diff) 1.032273e-02   0   0   0.0000    0.0000
## married    -0.009551495 (diff) 9.551495e-03   0   0   0.0000    0.0000
## nodegree   -0.081217371 (diff) 8.121737e-02   0  -1   0.0000    0.0000
## re74      -18.160446880 (diff) 5.551115e-17   0   0 284.0715  806.3452
## re75      101.501761679 (diff) 5.551115e-17   0   0 485.6310 1238.4114
## hispanic   -0.010144756 (diff) 1.014476e-02   0   0   0.0000    0.0000
## u74        -0.045582186 (diff) 4.558219e-02   0   0   0.0000    0.0000
## u75        -0.065555292 (diff) 6.555529e-02   0   0   0.0000    0.0000
## q1          7.494021189 (Chi2) 1.067078e-01  NA  NA       NA        NA
##                  max
## age          -6.0000
## education     1.0000
## black         0.0000
## married       0.0000
## nodegree      0.0000
## re74      -2139.0195
## re75        490.3945
## hispanic      0.0000
## u74           0.0000
## u75           0.0000
## q1                NA

The overall imbalance is given by the statistic L₁. This is very important as even if the marginal distribution of every variable is perfectly balanced, the joint distribution can still be highly imbalanced.

The first column in the table of unidimensional measures, labeled statistic, reports the difference in means for numerical variables (indicated by the second column, type, reporting (diff)) or a chi-square difference for categorical variables (when the second column reports (Chi2)). The second column, labeled L₁, reports the L_j1 measure, which is L₁ computed for the j_th variable separately (which of course does not include interactions). The remaining columns in the table report the difference in the empirical quantile of the distributions of the two groups for the 0_th (min), 25_th, 50_th, 75_th, and 100_th (max) percentiles for each variable. When the variable type is Chi², the only variable-by-variable measure that is defined in this table is L_j 1; others are reported missing.

Broadly speaking: This particular table shows that variables re74 and re75 are imbalanced in the raw data in many ways and variable age is balanced in means but not in the quantiles of the two distributions.

Automated Coarsenening

In our running example we have a dichotomous treatment variable. In the following code, we match on all variables but re78, which is the outcome variable and so should never be included. Hence we proceed specifying re78 in argument drop:

mat <- cem(treatment = "treated", data = Le, drop = "re78", eval.imbalance = TRUE)
mat

##            G0  G1
## All       392 258
## Matched    95  84
## Unmatched 297 174
## 
## 
## Multivariate Imbalance Measure: L1=0.605
## Percentage of local common support: LCS=25.8%
## 
## Univariate Imbalance Measures:
## 
##               statistic   type           L1 min 25% 50%      75%       max
## age        9.404762e-02 (diff) 0.000000e+00   0   0   1   0.0000    0.0000
## education -2.222222e-02 (diff) 2.222222e-02   0   0   0   0.0000    0.0000
## black      1.110223e-16 (diff) 0.000000e+00   0   0   0   0.0000    0.0000
## married    0.000000e+00 (diff) 0.000000e+00   0   0   0   0.0000    0.0000
## nodegree   0.000000e+00 (diff) 0.000000e+00   0   0   0   0.0000    0.0000
## re74       1.496418e+02 (diff) 0.000000e+00   0   0   0 463.3308  889.5410
## re75       1.587521e+02 (diff) 0.000000e+00   0   0   0 843.6863 -640.9307
## hispanic   0.000000e+00 (diff) 0.000000e+00   0   0   0   0.0000    0.0000
## u74        0.000000e+00 (diff) 0.000000e+00   0   0   0   0.0000    0.0000
## u75        0.000000e+00 (diff) 0.000000e+00   0   0   0   0.0000    0.0000
## q1         2.083349e+00 (Chi2) 1.387779e-17  NA  NA  NA       NA        NA

We can see from these results the number of observations matched and thus retained, as well as those which were pruned because they were not comparable. We can also see reduction in the imbalance of the new matched data with a reduction in L₁ (by setting the eval.imbalance argument to TRUE).

Estimating the SATT from `cem()` output using `att()`

data(LL)
mat <- cem(treatment = "treated", data = LL, drop = "re78")
#mat$matched  # provides a vector of 1 and 0 if matched and not, useful for selection
est <- att(mat, re78 ~ treated, data = LL) 
est

## 
##            G0  G1
## All       425 297
## Matched   222 163
## Unmatched 203 134
## 
## Linear regression model on CEM matched data:
## 
## SATT point estimate: 550.962564 (p.value=0.368242)
## 95% conf. interval: [-647.777701, 1749.702830]

Compare this estimate of $550 to our earlier naive estimate of $759. Note the 95% CI includes zero hence the p value is not significant.This information will be useful in determining the cost-benefit ratio for this intervention and allow a comparison to other interventions available.

Conclusion

This package has the flexibility to match our data and then use an appropriate general linear model for parameter estimation. When used properly with informative data, CEM can reduce model dependence and bias and improve efficiency, across a wide range of potential applications. This will mitigate the bias inherent in making causal inferences from observational data assisting Charities in improving their estimation of the effect size of their interventions on the target group.

References

Iacus, S. M., King, G., & Porro, G. (2012). Causal inference without balance checking: Coarsened exact matching. Political Analysis, 20(1), 1-24. http://doi.org/10.1093/pan/mpr013
Lalonde, R. (1986) “Evaluating the Econometric Evaluations of Training Programs,” American Economic Review, 76, 604-620

sessionInfo()

## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] tcltk     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] dplyr_0.4.3     cem_1.1.17      lattice_0.20-33
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.3         randomForest_4.6-12 assertthat_0.1     
##  [4] digest_0.6.9        MASS_7.3-45         R6_2.1.1           
##  [7] grid_3.2.3          DBI_0.3.1           nlme_3.1-122       
## [10] MatchIt_2.4-21      formatR_1.2.1       magrittr_1.5       
## [13] evaluate_0.8        stringi_1.0-1       combinat_0.0-8     
## [16] rmarkdown_0.9.2     tools_3.2.3         stringr_1.0.0      
## [19] parallel_3.2.3      yaml_2.1.13         htmltools_0.3      
## [22] knitr_1.12

cem_vignette