Coarsened exact matching (CEM) is designed to improve the estimation of causal effects via an extremely powerful method of matching that is widely applicable and exceptionally easy to understand and use (Iacus et al., 2012).
Observational data tend to be technically and logistically easier to collect than randomised experimental data, hence tends to be more readily available. However, it comes with issues such as the treatment assignment mechanism which is likely to be non-random and unknown. This results in the distribution of the covariates (X) in the treatment groups being dissimiliar. Thus we must try to match the treatment and control groups across variables to mitigate bias.
First, since it is generally not wise to obtain a very precise estimate of a drastically wrong quantity, the investigator should be more concerned about having an estimate with small bias than one with small variance. Second, since in many observational studies the sample sizes are sufficiently large that sampling variances of estimators will be small, the sensitivity of estimators to biases is the dominant source of uncertainty. CEM is a researchers first line of defense against such problems.
We begin with a naive estimate of the sample average treatment effect on the treated (SATT) which would be useful only if the in-sample distribution of pre-treatment covariates were the same in the treatment and control groups:
We load the package, load the LeLonde data then remove missing data from the the data set before starting the analysis. We then look at the data using the glimpse() function.
glimpse(Le)
## Observations: 650
## Variables: 13
## $ treated (dbl) 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1...
## $ age (int) 20, 39, 49, 26, 38, 28, 27, 33, 44, 31, 29, 20, 22, ...
## $ education (int) 12, 12, 8, 8, 10, 12, 12, 12, 9, 12, 12, 11, 10, 9, ...
## $ black (int) 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0...
## $ married (int) 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ nodegree (int) 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0...
## $ re74 (dbl) 8644.1562, 19785.3203, 9714.5967, 37211.7578, 14759....
## $ re75 (dbl) 8644.1562, 6608.1372, 7285.9478, 36941.2656, 14701.9...
## $ re78 (dbl) 11656.50586, 499.25720, 16717.12109, 30247.50000, 43...
## $ hispanic (int) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ u74 (dbl) 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0...
## $ u75 (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ q1 (fctr) strongly agree, neutral, no opinion, no opinion, st...
This is a modified version of the Lalonde experimental dataset used for explanatory reasons only (LaLonde, 1986). The program provided training (treated = 1) to the participants for 12-18 months and helped them in finding a job. The goal of the program was to increase participants’ earnings, and so 1978 earnings (re78) is the key outcome variable. All the other variables can be considered covariates. Some of these are dichotomous (married, nodegree, black, Hispanic, u74, and u75), some are categorical (age and education), and the earnings variables are continuous and highly skewed, with point masses at zero.
tr <- which(Le$treated == 1)
ct <- which(Le$treated == 0)
ntr <- length(tr) # count the number in the treated group
nct <- length(ct) # count the number in the control group
ntr
## [1] 258
nct
## [1] 392
We see that there are more control (392) than treated individuals (258). We can naively estimate the SATT by looking at the difference in the mean of the response variable between groups.
mean(Le$re78[tr]) - mean(Le$re78[ct])
## [1] 759.0479
This suggests that treated individuals on average earn $759 more than non-treated.
Because the variable treated was not randomly assigned, the pre-treatment covariates differ between the treated and control groups. This demonstrates that one cannot recover SATT from the observational data set in part because the data are highly imbalanced and relatively few good matches exist within the control group. To see this, we focus on these pre-treatment covariates:
vars <- c("age", "education", "black", "married", "nodegree", "re74", "re75", "hispanic", "u74", "u75", "q1") # create a vector of covariate names
imbalance(group = Le$treated, data = Le[vars])
##
## Multivariate Imbalance Measure: L1=0.902
## Percentage of local common support: LCS=5.8%
##
## Univariate Imbalance Measures:
##
## statistic type L1 min 25% 50% 75%
## age -0.252373042 (diff) 5.102041e-03 0 0 0.0000 -1.0000
## education 0.153634710 (diff) 8.463851e-02 1 0 1.0000 1.0000
## black -0.010322734 (diff) 1.032273e-02 0 0 0.0000 0.0000
## married -0.009551495 (diff) 9.551495e-03 0 0 0.0000 0.0000
## nodegree -0.081217371 (diff) 8.121737e-02 0 -1 0.0000 0.0000
## re74 -18.160446880 (diff) 5.551115e-17 0 0 284.0715 806.3452
## re75 101.501761679 (diff) 5.551115e-17 0 0 485.6310 1238.4114
## hispanic -0.010144756 (diff) 1.014476e-02 0 0 0.0000 0.0000
## u74 -0.045582186 (diff) 4.558219e-02 0 0 0.0000 0.0000
## u75 -0.065555292 (diff) 6.555529e-02 0 0 0.0000 0.0000
## q1 7.494021189 (Chi2) 1.067078e-01 NA NA NA NA
## max
## age -6.0000
## education 1.0000
## black 0.0000
## married 0.0000
## nodegree 0.0000
## re74 -2139.0195
## re75 490.3945
## hispanic 0.0000
## u74 0.0000
## u75 0.0000
## q1 NA
The overall imbalance is given by the statistic L1. This is very important as even if the marginal distribution of every variable is perfectly balanced, the joint distribution can still be highly imbalanced.
The first column in the table of unidimensional measures, labeled statistic, reports the difference in means for numerical variables (indicated by the second column, type, reporting (diff)) or a chi-square difference for categorical variables (when the second column reports (Chi2)). The second column, labeled L1, reports the Lj1 measure, which is L1 computed for the jth variable separately (which of course does not include interactions). The remaining columns in the table report the difference in the empirical quantile of the distributions of the two groups for the 0th (min), 25th, 50th, 75th, and 100th (max) percentiles for each variable. When the variable type is Chi2, the only variable-by-variable measure that is defined in this table is Lj 1; others are reported missing.
Broadly speaking: This particular table shows that variables re74 and re75 are imbalanced in the raw data in many ways and variable age is balanced in means but not in the quantiles of the two distributions.
In our running example we have a dichotomous treatment variable. In the following code, we match on all variables but re78, which is the outcome variable and so should never be included. Hence we proceed specifying re78 in argument drop:
mat <- cem(treatment = "treated", data = Le, drop = "re78", eval.imbalance = TRUE)
mat
## G0 G1
## All 392 258
## Matched 95 84
## Unmatched 297 174
##
##
## Multivariate Imbalance Measure: L1=0.605
## Percentage of local common support: LCS=25.8%
##
## Univariate Imbalance Measures:
##
## statistic type L1 min 25% 50% 75% max
## age 9.404762e-02 (diff) 0.000000e+00 0 0 1 0.0000 0.0000
## education -2.222222e-02 (diff) 2.222222e-02 0 0 0 0.0000 0.0000
## black 1.110223e-16 (diff) 0.000000e+00 0 0 0 0.0000 0.0000
## married 0.000000e+00 (diff) 0.000000e+00 0 0 0 0.0000 0.0000
## nodegree 0.000000e+00 (diff) 0.000000e+00 0 0 0 0.0000 0.0000
## re74 1.496418e+02 (diff) 0.000000e+00 0 0 0 463.3308 889.5410
## re75 1.587521e+02 (diff) 0.000000e+00 0 0 0 843.6863 -640.9307
## hispanic 0.000000e+00 (diff) 0.000000e+00 0 0 0 0.0000 0.0000
## u74 0.000000e+00 (diff) 0.000000e+00 0 0 0 0.0000 0.0000
## u75 0.000000e+00 (diff) 0.000000e+00 0 0 0 0.0000 0.0000
## q1 2.083349e+00 (Chi2) 1.387779e-17 NA NA NA NA NA
We can see from these results the number of observations matched and thus retained, as well as those which were pruned because they were not comparable. We can also see reduction in the imbalance of the new matched data with a reduction in L1 (by setting the eval.imbalance argument to TRUE).
cem() output using att()data(LL)
mat <- cem(treatment = "treated", data = LL, drop = "re78")
#mat$matched # provides a vector of 1 and 0 if matched and not, useful for selection
est <- att(mat, re78 ~ treated, data = LL)
est
##
## G0 G1
## All 425 297
## Matched 222 163
## Unmatched 203 134
##
## Linear regression model on CEM matched data:
##
## SATT point estimate: 550.962564 (p.value=0.368242)
## 95% conf. interval: [-647.777701, 1749.702830]
Compare this estimate of $550 to our earlier naive estimate of $759. Note the 95% CI includes zero hence the p value is not significant.This information will be useful in determining the cost-benefit ratio for this intervention and allow a comparison to other interventions available.
This package has the flexibility to match our data and then use an appropriate general linear model for parameter estimation. When used properly with informative data, CEM can reduce model dependence and bias and improve efficiency, across a wide range of potential applications. This will mitigate the bias inherent in making causal inferences from observational data assisting Charities in improving their estimation of the effect size of their interventions on the target group.
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252
## [2] LC_CTYPE=English_United Kingdom.1252
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] tcltk stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] dplyr_0.4.3 cem_1.1.17 lattice_0.20-33
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 randomForest_4.6-12 assertthat_0.1
## [4] digest_0.6.9 MASS_7.3-45 R6_2.1.1
## [7] grid_3.2.3 DBI_0.3.1 nlme_3.1-122
## [10] MatchIt_2.4-21 formatR_1.2.1 magrittr_1.5
## [13] evaluate_0.8 stringi_1.0-1 combinat_0.0-8
## [16] rmarkdown_0.9.2 tools_3.2.3 stringr_1.0.0
## [19] parallel_3.2.3 yaml_2.1.13 htmltools_0.3
## [22] knitr_1.12