1. Matching: Motivation and Formulation

This section summarizes the motivation of a classical matching problem. Suppose we have a dataset that contains a response variable \(Y\), a treatment variable \(T\) and some other covariates represented by \(X = (x_1, x_2, ..., x_t)\). Here, treatment is a categorical variable that indicates the membership of an observation in the treatment group or control group, and takes \(0\) or \(1\). Our end goal is to investigate the effect of treatment \(T\) on the response variable \(Y\). However, this is not doable unless we control for \(X = (x_1, ..., x_t)\). In an ideal world where we can carry out experiment, we would want to compare two subjects with different \(T\) but exactly the same \(X\). However, observational datasets happen more often than not, so such variate control cannot be easily obtained. Therefore, we need to “match” each treated subject \(x\) with a corresponding subject \(x'\) in the control group that “resembles” \(x\) in terms of \(X = (x_1, ..., x_n)\). In some way, matching is similar to the simulation of a randomized blocked experiment. With these intuition, we formulate the matching problem as follows:

Given a data set \(D\) of \(n\) observations, denote each observation by \((T_i, X_i, Y_i)\) (\(i = 1, 2, ..., n\)) where \(T_i\), \(X_i\) and \(Y_i\) respectively indicate the treatment variable, covariate and response variable for the i-th subject. We want to find a partition of \(D = \{(T_i, X_i, Y_i)\}_{i = 1}^n\), denoted by a collection of subsets \(\{S_j\}_{j = 1}^m\) such that \(\{(T_i, X_i, Y_i) \ | \ T_i = 1\} \subset \cup_{j = 1}^m S_i\), and that the \(X_i\) component of all \((T_i, X_i, Y_i) \in S_j\) are “similar” for any fixed \(j\).

2. Gary King’s Approach

2.1 Matching MethodS

King proposed and implemented many matching methods in his MatchIt package, which includes: exact matching, subclassification, nearest neighbor matching, optimal matching, full matching, genetic matching, and coarsened exact matching. Broadly speaking, these matching methods can be divided into two categories: the first category looks for a best match for each treated subject in the control group; the second category loosely puts ``similar’’ observations (both treated and control) into the same “subclass”. Most often, both approaches keep the maximum possible number of treated observations and drop unmatched control observations by assigning zero weights to them.

2.2 How to Define ``Similarity’’?

King defined a statistic called “balance”, which is a synthesis of a bunch of information in itself. To assess the goodness of a matching, one needs to know how close the covariate distributions are in the treated versus controlled groups. As King himself puts it, “MatchIt provides a number of ways to assess the balance of covariates after matching, including numerical summaries such as the `mean Diff’ (difference in means) or the difference in means divided by the treated group standard deviation, and summaries based on quantile-quantile plots that compare the empirical distributions of each covariate”. He also noted that balance diagnostics should be performed on all variables in \(X\), even if some covariates are excluded from the matching procedures.

2.3 Highlights

All these matching methods are nonparametric in the sense that they are free from model assumptions. Thus they introduce no further bias. However, we should note that matching is simply a form of data preprocessing. The bulk of the analysis has yet to be done after matching!

3. Playing Around with MatchIt

Lalonde dataset will be our major toy. First load the data with

library(MatchIt)

## Warning: package 'MatchIt' was built under R version 3.2.3

## Loading required package: MASS

data(lalonde)

The main method in King’s package is \(matchit\). Suppose we want to match the dataset on covariates \(age\) and \(educ\). Then we should call the following function. Note that \(treat\) is our treatment variable here, and that during the matching process, we should not be concerned about the response variable (which is \(re78\) in Lalonde) at all.

m.out <- matchit(treat ~ age + educ, method = "exact", data = lalonde)

After per matching, \(summary\) can be used to check balance diagnostics.

summary(m.out)

## 
## Call:
## matchit(formula = treat ~ age + educ, data = lalonde, method = "exact")
## 
## Sample sizes:
##           Control Treated
## All           429     185
## Matched       200     132
## Discarded     229      53
## 
## Matched sample sizes by subclass:
##    Treated Control Total
## 1        1       1     2
## 2        3       1     4
## 3        1       2     3
## 4        1       1     2
## 5        6       8    14
## 6        1       1     2
## 7        2       2     4
## 8        3       4     7
## 9        1       3     4
## 10       2       2     4
## 11       3       2     5
## 12       1       2     3
## 13       4       3     7
## 14       4       1     5
## 15       2       1     3
## 16       1       4     5
## 17       2       5     7
## 18       1       2     3
## 19       1       1     2
## 20       3       7    10
## 21       1       1     2
## 22       5      12    17
## 23       3       9    12
## 24       6       7    13
## 25       2       1     3
## 26       4       2     6
## 27       5       5    10
## 28       3       4     7
## 29       3       5     8
## 30       4       3     7
## 31       1       2     3
## 32       4      13    17
## 33       1       1     2
## 34       2       5     7
## 35       1       1     2
## 36       1       1     2
## 37       1       1     2
## 38       1       1     2
## 39       2       1     3
## 40       4       4     8
## 41       2       2     4
## 42       1       2     3
## 43       1       6     7
## 44       1       1     2
## 45       2       3     5
## 46       2       1     3
## 47       1       2     3
## 48       3       5     8
## 49       2       4     6
## 50       2       8    10
## 51       2       5     7
## 52       1       1     2
## 53       1       3     4
## 54       2       2     4
## 55       1       3     4
## 56       1       2     3
## 57       1       7     8
## 58       1       1     2
## 59       1       1     2
## 60       1       1     2
## 61       2       3     5
## 62       1       1     2
## 63       1       2     3
## 64       1       1     2
## 65       1       1     2

Notice that \(matchit\) is a very nasty objects, but it can be further cleaned up by \(match.data\). In the code below, the returned value \(m.data\) is simply the original lalonde data frame with two additional columns: \(weights\) and \(subclass\).

m.data <- match.data(m.out)

Then, King proposed using his Zelig package to do the main analysis. Here we fit a least square model and check out the mean difference in response between treated and controlled members.

library(Zelig)

## Warning: package 'Zelig' was built under R version 3.2.3

## Warning in .recacheSubclasses(def@className, def, doSubclasses, env):
## undefined subclass "externalRefMethod" of class "functionOrNULL";
## definition not updated

z.out <- zelig(re78 ~ treat + age + educ, data = m.data, model = "ls")

## How to cite this model in Zelig:
##   R Core Team. 2007.
##   ls: Least Squares Regression for Continuous Dependent Variables
##   in Christine Choirat, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
##   "Zelig: Everyone's Statistical Software," http://zeligproject.org/

x.out <- setx(z.out, treat = 0)
x1.out <- setx(z.out, treat = 1)
s.out <- sim(z.out, x = x.out, x1 = x1.out)
summary(s.out)

## 
##  sim x :
##  -----
## ev
##       mean       sd      50%     2.5%    97.5%
## 1 7257.349 470.9392 7255.506 6357.857 8163.512
## pv
##          mean       sd      50%      2.5%    97.5%
## [1,] 7325.433 6874.525 7651.145 -5822.243 20556.57
## 
##  sim x1 :
##  -----
## ev
##       mean       sd      50%    2.5%    97.5%
## 1 6159.382 595.6759 6130.173 4967.23 7319.159
## pv
##          mean       sd      50%      2.5%    97.5%
## [1,] 6360.206 6651.835 6344.884 -6123.589 20039.82
## fd
##        mean       sd       50%      2.5%    97.5%
## 1 -1097.967 745.0985 -1110.557 -2545.229 373.5773

4. Connection with the Sampling Problem and Walkr

Discussion: Matching and Sampling

Yuanchu Dang

December 23, 2015