Propensity Score Matching in R

Here is a demonstration of how to create and analyze matched data for propensity score analysis using the MatchIt package. First, we create an artificial data set that contains the following set of covariates (school size, percentage of minority students, and free and reduced lunch) along with a dependent variable and “treatment” indicator, indicating whether or not a student attends a public (1) or private (0) school.

Running the MatchIt R program is easy. First, we create the model, which is a logistic regression of the treatment (i.e. public school) regressed upon the following covariates: size, minority, and freelunch. Next, we need to select the data frame to use, which is the propScore data frame that contains all of the variables of interest. Next, we select the ratio, which we set to one indicating that each student in the treatment / public group will be matched with one student in the private / control group.

library(MatchIt)
set.seed(12345)
studentID = 1:10000
dep = abs(rnorm(10000))
public = c(rep(1,5000), rep(0,5000))
size = abs(rnorm(10000)*10000)
minority = abs(rnorm(10000)/10)
freeLunch = abs(rnorm(10000)/10)
propScore = cbind(public, size, minority, freeLunch)
propScore = as.data.frame(propScore)

m.out = matchit(public ~ size + minority + freeLunch, data = propScore, method = "nearest", ratio = 1)

Next, we want to analyze how well the matching procedure worked. First, we can look at a summary by using the summary function. The summary function provides pieces of information for the full and the matched data set. First it provides data on the balance for all of the data without matching. It provides information the means of the treated (i.e. public as indicated by the 1 for the public variable) and the control (i.e. private as indicated by the 0 for public variable). It also provides the standard deviation for the control group. It provides means and sd’s for both the treatment and control groups across each of the included covariates. Next there are QQ columns for the median, mean, and maximum quantiles differences between the treatment and control groups. Smaller QQ values indicate better matching data.

The summary of balance for matched data is interpreted in the same way as for the summary of balance for all data with the only difference being that the summary of balance for matched data uses only matched data. Therefore, the user can compare the differences in means and reductions in quantiles to evaluate if the matching process reduced the observed differences between the public and private group with the matched data. Finally, there is the percent balance improvement, which provides percentage improvement by using the matched data relative to all the data. In this example, there is no improvement, because we were able to match all of the data, which is likely because the data were randomly generated.

Finally, there are two plots that researchers can review to evaluate the effectiveness of the matching procedure. First is the jitter plot, which shows the distribution of unmatched and matched pairs for both treatment and control groups. You can see how close the data are from the matched and unmatched groups demonstrating the matched groups similarities among the observed covariates. Finally, we can evaluate a histogram of the matched and raw (i.e. total) data sets to evaluate how much better the matching procedures matched the data.

summary(m.out)

## 
## Call:
## matchit(formula = public ~ size + minority + freeLunch, data = propScore, 
##     method = "nearest", ratio = 1)
## 
## Summary of balance for all data:
##           Means Treated Means Control SD Control Mean Diff eQQ Med
## distance         0.5003        0.4997     0.0125    0.0006  0.0006
## size          7880.9095     7826.2593  5963.1835   54.6502 54.6155
## minority         0.0806        0.0790     0.0599    0.0016  0.0012
## freeLunch        0.0775        0.0800     0.0601   -0.0025  0.0022
##           eQQ Mean   eQQ Max
## distance    0.0006    0.0111
## size       82.1556 1533.5282
## minority    0.0016    0.0387
## freeLunch   0.0025    0.0565
## 
## 
## Summary of balance for matched data:
##           Means Treated Means Control SD Control Mean Diff eQQ Med
## distance         0.5003        0.4997     0.0125    0.0006  0.0006
## size          7880.9095     7826.2593  5963.1835   54.6502 54.6155
## minority         0.0806        0.0790     0.0599    0.0016  0.0012
## freeLunch        0.0775        0.0800     0.0601   -0.0025  0.0022
##           eQQ Mean   eQQ Max
## distance    0.0006    0.0111
## size       82.1556 1533.5282
## minority    0.0016    0.0387
## freeLunch   0.0025    0.0565
## 
## Percent Balance Improvement:
##           Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance           0       0        0       0
## size               0       0        0       0
## minority           0       0        0       0
## freeLunch          0       0        0       0
## 
## Sample sizes:
##           Control Treated
## All          5000    5000
## Matched      5000    5000
## Unmatched       0       0
## Discarded       0       0

m.outCSV = match.data(m.out) 
write.csv(m.outCSV, "m.outCSV.csv")
plot(m.out, type = "jitter")

## [1] "To identify the units, use first mouse button; to stop, use second."

## integer(0)

plot(m.out, type = "hist")

Finally, we can evaluate the impact of being in either a public or private school using the matched data, by evaluating the significance of the public variable. We can use the Zelig function to create the model that includes covariates.

library(Zelig)
z.out = zelig(dep ~ public + minority + freeLunch + size, model = "ls",data = m.outCSV)

## How to cite this model in Zelig:
##   R Core Team. 2007.
##   ls: Least Squares Regression for Continuous Dependent Variables
##   in Christine Choirat, Christopher Gandrud, James Honaker, Kosuke Imai, Gary King, and Olivia Lau,
##   "Zelig: Everyone's Statistical Software," http://zeligproject.org/

summary(z.out)

## Model: 
## 
## Call:
## z5$zelig(formula = dep ~ public + minority + freeLunch + size, 
##     data = m.outCSV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8233 -0.4810 -0.1328  0.3601  3.0874 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  8.011e-01  1.622e-02  49.384   <2e-16
## public      -1.808e-02  1.213e-02  -1.490    0.136
## minority     9.758e-02  1.001e-01   0.975    0.330
## freeLunch   -1.410e-01  1.018e-01  -1.385    0.166
## size         7.746e-07  1.015e-06   0.763    0.446
## 
## Residual standard error: 0.6065 on 9995 degrees of freedom
## Multiple R-squared:  0.0005522,  Adjusted R-squared:  0.0001522 
## F-statistic: 1.381 on 4 and 9995 DF,  p-value: 0.2378
## 
## Next step: Use 'setx' method