This demonstration is mostly a summary from the following literature.

Reference
1) Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. 2) Morgan, S. L., & Winship, C. (2015). Counterfactuals and causal inference. Cambridge University Press. 3) Iacus, Stefano M., Gary King, and Giuseppe Porro. “CEM: software for coarsened exact matching.” (2009): 1-27. 4) Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, http://gking. harvard. edu/matchit.

Lalonde data set (Lalonde, 1986) “This program provided training to selected individuals for 12-18 months and help finding a job in the hopes of increasing their’ earnings. The treatment variable, treated, is 1 for participants (the treatment group) and 0 for nonparticipants (the control group). The key outcome variable is earnings in 1978 (re78).” (Iacus et al. 2009)

The goal of matching: create a dataset that looks closer to a dataset that we could have in a randomized experiment. This means that we want “the distribution of covariates” to be the same between our treatment and control groups. Put it differently, we want that our treatment and control groups to be as similar as possible in any other aspects besides the treatment.

How do we know we might have self-selection problem? (whether two groups are similar in all other aspects) 1) by theory 2) using imbalance test

variables for other aspects (=pre-treated variables) 1) age (age) 2) years of education (educ) 3) marital status (married) 4) lack of a high school diploma (nodegree) 5) race (black, hispan) 6) indicator variables for unemployment in 1974 (u74) and 1975 (u75) 7) real earnings in 1974 (re74) and 1975 (re75).

#default is logistic regression model, but you can also change it to a probit model
#nearest neighbor
matched1<-matchit(treat~re74+re75+educ+black+hispan+nodegree+married, data=Le, method="nearest", distance = "logit")

#optimal matching=greedy matching (choose 1 at a time, do not try to minimize a global distance measure. Pick the control matched with the smallest average absolute distance.)
install.packages("optmatch")
## package 'optmatch' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\wenra\AppData\Local\Temp\RtmpYrs0Lk\downloaded_packages
library("optmatch")
## Warning: package 'optmatch' was built under R version 3.6.2
## Loading required package: survival
## The optmatch package has an academic license. Enter relaxinfo() for more information.
matched2<-matchit(treat~re74+re75+educ+black+hispan+nodegree+married, data=Le, method="optimal", ratio = 2)
## Warning in optmatch::fullmatch(d, min.controls = ratio, max.controls = ratio, : Without 'data' argument the order of the match is not guaranteed
##     to be the same as your original data.
#ratio: how many controls to 1 treatment

#Full matching: 1 treatment with >=1 controls
matched3<-matchit(treat~re74+re75+educ+black+hispan+nodegree+married, data=Le, method="full", distance = "logit")
## Warning in optmatch::fullmatch(d, ...): Without 'data' argument the order of the match is not guaranteed
##     to be the same as your original data.
#CEM
matched4 <- matchit(treat ~ age + educ + black + hispan + married + nodegree
+ re74 + re75, data = Le, method = "cem")
## 
## Using 'treat'='1' as baseline group

What do you need to do after matching? You need to check if the matching process generate a set of matched samples that looks really similar between the treated and the control groups. 1) summary statistics 2) plots: Q-Q plot: if dots are not lie on the 45 degree line, it’s not a good match jitter plots: it shows the overall distribution of propensity scores (you want more dots overlapped) histograms: the shapes of histograms should look as similar as possible

summary(matched1)
## 
## Call:
## matchit(formula = treat ~ re74 + re75 + educ + black + hispan + 
##     nodegree + married, data = Le, method = "nearest", distance = "logit")
## 
## Summary of balance for all data:
##          Means Treated Means Control SD Control  Mean Diff   eQQ Med
## distance        0.5762        0.1828     0.2301     0.3935    0.5113
## re74         2095.5737     5619.2365  6788.7508 -3523.6628 2425.5720
## re75         1532.0553     2466.4844  3291.9962  -934.4291  981.0968
## educ           10.3459       10.2354     2.8552     0.1105    1.0000
## black           0.8432        0.2028     0.4026     0.6404    1.0000
## hispan          0.0595        0.1422     0.3497    -0.0827    0.0000
## nodegree        0.7081        0.5967     0.4911     0.1114    0.0000
## married         0.1892        0.5128     0.5004    -0.3236    0.0000
##           eQQ Mean   eQQ Max
## distance    0.3935    0.5979
## re74     3620.9240 9216.5000
## re75     1060.6582 6795.0100
## educ        0.7027    4.0000
## black       0.6432    1.0000
## hispan      0.0811    1.0000
## nodegree    0.1135    1.0000
## married     0.3243    1.0000
## 
## 
## Summary of balance for matched data:
##          Means Treated Means Control SD Control Mean Diff  eQQ Med
## distance        0.5762        0.3637     0.2542    0.2125   0.1705
## re74         2095.5737     2237.6204  4262.4912 -142.0467  80.3299
## re75         1532.0553     1569.3000  2642.6441  -37.2446 131.1770
## educ           10.3459       10.4595     2.5728   -0.1135   0.0000
## black           0.8432        0.4703     0.5005    0.3730   0.0000
## hispan          0.0595        0.2108     0.4090   -0.1514   0.0000
## nodegree        0.7081        0.6649     0.4733    0.0432   0.0000
## married         0.1892        0.1892     0.3927    0.0000   0.0000
##          eQQ Mean    eQQ Max
## distance   0.2125     0.4665
## re74     456.4220 13121.7500
## re75     318.8618 11365.7100
## educ       0.3946     3.0000
## black      0.3730     1.0000
## hispan     0.1514     1.0000
## nodegree   0.0432     1.0000
## married    0.0000     0.0000
## 
## Percent Balance Improvement:
##          Mean Diff.  eQQ Med eQQ Mean  eQQ Max
## distance    45.9871  66.6456  45.9953  21.9772
## re74        95.9688  96.6882  87.3949 -42.3724
## re75        96.0142  86.6296  69.9374 -67.2655
## educ        -2.7135 100.0000  43.8462  25.0000
## black       41.7636 100.0000  42.0168   0.0000
## hispan     -82.9424   0.0000 -86.6667   0.0000
## nodegree    61.1721   0.0000  61.9048   0.0000
## married    100.0000   0.0000 100.0000 100.0000
## 
## Sample sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0
#Q-Q plot
plot(matched1)

plot(matched4)

#jitter plots
plot(matched1, type = "jitter")

## [1] "To identify the units, use first mouse button; to stop, use second."
## integer(0)
plot(matched4, type = "jitter")

## [1] "To identify the units, use first mouse button; to stop, use second."
## integer(0)
#histograms
plot(matched1, type = "hist")

plot(matched4, type = "hist")

What to do after matching? 1) assigned the matched ones into a dataframe, and then conduct analysis 2) or export the matched dataset to other softwares for further analysis

#1
m.data1 <- match.data(matched1)
m.data2 <- match.data(matched4)
ps_ols <- lm(re78 ~ treat + age + educ, data = m.data1)
CEM_OLS <- lm(re78 ~ treat + age + educ, data = m.data2)
original <- lm(re78 ~ treat + age + educ, data = Le)
summary(ps_ols)
## 
## Call:
## lm(formula = re78 ~ treat + age + educ, data = m.data1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9122  -5107  -1693   2953  53615 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -22.456   2004.350  -0.011  0.99107   
## treat       1002.241    715.952   1.400  0.16240   
## age            8.226     40.370   0.204  0.83864   
## educ         498.455    155.140   3.213  0.00143 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6863 on 366 degrees of freedom
## Multiple R-squared:  0.03206,    Adjusted R-squared:  0.02413 
## F-statistic: 4.041 on 3 and 366 DF,  p-value: 0.007579
summary(CEM_OLS)
## 
## Call:
## lm(formula = re78 ~ treat + age + educ, data = m.data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7495  -4548  -1732   2835  27046 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  640.55791 3071.33951   0.209   0.8351  
## treat        749.72927 1029.69839   0.728   0.4677  
## age           -0.08074   94.08939  -0.001   0.9993  
## educ         435.81350  254.62371   1.712   0.0892 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6056 on 142 degrees of freedom
## Multiple R-squared:  0.02695,    Adjusted R-squared:  0.006391 
## F-statistic: 1.311 on 3 and 142 DF,  p-value: 0.2733
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(ps_ols, CEM_OLS, original, type="text", title="Regression Results",
omit.stat=c("LL","ser","f"), single.row=TRUE)
## 
## Regression Results
## ==========================================================================
##                                   Dependent variable:                     
##              -------------------------------------------------------------
##                                          re78                             
##                      (1)                  (2)                 (3)         
## --------------------------------------------------------------------------
## treat        1,002.241 (715.952)  749.729 (1,029.698)  -480.729 (647.773) 
## age             8.226 (40.370)      -0.081 (94.089)    94.926*** (30.338) 
## educ         498.455*** (155.140) 435.813* (254.624)  505.607*** (113.471)
## Constant     -22.456 (2,004.350)  640.558 (3,071.340) -851.742 (1,562.921)
## --------------------------------------------------------------------------
## Observations         370                  146                 614         
## R2                  0.032                0.027               0.043        
## Adjusted R2         0.024                0.006               0.039        
## ==========================================================================
## Note:                                          *p<0.1; **p<0.05; ***p<0.01
#let's see the confidence interval
stargazer(ps_ols, CEM_OLS, original, type="text", title="Regression Results",
omit.stat=c("LL","ser","f"), ci=TRUE, ci.level=0.90, single.row=TRUE)
## 
## Regression Results
## =============================================================================================================
##                                                    Dependent variable:                                       
##              ------------------------------------------------------------------------------------------------
##                                                            re78                                              
##                            (1)                             (2)                             (3)               
## -------------------------------------------------------------------------------------------------------------
## treat        1,002.241 (-175.395, 2,179.876)  749.729 (-943.974, 2,443.432)   -480.729 (-1,546.220, 584.762) 
## age              8.226 (-58.176, 74.629)       -0.081 (-154.844, 154.683)      94.926*** (45.024, 144.828)   
## educ          498.455*** (243.272, 753.638)    435.813* (16.995, 854.632)     505.607*** (318.963, 692.251)  
## Constant     -22.456 (-3,319.319, 3,274.407) 640.558 (-4,411.346, 5,692.462) -851.742 (-3,422.518, 1,719.035)
## -------------------------------------------------------------------------------------------------------------
## Observations               370                             146                             614               
## R2                        0.032                           0.027                           0.043              
## Adjusted R2               0.024                           0.006                           0.039              
## =============================================================================================================
## Note:                                                                             *p<0.1; **p<0.05; ***p<0.01
#2
write.csv(m.data1, file = "matchLe.csv")

Well~Matching seems so easy, right? Not really, you have to think through if your dataset is suitable for matching. It is useful for individual level data, but if you want to conduct propensity matching at the national level, you might need to think twice.