This demonstration is mostly a summary from the following literature.
Reference
1) Rosenbaum, P. R., & Rubin, D. B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41-55. 2) Morgan, S. L., & Winship, C. (2015). Counterfactuals and causal inference. Cambridge University Press. 3) Iacus, Stefano M., Gary King, and Giuseppe Porro. “CEM: software for coarsened exact matching.” (2009): 1-27. 4) Ho, D. E., Imai, K., King, G., & Stuart, E. A. (2011). MatchIt: nonparametric preprocessing for parametric causal inference. Journal of Statistical Software, http://gking. harvard. edu/matchit.
Lalonde data set (Lalonde, 1986) “This program provided training to selected individuals for 12-18 months and help finding a job in the hopes of increasing their’ earnings. The treatment variable, treated, is 1 for participants (the treatment group) and 0 for nonparticipants (the control group). The key outcome variable is earnings in 1978 (re78).” (Iacus et al. 2009)
The goal of matching: create a dataset that looks closer to a dataset that we could have in a randomized experiment. This means that we want “the distribution of covariates” to be the same between our treatment and control groups. Put it differently, we want that our treatment and control groups to be as similar as possible in any other aspects besides the treatment.
How do we know we might have self-selection problem? (whether two groups are similar in all other aspects) 1) by theory 2) using imbalance test
variables for other aspects (=pre-treated variables) 1) age (age) 2) years of education (educ) 3) marital status (married) 4) lack of a high school diploma (nodegree) 5) race (black, hispan) 6) indicator variables for unemployment in 1974 (u74) and 1975 (u75) 7) real earnings in 1974 (re74) and 1975 (re75).
#default is logistic regression model, but you can also change it to a probit model
#nearest neighbor
matched1<-matchit(treat~re74+re75+educ+black+hispan+nodegree+married, data=Le, method="nearest", distance = "logit")
#optimal matching=greedy matching (choose 1 at a time, do not try to minimize a global distance measure. Pick the control matched with the smallest average absolute distance.)
install.packages("optmatch")
## package 'optmatch' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\wenra\AppData\Local\Temp\RtmpYrs0Lk\downloaded_packages
library("optmatch")
## Warning: package 'optmatch' was built under R version 3.6.2
## Loading required package: survival
## The optmatch package has an academic license. Enter relaxinfo() for more information.
matched2<-matchit(treat~re74+re75+educ+black+hispan+nodegree+married, data=Le, method="optimal", ratio = 2)
## Warning in optmatch::fullmatch(d, min.controls = ratio, max.controls = ratio, : Without 'data' argument the order of the match is not guaranteed
## to be the same as your original data.
#ratio: how many controls to 1 treatment
#Full matching: 1 treatment with >=1 controls
matched3<-matchit(treat~re74+re75+educ+black+hispan+nodegree+married, data=Le, method="full", distance = "logit")
## Warning in optmatch::fullmatch(d, ...): Without 'data' argument the order of the match is not guaranteed
## to be the same as your original data.
#CEM
matched4 <- matchit(treat ~ age + educ + black + hispan + married + nodegree
+ re74 + re75, data = Le, method = "cem")
##
## Using 'treat'='1' as baseline group
What do you need to do after matching? You need to check if the matching process generate a set of matched samples that looks really similar between the treated and the control groups. 1) summary statistics 2) plots: Q-Q plot: if dots are not lie on the 45 degree line, it’s not a good match jitter plots: it shows the overall distribution of propensity scores (you want more dots overlapped) histograms: the shapes of histograms should look as similar as possible
summary(matched1)
##
## Call:
## matchit(formula = treat ~ re74 + re75 + educ + black + hispan +
## nodegree + married, data = Le, method = "nearest", distance = "logit")
##
## Summary of balance for all data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.5762 0.1828 0.2301 0.3935 0.5113
## re74 2095.5737 5619.2365 6788.7508 -3523.6628 2425.5720
## re75 1532.0553 2466.4844 3291.9962 -934.4291 981.0968
## educ 10.3459 10.2354 2.8552 0.1105 1.0000
## black 0.8432 0.2028 0.4026 0.6404 1.0000
## hispan 0.0595 0.1422 0.3497 -0.0827 0.0000
## nodegree 0.7081 0.5967 0.4911 0.1114 0.0000
## married 0.1892 0.5128 0.5004 -0.3236 0.0000
## eQQ Mean eQQ Max
## distance 0.3935 0.5979
## re74 3620.9240 9216.5000
## re75 1060.6582 6795.0100
## educ 0.7027 4.0000
## black 0.6432 1.0000
## hispan 0.0811 1.0000
## nodegree 0.1135 1.0000
## married 0.3243 1.0000
##
##
## Summary of balance for matched data:
## Means Treated Means Control SD Control Mean Diff eQQ Med
## distance 0.5762 0.3637 0.2542 0.2125 0.1705
## re74 2095.5737 2237.6204 4262.4912 -142.0467 80.3299
## re75 1532.0553 1569.3000 2642.6441 -37.2446 131.1770
## educ 10.3459 10.4595 2.5728 -0.1135 0.0000
## black 0.8432 0.4703 0.5005 0.3730 0.0000
## hispan 0.0595 0.2108 0.4090 -0.1514 0.0000
## nodegree 0.7081 0.6649 0.4733 0.0432 0.0000
## married 0.1892 0.1892 0.3927 0.0000 0.0000
## eQQ Mean eQQ Max
## distance 0.2125 0.4665
## re74 456.4220 13121.7500
## re75 318.8618 11365.7100
## educ 0.3946 3.0000
## black 0.3730 1.0000
## hispan 0.1514 1.0000
## nodegree 0.0432 1.0000
## married 0.0000 0.0000
##
## Percent Balance Improvement:
## Mean Diff. eQQ Med eQQ Mean eQQ Max
## distance 45.9871 66.6456 45.9953 21.9772
## re74 95.9688 96.6882 87.3949 -42.3724
## re75 96.0142 86.6296 69.9374 -67.2655
## educ -2.7135 100.0000 43.8462 25.0000
## black 41.7636 100.0000 42.0168 0.0000
## hispan -82.9424 0.0000 -86.6667 0.0000
## nodegree 61.1721 0.0000 61.9048 0.0000
## married 100.0000 0.0000 100.0000 100.0000
##
## Sample sizes:
## Control Treated
## All 429 185
## Matched 185 185
## Unmatched 244 0
## Discarded 0 0
#Q-Q plot
plot(matched1)
plot(matched4)
#jitter plots
plot(matched1, type = "jitter")
## [1] "To identify the units, use first mouse button; to stop, use second."
## integer(0)
plot(matched4, type = "jitter")
## [1] "To identify the units, use first mouse button; to stop, use second."
## integer(0)
#histograms
plot(matched1, type = "hist")
plot(matched4, type = "hist")
What to do after matching? 1) assigned the matched ones into a dataframe, and then conduct analysis 2) or export the matched dataset to other softwares for further analysis
#1
m.data1 <- match.data(matched1)
m.data2 <- match.data(matched4)
ps_ols <- lm(re78 ~ treat + age + educ, data = m.data1)
CEM_OLS <- lm(re78 ~ treat + age + educ, data = m.data2)
original <- lm(re78 ~ treat + age + educ, data = Le)
summary(ps_ols)
##
## Call:
## lm(formula = re78 ~ treat + age + educ, data = m.data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9122 -5107 -1693 2953 53615
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -22.456 2004.350 -0.011 0.99107
## treat 1002.241 715.952 1.400 0.16240
## age 8.226 40.370 0.204 0.83864
## educ 498.455 155.140 3.213 0.00143 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6863 on 366 degrees of freedom
## Multiple R-squared: 0.03206, Adjusted R-squared: 0.02413
## F-statistic: 4.041 on 3 and 366 DF, p-value: 0.007579
summary(CEM_OLS)
##
## Call:
## lm(formula = re78 ~ treat + age + educ, data = m.data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7495 -4548 -1732 2835 27046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 640.55791 3071.33951 0.209 0.8351
## treat 749.72927 1029.69839 0.728 0.4677
## age -0.08074 94.08939 -0.001 0.9993
## educ 435.81350 254.62371 1.712 0.0892 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6056 on 142 degrees of freedom
## Multiple R-squared: 0.02695, Adjusted R-squared: 0.006391
## F-statistic: 1.311 on 3 and 142 DF, p-value: 0.2733
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
stargazer(ps_ols, CEM_OLS, original, type="text", title="Regression Results",
omit.stat=c("LL","ser","f"), single.row=TRUE)
##
## Regression Results
## ==========================================================================
## Dependent variable:
## -------------------------------------------------------------
## re78
## (1) (2) (3)
## --------------------------------------------------------------------------
## treat 1,002.241 (715.952) 749.729 (1,029.698) -480.729 (647.773)
## age 8.226 (40.370) -0.081 (94.089) 94.926*** (30.338)
## educ 498.455*** (155.140) 435.813* (254.624) 505.607*** (113.471)
## Constant -22.456 (2,004.350) 640.558 (3,071.340) -851.742 (1,562.921)
## --------------------------------------------------------------------------
## Observations 370 146 614
## R2 0.032 0.027 0.043
## Adjusted R2 0.024 0.006 0.039
## ==========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
#let's see the confidence interval
stargazer(ps_ols, CEM_OLS, original, type="text", title="Regression Results",
omit.stat=c("LL","ser","f"), ci=TRUE, ci.level=0.90, single.row=TRUE)
##
## Regression Results
## =============================================================================================================
## Dependent variable:
## ------------------------------------------------------------------------------------------------
## re78
## (1) (2) (3)
## -------------------------------------------------------------------------------------------------------------
## treat 1,002.241 (-175.395, 2,179.876) 749.729 (-943.974, 2,443.432) -480.729 (-1,546.220, 584.762)
## age 8.226 (-58.176, 74.629) -0.081 (-154.844, 154.683) 94.926*** (45.024, 144.828)
## educ 498.455*** (243.272, 753.638) 435.813* (16.995, 854.632) 505.607*** (318.963, 692.251)
## Constant -22.456 (-3,319.319, 3,274.407) 640.558 (-4,411.346, 5,692.462) -851.742 (-3,422.518, 1,719.035)
## -------------------------------------------------------------------------------------------------------------
## Observations 370 146 614
## R2 0.032 0.027 0.043
## Adjusted R2 0.024 0.006 0.039
## =============================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
#2
write.csv(m.data1, file = "matchLe.csv")
Well~Matching seems so easy, right? Not really, you have to think through if your dataset is suitable for matching. It is useful for individual level data, but if you want to conduct propensity matching at the national level, you might need to think twice.