This is a introduction on propensity score matching on R, written in R Markdown. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see rmarkdown.
Propensity score is a balancing method to explore causal effect in observational studies. The basic idea is to explore the relationship between treatment and outcome, while controling the effect of covariants, or confoundings. This is done by minimizing or eliminating the selection bias in either measured or unmeasured baseline characteristics. This matching process mimics the randomized controlled trials (RCTs). Propensity score method can be implemented through R with package MatchIt and some further packages like Optmatch.
In potential outcomes framwork, let \( Z \) be the indicator variable denoting treatment received (\( Z=0 \) for control vs. \( Z=1 \) for treatment). Thus only one outcome, \( Y_i \) (\( Y_i=Z_iY_i(1)+(1-Z_i)Y_i(0) \)), is observed for each sample. For each sample, the effect of treatment is defined to be \( Y_i(1)-Y_i(0) \). The average treatment effect (ATE) is defined to be \( E[Y_i(1)-Y_i(0)] \) while a related measured of treatment effect is the average treatment effect for the treated (ATT), defined as \( E[Y(1)-Y(0)|Z=1] \).
The following is a basic reproduct based on r-tutiorial. Instead of using real data, I find it convenient to simulate some data and draw inference with PSM. More specifically, weâll analyze the effect of going to Catholic school (rather than public school) on student achievement, as measured by studentsâ spring 3rd grade math scores. Because students who attend Catholic school on average are different from students who attend public school, we will use propensity score matching to get more credible causal estimates of Catholic schooling and we use the following covariates for now:
wkmomed_1: Has the mother completed high school (1) or not (0)?
We will go through the following steps:
First we need to simulate some data, we generate each variable we need accroding to its characteristic. In the part, we generate a data frame (a common data type in R) with 1000 samples and 6 columns, including 4 covariants we want to balance on and one outcome varaible and one treatment variable. All the syntax with light blue background is the R code I wrote and every line begins with ## is the output or running result in R.
#data generation
n = 1000
#when size=1, binomial becomes bernoulli
#covariates
race_white = rbinom(n, 1, .4)
wkincome = rnorm(n)
p1numpla = rpois(n, 2)+1
wkmomed_1 = rbinom(n, 1, .7)
#treatment and outcome
catholic = rbinom(n, 1, .15)
c5r2mtsc_std = rnorm(n)
#combine together
csch = data.frame(cbind(race_white,wkincome,p1numpla,wkmomed_1, catholic, c5r2mtsc_std))
dim(csch)
## [1] 1000 6
We will then examine the difference of outcome variable and covariants using difference and t-test. dplyr is a package I like a lot in R for data manipulating.
mn <- with(csch, tapply(c5r2mtsc_std, catholic, mean))
mn
## 0 1
## 0.007622591 -0.123117580
mn[2] - mn[1]
## 1
## -0.1307402
with(csch, t.test(c5r2mtsc_std ~ catholic))
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = 1.4978, df = 230.318, p-value = 0.1356
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0412502 0.3027305
## sample estimates:
## mean in group 0 mean in group 1
## 0.007622591 -0.123117580
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#csch_cov <- select(csch, catholic, race_white, wkincome, p1numpla, wkmomed_1)
csch %>%
group_by(catholic) %>%
summarise_each(funs(mean))
## Source: local data frame [2 x 6]
##
## catholic race_white wkincome p1numpla wkmomed_1 c5r2mtsc_std
## 1 0 0.3788969 -0.067361283 2.995204 0.6930456 0.007622591
## 2 1 0.3855422 0.009433034 3.066265 0.6807229 -0.123117580
Next we will estimate the propensity score, or the probability of being treated using linear model of logistic regression. Of course we can choose other method, such as classification trees and random forest in machine learning. Then we calculate the propensity score for each student using function predict and create a dataframe that has the propensity score as well as the studentâs actual treatment status.
m_ps <- glm(catholic ~ race_white + wkincome + p1numpla + wkmomed_1,
family=binomial(), data=csch)
require(arm)
## Loading required package: arm
display(m_ps, digits = 4)
## Error in eval(expr, envir, enclos): could not find function "display"
prs_df <- data.frame( pr_score = predict(m_ps, type = "response"),
catholic = csch$catholic )
head(prs_df)
## pr_score catholic
## 1 0.1542399 0
## 2 0.1778544 0
## 3 0.1749962 1
## 4 0.1695123 0
## 5 0.1552635 1
## 6 0.1525381 0
After prediction, we can examine the region of common support with visualization. ggplot2 package is a powerful tool for visualization in R.
require(ggplot2)
## Loading required package: ggplot2
levels(prs_df$catholic) <- paste("Actual school type attended:", c("Public", "Catholic"))
bb<- ggplot(prs_df, aes(x = pr_score)) +
geom_histogram(color = "white") +
facet_wrap(~catholic) +
xlab("Probability of going to Catholic school") +
theme_bw()
bb
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We move on to the next phase of mathing with propensity score. More specifically, we use the stratification on the propensity score method, in which we divide the sample within the region of common support into 4 quintiles. The default setting for subclass in package MatchIt is 6 subclasses.
#based on the estimated propensity score. nearest is selected here
library(MatchIt)
## Loading required package: MASS
m.out <- matchit(catholic ~ race_white + wkincome + p1numpla + wkmomed_1,
method = "subclass", subclass = 4, data = csch)
m.dat <- match.data(m.out)
dim(m.dat)
## [1] 1000 9
m.dat$pr_score=predict(m_ps, type = "response")
cc <- ggplot(m.dat, aes(x = pr_score)) +
geom_histogram(color = "white") +
facet_wrap(~subclass) +
xlab("Probability of going to Catholic school") +
theme_bw()
cc
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
First we examine by visualing, next we examine the mean difference in each group using t-test or anova in multiple measurements.
fn_bal <- function(dta, variable) {
dta$variable <- dta[, variable]
ggplot(dta, aes(x = distance, y = variable, color = catholic)) +
geom_point(alpha = 0.2, size = 1.5) +
geom_smooth(method = "loess", se = F) +
xlab("Propensity score") +
ylab(variable) +
theme_bw()
}
require(gridExtra)
## Loading required package: gridExtra
grid.arrange(
fn_bal(m.dat, "wkincome") + theme(legend.position = "none"),
fn_bal(m.dat, "p1numpla"),
fn_bal(m.dat, "wkmomed_1") + theme(legend.position = "none"),
fn_bal(m.dat, "race_white"), widths = c(0.8, 1, 0.8, 1)
)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
match_cov <- select(m.dat, catholic, race_white, wkincome, p1numpla, wkmomed_1, subclass)
match_cov %>%
group_by(subclass) %>%
summarise_each(funs(mean))
## Source: local data frame [4 x 6]
##
## subclass catholic race_white wkincome p1numpla wkmomed_1
## (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 1 0.1532847 0.3138686 -1.0440974 2.251825 0.8540146
## 2 2 0.1418685 0.3494810 -0.1565159 2.761246 0.7404844
## 3 3 0.1798246 0.4122807 0.3043264 3.214912 0.5745614
## 4 4 0.2009569 0.4736842 0.9919422 4.110048 0.5358852
The result is trivial since the data we generated is random without restriction. There exists no association relation between the treatment and the outcome, nor between covariants and the treatment. But the general process is done in the way.
library(dplyr)
library(arm)
## Loading required package: Matrix
## Loading required package: lme4
##
## arm (Version 1.8-6, built: 2015-7-7)
##
## Working directory is C:/Users/User/Google Drive/job_app/PhD/GU_Pedagog/report_1
with(m.dat, t.test(c5r2mtsc_std ~ catholic))
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = 1.4978, df = 230.318, p-value = 0.1356
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.0412502 0.3027305
## sample estimates:
## mean in group 0 mean in group 1
## 0.007622591 -0.123117580
#Or we can use OLS with or without covariates:
lm_treat1 <- lm(c5r2mtsc_std ~ factor(catholic), data = m.dat)
display(lm_treat1, digits = 3, detail = T)
## lm(formula = c5r2mtsc_std ~ factor(catholic), data = m.dat)
## coef.est coef.se t value Pr(>|t|)
## (Intercept) 0.008 0.035 0.219 0.826
## factor(catholic)1 -0.131 0.085 -1.533 0.125
## ---
## n = 1000, k = 2
## residual sd = 1.003, R-Squared = 0.00
lm_treat2 <- lm(c5r2mtsc_std ~ factor(catholic) + race_white + wkincome +
p1numpla + wkmomed_1, data = m.dat)
display(lm_treat2, digits = 3, detail = T)
## lm(formula = c5r2mtsc_std ~ factor(catholic) + race_white + wkincome +
## p1numpla + wkmomed_1, data = m.dat)
## coef.est coef.se t value Pr(>|t|)
## (Intercept) -0.054 0.091 -0.596 0.551
## factor(catholic)1 -0.134 0.085 -1.565 0.118
## race_white 0.008 0.066 0.129 0.897
## wkincome 0.005 0.032 0.144 0.886
## p1numpla 0.029 0.022 1.296 0.195
## wkmomed_1 -0.039 0.069 -0.566 0.572
## ---
## n = 1000, k = 6
## residual sd = 1.004, R-Squared = 0.00
This is a try-out of propensity score in R. In general, it follows the following steps.
Hope you like it and look forward to further dicussion in the future. Some papers provided and links above will give more detail.
A Step-by-Step Guide to Propensity Score Matching in R is a detailed paper for the first phase in PSM.
matchit is the official document for the package matchit
An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies is a review paper for methodology and compares four different methods in PSM.
Assessing the effect of small school size on mathematics achievement A propensity score approach is a demonstration of stratification based on PSM specifically.
Reducing Bias in Observational Studies Using Subclassification on the Propensity Score together with The central role of the propensity score in observational studies for causal effects is a frequenly cited paper in PSM and gives more mathematics detail.