Intro to Propensity Score Matching in R

📊 Overview

This notebook demonstrates how to use the MatchIt package in R to perform propensity score matching using the classic lalonde dataset.

📦 Load Required Packages

library(MatchIt)

## Warning: package 'MatchIt' was built under R version 4.3.3

library(cobalt)

##  cobalt (Version 4.6.0, Build Date: 2025-04-15)

## 
## Attaching package: 'cobalt'

## The following object is masked from 'package:MatchIt':
## 
##     lalonde

📁 Load the Lalonde Dataset

data("lalonde", package = "MatchIt")
head(lalonde)

treat: 1 = treated, 0 = control
re78: 1978 earnings (outcome)
Covariates: age, educ, race, married, nodegree, re74, re75

⚙️ Estimate Propensity Scores and Match

m.out <- matchit(treat ~ age + educ + race + married + nodegree + re74 + re75,
                 data = lalonde,
                 method = "nearest",
                 ratio = 1)
summary(m.out)

## 
## Call:
## matchit(formula = treat ~ age + educ + race + married + nodegree + 
##     re74 + re75, data = lalonde, method = "nearest", ratio = 1)
## 
## Summary of Balance for All Data:
##            Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance          0.5774        0.1822          1.7941     0.9211    0.3774
## age              25.8162       28.0303         -0.3094     0.4400    0.0813
## educ             10.3459       10.2354          0.0550     0.4959    0.0347
## raceblack         0.8432        0.2028          1.7615          .    0.6404
## racehispan        0.0595        0.1422         -0.3498          .    0.0827
## racewhite         0.0973        0.6550         -1.8819          .    0.5577
## married           0.1892        0.5128         -0.8263          .    0.3236
## nodegree          0.7081        0.5967          0.2450          .    0.1114
## re74           2095.5737     5619.2365         -0.7211     0.5181    0.2248
## re75           1532.0553     2466.4844         -0.2903     0.9563    0.1342
##            eCDF Max
## distance     0.6444
## age          0.1577
## educ         0.1114
## raceblack    0.6404
## racehispan   0.0827
## racewhite    0.5577
## married      0.3236
## nodegree     0.1114
## re74         0.4470
## re75         0.2876
## 
## Summary of Balance for Matched Data:
##            Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance          0.5774        0.3629          0.9739     0.7566    0.1321
## age              25.8162       25.3027          0.0718     0.4568    0.0847
## educ             10.3459       10.6054         -0.1290     0.5721    0.0239
## raceblack         0.8432        0.4703          1.0259          .    0.3730
## racehispan        0.0595        0.2162         -0.6629          .    0.1568
## racewhite         0.0973        0.3135         -0.7296          .    0.2162
## married           0.1892        0.2108         -0.0552          .    0.0216
## nodegree          0.7081        0.6378          0.1546          .    0.0703
## re74           2095.5737     2342.1076         -0.0505     1.3289    0.0469
## re75           1532.0553     1614.7451         -0.0257     1.4956    0.0452
##            eCDF Max Std. Pair Dist.
## distance     0.4216          0.9740
## age          0.2541          1.3938
## educ         0.0757          1.2474
## raceblack    0.3730          1.0259
## racehispan   0.1568          1.0743
## racewhite    0.2162          0.8390
## married      0.0216          0.8281
## nodegree     0.0703          1.0106
## re74         0.2757          0.7965
## re75         0.2054          0.7381
## 
## Sample Sizes:
##           Control Treated
## All           429     185
## Matched       185     185
## Unmatched     244       0
## Discarded       0       0

✅ Check Balance

love.plot(m.out, threshold = 0.1, stars = "std")

The love.plot displays standardized mean differences for covariates before and after matching. Values under 0.1 are considered well-balanced.

📂 Extract Matched Data

matched_data <- match.data(m.out)
head(matched_data)

🧮 Estimate Treatment Effect

# Simple difference in means of 1978 earnings (re78)
with(matched_data, t.test(re78 ~ treat))

## 
##  Welch Two Sample t-test
## 
## data:  re78 by treat
## t = -1.2247, df = 345.59, p-value = 0.2215
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -2330.7493   542.0143
## sample estimates:
## mean in group 0 mean in group 1 
##        5454.776        6349.144

Interpret the result: A significant difference suggests that the treatment (job training) had an effect on earnings.

📌 Conclusion

Propensity score matching helps approximate causal comparisons.
Always check covariate balance.
The MatchIt package simplifies this process in R.

📚 Further Reading

Austin (2011), An Introduction to Propensity Score Methods, PMC3144483
MatchIt documentation: https://cran.r-project.org/web/packages/MatchIt
cobalt for balance diagnostics: https://cran.r-project.org/web/packages/cobalt