This assignment walks through a Propensity Matching Exercise, with a Machine Learning twist at the end.
# Required libraries
library(dplyr)
library(ggplot2)
library(MatchIt)
library(glmnet)
# download dataset and select + rename relevant variables
psm <- read.csv(url("http://www.mfilipski.com/files/ecls.csv")) %>%
select(c5r2mtsc_std, catholic, w3income_1k, p5numpla, w3momed_hsb) %>%
rename(scores = c5r2mtsc_std, income = w3income_1k, places = p5numpla, momDegree = w3momed_hsb)
t.test(scores ~ catholic, psm)
##
## Welch Two Sample t-test
##
## data: scores by catholic
## t = -1.7866, df = 1468.1, p-value = 0.07421
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.118653665 0.005539377
## sample estimates:
## mean in group 0 mean in group 1
## 0.1631279 0.2196851
Yes. (0.1631 for non-catholic schools v 0.2197 for catholic schools)
No, t-test shows difference of means across groups is not significant. And economic theory tells us it’s not that straightforward.
summary(lm(scores ~ catholic, psm))
##
## Call:
## lm(formula = scores ~ catholic, data = psm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2667 -0.6082 0.0538 0.6292 3.2705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.16313 0.01424 11.459 <2e-16 ***
## catholic 0.05656 0.03440 1.644 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9549 on 5427 degrees of freedom
## Multiple R-squared: 0.000498, Adjusted R-squared: 0.0003138
## F-statistic: 2.704 on 1 and 5427 DF, p-value: 0.1002
Coefficient estimate suggests no relationship between attending catholic schools and math scores.
summary(lm(scores ~ catholic + income + places + momDegree, data = psm))
##
## Call:
## lm(formula = scores ~ catholic + income + places + momDegree,
## data = psm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3922 -0.5696 0.0372 0.5986 3.2572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0733290 0.0491661 1.491 0.135900
## catholic -0.1273899 0.0328888 -3.873 0.000109 ***
## income 0.0053247 0.0003026 17.595 < 2e-16 ***
## places -0.1009432 0.0355779 -2.837 0.004567 **
## momDegree -0.3740355 0.0271962 -13.753 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8941 on 5424 degrees of freedom
## Multiple R-squared: 0.1242, Adjusted R-squared: 0.1235
## F-statistic: 192.3 on 4 and 5424 DF, p-value: < 2.2e-16
Coefficient estimate on the catholic dummy now suggests that attending catholic schools is associated with a 12% decrease in standardized math scores. Further, the coefficients on all regressors are statistically significant (p < 0.01 for all). This, plus the improved R-squared suggests this model captures more of the variation in math scores.
t.test(income ~ catholic, psm) # different
##
## Welch Two Sample t-test
##
## data: income by catholic
## t = -13.238, df = 1314.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -23.86719 -17.70620
## sample estimates:
## mean in group 0 mean in group 1
## 65.39393 86.18063
t.test(places ~ catholic, psm) # different
##
## Welch Two Sample t-test
##
## data: places by catholic
## t = 3.128, df = 1600.4, p-value = 0.001792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01235472 0.05390038
## sample estimates:
## mean in group 0 mean in group 1
## 1.106246 1.073118
t.test(momDegree ~ catholic, psm) # different
##
## Welch Two Sample t-test
##
## data: momDegree by catholic
## t = 12.362, df = 1545.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1572715 0.2165946
## sample estimates:
## mean in group 0 mean in group 1
## 0.3923094 0.2053763
Yes, means for all other regressors are different across school types, which suggests pupils attending catholic schools are likely to come from a different background compared to those that attend non-catholic schools. Because there is something fundamentally different between the two groups, there is a chance that variations in math scores might arise from sources outside of the school. To attribute differences in scores to the type of school one attends, the groups attending different schools would have to be the same across all characteristics and differ only in the instruction received at school (or other school-related activities that might impact scores).
catholic based on the covariates we kept.# logit regression to see if catholic schools is predicted by covariates
cath <- glm(catholic ~ places + momDegree + income, family = "binomial", psm)
summary(cath)
##
## Call:
## glm(formula = catholic ~ places + momDegree + income, family = "binomial",
## data = psm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0172 -0.6461 -0.5240 -0.4122 2.3672
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6702950 0.1576164 -10.597 < 2e-16 ***
## places -0.2774346 0.1232930 -2.250 0.0244 *
## momDegree -0.6504757 0.0915710 -7.104 1.22e-12 ***
## income 0.0077921 0.0008115 9.602 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4972.4 on 5428 degrees of freedom
## Residual deviance: 4750.6 on 5425 degrees of freedom
## AIC: 4758.6
##
## Number of Fisher Scoring iterations: 4
Results indicate that attending a catholic school is correlated with income and mother’s education.
# append predictions based on the previous model
psm <-psm %>% mutate(prediction = predict(cath, type = "response"))
# plot
ggplot(psm, aes(x = prediction, group = factor(catholic), color = factor(catholic))) +
geom_density()+ scale_color_discrete(name = "Catholic School (1 = yes)")
# income against prediction
ggplot(psm, aes(prediction, income, group = factor(catholic), color = factor(catholic))) +
geom_smooth(method = "loess")+ scale_color_discrete(name = "Catholic School (1 = yes)")
# 'places' against prediction
ggplot(psm, aes(prediction, places, group = factor(catholic), color = factor(catholic))) +
geom_smooth(method = "loess")+ scale_color_discrete(name = "Catholic School (1 = yes)")
They look similar, but not perfectly aligned.
Create a matched dataset.
# create a new matched database using the MatchIt package
psmMatched <- match.data(matchit(catholic ~ places + momDegree + income, psm))
dim(psmMatched)
## [1] 1860 9
Number of observations reduced from 5429 to 1860. The remaining observations can be thought of as being ready for comparison.
# sort by distance
psmMatched %>% arrange(distance) %>% head(10)
## scores catholic income places momDegree prediction distance weights
## 1 -0.3835843 0 17.5005 2 1 0.06069529 0.06069529 1
## 2 -2.5922340 1 17.5005 2 1 0.06069529 0.06069529 1
## 3 -2.3690526 0 22.5005 2 1 0.06295488 0.06295488 1
## 4 0.5271555 1 22.5005 2 1 0.06295488 0.06295488 1
## 5 -0.2195955 0 27.5005 2 1 0.06529274 0.06529274 1
## 6 -1.6878765 1 27.5005 2 1 0.06529274 0.06529274 1
## 7 -0.2550080 0 32.5005 2 1 0.06771114 0.06771114 1
## 8 -1.2722942 1 32.5005 2 1 0.06771114 0.06771114 1
## 9 -0.7403859 0 37.5005 2 1 0.07021240 0.07021240 1
## 10 -0.7841369 1 37.5005 2 1 0.07021240 0.07021240 1
## subclass
## 1 598
## 2 598
## 3 222
## 4 222
## 5 272
## 6 272
## 7 859
## 8 859
## 9 728
## 10 728
Observations are matched on distance.
na.exclude(psmMatched) %>% group_by(catholic) %>% summarise_all(mean)
## # A tibble: 2 x 9
## catholic scores income places momDegree prediction distance weights subclass
## * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.335 86.2 1.07 0.205 0.203 0.203 1 NA
## 2 1 0.220 86.2 1.07 0.205 0.203 0.203 1 NA
t.test(income ~ catholic, psmMatched) # same
##
## Welch Two Sample t-test
##
## data: income by catholic
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.985565 3.985565
## sample estimates:
## mean in group 0 mean in group 1
## 86.18063 86.18063
t.test(places ~ catholic, psmMatched) # same
##
## Welch Two Sample t-test
##
## data: places by catholic
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02550007 0.02550007
## sample estimates:
## mean in group 0 mean in group 1
## 1.073118 1.073118
t.test(momDegree~ catholic, psmMatched) # same
##
## Welch Two Sample t-test
##
## data: momDegree by catholic
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03676158 0.03676158
## sample estimates:
## mean in group 0 mean in group 1
## 0.2053763 0.2053763
Now the means of covariates across catholic and non-catholic schools are similar and the two groups are more comparable.
# income against prediction after matching
ggplot(psmMatched, aes(prediction, income, color = factor(catholic))) +
geom_smooth(method = "loess") + scale_color_discrete(name = "Catholic School (1 = yes)")
# 'places' against prediction after matching
ggplot(psmMatched, aes(prediction, places, color = factor(catholic))) +
geom_smooth(method = "loess") + scale_color_discrete(name = "Catholic School (1 = yes)")
Means are nearly identical at each level of the pscore distribution.
psmMatched %>% group_by(catholic) %>% summarise_all(mean) #same means
## # A tibble: 2 x 9
## catholic scores income places momDegree prediction distance weights subclass
## * <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.335 86.2 1.07 0.205 0.203 0.203 1 NA
## 2 1 0.220 86.2 1.07 0.205 0.203 0.203 1 NA
Covariate means are the same.
summary(lm(scores ~ catholic, psmMatched))
##
## Call:
## lm(formula = scores ~ catholic, data = psmMatched)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2884 -0.5547 0.0443 0.5931 2.8356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.33456 0.02896 11.552 < 2e-16 ***
## catholic -0.11488 0.04096 -2.805 0.00509 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8832 on 1858 degrees of freedom
## Multiple R-squared: 0.004216, Adjusted R-squared: 0.00368
## F-statistic: 7.867 on 1 and 1858 DF, p-value: 0.005087
summary(lm(scores ~ catholic + income + places + momDegree, psmMatched))
##
## Call:
## lm(formula = scores ~ catholic + income + places + momDegree,
## data = psmMatched)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3758 -0.5347 0.0265 0.5692 2.8515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1513954 0.0910427 1.663 0.09650 .
## catholic -0.1148784 0.0392301 -2.928 0.00345 **
## income 0.0041331 0.0004566 9.052 < 2e-16 ***
## places -0.0911405 0.0700264 -1.302 0.19324
## momDegree -0.3662377 0.0495217 -7.396 2.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.846 on 1855 degrees of freedom
## Multiple R-squared: 0.08793, Adjusted R-squared: 0.08596
## F-statistic: 44.71 on 4 and 1855 DF, p-value: < 2.2e-16
Coefficient estimate on catholic is almost the same but the standard error decreases if we include covariates in the regression suggesting that matching allows us to interpret it as the variation in math scores caused by attending catholic schools (because the two groups are now similar across other observables.)
psmMatched %>% select(distance, prediction) %>% head(20)
## distance prediction
## 1 0.18835756 0.18835756
## 2 0.16839012 0.16839012
## 3 0.18835756 0.18835756
## 4 0.21995740 0.21995740
## 5 0.31455557 0.31455557
## 6 0.18835756 0.18835756
## 7 0.08746743 0.08746743
## 8 0.16036417 0.16036417
## 9 0.09556052 0.09556052
## 10 0.18835756 0.18835756
## 11 0.10801363 0.10801363
## 12 0.10801363 0.10801363
## 13 0.14047341 0.14047341
## 14 0.07857707 0.07857707
## 15 0.08144465 0.08144465
## 16 0.13583484 0.13583484
## 17 0.21995740 0.21995740
## 18 0.18835756 0.18835756
## 19 0.16839012 0.16839012
## 20 0.08440729 0.08440729
distance and prediction are the same. Alternatively:
ggplot(psmMatched, aes(distance, prediction)) + geom_point(size = 0.4, color = "midnightblue") + geom_line(linetype = "dotted", color = "sienna4")
# 45-degree line is the locus of y = x, implies `distance` = `prediction`.
# running a matchit command with distance=prediction and repeating the same plot
data.frame(match.data(matchit(formula(cath), psm, distance = predict(cath, type = "response")))) %>%
ggplot(aes(distance, prediction)) + geom_point(size = 0.02)
# select only numeric variables
mlPsm <- read.csv(url("http://www.mfilipski.com/files/ecls.csv")) %>%
select(where(is.numeric))
yMat <- as.matrix(psm %>% select(catholic))
xMat <- mlPsm %>% select(2:18) %>% as.matrix()
# run the lasso
temp <- cv.glmnet((as.matrix(mlPsm[,2:18])), as.matrix(mlPsm[,1]), family = "binomial")
lasso<- glmnet(as.matrix(mlPsm[,2:18]), as.matrix(mlPsm[,1]), family = "binomial", lambda = temp$lambda.1se)
# coef function returns lambda.1se coefficients by default
coef(lasso)
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -2.687873e+00
## race_white 5.838045e-02
## race_black .
## race_hispanic .
## race_asian .
## p5numpla .
## p5hmage 2.245165e-02
## p5hdage .
## w3daded_hsb -2.348271e-01
## w3momed_hsb -2.584542e-01
## w3momscr 1.366464e-03
## w3dadscr .
## w3income 4.518323e-06
## w3povrty -3.252269e-01
## p5fstamp .
## c5r2mtsc .
## c5r2mtsc_std .
## w3income_1k .
Based on this, we would select race, mother’s age, parents’ education, income, mother’s score, and poverty status for our regression.
# matching with predictions from the lasso model as the distance
mlPsmMatch <- match.data(matchit(catholic~w3income_1k + p5numpla + w3momed_hsb,
data = mlPsm,
distance=as.numeric(predict(lasso,newx = as.matrix(mlPsm[,2:18]),
type= "response"))))
# income v prediction
ggplot(mlPsmMatch, aes(distance, w3income_1k), color = factor(catholic)) + geom_smooth(method = "loess")
# momDegree v prediction
ggplot(mlPsmMatch, aes(distance, w3momed_hsb), color = factor(catholic)) + geom_smooth(method = "loess")
Overlapping plots when using catholic as factor \(\implies\) two groups are similar after matching.
When data is large (or at least feature-rich), using PSM together with ML methods might help reveal interesting relationships in the data that would otherwise be difficult to establish due to the presence of confounders. 🤦 🤷