This assignment walks through a Propensity Matching Exercise, with a Machine Learning twist at the end.

Part A: Understanding and running PSM

1. Download data.

# Required libraries 
library(dplyr)
library(ggplot2)
library(MatchIt)
library(glmnet)
# download dataset and select + rename relevant variables 
psm <- read.csv(url("http://www.mfilipski.com/files/ecls.csv")) %>% 
  select(c5r2mtsc_std, catholic, w3income_1k, p5numpla, w3momed_hsb) %>% 
  rename(scores = c5r2mtsc_std, income = w3income_1k, places = p5numpla, momDegree =  w3momed_hsb)

2. Identify problem.

Do test scores differ for pupils of catholic and non-catholic schools on average?

t.test(scores ~ catholic, psm)

## 
##  Welch Two Sample t-test
## 
## data:  scores by catholic
## t = -1.7866, df = 1468.1, p-value = 0.07421
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.118653665  0.005539377
## sample estimates:
## mean in group 0 mean in group 1 
##       0.1631279       0.2196851

Yes. (0.1631 for non-catholic schools v 0.2197 for catholic schools)

Can you conclude that going to a catholic school leads to higher math scores? Why or why not?

No, t-test shows difference of means across groups is not significant. And economic theory tells us it’s not that straightforward.

Run regressions:
- A regression of just math scores on the catholic dummy.

summary(lm(scores ~ catholic, psm))

## 
## Call:
## lm(formula = scores ~ catholic, data = psm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2667 -0.6082  0.0538  0.6292  3.2705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.16313    0.01424  11.459   <2e-16 ***
## catholic     0.05656    0.03440   1.644      0.1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9549 on 5427 degrees of freedom
## Multiple R-squared:  0.000498,   Adjusted R-squared:  0.0003138 
## F-statistic: 2.704 on 1 and 5427 DF,  p-value: 0.1002

Coefficient estimate suggests no relationship between attending catholic schools and math scores.

Another regression that includes all the variables listed above.

summary(lm(scores ~ catholic + income + places + momDegree, data = psm))

## 
## Call:
## lm(formula = scores ~ catholic + income + places + momDegree, 
##     data = psm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3922 -0.5696  0.0372  0.5986  3.2572 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0733290  0.0491661   1.491 0.135900    
## catholic    -0.1273899  0.0328888  -3.873 0.000109 ***
## income       0.0053247  0.0003026  17.595  < 2e-16 ***
## places      -0.1009432  0.0355779  -2.837 0.004567 ** 
## momDegree   -0.3740355  0.0271962 -13.753  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8941 on 5424 degrees of freedom
## Multiple R-squared:  0.1242, Adjusted R-squared:  0.1235 
## F-statistic: 192.3 on 4 and 5424 DF,  p-value: < 2.2e-16

Coefficient estimate on the catholic dummy now suggests that attending catholic schools is associated with a 12% decrease in standardized math scores. Further, the coefficients on all regressors are statistically significant (p < 0.01 for all). This, plus the improved R-squared suggests this model captures more of the variation in math scores.

Do the means of the covariates (income, mother’s education etc) differ between catholic vs non-catholic schools?

t.test(income ~ catholic, psm) # different

## 
##  Welch Two Sample t-test
## 
## data:  income by catholic
## t = -13.238, df = 1314.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.86719 -17.70620
## sample estimates:
## mean in group 0 mean in group 1 
##        65.39393        86.18063

t.test(places ~ catholic, psm) # different

## 
##  Welch Two Sample t-test
## 
## data:  places by catholic
## t = 3.128, df = 1600.4, p-value = 0.001792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01235472 0.05390038
## sample estimates:
## mean in group 0 mean in group 1 
##        1.106246        1.073118

t.test(momDegree ~ catholic, psm) # different

## 
##  Welch Two Sample t-test
## 
## data:  momDegree by catholic
## t = 12.362, df = 1545.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1572715 0.2165946
## sample estimates:
## mean in group 0 mean in group 1 
##       0.3923094       0.2053763

Yes, means for all other regressors are different across school types, which suggests pupils attending catholic schools are likely to come from a different background compared to those that attend non-catholic schools. Because there is something fundamentally different between the two groups, there is a chance that variations in math scores might arise from sources outside of the school. To attribute differences in scores to the type of school one attends, the groups attending different schools would have to be the same across all characteristics and differ only in the instruction received at school (or other school-related activities that might impact scores).

3. Estimate the propensity to go to catholic school, by hand.

Run a logit regression to predict catholic based on the covariates we kept.

# logit regression to see if catholic schools is predicted by covariates
cath <- glm(catholic ~ places  + momDegree + income, family = "binomial", psm)
summary(cath)

## 
## Call:
## glm(formula = catholic ~ places + momDegree + income, family = "binomial", 
##     data = psm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0172  -0.6461  -0.5240  -0.4122   2.3672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.6702950  0.1576164 -10.597  < 2e-16 ***
## places      -0.2774346  0.1232930  -2.250   0.0244 *  
## momDegree   -0.6504757  0.0915710  -7.104 1.22e-12 ***
## income       0.0077921  0.0008115   9.602  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4972.4  on 5428  degrees of freedom
## Residual deviance: 4750.6  on 5425  degrees of freedom
## AIC: 4758.6
## 
## Number of Fisher Scoring iterations: 4

Results indicate that attending a catholic school is correlated with income and mother’s education.

Making a prediction for who is likely to go to catholic school, using this model.

# append predictions based on the previous model 
psm  <-psm %>% mutate(prediction = predict(cath, type = "response"))

# plot 
ggplot(psm, aes(x = prediction, group = factor(catholic), color = factor(catholic))) + 
  geom_density()+ scale_color_discrete(name = "Catholic School (1 = yes)")

# income against prediction
ggplot(psm, aes(prediction, income, group = factor(catholic), color = factor(catholic))) + 
  geom_smooth(method = "loess")+ scale_color_discrete(name = "Catholic School (1 = yes)")

# 'places' against prediction
ggplot(psm, aes(prediction, places, group = factor(catholic), color = factor(catholic))) + 
  geom_smooth(method = "loess")+ scale_color_discrete(name = "Catholic School (1 = yes)")

They look similar, but not perfectly aligned.

4. Run a Matching algorithm to get a balanced dataset

Create a matched dataset.
- Run a MatchIt command with the same equation as your logit.

# create a new matched database using the MatchIt package
psmMatched <- match.data(matchit(catholic ~ places  + momDegree + income,  psm))
dim(psmMatched)

## [1] 1860    9

Number of observations reduced from 5429 to 1860. The remaining observations can be thought of as being ready for comparison.

Sort the data by distance adn show the first ten observations.

# sort by distance
psmMatched %>% arrange(distance) %>% head(10)

##        scores catholic  income places momDegree prediction   distance weights
## 1  -0.3835843        0 17.5005      2         1 0.06069529 0.06069529       1
## 2  -2.5922340        1 17.5005      2         1 0.06069529 0.06069529       1
## 3  -2.3690526        0 22.5005      2         1 0.06295488 0.06295488       1
## 4   0.5271555        1 22.5005      2         1 0.06295488 0.06295488       1
## 5  -0.2195955        0 27.5005      2         1 0.06529274 0.06529274       1
## 6  -1.6878765        1 27.5005      2         1 0.06529274 0.06529274       1
## 7  -0.2550080        0 32.5005      2         1 0.06771114 0.06771114       1
## 8  -1.2722942        1 32.5005      2         1 0.06771114 0.06771114       1
## 9  -0.7403859        0 37.5005      2         1 0.07021240 0.07021240       1
## 10 -0.7841369        1 37.5005      2         1 0.07021240 0.07021240       1
##    subclass
## 1       598
## 2       598
## 3       222
## 4       222
## 5       272
## 6       272
## 7       859
## 8       859
## 9       728
## 10      728

Observations are matched on distance.

Means of covariates across school types.

na.exclude(psmMatched) %>% group_by(catholic) %>% summarise_all(mean)

## # A tibble: 2 x 9
##   catholic scores income places momDegree prediction distance weights subclass
## *    <int>  <dbl>  <dbl>  <dbl>     <dbl>      <dbl>    <dbl>   <dbl>    <dbl>
## 1        0  0.335   86.2   1.07     0.205      0.203    0.203       1       NA
## 2        1  0.220   86.2   1.07     0.205      0.203    0.203       1       NA

t.test(income ~ catholic, psmMatched) # same

## 
##  Welch Two Sample t-test
## 
## data:  income by catholic
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.985565  3.985565
## sample estimates:
## mean in group 0 mean in group 1 
##        86.18063        86.18063

t.test(places ~ catholic, psmMatched) # same

## 
##  Welch Two Sample t-test
## 
## data:  places by catholic
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02550007  0.02550007
## sample estimates:
## mean in group 0 mean in group 1 
##        1.073118        1.073118

t.test(momDegree~ catholic, psmMatched) # same

## 
##  Welch Two Sample t-test
## 
## data:  momDegree by catholic
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03676158  0.03676158
## sample estimates:
## mean in group 0 mean in group 1 
##       0.2053763       0.2053763

Reflect on what PSM did for your identification strategy.

Now the means of covariates across catholic and non-catholic schools are similar and the two groups are more comparable.

5. Visually check the balance in the new dataset.

# income against prediction after matching 
ggplot(psmMatched, aes(prediction, income, color = factor(catholic))) + 
  geom_smooth(method = "loess") + scale_color_discrete(name = "Catholic School (1 = yes)")

# 'places' against prediction after matching 
ggplot(psmMatched, aes(prediction, places, color = factor(catholic))) + 
  geom_smooth(method = "loess") + scale_color_discrete(name = "Catholic School (1 = yes)")

Means are nearly identical at each level of the pscore distribution.

6. Average treatment effect in the new sample

Means comparison.

psmMatched %>% group_by(catholic) %>% summarise_all(mean) #same means

## # A tibble: 2 x 9
##   catholic scores income places momDegree prediction distance weights subclass
## *    <int>  <dbl>  <dbl>  <dbl>     <dbl>      <dbl>    <dbl>   <dbl>    <dbl>
## 1        0  0.335   86.2   1.07     0.205      0.203    0.203       1       NA
## 2        1  0.220   86.2   1.07     0.205      0.203    0.203       1       NA

Covariate means are the same.

Run regressions

summary(lm(scores ~ catholic, psmMatched))

## 
## Call:
## lm(formula = scores ~ catholic, data = psmMatched)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2884 -0.5547  0.0443  0.5931  2.8356 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33456    0.02896  11.552  < 2e-16 ***
## catholic    -0.11488    0.04096  -2.805  0.00509 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8832 on 1858 degrees of freedom
## Multiple R-squared:  0.004216,   Adjusted R-squared:  0.00368 
## F-statistic: 7.867 on 1 and 1858 DF,  p-value: 0.005087

summary(lm(scores ~ catholic + income + places + momDegree, psmMatched))

## 
## Call:
## lm(formula = scores ~ catholic + income + places + momDegree, 
##     data = psmMatched)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3758 -0.5347  0.0265  0.5692  2.8515 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1513954  0.0910427   1.663  0.09650 .  
## catholic    -0.1148784  0.0392301  -2.928  0.00345 ** 
## income       0.0041331  0.0004566   9.052  < 2e-16 ***
## places      -0.0911405  0.0700264  -1.302  0.19324    
## momDegree   -0.3662377  0.0495217  -7.396 2.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.846 on 1855 degrees of freedom
## Multiple R-squared:  0.08793,    Adjusted R-squared:  0.08596 
## F-statistic: 44.71 on 4 and 1855 DF,  p-value: < 2.2e-16

Coefficient estimate on catholic is almost the same but the standard error decreases if we include covariates in the regression suggesting that matching allows us to interpret it as the variation in math scores caused by attending catholic schools (because the two groups are now similar across other observables.)

Part B: Deconstructing PSM

7. Split the PSM procedure.

Reproduce a PSM ‘by hand’

psmMatched %>% select(distance, prediction) %>% head(20)

##      distance prediction
## 1  0.18835756 0.18835756
## 2  0.16839012 0.16839012
## 3  0.18835756 0.18835756
## 4  0.21995740 0.21995740
## 5  0.31455557 0.31455557
## 6  0.18835756 0.18835756
## 7  0.08746743 0.08746743
## 8  0.16036417 0.16036417
## 9  0.09556052 0.09556052
## 10 0.18835756 0.18835756
## 11 0.10801363 0.10801363
## 12 0.10801363 0.10801363
## 13 0.14047341 0.14047341
## 14 0.07857707 0.07857707
## 15 0.08144465 0.08144465
## 16 0.13583484 0.13583484
## 17 0.21995740 0.21995740
## 18 0.18835756 0.18835756
## 19 0.16839012 0.16839012
## 20 0.08440729 0.08440729

distance and prediction are the same. Alternatively:

ggplot(psmMatched, aes(distance, prediction)) +  geom_point(size = 0.4, color = "midnightblue") + geom_line(linetype = "dotted", color = "sienna4")

# 45-degree line is the locus of y = x, implies  `distance` = `prediction`.

# running a matchit command with distance=prediction and repeating the same plot
data.frame(match.data(matchit(formula(cath), psm, distance = predict(cath, type = "response")))) %>% 
  ggplot(aes(distance, prediction)) + geom_point(size = 0.02)

8. Machine Learning for Matching

Running a LASSO procedure on the original dataset

# select only numeric variables
mlPsm <- read.csv(url("http://www.mfilipski.com/files/ecls.csv")) %>% 
  select(where(is.numeric)) 
yMat <- as.matrix(psm %>% select(catholic))
xMat <- mlPsm %>% select(2:18) %>% as.matrix()
# run the lasso
temp <- cv.glmnet((as.matrix(mlPsm[,2:18])), as.matrix(mlPsm[,1]), family = "binomial")
lasso<- glmnet(as.matrix(mlPsm[,2:18]), as.matrix(mlPsm[,1]), family = "binomial", lambda = temp$lambda.1se)

# coef function returns lambda.1se coefficients by default
coef(lasso)

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                          s0
## (Intercept)   -2.687873e+00
## race_white     5.838045e-02
## race_black     .           
## race_hispanic  .           
## race_asian     .           
## p5numpla       .           
## p5hmage        2.245165e-02
## p5hdage        .           
## w3daded_hsb   -2.348271e-01
## w3momed_hsb   -2.584542e-01
## w3momscr       1.366464e-03
## w3dadscr       .           
## w3income       4.518323e-06
## w3povrty      -3.252269e-01
## p5fstamp       .           
## c5r2mtsc       .           
## c5r2mtsc_std   .           
## w3income_1k    .

Based on this, we would select race, mother’s age, parents’ education, income, mother’s score, and poverty status for our regression.

9. Lasso predictions for matching

# matching with predictions from the lasso model as the distance
mlPsmMatch <-  match.data(matchit(catholic~w3income_1k + p5numpla + w3momed_hsb,
                                    data = mlPsm,
                                    distance=as.numeric(predict(lasso,newx = as.matrix(mlPsm[,2:18]),
                                                                type= "response"))))

Plots

# income v prediction
ggplot(mlPsmMatch, aes(distance, w3income_1k), color = factor(catholic)) + geom_smooth(method = "loess")

# momDegree v prediction
ggplot(mlPsmMatch, aes(distance, w3momed_hsb), color = factor(catholic)) + geom_smooth(method = "loess")

Overlapping plots when using catholic as factor \(\implies\) two groups are similar after matching.

When data is large (or at least feature-rich), using PSM together with ML methods might help reveal interesting relationships in the data that would otherwise be difficult to establish due to the presence of confounders. 🤦 🤷

HW10 PSM and LASSO

maila

4/28/2021