Background:

This assignment will take you through a Propensity Score Matching exercise, with a Machine Learning twist at the end. The example data contains test results of school children along with various characteristics.

The key question is: Do catholic schools have an effect on student’s performance?

The key issus is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely).

Note: This assignment heavily uses code from Dr. Filipski’s HW11 Key which was distributed in class.

Part A: Understanding and running PSM

1. Download the data

• From here: http://www.mfilipski.com/files/ecls.csv (Data is originally form here)

• For the purpose of questions 1 - 6, make a smaller dataset which only contains the variables:

  • c5r2mtsc_std: Standardized math scores
  • catholic: Whether the school is catholic
  • w3income_1k: Family income in thousands
  • p5numpla: Number of places the student has lived for at least 4 months
  • w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?
setwd("~/Library/CloudStorage/OneDrive-UniversityofGeorgia/4th Sem PhD/Adv Econometric Applications_Filipski/Assignments/HW11a")
library(readr)
library(dplyr)
ecls <- read_csv("ecls.csv")

#making dataset smaller
ecls_small <- select(ecls, c5r2mtsc_std, catholic, w3income_1k, p5numpla, w3momed_hsb)

2. Identify the problem:

• Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average?

library(stats)
ecls_small %>%
  group_by(catholic) %>%
  summarise_at(vars(c5r2mtsc_std), list(name = mean))
## # A tibble: 2 × 2
##   catholic  name
##      <dbl> <dbl>
## 1        0 0.163
## 2        1 0.220
t.test(c5r2mtsc_std ~ catholic, data = ecls_small)
## 
##  Welch Two Sample t-test
## 
## data:  c5r2mtsc_std by catholic
## t = -1.7866, df = 1468.1, p-value = 0.07421
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -0.118653665  0.005539377
## sample estimates:
## mean in group 0 mean in group 1 
##       0.1631279       0.2196851

Simply comparing the means: one going to catholic school score higher (0.219) in math compared to the non-Catholics (0.163). Upon t-test, we can see that the null of equal means is rejected at 10% level.

• Can you conclude that going to a catholic school leads to higher math scores? Why or why not?

Just looking into the mean values, we can not confirm this result. T-test somehow says they are different, but not with higher degree of confidence. We might be referring to the spurious correlation. This may hold true even after controlling for all other confounders, but yet, we don’t know. So we cannot conclude that.

• Run regressions: – A regression of just math scores on the catholic dummy

library(stargazer)
model1 <- lm(c5r2mtsc_std ~ catholic, data=ecls_small)
stargazer (model1,
           type="text",
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("catholic=1"),
           title="TABLE: OLS")
## 
## TABLE: OLS
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic=1                     0.057           
##                               (0.034)          
## Constant                     0.163***          
##                               (0.014)          
## -----------------------------------------------
## Observations                   5,429           
## R2                            0.0005           
## Adjusted R2                   0.0003           
## Residual Std. Error      0.955 (df = 5427)     
## F Statistic            2.704 (df = 1; 5427)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Notice that if we would have concluded that the Catholics’ math score are significantly different, that would have been a false conclusion. We may end up getting significant effect upon controlling other variables, but as of now, we can not say they are statistically different.

– Another regression that includes all the variables listed above

model2 <- lm(c5r2mtsc_std ~ catholic + w3income_1k + p5numpla+ w3momed_hsb, data=ecls_small)
stargazer (model2,
           type="text",
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("catholic=1", "family income(1000s)", "total places lived", "mother education below HS=1"),
           title="TABLE: OLS 2")
## 
## TABLE: OLS 2
## =======================================================
##                                 Dependent variable:    
##                             ---------------------------
##                                    c5r2mtsc_std        
## -------------------------------------------------------
## catholic=1                           -0.127***         
##                                       (0.033)          
## family income(1000s)                 0.005***          
##                                      (0.0003)          
## total places lived                   -0.101***         
##                                       (0.036)          
## mother education below HS=1          -0.374***         
##                                       (0.027)          
## Constant                               0.073           
##                                       (0.049)          
## -------------------------------------------------------
## Observations                           5,429           
## R2                                     0.124           
## Adjusted R2                            0.124           
## Residual Std. Error              0.894 (df = 5424)     
## F Statistic                  192.272*** (df = 4; 5424) 
## =======================================================
## Note:                       *p<0.1; **p<0.05; ***p<0.01

– Comment on the results WOW! You see the completely opposite relation upon controlling for observables. Everything else constant, the catholics score less than thr non-catholic counterparts.

• Do the means of the covariates (income, mother’s education etc) differ between catholic vs. non-catholic schools?

– Compare the means

ecls_small %>%
  group_by(catholic) %>%
  summarise_all(mean)
## # A tibble: 2 × 5
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb
##      <dbl>        <dbl>       <dbl>    <dbl>       <dbl>
## 1        0        0.163        65.4     1.11       0.392
## 2        1        0.220        86.2     1.07       0.205

Again, notice that catholic school affiliated observations have higher income, lower number of places the student lived for at least 4 months, and higher proportion of mothers educated up to college or more. But are these difference significant?

– Test the difference in means for statistical significance:

ttest_inc <- t.test(w3income_1k ~ catholic, data = ecls_small)
ttest_places <- t.test(p5numpla ~ catholic, data = ecls_small)
ttest_meduc <- t.test(w3momed_hsb ~ catholic, data = ecls_small)
ttest_inc$p.value
## [1] 1.212805e-37
ttest_places$p.value
## [1] 0.001791557
ttest_meduc$p.value
## [1] 1.530072e-33

– Discuss what would be the problems with an identification strategy based on simple regressions:

From t-test, for all three variables, the mean among catholics and non-catholics are significantly different. At the baseline, the observables are not equal. It means it does not mimics the randomization-type setting. Even after we control for all these variables, the effect of catholics variable remains biased.

3. Estimate the propensity to go to catholic school, by hand

• Run a logit regression to predict catholic based on the covariates we kept (but not the math scores, of course).

logit <- glm(catholic ~  w3income_1k  + p5numpla + w3momed_hsb, family = binomial(), data = ecls)
summary(logit)
## 
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb, 
##     family = binomial(), data = ecls)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0172  -0.6461  -0.5240  -0.4122   2.3672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.6702950  0.1576164 -10.597  < 2e-16 ***
## w3income_1k  0.0077921  0.0008115   9.602  < 2e-16 ***
## p5numpla    -0.2774346  0.1232930  -2.250   0.0244 *  
## w3momed_hsb -0.6504757  0.0915710  -7.104 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4972.4  on 5428  degrees of freedom
## Residual deviance: 4750.6  on 5425  degrees of freedom
## AIC: 4758.6
## 
## Number of Fisher Scoring iterations: 4

• Make a prediction for who is likely to go to catholic school, using this model

df_propensity <- data.frame(propensity_score = predict(logit, type = "response"),
                     catholic = logit$model$catholic)

• Check that there is common support.

– (Meaning check both groups have predicted values in overlapping ranges) – We may check this visually. For instance: we can do this with two histograms side-by-side. Or plot densities.

ecls$df_propensity = df_propensity$propensity_score

– Plot the income variable against the logit prediction (by catholic), and add the lowess smooth densities to the plot. They should look similar, but not perfectly aligned.

library(ggplot2)
ggplot(ecls, aes(x = df_propensity, y = w3income_1k, color = as.factor(catholic))) +
  geom_point(alpha = 0.2, size = 1.3) +
  geom_smooth(method = "loess", se = F) +
  xlab("Prediction")

#densities
bal <- ggplot(df_propensity, aes(propensity_score, fill = as.factor(catholic))) + 
  stat_density(aes(y = ..density..),position = "identity" 
               , color = "black", alpha=0.5)
bal

We can clearly see the overlapping section of these density curves. Therefore, yes, it has the common support.

4. Run a Matching algorithm to get a balanced dataset

• Create a matched dataset:

– Run a MatchIt command with the same equation as your logit. (which by default runs a logit just like the one you did, then matches each observation of the treated group with another from the control group.) – Follow that by match.data(your_matchit_output) to create your matched dataset

library(MatchIt)
psm <- matchit(catholic ~  w3income_1k  + p5numpla + w3momed_hsb,
                     method = "nearest",  data = ecls)
psm_data <- match.data(psm)

– Look at the dimensions of what you just created (say, dta_psm). What can you conclude about the number of observations that matched?

dim(psm_data)
## [1] 1860   27

The sample size has been reduced in the psm_data, because for rest of the observations there are no comparable groups. Only 1860 observations have their pairs. It is not necessary that there are only single comparable control for each treated observation.

• Sort the data by distance, and show the 10 first observations. – Notice how observations were matched on distance?

head(arrange(psm_data, distance), 10)
## # A tibble: 10 × 27
##    childid  catholic race         race_white race_black race_hispanic race_asian
##    <chr>       <dbl> <chr>             <dbl>      <dbl>         <dbl>      <dbl>
##  1 1108020C        0 HISPANIC, R…          0          0             1          0
##  2 1117012C        1 BLACK OR AF…          0          1             0          0
##  3 0040042C        0 WHITE, NON-…          1          0             0          0
##  4 0485014C        1 WHITE, NON-…          1          0             0          0
##  5 0436007C        0 WHITE, NON-…          1          0             0          0
##  6 0529004C        1 HISPANIC, R…          0          0             1          0
##  7 0136024C        0 WHITE, NON-…          1          0             0          0
##  8 0184013C        1 AMERICAN IN…          0          0             0          0
##  9 0058014C        0 HISPANIC, R…          0          0             1          0
## 10 0120004C        1 HISPANIC, R…          0          0             1          0
## # ℹ 20 more variables: p5numpla <dbl>, p5hmage <dbl>, p5hdage <dbl>,
## #   w3daded <chr>, w3momed <chr>, w3daded_hsb <dbl>, w3momed_hsb <dbl>,
## #   w3momscr <dbl>, w3dadscr <dbl>, w3inccat <chr>, w3income <dbl>,
## #   w3povrty <dbl>, p5fstamp <dbl>, c5r2mtsc <dbl>, c5r2mtsc_std <dbl>,
## #   w3income_1k <dbl>, df_propensity <dbl>, distance <dbl>, weights <dbl>,
## #   subclass <fct>

Notice that for those observations with same distance are assigned the same subclass. One of the two observations acts as treatment and another as control.

• In this new dataset, do the means of the covariates (income, etc) differ between catholic vs. non-catholic schools?

– Compare the means

psm_data %>%
  group_by(catholic) %>%
  summarise_all(mean)
## # A tibble: 2 × 27
##   catholic childid  race race_white race_black race_hispanic race_asian p5numpla
##      <dbl>   <dbl> <dbl>      <dbl>      <dbl>         <dbl>      <dbl>    <dbl>
## 1        0      NA    NA      0.705     0.0710        0.0882     0.0828     1.07
## 2        1      NA    NA      0.767     0.0366        0.116      0.0462     1.07
## # ℹ 19 more variables: p5hmage <dbl>, p5hdage <dbl>, w3daded <dbl>,
## #   w3momed <dbl>, w3daded_hsb <dbl>, w3momed_hsb <dbl>, w3momscr <dbl>,
## #   w3dadscr <dbl>, w3inccat <dbl>, w3income <dbl>, w3povrty <dbl>,
## #   p5fstamp <dbl>, c5r2mtsc <dbl>, c5r2mtsc_std <dbl>, w3income_1k <dbl>,
## #   df_propensity <dbl>, distance <dbl>, weights <dbl>, subclass <dbl>

– Test the difference in means for statistical significance

ttest_inc2 <- t.test(w3income_1k ~ catholic, data = psm_data)
ttest_places2 <- t.test(p5numpla ~ catholic, data = psm_data)
ttest_meduc2 <- t.test(w3momed_hsb ~ catholic, data = psm_data)
ttest_inc2$p.value
## [1] 1
ttest_places2$p.value
## [1] 1
ttest_meduc2$p.value
## [1] 1

• Reflect on what PSM did for your identification strategy. PSM selected a control group that is comparable to the observable features of the treated group. Now, once they are comparable (quasi-experimental setting), ceteris paribus assumption holds. But, it removes other data which it deems unnecessary to make a comparison or for which the comaprable observation is not found.

5. Visually check the balance in the new dataset:

• Plot the income variable against the propensity score (by catholic) and plot the loess smooth densities

ggplot(psm_data, aes(x = distance, y = w3income_1k, color = as.factor(catholic))) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") 

– Verify that the means are nearly identical at each level of the pscore distribution Let’s do for one more:

ggplot(psm_data, aes(x = distance, y = w3momed_hsb, color = as.factor(catholic))) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") 

This is almost the perfect match.

6. Compute the average treatment effect in your new sample

• Compare means of the two groups.

library(stats)
psm_data %>%
  group_by(catholic) %>%
  summarise_at(vars(c5r2mtsc_std), list(name = mean))
## # A tibble: 2 × 2
##   catholic  name
##      <dbl> <dbl>
## 1        0 0.321
## 2        1 0.220
t.test(c5r2mtsc_std ~ catholic, data = psm_data)
## 
##  Welch Two Sample t-test
## 
## data:  c5r2mtsc_std by catholic
## t = 2.4547, df = 1849.4, p-value = 0.01419
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  0.02033367 0.18198232
## sample estimates:
## mean in group 0 mean in group 1 
##       0.3208431       0.2196851

According to the t-test, math score for catholic and non-catholics are significantly different.

• Run regressions: – A regression of just scores on the catholic dummy

library(stargazer)
model3 <- lm(c5r2mtsc_std ~ catholic, data=psm_data)
stargazer (model3,
           type="text",
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("catholic=1"),
           title="TABLE: OLS with PSM DATA")
## 
## TABLE: OLS with PSM DATA
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic=1                   -0.101**          
##                               (0.041)          
## Constant                     0.321***          
##                               (0.029)          
## -----------------------------------------------
## Observations                   1,860           
## R2                             0.003           
## Adjusted R2                    0.003           
## Residual Std. Error      0.889 (df = 1858)     
## F Statistic           6.025** (df = 1; 1858)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Catholic schools have lower math score, which is statistically significant.

– Another regression that includes all the variables listed above

model4 <- lm(c5r2mtsc_std ~ catholic + w3income_1k + p5numpla+ w3momed_hsb, data=psm_data)
stargazer (model4,
           type="text",
           align=TRUE,
           no.space=TRUE,
           covariate.labels = c("catholic=1", "family income(1000s)", "total places lived", "mother education below HS=1"),
           title="TABLE: OLS 2 WITH PSM DATA")
## 
## TABLE: OLS 2 WITH PSM DATA
## =======================================================
##                                 Dependent variable:    
##                             ---------------------------
##                                    c5r2mtsc_std        
## -------------------------------------------------------
## catholic=1                           -0.101**          
##                                       (0.040)          
## family income(1000s)                 0.004***          
##                                      (0.0005)          
## total places lived                    -0.115           
##                                       (0.071)          
## mother education below HS=1          -0.297***         
##                                       (0.050)          
## Constant                               0.151           
##                                       (0.092)          
## -------------------------------------------------------
## Observations                           1,860           
## R2                                     0.074           
## Adjusted R2                            0.072           
## Residual Std. Error              0.857 (df = 1855)     
## F Statistic                  37.083*** (df = 4; 1855)  
## =======================================================
## Note:                       *p<0.1; **p<0.05; ***p<0.01

• Comment on the results of those regressions after the PSM procedure

This is just wow. Even after adding bunch of controls, the result on the parameter of interest is exaclty same. It means that we now have pair of observations that perfectly mimics the randomization like setting. “Everything else constant” is mimicked after we ran PSM. Therefore, it does not matter whether we control for other observables or not, the effect is stably same.

Part B: Deconstructing PSM

7. Split the PSM procedure

• Reproduce a PSM ‘by hand’: – Recall your logit-predicted values from part 3. – Verify that they are the same as the distance used by your matchit command. [Hint: just plot psm$distance against your logit predictions - assuming psm is the name of what came out of your matchit command]

#lets make a plot
plot(psm_data$distance, psm_data$df_propensity)

The prediction of logit and the distance calculated by matchit function are equivalent.

– Run your same matchit command but with distance= your logit predictions. See that this is the same.

#psm2_hand <- matchit(catholic ~  w3income_1k  + p5numpla + w3momed_hsb,
#                     distance=as.numeric(df_propensity), method= "nearest", data = ecls)

#psm2_data <- match.data(psm2_hand)

We are asked to define our own propensity score. Not sure why the distance=df_propensity did not worked.

• Reflect on how you can replace the logit stage by any other prediction “score”.

Logit is basically trying to predict the probability of catholic=1 in this case. We can substitute this prediction concern with multiple techniques ranging from probit and linear probability models to machine learning methods. Note that ML methods are very good in predictions. So, in the first stage where our concern is just to well-predict the model, we can use range of these methods.

8. Use Machine Learning for matching

• Run a LASSO procedure on the ORIGINAL dataset (the one before your select command in q1)

– Keep only numeric variables – Run the lasso on all the numeric covariates

numeric <- unlist(lapply(ecls, is.numeric))  
ymat <- as.matrix(ecls$catholic)
xall <- as.matrix(ecls[ , numeric])
xall <- xall[,2:dim(xall)[2]]
head(xall)
##      race_white race_black race_hispanic race_asian p5numpla p5hmage p5hdage
## [1,]          1          0             0          0        1      47      45
## [2,]          1          0             0          0        1      41      48
## [3,]          1          0             0          0        1      43      55
## [4,]          1          0             0          0        1      38      39
## [5,]          1          0             0          0        1      47      57
## [6,]          1          0             0          0        1      41      41
##      w3daded_hsb w3momed_hsb w3momscr w3dadscr w3income w3povrty p5fstamp
## [1,]           0           0    53.50     77.5  62500.5        0        0
## [2,]           0           0    34.95     53.5  45000.5        0        0
## [3,]           0           0    63.43     53.5  62500.5        0        0
## [4,]           0           0    53.50     53.5  87500.5        0        0
## [5,]           0           0    61.56     77.5 150000.5        0        0
## [6,]           0           0    38.18     53.5  62500.5        0        0
##      c5r2mtsc c5r2mtsc_std w3income_1k df_propensity
## [1,]   60.043    0.9817533     62.5005     0.1883576
## [2,]   56.280    0.5943775     45.0005     0.1683901
## [3,]   55.272    0.4906106     62.5005     0.1883576
## [4,]   64.604    1.4512779     87.5005     0.2199574
## [5,]   75.721    2.5956991    150.0005     0.3145556
## [6,]   54.248    0.3851966     62.5005     0.1883576

– Hint: You can use cv.glmnet followed by glmnet with the best lambda (use lambda.1se, not lambda.min) [lambda.1se the “most regularized model such that error is within one standard error of the minimum”] – Hint: The glmnet command likes to work with matrix objects, so make sure you make x into a matrix.

library(glmnet)
pick_lasso <- cv.glmnet(y=ymat, x=xall, alpha=1, family="binomial")
plot(pick_lasso)

We can see that the best model selected by LASSO has 15-16 variables only.

– List the variables selected by your LASSO model [hint: use coef() command]

coef(pick_lasso)
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                          s1
## (Intercept)   -2.625412e+00
## race_white     1.058582e-01
## race_black     .           
## race_hispanic  .           
## race_asian     .           
## p5numpla       .           
## p5hmage        2.483002e-02
## p5hdage        1.035086e-03
## w3daded_hsb   -2.663184e-01
## w3momed_hsb   -2.889917e-01
## w3momscr       2.285426e-03
## w3dadscr       .           
## w3income       4.502920e-06
## w3povrty      -4.425720e-01
## p5fstamp       .           
## c5r2mtsc      -4.894784e-03
## c5r2mtsc_std  -2.520588e-08
## w3income_1k    1.302790e-04
## df_propensity  .

• Comment on the results:

It says that the non-zero variables like race_white, p5hmage, ……. etc should be included in the model. These variables are refined variables after shrinkage and regularization operation.

– If you used lambda.1se as your regularization parameter, how many variables would you have chosen to include?

best_lasso <- glmnet(y=ymat, x=xall, alpha=1, lambda=pick_lasso$lambda.1se)
coef(best_lasso)
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                          s0
## (Intercept)    2.534294e-02
## race_white     1.257058e-02
## race_black     .           
## race_hispanic  .           
## race_asian     .           
## p5numpla       .           
## p5hmage        3.305478e-03
## p5hdage        7.539383e-05
## w3daded_hsb   -3.438121e-02
## w3momed_hsb   -3.379810e-02
## w3momscr       4.043497e-04
## w3dadscr       2.549573e-05
## w3income       7.859703e-07
## w3povrty      -2.400276e-02
## p5fstamp       .           
## c5r2mtsc      -6.884298e-04
## c5r2mtsc_std  -1.827005e-09
## w3income_1k    2.517531e-06
## df_propensity  .

– Which variables would you have chosen to run the PSM command?

The best_lasso model variables that has been selected after the regularization and shrinkage.

– Do these variables differ from those you used before?

Yes, they are different.

9. Use Lasso predictions for matching.

#generating the predictions from best_lasso model:
probability_lasso = predict(best_lasso, newx = xall)
psm_lasso <- matchit(catholic ~   w3income_1k  + p5numpla + w3momed_hsb, 
                     distance = as.numeric(probability_lasso), method= "nearest", data = ecls)
psm_data_lasso <- match.data(psm_lasso)


model5 <- lm(c5r2mtsc_std ~ catholic +
                  w3income_1k + p5numpla + w3momed_hsb, data = psm_data_lasso)
stargazer (model5,
           type="text",
           align=TRUE,
           no.space=TRUE,
           title="TABLE: OLS")
## 
## TABLE: OLS
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic                     -0.156***         
##                               (0.040)          
## w3income_1k                  0.004***          
##                              (0.0004)          
## p5numpla                      -0.076           
##                               (0.063)          
## w3momed_hsb                  -0.363***         
##                               (0.051)          
## Constant                      0.192**          
##                               (0.085)          
## -----------------------------------------------
## Observations                   1,860           
## R2                             0.087           
## Adjusted R2                    0.085           
## Residual Std. Error      0.851 (df = 1855)     
## F Statistic          44.156*** (df = 4; 1855)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

When LASSO is used in the prediction stage (it means, the second stage will use the prediction score from first stage as distance), then the effect of catholic is still significant, but lesser than what immediate past model gave.