Background:

This assignment will take you through a Propensity Score Matching exercise, with a Machine Learning twist at the end. The example data contains test results of school children along with various characteristics.

The key question is: Do catholic schools have an effect on student’s performance?

The key issus is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely).

Part A: Understanding and running PSM

1. Download the data

• From here: http://www.mfilipski.com/files/ecls.csv (Data is originally form here)

• For the purpose of questions 1 - 6, make a smaller dataset which only contains the variables:
c5r2mtsc_std: Standardized math scores
catholic: Whether the school is catholic
w3income_1k: Family income in thousands
p5numpla: Number of places the student has lived for at least 4 months
w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?
[hint: use select()]

library(csvread)
library(dplyr)
library(stargazer)
library(ggplot2)

ecls_full <- read.csv2("~/Downloads/AAEC 8610/R Working Directory/HW12/ecls.csv", sep = ",",
                       stringsAsFactors = TRUE)

ecls <- select(ecls_full, c5r2mtsc_std, catholic, w3income_1k, p5numpla, w3momed_hsb)


ecls$c5r2mtsc_std <- as.numeric(ecls$c5r2mtsc_std)
ecls$catholic <- as.numeric(ecls$catholic)
ecls$w3income_1k <- as.numeric(ecls$w3income_1k)
ecls$p5numpla <- as.numeric(ecls$p5numpla)
ecls$w3momed_hsb <- as.numeric(ecls$w3momed_hsb)
head(ecls)
##   c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb
## 1         3929        0          11        1           0
## 2         3274        0           9        1           0
## 3         3074        0          11        1           0
## 4         4532        1          13        1           0
## 5         4953        0           2        1           0
## 6         2862        0          11        1           0

2. Identify the problem
• Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average?
• Can you conclude that going to a catholic school leads to higher math scores? Why or why not?

summary <-ecls %>%
  group_by(catholic) %>%
  summarise(n_students = n(),
            mean_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std) / sqrt(n()))
View(summary)

The test score on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average. I am unable to conclude that going to a catholic school leads to higher math scores because of the existence of “self-selection bias”. Parents who decides to send their children to catholic schools are very likely to be different than parents who decides not to due to various factors such as religious background and household income.


• Run regressions:
– A regression of just math scores on the catholic dummy

– Another regression that includes all the variables listed above

– Comment on the results

lm_naive <- lm(c5r2mtsc_std ~ catholic, data = ecls)

stargazer(lm_naive, type = "text",
          align = TRUE,  
          keep.stat = c("n", "rsq", "adj.rsq"),
          table.layout = "=d=t-s=n") 
## 
## ========================================
##                     c5r2mtsc_std        
## ========================================
## catholic               -27.275          
##                       (51.415)          
##                                         
## Constant            2,483.336***        
##                       (21.280)          
##                                         
## ----------------------------------------
## Observations            5,429           
## R2                     0.0001           
## Adjusted R2            -0.0001          
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01
lm_controls <- lm(c5r2mtsc_std ~ catholic  +
                    w3income_1k + p5numpla + w3momed_hsb, data = ecls)

stargazer(lm_controls, type = "text",
          align = TRUE,  
          keep.stat = c("n", "rsq", "adj.rsq"), 
          table.layout = "=d=t-s=n") 
## 
## ========================================
##                     c5r2mtsc_std        
## ========================================
## catholic             -149.619***        
##                       (50.824)          
##                                         
## w3income_1k            -4.155           
##                        (5.045)          
##                                         
## p5numpla              -94.000*          
##                       (55.513)          
##                                         
## w3momed_hsb          -650.169***        
##                       (39.937)          
##                                         
## Constant            2,877.101***        
##                       (81.088)          
##                                         
## ----------------------------------------
## Observations            5,429           
## R2                      0.047           
## Adjusted R2             0.047           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

The second regression indicates that in addition to whether student attends a catholic school, family income, number of places students have previously lived and mother’s educational level have impacts on mean standardized math scores and the results are all significant at 1% level.

• Do the means of the covariates (income, mother’s education etc) differ between catholic vs. non-catholic schools? You can:

– Compare the means [Hint: summarise_all()]

– Test the difference in means for statistical significance [Hint: t.test]

– Discuss what would be the problems with an identification strategy based on simple regressions.

summarise_mean <- ecls %>%
  group_by(catholic) %>%
  summarise_all(mean)
print(summarise_mean)
## # A tibble: 2 × 5
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb
##      <dbl>        <dbl>       <dbl>    <dbl>       <dbl>
## 1        0        2483.        8.35     1.11       0.392
## 2        1        2456.        8.91     1.07       0.205
t1 <- t.test(ecls[, "w3income_1k"] ~ ecls[, 'catholic'])
t2 <- t.test(ecls[, "p5numpla"] ~ ecls[, 'catholic'])
t3 <- t.test(ecls[, "w3momed_hsb"] ~ ecls[, 'catholic'])

print(t1$p.value); print(t2$p.value); print(t3$p.value)
## [1] 0.0001468243
## [1] 0.001791557
## [1] 1.530072e-33

The difference in means of the covariates between catholic versus non-catholic schools in terms of income, mother’s educational background and the number of places students have previously lived are all statistically significant. A bias is likely to exist since the covariates are statistically significant.

3. Estimate the propensity to go to catholic school, by hand

• Run a logit regression to predict catholic based on the covariates we kept (but not the math scores, of course).
– [Hint: you can use the glm command and the option family = binomial() ]

cath_logit <- glm(catholic ~  w3income_1k  + p5numpla + w3momed_hsb,
                  family = binomial(), data = ecls)
summary(cath_logit)
## 
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb, 
##     family = binomial(), data = ecls)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.260122   0.167404  -7.527 5.17e-14 ***
## w3income_1k  0.029658   0.009715   3.053  0.00227 ** 
## p5numpla    -0.285348   0.123301  -2.314  0.02065 *  
## w3momed_hsb -0.893003   0.087035 -10.260  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4972.4  on 5428  degrees of freedom
## Residual deviance: 4830.6  on 5425  degrees of freedom
## AIC: 4838.6
## 
## Number of Fisher Scoring iterations: 4

• Make a prediction for who is likely to go to catholic school, using this model – [Hint: use the predict command]

prs_df <- data.frame(pr_score = predict(cath_logit, type = "response"),
                     catholic = cath_logit$model$catholic)
head(prs_df)
##    pr_score catholic
## 1 0.2280715        0
## 2 0.2177974        0
## 3 0.2280715        0
## 4 0.2386824        1
## 5 0.1844996        0
## 6 0.2280715        0
ecls$prs_df = prs_df$pr_score
ggplot(ecls, aes(x = prs_df, y = w3income_1k, color = as.factor(catholic))) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Logit prediction") 

• Check that there is common support.

– (Meaning check both groups have predicted values in overlapping ranges)

– You may check this visually. For instance: you can do this with two histograms side-by-side. Or plot densities.

– Plot the income variable against the logit prediction (by catholic), and add the lowess smooth densities to the plot. They should look similar, but not perfectly aligned

labs <- paste("Actual school type attended:", c("Catholic", "Public"))
prs_df %>%
  mutate(catholic = ifelse(catholic == 1, labs[1], labs[2])) %>%
  ggplot(aes(x = pr_score)) +
  geom_histogram(color = "white") +
  facet_wrap(~catholic) +
  xlab("Probability of going to Catholic school") +
  theme_bw()

4. Run a Matching algorithm to get the impact of catholic schools by PSM

• Run a MatchIt command, which runs a logit just like the one you did, then matches each observation of the treated group with another from the control group. Store that as psm.

• Follow that by match.data(psm) and call it psm_data

library(MatchIt)
library(glmnet)
psm <- matchit(catholic ~  w3income_1k  + p5numpla + w3momed_hsb,
                     method = "nearest",  data = ecls)
# This is the new dataframe with matched data
psm_data <- match.data(psm)

• Look at the dimensions of dta_psm What can you conclude about the number of observations that matched?

dim(psm_data) # 2704 = 1352 pairs were matched
## [1] 1860    9

• Sort the data by distance, and show the 10 first observations. Do you see how observations were matched on distance?

head(arrange(psm_data, distance), 10)
##    c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb     prs_df   distance
## 1          2819        0           2        2           1 0.06510058 0.06510058
## 2          3440        1           2        2           1 0.06510058 0.06510058
## 3          1188        0           2        2           1 0.06510058 0.06510058
## 4           593        1           2        2           1 0.06510058 0.06510058
## 5          1945        0           3        2           1 0.06692909 0.06692909
## 6          2022        1           3        2           1 0.06692909 0.06692909
## 7          2353        0          13        3           1 0.06763392 0.06763392
## 8          3653        1          13        3           1 0.06763392 0.06763392
## 9          2431        0           5        2           1 0.07072986 0.07072986
## 10         3147        1           5        2           1 0.07072986 0.07072986
##    weights subclass
## 1        1      506
## 2        1      506
## 3        1      584
## 4        1      584
## 5        1      598
## 6        1      598
## 7        1      463
## 8        1      463
## 9        1      222
## 10       1      222

• In this new dataset, statistically test the means of the covariates (income, etc) between catholic vs. non-catholic schools? Is there a statistically significant difference? What does that tell you about your identification strategy?

• Do the means of the covariates (income, etc) between catholic vs. non-catholic schools? (hint: summarise_all) Test the difference in means for statistical significance. Reflect on what PSM did for your identification strategy.

psm_data %>%
  group_by(catholic) %>%
  summarise_all(mean)
## # A tibble: 2 × 9
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb prs_df distance weights
##      <dbl>        <dbl>       <dbl>    <dbl>       <dbl>  <dbl>    <dbl>   <dbl>
## 1        0        2636.        8.91     1.07       0.205  0.191    0.191       1
## 2        1        2456.        8.91     1.07       0.205  0.191    0.191       1
## # ℹ 1 more variable: subclass <dbl>

Significant tests:

t1 <- t.test(psm_data[, "w3income_1k"] ~ psm_data[, 'catholic'])
t2 <- t.test(psm_data[, "p5numpla"] ~ psm_data[, 'catholic'])
t3 <- t.test(psm_data[, "w3momed_hsb"] ~ psm_data[, 'catholic'])
t1$p.value; t2$p.value; t3$p.value
## [1] 1
## [1] 1
## [1] 1

5. Visually check the balance in the new dataset:

• Plot the income variable against the propensity score (by catholic) and plot the lowess smooth densities

• Verify that the means are nearly identical at each level of the pscore distribution

ggplot(psm_data, aes(x = distance, y = w3momed_hsb, color = as.factor(catholic))) +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") 

g1 <- ggplot(psm_data, aes(x = distance, y = w3momed_hsb, color = as.factor(catholic), shape = as.factor(catholic))) +
    geom_point(data = subset(psm_data, catholic == 0), size = 1.9, color = "red") +
    geom_point(alpha = 0.2, size = 1.3) +
    geom_smooth(method = "loess", se = F) +
    xlab("Propensity score") 
g1

6.Compute the average treatment effect in your new sample

• Use the same two regressions as in the very beginning. How do results differ from before?

psm_data %>%
  group_by(catholic) %>%
  summarise(n_students = n(),
            mean_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std) / sqrt(n_students))
## # A tibble: 2 × 4
##   catholic n_students mean_math std_error
##      <dbl>      <int>     <dbl>     <dbl>
## 1        0        930     2636.      47.5
## 2        1        930     2456.      46.8
lm_psm_naive <- lm(c5r2mtsc_std ~ catholic, data = psm_data)
stargazer(lm_psm_naive,type = "text",
          align = TRUE,  
          keep.stat = c("n", "rsq", "adj.rsq"), 
          table.layout = "=d=t-s=n") 
## 
## ========================================
##                     c5r2mtsc_std        
## ========================================
## catholic             -179.890***        
##                       (66.676)          
##                                         
## Constant            2,635.952***        
##                       (47.147)          
##                                         
## ----------------------------------------
## Observations            1,860           
## R2                      0.004           
## Adjusted R2             0.003           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01
lm_psm_controls <- lm(c5r2mtsc_std ~ catholic +
                  w3income_1k + p5numpla + w3momed_hsb, data = psm_data)
stargazer(lm_psm_controls,type = "text",
          align = TRUE,  
          keep.stat = c("n", "rsq", "adj.rsq"), 
          table.layout = "=d=t-s=n") 
## 
## ========================================
##                     c5r2mtsc_std        
## ========================================
## catholic             -179.890***        
##                       (65.702)          
##                                         
## w3income_1k          -21.083***         
##                        (7.987)          
##                                         
## p5numpla               -39.442          
##                       (117.402)         
##                                         
## w3momed_hsb          -586.530***        
##                       (81.358)          
##                                         
## Constant            2,986.554***        
##                       (156.024)         
##                                         
## ----------------------------------------
## Observations            1,860           
## R2                      0.034           
## Adjusted R2             0.032           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

Run a LASSO procedure on the ORIGINAL dataset (the one with all the variables) Hints: You can use cv.glmnet followed by glmnet with the best lambda (use lambda.1se, not lambda.min) (lambda.1se the “most regularized model such that error is within one standard error of the minimum”) This command likes to have x in matrix form, so make sure you make x into a matrix and only keep the numeric variables.

numeric <- unlist(lapply(ecls_full, is.numeric))  
ymat <- as.matrix(ecls_full$catholic)
xall <- as.matrix(ecls_full[ , numeric])
xall <- xall[,2:dim(xall)[2]]
head(xall)
##      race_white race_black race_hispanic race_asian p5numpla p5hmage p5hdage
## [1,]          1          0             0          0        1      47      45
## [2,]          1          0             0          0        1      41      48
## [3,]          1          0             0          0        1      43      55
## [4,]          1          0             0          0        1      38      39
## [5,]          1          0             0          0        1      47      57
## [6,]          1          0             0          0        1      41      41
##      w3daded_hsb w3momed_hsb w3povrty p5fstamp
## [1,]           0           0        0        0
## [2,]           0           0        0        0
## [3,]           0           0        0        0
## [4,]           0           0        0        0
## [5,]           0           0        0        0
## [6,]           0           0        0        0
pick_lasso <- cv.glmnet(y=ymat, x=xall, alpha=1, family="binomial")
plot(pick_lasso)

best_lasso <- glmnet(y=ymat, x=xall, alpha=1, lambda=pick_lasso$lambda.1se)

coef(best_lasso)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)    0.067076396
## race_white     0.010771864
## race_black     .          
## race_hispanic  .          
## race_asian     .          
## p5numpla       .          
## p5hmage        0.003504172
## p5hdage        .          
## w3daded_hsb   -0.043277882
## w3momed_hsb   -0.040207289
## w3povrty      -0.039297203
## p5fstamp       .

Use Lasso predictions for matching

pr_lasso = predict(best_lasso, newx = xall)
psm_lasso <- matchit(catholic ~   w3income_1k  + p5numpla + w3momed_hsb, 
                     distance = as.numeric(pr_lasso), method= "nearest", data = ecls)
psm_data_lasso <- match.data(psm_lasso)
lm_psm_lasso <- lm(c5r2mtsc_std ~ catholic +
                  w3income_1k + p5numpla + w3momed_hsb, data = psm_data_lasso)
                  
stargazer(lm_psm_lasso, type = "text",
          align = TRUE,  
          keep.stat = c("n", "rsq", "adj.rsq"), 
          table.layout = "=d=t-s=n") 
## 
## ========================================
##                     c5r2mtsc_std        
## ========================================
## catholic             -274.354***        
##                       (65.768)          
##                                         
## w3income_1k            -12.228          
##                        (8.192)          
##                                         
## p5numpla               -54.433          
##                       (104.386)         
##                                         
## w3momed_hsb          -540.369***        
##                       (81.426)          
##                                         
## Constant            3,008.739***        
##                       (146.368)         
##                                         
## ----------------------------------------
## Observations            1,860           
## R2                      0.033           
## Adjusted R2             0.031           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01