1. Download the data

Make a smaller dataset which only contains the variables:
c5r2mtsc_std: Standardized math scores
catholic: Standardized math scores
w3income_1k: Family income in thousands
p5numpla: Number of places the student has lived for at least 4 months
w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?

library(dplyr) 
library(ggplot2)
library(MatchIt)
library(glmnet)
ecls <- read.csv("C:/Users/SHOUY/Desktop/AAEC8610/code/hw10/ecls.csv")
keep <- c("c5r2mtsc_std","catholic","w3income_1k","p5numpla","w3momed_hsb")
small <- ecls[,(names(ecls) %in% keep)]

2.Identify the problem

Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average?

Answer:The test sores are different on average in the catholic and non-catholic schools, but quite close. The summary table is shown as below.

small %>%                           
  group_by(catholic) %>%
    summarize(n_students= n(), 
            mean_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std))
## # A tibble: 2 x 4
##   catholic n_students mean_math std_error
##      <int>      <int>     <dbl>     <dbl>
## 1        0       4499     0.163     0.974
## 2        1        930     0.220     0.858

Can you conclude that going to a catholic school leads to higher math scores? Why or why not?
A regression of just math scores on the catholic dummy

reg1<-lm(c5r2mtsc_std~catholic,small)
summary(reg1)
## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2667 -0.6082  0.0538  0.6292  3.2705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.16313    0.01424  11.459   <2e-16 ***
## catholic     0.05656    0.03440   1.644      0.1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9549 on 5427 degrees of freedom
## Multiple R-squared:  0.000498,   Adjusted R-squared:  0.0003138 
## F-statistic: 2.704 on 1 and 5427 DF,  p-value: 0.1002

Another regression that includes all the variables listed above.

reg2<-lm(c5r2mtsc_std~catholic+w3income_1k+p5numpla+w3momed_hsb,small)
summary(reg2)
## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic + w3income_1k + p5numpla + 
##     w3momed_hsb, data = small)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3922 -0.5696  0.0372  0.5986  3.2572 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0733290  0.0491661   1.491 0.135900    
## catholic    -0.1273899  0.0328888  -3.873 0.000109 ***
## w3income_1k  0.0053247  0.0003026  17.595  < 2e-16 ***
## p5numpla    -0.1009432  0.0355779  -2.837 0.004567 ** 
## w3momed_hsb -0.3740355  0.0271962 -13.753  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8941 on 5424 degrees of freedom
## Multiple R-squared:  0.1242, Adjusted R-squared:  0.1235 
## F-statistic: 192.3 on 4 and 5424 DF,  p-value: < 2.2e-16

Answer: In the first regression,the coefficient of catholic is not significant in the regression result. But in the second regression, the coefficient of catholic= -0.1273899is now statistically significant. So we cannot conclude that going to a catholic school leads to higher math scores since coefficient is significant negative.

Do the means of the covariates (income, etc) between catholic vs. non-catholic schools?

small %>%                           
  group_by(catholic) %>%
    summarise_all(mean)
## # A tibble: 2 x 5
##   catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k
##      <int>    <dbl>       <dbl>        <dbl>       <dbl>
## 1        0     1.11       0.392        0.163        65.4
## 2        1     1.07       0.205        0.220        86.2
cath<-small[which(small$catholic==1),] 
ncath<-small[which(small$catholic==0),] 
t.test(cath$p5numpla,ncath$p5numpla)
## 
##  Welch Two Sample t-test
## 
## data:  cath$p5numpla and ncath$p5numpla
## t = -3.128, df = 1600.4, p-value = 0.001792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.05390038 -0.01235472
## sample estimates:
## mean of x mean of y 
##  1.073118  1.106246
t.test(cath$w3momed_hsb,ncath$w3momed_hsb)
## 
##  Welch Two Sample t-test
## 
## data:  cath$w3momed_hsb and ncath$w3momed_hsb
## t = -12.362, df = 1545.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2165946 -0.1572715
## sample estimates:
## mean of x mean of y 
## 0.2053763 0.3923094
t.test(cath$w3income_1k,ncath$w3income_1k)
## 
##  Welch Two Sample t-test
## 
## data:  cath$w3income_1k and ncath$w3income_1k
## t = 13.238, df = 1314.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  17.70620 23.86719
## sample estimates:
## mean of x mean of y 
##  86.18063  65.39393

Answer: The means are all performed significant differences in the variables in two groups. The identification strategy problem is that the differences of dependent variables in catholic and non-catholic schools can be driven by the differences of other variables in two groups rather than only dummy variable.

3. Estimate the propensity to go to catholic school, by hand

reg3<-glm(catholic ~ w3income_1k + p5numpla + w3momed_hsb,family = binomial(), data = small)
summary(reg3)
## 
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb, 
##     family = binomial(), data = small)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0172  -0.6461  -0.5240  -0.4122   2.3672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.6702950  0.1576164 -10.597  < 2e-16 ***
## w3income_1k  0.0077921  0.0008115   9.602  < 2e-16 ***
## p5numpla    -0.2774346  0.1232930  -2.250   0.0244 *  
## w3momed_hsb -0.6504757  0.0915710  -7.104 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4972.4  on 5428  degrees of freedom
## Residual deviance: 4750.6  on 5425  degrees of freedom
## AIC: 4758.6
## 
## Number of Fisher Scoring iterations: 4
small$pr.score<-predict(reg3,type = "response") 
head(small)
##   catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k  pr.score
## 1        0        1           0    0.9817533     62.5005 0.1883576
## 2        0        1           0    0.5943775     45.0005 0.1683901
## 3        0        1           0    0.4906106     62.5005 0.1883576
## 4        1        1           0    1.4512779     87.5005 0.2199574
## 5        0        1           0    2.5956991    150.0005 0.3145556
## 6        0        1           0    0.3851966     62.5005 0.1883576
cath$pr.score<-predict(reg3,type = "response",newdata = cath) 
ncath$pr.score<-predict(reg3,type = "response",newdata = ncath) 
F1<-ggplot(data=small,aes(x = pr.score, y = w3income_1k))+
 geom_point(aes(color=factor(catholic)),size=1.2)+
 geom_smooth(aes(color=factor(catholic)))
F1

F2<-ggplot(data=small,aes(pr.score))+
 geom_density(aes(color=factor(catholic),fill=factor(catholic),alpha=0.5),size=1.2)
F2

4. Run a Matching algorithm to get a balanced dataset

psm<-matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb,data=small,distance="glm")
summary(psm)
## 
## Call:
## matchit(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb, 
##     data = small, distance = "glm")
## 
## Summary of Balance for All Data:
##             Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance           0.2034        0.1647          0.5198     0.9539    0.1217
## w3income_1k       86.1806       65.3939          0.4744     1.0647    0.1011
## p5numpla           1.0731        1.1062         -0.1182     0.6323    0.0066
## w3momed_hsb        0.2054        0.3923         -0.4627          .    0.1869
##             eCDF Max
## distance      0.2544
## w3income_1k   0.2478
## p5numpla      0.0265
## w3momed_hsb   0.1869
## 
## 
## Summary of Balance for Matched Data:
##             Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance           0.2034        0.2034               0          1         0
## w3income_1k       86.1806       86.1806               0          1         0
## p5numpla           1.0731        1.0731               0          1         0
## w3momed_hsb        0.2054        0.2054               0          .         0
##             eCDF Max Std. Pair Dist.
## distance           0               0
## w3income_1k        0               0
## p5numpla           0               0
## w3momed_hsb        0               0
## 
## Percent Balance Improvement:
##             Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
## distance                100        100       100      100
## w3income_1k             100        100       100      100
## p5numpla                100        100       100      100
## w3momed_hsb             100          .       100      100
## 
## Sample Sizes:
##           Control Treated
## All          4499     930
## Matched       930     930
## Unmatched    3569       0
## Discarded       0       0
psm_data <- match.data(psm)
dim(psm_data)
## [1] 1860    9
psm_data<-psm_data[order(psm_data$distance),]
head(psm_data, 10)
##      catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k   pr.score
## 176         0        2           1   -0.3835843     17.5005 0.06069529
## 4274        1        2           1   -2.5922340     17.5005 0.06069529
## 41          0        2           1   -2.3690526     22.5005 0.06295488
## 2083        1        2           1    0.5271555     22.5005 0.06295488
## 561         0        2           1   -0.2195955     27.5005 0.06529274
## 2304        1        2           1   -1.6878765     27.5005 0.06529274
## 178         0        2           1   -0.2550080     32.5005 0.06771114
## 716         1        2           1   -1.2722942     32.5005 0.06771114
## 288         0        2           1   -0.7403859     37.5005 0.07021240
## 485         1        2           1   -0.7841369     37.5005 0.07021240
##        distance weights subclass
## 176  0.06069529       1      598
## 4274 0.06069529       1      598
## 41   0.06295488       1      222
## 2083 0.06295488       1      222
## 561  0.06529274       1      272
## 2304 0.06529274       1      272
## 178  0.06771114       1      859
## 716  0.06771114       1      859
## 288  0.07021240       1      728
## 485  0.07021240       1      728
psm_data %>%                           
  group_by(catholic) %>%
    summarise_all(mean)
## # A tibble: 2 x 9
##   catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k pr.score distance
##      <int>    <dbl>       <dbl>        <dbl>       <dbl>    <dbl>    <dbl>
## 1        0     1.07       0.205        0.335        86.2    0.203    0.203
## 2        1     1.07       0.205        0.220        86.2    0.203    0.203
## # ... with 2 more variables: weights <dbl>, subclass <dbl>
psm_data1<-psm_data[which(psm_data$catholic==1),] 
psm_data0<-psm_data[which(psm_data$catholic==0),] 
t.test(psm_data1$p5numpla,psm_data0$p5numpla)
## 
##  Welch Two Sample t-test
## 
## data:  psm_data1$p5numpla and psm_data0$p5numpla
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.02550007  0.02550007
## sample estimates:
## mean of x mean of y 
##  1.073118  1.073118
t.test(psm_data1$w3momed_hsb,psm_data0$w3momed_hsb)
## 
##  Welch Two Sample t-test
## 
## data:  psm_data1$w3momed_hsb and psm_data0$w3momed_hsb
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.03676158  0.03676158
## sample estimates:
## mean of x mean of y 
## 0.2053763 0.2053763
t.test(psm_data1$w3income_1k,psm_data0$w3income_1k)
## 
##  Welch Two Sample t-test
## 
## data:  psm_data1$w3income_1k and psm_data0$w3income_1k
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.985565  3.985565
## sample estimates:
## mean of x mean of y 
##  86.18063  86.18063

5. Visually check the balance in the new dataset:

F3<-ggplot(data=psm_data,aes(x = pr.score, y = w3income_1k))+
 geom_point(aes(color=factor(catholic)),size=1.2)+
 geom_smooth(aes(color=factor(catholic)))
F3

6. Compute the average treatment effect in your new sample

psm_data %>%                           
  group_by(catholic) %>%
    summarize(n_students= n(), 
            mean_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std))
## # A tibble: 2 x 4
##   catholic n_students mean_math std_error
##      <int>      <int>     <dbl>     <dbl>
## 1        0        930     0.335     0.908
## 2        1        930     0.220     0.858
reg4<-lm(c5r2mtsc_std~catholic,psm_data)
summary(reg4)
## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = psm_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2884 -0.5547  0.0443  0.5931  2.8356 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33456    0.02896  11.552  < 2e-16 ***
## catholic    -0.11488    0.04096  -2.805  0.00509 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8832 on 1858 degrees of freedom
## Multiple R-squared:  0.004216,   Adjusted R-squared:  0.00368 
## F-statistic: 7.867 on 1 and 1858 DF,  p-value: 0.005087
reg5<-lm(c5r2mtsc_std~catholic+w3income_1k+p5numpla+w3momed_hsb,psm_data)
summary(reg5)
## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic + w3income_1k + p5numpla + 
##     w3momed_hsb, data = psm_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3758 -0.5347  0.0265  0.5692  2.8515 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1513954  0.0910427   1.663  0.09650 .  
## catholic    -0.1148784  0.0392301  -2.928  0.00345 ** 
## w3income_1k  0.0041331  0.0004566   9.052  < 2e-16 ***
## p5numpla    -0.0911405  0.0700264  -1.302  0.19324    
## w3momed_hsb -0.3662377  0.0495217  -7.396 2.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.846 on 1855 degrees of freedom
## Multiple R-squared:  0.08793,    Adjusted R-squared:  0.08596 
## F-statistic: 44.71 on 4 and 1855 DF,  p-value: < 2.2e-16

Comment on the results of those regressions after the PSM procedure

Answer: In both regression results, the catholic is significant after PSM procedure. Since the value of the coefficient of catholic is negative, that means catholic schools can get higher scores. Income and w3momed_hsb are also significant in the second regression result.

7. Split the PSM procedure

plot(small$pr.score,psm$distance)

Reflect on how you can replace the logit stage by any other prediction “score”.
The distance and pr.score are same which is shown in the graph above. So we can get the predicted value from the logit stage and use it to replace the distance in PSM.

8. Use Machine Learning for matching

ecls<-select_if(ecls, is.numeric)

lasso<-cv.glmnet(data.matrix(ecls[,2:18]),data.matrix(ecls[,1]),type.measure ="deviance")
plot(lasso)

coef(lasso,s="lambda.1se")
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                           1
## (Intercept)    9.937395e-02
## race_white     .           
## race_black     .           
## race_hispanic  .           
## race_asian     .           
## p5numpla       .           
## p5hmage        1.065969e-03
## p5hdage        .           
## w3daded_hsb   -1.659486e-02
## w3momed_hsb   -1.645021e-02
## w3momscr       .           
## w3dadscr       .           
## w3income       6.438272e-07
## w3povrty       .           
## p5fstamp       .           
## c5r2mtsc       .           
## c5r2mtsc_std   .           
## w3income_1k    .

Comment on the results
Answer: If I used lambda.1se as parameter,I would choose 7 variables to include as the results shown above. I would choose to have p5hmage, w3daded_hsb.w3momed_hsb,w3momscr,w3income,w3povrty and w3income_1k. They are patial same with what we have chosen before but still have some differences.

9. Use Lasso predictions for matching.

prlasso<-predict(lasso,data.matrix(ecls[,2:18]),s="lambda.1se")
dis<-t(prlasso[,1])
psm2<-matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb,data=small,distance=dis[1,])
psm2_data <- match.data(psm2)
F4<-ggplot(data=psm2_data,aes(x = pr.score, y = w3income_1k))+
 geom_point(aes(color=factor(catholic)),size=1.2)+
 geom_smooth(aes(color=factor(catholic)))
F4

plot(psm2_data$pr.score,psm2_data$distance)

Answer: The pr.score-income graph shown above can be compared with the our orginal graph using PSM command in previous parts. There’s slight difference between them but with same trend. The last graph shows the relationship between the distance we given by ML method and the pr.score by PSM method.