This assignment will take you through a Propensity Score Matching exercise, with a Machine Learning twist at the end. The example data contains test results of school children along with various characteristics. The key question is: Do catholic schools have an effect on student’s performance? The key issus is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely).
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb
## Min. :-3.1036 0:4499 Min. : 5.00 Min. :1.000 Min. :0.0000
## 1st Qu.:-0.4416 1: 930 1st Qu.: 37.50 1st Qu.:1.000 1st Qu.:0.0000
## Median : 0.2256 Median : 62.50 Median :1.000 Median :0.0000
## Mean : 0.1728 Mean : 68.95 Mean :1.101 Mean :0.3603
## 3rd Qu.: 0.8024 3rd Qu.: 87.50 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. : 3.4337 Max. :200.00 Max. :5.000 Max. :1.0000
Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average? * Can you conclude that going to a catholic school leads to higher math scores? Why or why not? * Run regressions: 1. A regression of just math scores on the catholic dummy 2. Another regression that includes all the variables listed above 3. Comment on the results * Do the means of the covariates (income, mother’s education etc) differ between catholic vs. non-catholic schools? You can: 1. Compare the means [Hint: summarise_all()] 2. Test the difference in means for statistical significance [Hint: t.test] 3. Discuss what would be the problems with an identification strategy based on simple regressions.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.5922 -0.3232 0.2510 0.2197 0.7729 3.0552
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.1036 -0.4577 0.2207 0.1631 0.8084 3.4337
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = -1.7866, df = 1468.1, p-value = 0.07421
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.118653665 0.005539377
## sample estimates:
## mean in group 0 mean in group 1
## 0.1631279 0.2196851
There is significant difference in the t-test difference means of the scores between the catholic and non-catholic schools distribution in the Math scores. Also if we look closely at the box plot we see a lot of variation between the two as well present.
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2667 -0.6082 0.0538 0.6292 3.2705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.16313 0.01424 11.459 <2e-16 ***
## catholic1 0.05656 0.03440 1.644 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9549 on 5427 degrees of freedom
## Multiple R-squared: 0.000498, Adjusted R-squared: 0.0003138
## F-statistic: 2.704 on 1 and 5427 DF, p-value: 0.1002
##
## Call:
## lm(formula = c5r2mtsc_std ~ ., data = dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3922 -0.5696 0.0372 0.5986 3.2572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0733290 0.0491661 1.491 0.135900
## catholic1 -0.1273899 0.0328888 -3.873 0.000109 ***
## w3income_1k 0.0053247 0.0003026 17.595 < 2e-16 ***
## p5numpla -0.1009432 0.0355779 -2.837 0.004567 **
## w3momed_hsb -0.3740355 0.0271962 -13.753 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8941 on 5424 degrees of freedom
## Multiple R-squared: 0.1242, Adjusted R-squared: 0.1235
## F-statistic: 192.3 on 4 and 5424 DF, p-value: < 2.2e-16
If we are to compare the two regression we see that catholic school admits have a reduced standardized test scores and the effect although in the ols first equation is positive but is insignificant. For the second regression with all the variables we can see that mother’s age and better standard of life proxied with greater family income increases the standardized test scores and vice versa.
## # A tibble: 2 x 5
## catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.163 65.4 1.11 0.392
## 2 1 0.220 86.2 1.07 0.205
##
## Welch Two Sample t-test
##
## data: w3income_1k by catholic
## t = -13.238, df = 1314.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -23.86719 -17.70620
## sample estimates:
## mean in group 0 mean in group 1
## 65.39393 86.18063
##
## Welch Two Sample t-test
##
## data: p5numpla by catholic
## t = 3.128, df = 1600.4, p-value = 0.001792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01235472 0.05390038
## sample estimates:
## mean in group 0 mean in group 1
## 1.106246 1.073118
##
## Welch Two Sample t-test
##
## data: w3momed_hsb by catholic
## t = 12.362, df = 1545.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.1572715 0.2165946
## sample estimates:
## mean in group 0 mean in group 1
## 0.3923094 0.2053763
There is significant difference in the t-test difference means of the scores between the catholic and non-catholic schools distributios. There could be possible parallel trends assumption in DID design getting violated. There could exist selction on observables and unobservables which means existence of endogeneity and ommitted variable bias present in the dataset. There is potential sample selection problems with too less variables being the correct representative of the datasset and there are other factors contributing to attending catholic schools and not attending.
##
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb,
## family = "binomial", data = dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0172 -0.6461 -0.5240 -0.4122 2.3672
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6702950 0.1576164 -10.597 < 2e-16 ***
## w3income_1k 0.0077921 0.0008115 9.602 < 2e-16 ***
## p5numpla -0.2774346 0.1232930 -2.250 0.0244 *
## w3momed_hsb -0.6504757 0.0915710 -7.104 1.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4972.4 on 5428 degrees of freedom
## Residual deviance: 4750.6 on 5425 degrees of freedom
## AIC: 4758.6
##
## Number of Fisher Scoring iterations: 4
## c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb
## Min. :-3.1036 0:4499 Min. : 5.00 Min. :1.000 Min. :0.0000
## 1st Qu.:-0.4416 1: 930 1st Qu.: 37.50 1st Qu.:1.000 1st Qu.:0.0000
## Median : 0.2256 Median : 62.50 Median :1.000 Median :0.0000
## Mean : 0.1728 Mean : 68.95 Mean :1.101 Mean :0.3603
## 3rd Qu.: 0.8024 3rd Qu.: 87.50 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. : 3.4337 Max. :200.00 Max. :5.000 Max. :1.0000
## catholicpred
## Min. :0.04333
## 1st Qu.:0.10801
## Median :0.16839
## Mean :0.17130
## 3rd Qu.:0.21996
## Max. :0.40389
## A matchit object
## - method: 1:1 nearest neighbor matching without replacement
## - distance: Propensity score
## - estimated with logistic regression
## - number of obs.: 5429 (original), 1860 (matched)
## - target estimand: ATT
## - covariates: w3income_1k, p5numpla, w3momed_hsb
## c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb
## Min. :-2.9539 0:930 Min. : 7.50 Min. :1.000 Min. :0.0000
## 1st Qu.:-0.2794 1:930 1st Qu.: 62.50 1st Qu.:1.000 1st Qu.:0.0000
## Median : 0.3035 Median : 87.50 Median :1.000 Median :0.0000
## Mean : 0.2771 Mean : 86.18 Mean :1.073 Mean :0.2054
## 3rd Qu.: 0.8766 3rd Qu.: 87.50 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. : 3.1066 Max. :200.00 Max. :3.000 Max. :1.0000
##
## distance weights subclass
## Min. :0.0607 Min. :1 1 : 2
## 1st Qu.:0.1604 1st Qu.:1 2 : 2
## Median :0.1884 Median :1 3 : 2
## Mean :0.2034 Mean :1 4 : 2
## 3rd Qu.:0.2200 3rd Qu.:1 5 : 2
## Max. :0.4039 Max. :1 6 : 2
## (Other):1848
## c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb
## Min. :-2.9539 0:930 Min. : 7.50 Min. :1.000 Min. :0.0000
## 1st Qu.:-0.2794 1:930 1st Qu.: 62.50 1st Qu.:1.000 1st Qu.:0.0000
## Median : 0.3035 Median : 87.50 Median :1.000 Median :0.0000
## Mean : 0.2771 Mean : 86.18 Mean :1.073 Mean :0.2054
## 3rd Qu.: 0.8766 3rd Qu.: 87.50 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. : 3.1066 Max. :200.00 Max. :3.000 Max. :1.0000
##
## distance weights subclass
## Min. :0.0607 Min. :1 1 : 2
## 1st Qu.:0.1604 1st Qu.:1 2 : 2
## Median :0.1884 Median :1 3 : 2
## Mean :0.2034 Mean :1 4 : 2
## 3rd Qu.:0.2200 3rd Qu.:1 5 : 2
## Max. :0.4039 Max. :1 6 : 2
## (Other):1848
## c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb distance weights
## 176 -0.3835843 0 17.5005 2 1 0.06069529 1
## 4274 -2.5922340 1 17.5005 2 1 0.06069529 1
## 41 -2.3690526 0 22.5005 2 1 0.06295488 1
## 2083 0.5271555 1 22.5005 2 1 0.06295488 1
## 561 -0.2195955 0 27.5005 2 1 0.06529274 1
## 2304 -1.6878765 1 27.5005 2 1 0.06529274 1
## 178 -0.2550080 0 32.5005 2 1 0.06771114 1
## 716 -1.2722942 1 32.5005 2 1 0.06771114 1
## 288 -0.7403859 0 37.5005 2 1 0.07021240 1
## 485 -0.7841369 1 37.5005 2 1 0.07021240 1
## subclass
## 176 598
## 4274 598
## 41 222
## 2083 222
## 561 272
## 2304 272
## 178 859
## 716 859
## 288 728
## 485 728
## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA
## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA
## # A tibble: 2 x 8
## catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb distance weights
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.335 86.2 1.07 0.205 0.203 1
## 2 1 0.220 86.2 1.07 0.205 0.203 1
## # … with 1 more variable: subclass <dbl>
## [1] 1
## [1] 1
## [1] 1
We have 1860 observations in all with a sample of Catholics & Non-Catholics, i.e. 930 students in each category. With the Popensity score matching we got equal pairs of treated vs non-treated sampe and weighed them based on distance metric. We still have identification issue with our treatment variable even after PSM as there exists some unobserved factor that is correlated with the decision to go to Catholic School but not correlated with the covariates.
## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA
## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA
## # A tibble: 2 x 8
## catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb distance weights
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.335 86.2 1.07 0.205 0.203 1
## 2 1 0.220 86.2 1.07 0.205 0.203 1
## # … with 1 more variable: subclass <dbl>
##
## Welch Two Sample t-test
##
## data: c5r2mtsc_std by catholic
## t = 2.8048, df = 1852.1, p-value = 0.005087
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.03455008 0.19520664
## sample estimates:
## mean in group 0 mean in group 1
## 0.3345634 0.2196851
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = newmatcheddata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2884 -0.5547 0.0443 0.5931 2.8356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.33456 0.02896 11.552 < 2e-16 ***
## catholic1 -0.11488 0.04096 -2.805 0.00509 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8832 on 1858 degrees of freedom
## Multiple R-squared: 0.004216, Adjusted R-squared: 0.00368
## F-statistic: 7.867 on 1 and 1858 DF, p-value: 0.005087
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic + p5numpla + w3momed_hsb,
## data = newmatcheddata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3884 -0.5685 0.0329 0.5975 2.7356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.52796 0.08273 6.382 2.21e-10 ***
## catholic1 -0.11488 0.04008 -2.866 0.0042 **
## p5numpla -0.09342 0.07154 -1.306 0.1917
## w3momed_hsb -0.45350 0.04962 -9.139 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8642 on 1856 degrees of freedom
## Multiple R-squared: 0.04764, Adjusted R-squared: 0.0461
## F-statistic: 30.95 on 3 and 1856 DF, p-value: < 2.2e-16
We observe that with this new PSM matched data, our treatment effect of being in catholic schools on standardized score is negative and statistically significant.
##
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb,
## family = "binomial", data = dataset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0172 -0.6461 -0.5240 -0.4122 2.3672
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6702950 0.1576164 -10.597 < 2e-16 ***
## w3income_1k 0.0077921 0.0008115 9.602 < 2e-16 ***
## p5numpla -0.2774346 0.1232930 -2.250 0.0244 *
## w3momed_hsb -0.6504757 0.0915710 -7.104 1.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4972.4 on 5428 degrees of freedom
## Residual deviance: 4750.6 on 5425 degrees of freedom
## AIC: 4758.6
##
## Number of Fisher Scoring iterations: 4
## c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb
## Min. :-3.1036 0:4499 Min. : 5.00 Min. :1.000 Min. :0.0000
## 1st Qu.:-0.4416 1: 930 1st Qu.: 37.50 1st Qu.:1.000 1st Qu.:0.0000
## Median : 0.2256 Median : 62.50 Median :1.000 Median :0.0000
## Mean : 0.1728 Mean : 68.95 Mean :1.101 Mean :0.3603
## 3rd Qu.: 0.8024 3rd Qu.: 87.50 3rd Qu.:1.000 3rd Qu.:1.0000
## Max. : 3.4337 Max. :200.00 Max. :5.000 Max. :1.0000
## catholicpred
## Min. :0.04333
## 1st Qu.:0.10801
## Median :0.16839
## Mean :0.17130
## 3rd Qu.:0.21996
## Max. :0.40389
##
## The downloaded binary packages are in
## /var/folders/np/zfkhhhb16cjfjjh96791mx140000gn/T//RtmpthanNQ/downloaded_packages
## Loading required package: Matrix
## Loaded glmnet 4.1-1
## c5r2mtsc_std catholic race_white race_black
## Min. :-3.1036 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:-0.4416 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
## Median : 0.2256 Median :0.0000 Median :1.0000 Median :0.00000
## Mean : 0.1728 Mean :0.1713 Mean :0.6731 Mean :0.06576
## 3rd Qu.: 0.8024 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:0.00000
## Max. : 3.4337 Max. :1.0000 Max. :1.0000 Max. :1.00000
## race_asian race_hispanic p5hmage p5hdage
## Min. :0.00000 Min. :0.0000 Min. :19.00 Min. :22.00
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:34.00 1st Qu.:36.00
## Median :0.00000 Median :0.0000 Median :38.00 Median :40.00
## Mean :0.06299 Mean :0.1464 Mean :38.13 Mean :40.46
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:42.00 3rd Qu.:44.00
## Max. :1.00000 Max. :1.0000 Max. :71.00 Max. :69.00
## w3daded_hsb p5numpla w3momed_hsb w3income_1k
## Min. :0.0000 Min. :1.000 Min. :0.0000 Min. : 5.00
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 37.50
## Median :0.0000 Median :1.000 Median :0.0000 Median : 62.50
## Mean :0.4332 Mean :1.101 Mean :0.3603 Mean : 68.95
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.: 87.50
## Max. :1.0000 Max. :5.000 Max. :1.0000 Max. :200.00
## w3momscr w3dadscr w3povrty p5fstamp
## Min. :29.60 Min. :29.60 Min. :0.00000 Min. :0.00000
## 1st Qu.:35.78 1st Qu.:35.78 1st Qu.:0.00000 1st Qu.:0.00000
## Median :38.18 Median :39.18 Median :0.00000 Median :0.00000
## Mean :44.55 Mean :43.16 Mean :0.08694 Mean :0.03887
## 3rd Qu.:53.50 3rd Qu.:53.50 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :77.50 Max. :77.50 Max. :1.00000 Max. :1.00000
##
## Call: glmnet(x = x.matrix, y = y, family = "gaussian", alpha = 1)
##
## Df %Dev Lambda
## 1 0 0.00 0.288400
## 2 1 1.55 0.262800
## 3 2 3.34 0.239400
## 4 3 4.97 0.218200
## 5 3 6.45 0.198800
## 6 4 7.91 0.181100
## 7 4 9.25 0.165000
## 8 4 10.36 0.150400
## 9 5 11.29 0.137000
## 10 5 12.12 0.124800
## 11 6 12.83 0.113700
## 12 6 13.43 0.103600
## 13 8 14.00 0.094430
## 14 9 14.57 0.086040
## 15 9 15.04 0.078400
## 16 9 15.44 0.071440
## 17 9 15.76 0.065090
## 18 10 16.06 0.059310
## 19 10 16.38 0.054040
## 20 10 16.64 0.049240
## 21 11 16.91 0.044860
## 22 11 17.16 0.040880
## 23 11 17.37 0.037250
## 24 11 17.55 0.033940
## 25 12 17.69 0.030920
## 26 13 17.81 0.028180
## 27 13 17.93 0.025670
## 28 13 18.02 0.023390
## 29 13 18.10 0.021310
## 30 13 18.16 0.019420
## 31 13 18.22 0.017700
## 32 13 18.26 0.016120
## 33 13 18.30 0.014690
## 34 13 18.33 0.013390
## 35 13 18.35 0.012200
## 36 13 18.38 0.011110
## 37 13 18.39 0.010130
## 38 13 18.41 0.009226
## 39 13 18.42 0.008407
## 40 14 18.44 0.007660
## 41 14 18.46 0.006979
## 42 14 18.48 0.006359
## 43 14 18.50 0.005794
## 44 14 18.51 0.005280
## 45 14 18.52 0.004811
## 46 14 18.53 0.004383
## 47 14 18.54 0.003994
## 48 14 18.54 0.003639
## 49 14 18.55 0.003316
## 50 14 18.55 0.003021
## 51 14 18.56 0.002753
## 52 14 18.56 0.002508
## 53 14 18.56 0.002285
## 54 14 18.57 0.002082
## 55 14 18.57 0.001897
## 56 14 18.57 0.001729
## 57 14 18.57 0.001575
## 58 14 18.57 0.001435
## 59 14 18.57 0.001308
## 60 14 18.57 0.001192
## 61 14 18.57 0.001086
## 62 15 18.57 0.000989
## 63 15 18.57 0.000901
## 64 15 18.57 0.000821
## 65 15 18.58 0.000748
## 66 15 18.58 0.000682
## 67 15 18.58 0.000621
## 68 15 18.58 0.000566
## 69 15 18.58 0.000516
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -0.455557786
## catholic -0.010889736
## race_white 0.210778482
## race_black -0.128278509
## race_asian 0.085651014
## race_hispanic .
## p5hmage 0.005514440
## p5hdage .
## w3daded_hsb -0.199875569
## p5numpla .
## w3momed_hsb -0.161435158
## w3income_1k 0.002446267
## w3momscr 0.003339817
## w3dadscr 0.002685208
## w3povrty -0.085842306
## p5fstamp .
For regularization, we choose about 10 out of 15 covariates. And they do differ as we see certain variables which we used previously are getting dropped here suggesting low penalty being imposed on them.
## [1] 0.001085757
We can see Overfitting - switch spot (Log(lambda)=3.5) - Underfitting