Make a smaller dataset which only contains the variables:
c5r2mtsc_std: Standardized math scores
catholic: Standardized math scores
w3income_1k: Family income in thousands
p5numpla: Number of places the student has lived for at least 4 months
w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?
library(dplyr)
library(ggplot2)
library(MatchIt)
library(glmnet)
ecls <- read.csv("C:/Users/SHOUY/Desktop/AAEC8610/code/hw10/ecls.csv")
keep <- c("c5r2mtsc_std","catholic","w3income_1k","p5numpla","w3momed_hsb")
small <- ecls[,(names(ecls) %in% keep)]
Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average?
Answer:The test sores are different on average in the catholic and non-catholic schools, but quite close. The summary table is shown as below.
small %>%
group_by(catholic) %>%
summarize(n_students= n(),
mean_math = mean(c5r2mtsc_std),
std_error = sd(c5r2mtsc_std))
## # A tibble: 2 x 4
## catholic n_students mean_math std_error
## <int> <int> <dbl> <dbl>
## 1 0 4499 0.163 0.974
## 2 1 930 0.220 0.858
Can you conclude that going to a catholic school leads to higher math scores? Why or why not?
A regression of just math scores on the catholic dummy
reg1<-lm(c5r2mtsc_std~catholic,small)
summary(reg1)
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2667 -0.6082 0.0538 0.6292 3.2705
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.16313 0.01424 11.459 <2e-16 ***
## catholic 0.05656 0.03440 1.644 0.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9549 on 5427 degrees of freedom
## Multiple R-squared: 0.000498, Adjusted R-squared: 0.0003138
## F-statistic: 2.704 on 1 and 5427 DF, p-value: 0.1002
Another regression that includes all the variables listed above.
reg2<-lm(c5r2mtsc_std~catholic+w3income_1k+p5numpla+w3momed_hsb,small)
summary(reg2)
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic + w3income_1k + p5numpla +
## w3momed_hsb, data = small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3922 -0.5696 0.0372 0.5986 3.2572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0733290 0.0491661 1.491 0.135900
## catholic -0.1273899 0.0328888 -3.873 0.000109 ***
## w3income_1k 0.0053247 0.0003026 17.595 < 2e-16 ***
## p5numpla -0.1009432 0.0355779 -2.837 0.004567 **
## w3momed_hsb -0.3740355 0.0271962 -13.753 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8941 on 5424 degrees of freedom
## Multiple R-squared: 0.1242, Adjusted R-squared: 0.1235
## F-statistic: 192.3 on 4 and 5424 DF, p-value: < 2.2e-16
Answer: In the first regression,the coefficient of catholic is not significant in the regression result. But in the second regression, the coefficient of catholic= -0.1273899is now statistically significant. So we cannot conclude that going to a catholic school leads to higher math scores since coefficient is significant negative.
Do the means of the covariates (income, etc) between catholic vs. non-catholic schools?
small %>%
group_by(catholic) %>%
summarise_all(mean)
## # A tibble: 2 x 5
## catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 1.11 0.392 0.163 65.4
## 2 1 1.07 0.205 0.220 86.2
cath<-small[which(small$catholic==1),]
ncath<-small[which(small$catholic==0),]
t.test(cath$p5numpla,ncath$p5numpla)
##
## Welch Two Sample t-test
##
## data: cath$p5numpla and ncath$p5numpla
## t = -3.128, df = 1600.4, p-value = 0.001792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.05390038 -0.01235472
## sample estimates:
## mean of x mean of y
## 1.073118 1.106246
t.test(cath$w3momed_hsb,ncath$w3momed_hsb)
##
## Welch Two Sample t-test
##
## data: cath$w3momed_hsb and ncath$w3momed_hsb
## t = -12.362, df = 1545.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2165946 -0.1572715
## sample estimates:
## mean of x mean of y
## 0.2053763 0.3923094
t.test(cath$w3income_1k,ncath$w3income_1k)
##
## Welch Two Sample t-test
##
## data: cath$w3income_1k and ncath$w3income_1k
## t = 13.238, df = 1314.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 17.70620 23.86719
## sample estimates:
## mean of x mean of y
## 86.18063 65.39393
Answer: The means are all performed significant differences in the variables in two groups. The identification strategy problem is that the differences of dependent variables in catholic and non-catholic schools can be driven by the differences of other variables in two groups rather than only dummy variable.
reg3<-glm(catholic ~ w3income_1k + p5numpla + w3momed_hsb,family = binomial(), data = small)
summary(reg3)
##
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb,
## family = binomial(), data = small)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0172 -0.6461 -0.5240 -0.4122 2.3672
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6702950 0.1576164 -10.597 < 2e-16 ***
## w3income_1k 0.0077921 0.0008115 9.602 < 2e-16 ***
## p5numpla -0.2774346 0.1232930 -2.250 0.0244 *
## w3momed_hsb -0.6504757 0.0915710 -7.104 1.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4972.4 on 5428 degrees of freedom
## Residual deviance: 4750.6 on 5425 degrees of freedom
## AIC: 4758.6
##
## Number of Fisher Scoring iterations: 4
small$pr.score<-predict(reg3,type = "response")
head(small)
## catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k pr.score
## 1 0 1 0 0.9817533 62.5005 0.1883576
## 2 0 1 0 0.5943775 45.0005 0.1683901
## 3 0 1 0 0.4906106 62.5005 0.1883576
## 4 1 1 0 1.4512779 87.5005 0.2199574
## 5 0 1 0 2.5956991 150.0005 0.3145556
## 6 0 1 0 0.3851966 62.5005 0.1883576
cath$pr.score<-predict(reg3,type = "response",newdata = cath)
ncath$pr.score<-predict(reg3,type = "response",newdata = ncath)
F1<-ggplot(data=small,aes(x = pr.score, y = w3income_1k))+
geom_point(aes(color=factor(catholic)),size=1.2)+
geom_smooth(aes(color=factor(catholic)))
F1
F2<-ggplot(data=small,aes(pr.score))+
geom_density(aes(color=factor(catholic),fill=factor(catholic),alpha=0.5),size=1.2)
F2
psm<-matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb,data=small,distance="glm")
summary(psm)
##
## Call:
## matchit(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb,
## data = small, distance = "glm")
##
## Summary of Balance for All Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.2034 0.1647 0.5198 0.9539 0.1217
## w3income_1k 86.1806 65.3939 0.4744 1.0647 0.1011
## p5numpla 1.0731 1.1062 -0.1182 0.6323 0.0066
## w3momed_hsb 0.2054 0.3923 -0.4627 . 0.1869
## eCDF Max
## distance 0.2544
## w3income_1k 0.2478
## p5numpla 0.0265
## w3momed_hsb 0.1869
##
##
## Summary of Balance for Matched Data:
## Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance 0.2034 0.2034 0 1 0
## w3income_1k 86.1806 86.1806 0 1 0
## p5numpla 1.0731 1.0731 0 1 0
## w3momed_hsb 0.2054 0.2054 0 . 0
## eCDF Max Std. Pair Dist.
## distance 0 0
## w3income_1k 0 0
## p5numpla 0 0
## w3momed_hsb 0 0
##
## Percent Balance Improvement:
## Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
## distance 100 100 100 100
## w3income_1k 100 100 100 100
## p5numpla 100 100 100 100
## w3momed_hsb 100 . 100 100
##
## Sample Sizes:
## Control Treated
## All 4499 930
## Matched 930 930
## Unmatched 3569 0
## Discarded 0 0
psm_data <- match.data(psm)
dim(psm_data)
## [1] 1860 9
psm_data<-psm_data[order(psm_data$distance),]
head(psm_data, 10)
## catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k pr.score
## 176 0 2 1 -0.3835843 17.5005 0.06069529
## 4274 1 2 1 -2.5922340 17.5005 0.06069529
## 41 0 2 1 -2.3690526 22.5005 0.06295488
## 2083 1 2 1 0.5271555 22.5005 0.06295488
## 561 0 2 1 -0.2195955 27.5005 0.06529274
## 2304 1 2 1 -1.6878765 27.5005 0.06529274
## 178 0 2 1 -0.2550080 32.5005 0.06771114
## 716 1 2 1 -1.2722942 32.5005 0.06771114
## 288 0 2 1 -0.7403859 37.5005 0.07021240
## 485 1 2 1 -0.7841369 37.5005 0.07021240
## distance weights subclass
## 176 0.06069529 1 598
## 4274 0.06069529 1 598
## 41 0.06295488 1 222
## 2083 0.06295488 1 222
## 561 0.06529274 1 272
## 2304 0.06529274 1 272
## 178 0.06771114 1 859
## 716 0.06771114 1 859
## 288 0.07021240 1 728
## 485 0.07021240 1 728
psm_data %>%
group_by(catholic) %>%
summarise_all(mean)
## # A tibble: 2 x 9
## catholic p5numpla w3momed_hsb c5r2mtsc_std w3income_1k pr.score distance
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 1.07 0.205 0.335 86.2 0.203 0.203
## 2 1 1.07 0.205 0.220 86.2 0.203 0.203
## # ... with 2 more variables: weights <dbl>, subclass <dbl>
psm_data1<-psm_data[which(psm_data$catholic==1),]
psm_data0<-psm_data[which(psm_data$catholic==0),]
t.test(psm_data1$p5numpla,psm_data0$p5numpla)
##
## Welch Two Sample t-test
##
## data: psm_data1$p5numpla and psm_data0$p5numpla
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.02550007 0.02550007
## sample estimates:
## mean of x mean of y
## 1.073118 1.073118
t.test(psm_data1$w3momed_hsb,psm_data0$w3momed_hsb)
##
## Welch Two Sample t-test
##
## data: psm_data1$w3momed_hsb and psm_data0$w3momed_hsb
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.03676158 0.03676158
## sample estimates:
## mean of x mean of y
## 0.2053763 0.2053763
t.test(psm_data1$w3income_1k,psm_data0$w3income_1k)
##
## Welch Two Sample t-test
##
## data: psm_data1$w3income_1k and psm_data0$w3income_1k
## t = 0, df = 1858, p-value = 1
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -3.985565 3.985565
## sample estimates:
## mean of x mean of y
## 86.18063 86.18063
F3<-ggplot(data=psm_data,aes(x = pr.score, y = w3income_1k))+
geom_point(aes(color=factor(catholic)),size=1.2)+
geom_smooth(aes(color=factor(catholic)))
F3
psm_data %>%
group_by(catholic) %>%
summarize(n_students= n(),
mean_math = mean(c5r2mtsc_std),
std_error = sd(c5r2mtsc_std))
## # A tibble: 2 x 4
## catholic n_students mean_math std_error
## <int> <int> <dbl> <dbl>
## 1 0 930 0.335 0.908
## 2 1 930 0.220 0.858
reg4<-lm(c5r2mtsc_std~catholic,psm_data)
summary(reg4)
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = psm_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2884 -0.5547 0.0443 0.5931 2.8356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.33456 0.02896 11.552 < 2e-16 ***
## catholic -0.11488 0.04096 -2.805 0.00509 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8832 on 1858 degrees of freedom
## Multiple R-squared: 0.004216, Adjusted R-squared: 0.00368
## F-statistic: 7.867 on 1 and 1858 DF, p-value: 0.005087
reg5<-lm(c5r2mtsc_std~catholic+w3income_1k+p5numpla+w3momed_hsb,psm_data)
summary(reg5)
##
## Call:
## lm(formula = c5r2mtsc_std ~ catholic + w3income_1k + p5numpla +
## w3momed_hsb, data = psm_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3758 -0.5347 0.0265 0.5692 2.8515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1513954 0.0910427 1.663 0.09650 .
## catholic -0.1148784 0.0392301 -2.928 0.00345 **
## w3income_1k 0.0041331 0.0004566 9.052 < 2e-16 ***
## p5numpla -0.0911405 0.0700264 -1.302 0.19324
## w3momed_hsb -0.3662377 0.0495217 -7.396 2.12e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.846 on 1855 degrees of freedom
## Multiple R-squared: 0.08793, Adjusted R-squared: 0.08596
## F-statistic: 44.71 on 4 and 1855 DF, p-value: < 2.2e-16
Comment on the results of those regressions after the PSM procedure
Answer: In both regression results, the catholic is significant after PSM procedure. Since the value of the coefficient of catholic is negative, that means catholic schools can get higher scores. Income and w3momed_hsb are also significant in the second regression result.
plot(small$pr.score,psm$distance)
Reflect on how you can replace the logit stage by any other prediction “score”.
The distance and pr.score are same which is shown in the graph above. So we can get the predicted value from the logit stage and use it to replace the distance in PSM.
ecls<-select_if(ecls, is.numeric)
lasso<-cv.glmnet(data.matrix(ecls[,2:18]),data.matrix(ecls[,1]),type.measure ="deviance")
plot(lasso)
coef(lasso,s="lambda.1se")
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 9.937395e-02
## race_white .
## race_black .
## race_hispanic .
## race_asian .
## p5numpla .
## p5hmage 1.065969e-03
## p5hdage .
## w3daded_hsb -1.659486e-02
## w3momed_hsb -1.645021e-02
## w3momscr .
## w3dadscr .
## w3income 6.438272e-07
## w3povrty .
## p5fstamp .
## c5r2mtsc .
## c5r2mtsc_std .
## w3income_1k .
Comment on the results
Answer: If I used lambda.1se as parameter,I would choose 7 variables to include as the results shown above. I would choose to have p5hmage, w3daded_hsb.w3momed_hsb,w3momscr,w3income,w3povrty and w3income_1k. They are patial same with what we have chosen before but still have some differences.
prlasso<-predict(lasso,data.matrix(ecls[,2:18]),s="lambda.1se")
dis<-t(prlasso[,1])
psm2<-matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb,data=small,distance=dis[1,])
psm2_data <- match.data(psm2)
F4<-ggplot(data=psm2_data,aes(x = pr.score, y = w3income_1k))+
geom_point(aes(color=factor(catholic)),size=1.2)+
geom_smooth(aes(color=factor(catholic)))
F4
plot(psm2_data$pr.score,psm2_data$distance)
Answer: The pr.score-income graph shown above can be compared with the our orginal graph using PSM command in previous parts. There’s slight difference between them but with same trend. The last graph shows the relationship between the distance we given by ML method and the pr.score by PSM method.