library(MatchIt)
library(cobalt)
library(stargazer)
library(tidyverse)
library(ggplot2)
library(knitr)
opts_knit$set(global.par = TRUE)
#install.packages("pander")
library(pander)
library(readr)
ecls <- read_csv("ecls.csv")
cleaned<- na.omit(ecls)
In this assignment we’ll analyze the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use propensity score matching to get more credible causal estimates of Catholic schooling.
To examine the effect of going to Catholic school (“Treated”) versus public school (“Control”) on student achievement using matching we will go through the following steps:
In addition, before we implement a matching method, we’ll conduct the following analyses using the non-matched data:
Here is some basic information about public and catholic school students in terms of math achievement.
Note that we’re using students’ standardized math score (c5r2mtsc_std) – with a mean of 0 and standard deviation of 1 – as the outcome variable of interest.
The independent variable of interest is catholic (1 = student went to catholic school; 0 = student went to public school).
Find the mean and standard error of the standardized math score for each group.
cleaned %>%
group_by(catholic) %>%
summarise(n_students = n(),
mean_math = mean(c5r2mtsc_std),
std_error = sd(c5r2mtsc_std) / sqrt(n_students)) %>% round(digits = 4) %>% kable(col.names = c("Catholic","No. Students","Mean Math Score","Std. Err"))
| Catholic | No. Students | Mean Math Score | Std. Err |
|---|---|---|---|
| 0 | 4499 | 0.1631 | 0.0145 |
| 1 | 930 | 0.2197 | 0.0281 |
By looking at the means, standardized scores are showing that the Catholic school group does perform better then the public school group.
Use a standard t-test, test if there is a difference in the standardized math test scores between the two groups.
t <- t.test(ecls$c5r2mtsc_std~ecls$catholic)
Because the p-value is 1.84119510^{-19} we are able to reject the null hypothesis and the Catholic school students appear to have higher average test scores.
We’ll work with the following covariates for now:
race_white: Is the student white (1) or not (0)?p5hmage: Mother’s agew3income: Family incomep5numpla: Number of places the student has lived for at least 4 monthsw3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?Calculate the mean for each covariate by the treatment status.
ecls1<- cleaned[ ,c("catholic","race_white","p5hmage","w3income","p5numpla","w3momed_hsb")]
setNames(aggregate(ecls1[ ,c("race_white","p5hmage","w3income","p5numpla","w3momed_hsb")],list(ecls1$catholic),mean, na.rm = TRUE)
%>% round(digits = 4), c("Catholic","Race - White (1)/Non-White (0)", "Mother's Age", "Family Income","Nbr Places Lived 4+ Months","Mother's Education - High School or Below (1)/College + (0)"))
Catholic Race - White (1)/Non-White (0) Mother’s Age Family Income 1 0 0.6537 37.7946 65393.93 2 1 0.7667 39.7753 86180.63 Nbr Places Lived 4+ Months 1 1.1062 2 1.0731 Mother’s Education - High School or Below (1)/College + (0) 1 0.3923 2 0.2054
What do you see? Take a moment to reflect on what these differences suggest for the relationship of interest (that between Catholic schooling and student achievement).
There are more white students in the Catholic school group. Mothers age and family income are higher in the Catholic school group. The number of places lived in the last 4 months is slightly lower in the Catholic school group. This may perhaps represent more family stability for students in the Catholic school group. More mothers who are college educated send their children to Catholic school than women who have lower levels of education.
Perform a t-test for each variable across groups. Are there any significant differences?
t1<-t.test(ecls1$race_white ~ ecls1$catholic)
t2<-t.test(ecls1$p5hmage ~ ecls1$catholic)
t3<-t.test(ecls1$w3income ~ ecls1$catholic)
t4<-t.test(ecls1$p5numpla ~ ecls1$catholic)
t5<-t.test(ecls1$w3momed_hsb ~ ecls1$catholic)
Race: Our estimate for the control group is 0.6537008, for the treated group is 0.7666667 with a p.value equal to 6.814075610^{-13}.
Mother’s Age: Our estimate for the control group is 37.794621, for the treated group is 39.7752688 with a p.value equal to 4.283437410^{-28}.
Family Income: Our estimate for the control group is 6.539392910^{4}, for the treated group is 8.618062510^{4} with a p.value equal to 1.212804810^{-37}.
No. of Places lived for more than 4 months: Our estimate for the control group is 1.1062458, for the treated group is 1.0731183 with a p.value equal to 0.0017916.
Mother’s Education Level: Our estimate for the control group is 0.3923094, for the treated group is 0.2053763 with a p.value equal to 1.530071610^{-33}.
We estimate the propensity score by running a logit model (probit also works) where the outcome variable is a binary variable indicating treatment status. What covariates should you include?
library(MatchIt); data(cleaned)
cleaned.formu <- catholic~race_white + p5hmage + w3income + p5numpla + w3momed
glm1 <- glm(cleaned.formu, family=binomial, data=ecls)
stargazer::stargazer(glm1,type="html",covariate.labels = c("Race - White (1)/Non-White (0)", "Mother's Age", "Family Income","Nbr Places Lived 4+ Months","Mother's Education - High School or Below (1)/College + (0)","Mother - Bachelor's Degree","Mother - Doctorate or Professional Degree","Mother - Graduate/Professional School - No Degree","Mother - High School Diploma/Equivalent","Mother - Master's Degree (MA, MS)","Mother - Some College","Mother - Vocational/Technical Program"),single.row = TRUE)
| Dependent variable: | |
| catholic | |
| Race - White (1)/Non-White (0) | 0.242*** (0.071) |
| Mother’s Age | 0.030*** (0.005) |
| Family Income | 0.00001*** (0.00000) |
| Nbr Places Lived 4+ Months | -0.131 (0.092) |
| Mother’s Education - High School or Below (1)/College + (0) | -0.095 (0.455) |
| Mother - Bachelor’s Degree | 2.131*** (0.368) |
| Mother - Doctorate or Professional Degree | 1.610*** (0.416) |
| Mother - Graduate/Professional School - No Degree | 2.258*** (0.393) |
| Mother - High School Diploma/Equivalent | 1.291*** (0.367) |
| Mother - Master’s Degree (MA, MS) | 1.735*** (0.381) |
| Mother - Some College | 1.714*** (0.365) |
| Mother - Vocational/Technical Program | 1.570*** (0.386) |
| Constant | -5.033*** (0.435) |
| Observations | 9,267 |
| Log Likelihood | -3,538.428 |
| Akaike Inf. Crit. | 7,102.857 |
| Note: | p<0.1; p<0.05; p<0.01 |
For the matching to give you a causal estimate in the end, you need to include any covariate that is related to both the treatment assignment and potential outcomes.
Give a reason for including each variable.
You need to include these variables: race_white, p5hmage, w3income and w4momed.
Race if white or nonwhite (race_white) will help you understand if certain races enter Catholic school. Mother’s age (p5hmage) may mean that older mothers may have more money to send children to Catholic school because they delayed pregnancy to have a college education and/or a career. Families with higher income may have extra money to send children to Catholic school so will need to include w3income. If mother had less education (w3momed), the mother may have less money to send children to Catholic school because she did not have a career and or did not see the need for Catholic schooling.
Using the covariates you have selected above, estimate a logit model where the binary variable is 1 for catholic school and zero other wise.
Provide the logit coefficients in a nice table.
formula1 <- catholic ~ race_white + p5hmage + w3income + p5numpla + w3momed_hsb
glm1 <- glm(formula1, family=binomial, data=cleaned)
stargazer::stargazer(glm1,type="html",covariate.labels = c("Race - White (1)/Non-White (0)", "Mother's Age", "Family Income","Nbr Places Lived 4+ Months","Mother's Education - High School or Below (1)/College + (0)"),single.row = TRUE)
| Dependent variable: | |
| catholic | |
| Race - White (1)/Non-White (0) | 0.300*** (0.087) |
| Mother’s Age | 0.040*** (0.007) |
| Family Income | 0.00001*** (0.00000) |
| Nbr Places Lived 4+ Months | -0.211* (0.123) |
| Mother’s Education - High School or Below (1)/College + (0) | -0.564*** (0.093) |
| Constant | -3.409*** (0.326) |
| Observations | 5,429 |
| Log Likelihood | -2,353.385 |
| Akaike Inf. Crit. | 4,718.770 |
| Note: | p<0.1; p<0.05; p<0.01 |
Using the model you estimated in part 2.2, predict the probability of being in a catholic school for each observation and save these data in a new variable within your dataframe called ps_score.
p.score <-predict.glm(glm1,data=cleaned)
Make a histogram of the propensity scores for each group, Catholic and public school.
hga <- hist(p.score[cleaned$catholic==1],
main="Histogram of Propensity Scores for Catholic school vs. Public School",
xlab="Catholic School",
border="blue",
col="lightblue",
plot = FALSE
)
hgb <- hist(p.score[cleaned$catholic==0],
main="Histogram of Propensity Scores for Catholic school vs. Public School",
xlab="Public School",
border="blue",
col="lightblue",
plot = FALSE
)
c1 <- rgb(173, 216, 230, max = 255, alpha = 80, names = "ltblue")
c2 <- rgb(255, 192, 203, max = 255, alpha = 80, names = "ltpink")
plot(hga,
col = c1,
main="Histogram of Propensity Scores for Catholic school vs. Public School",
xlab="Catholic School",
xlim = c(-5, 5), ylim = c(0,600))
plot(hgb,
col = c2,
main="Histogram of Propensity Scores for Catholic school vs. Public School",
xlab="Public School",
add = TRUE)
Use the package Matchit to estimate the propensity score model and treatment effect. Use the nearest neighbor approach for matching.
m.mahal<-matchit(catholic ~ race_white + p5hmage + w3income + w3momed,
data = cleaned,
distance = "glm", link = "probit",
mahvars = ~ race_white + p5hmage + w3income + w3momed,
caliper = .1, ratio = 2)
mahal.match<-match.data(m.mahal)
ncleaned<-matchit(cleaned.formu, data = cleaned, ratio=1, method ="nearest")
nn.match<-match.data(ncleaned)
library(RItools)
pander(xBalance(cleaned.formu, data = cleaned, report=c("chisquare.test")))
results:
overall:
| chisquare | df | p.value | |
|---|---|---|---|
| unstrat | 287.9 | 12 | 1.589e-54 |
matches<-data.frame(ncleaned$match.matrix)
group1<-match(row.names(matches),row.names(nn.match))
group2<-match(matches[ ,],row.names(nn.match))
yT<-nn.match$catholic[group1]
yC<-nn.match$catholic[group2]
matched.cases<-cbind(matches,yT,yC)
ttest <- t.test(matched.cases$yT,matched.cases$yC,paired=TRUE)
pander(ttest)
| Test statistic | df | P value | Alternative hypothesis |
|---|---|---|---|
| 3.596 | 292 | 0.0003799 * * * | two.sided |
| mean of the differences |
|---|
| 0.1433 |
Perform a balance test as we did in the notes, but use a Love plot.
library("MatchIt")
data("cleaned", package = "MatchIt")
m.out <- matchit(catholic ~ race_white + p5hmage + w3income + w3momed,
data = cleaned, method = "full")
v<-data.frame(old = c("w3momed9TH-12TH GRADE","distance","w3income","p5hmage","w3momed8TH GRADE OR BELOW","w3momedHIGH SCHOOL DIPLOMA/EQUIVALENT","race_white","w3momedBACHELOR'S DEGREE","w3momedDOCTORATE OR PROFESSIONAL DEGREE","w3momedSOME COLLEGE"),
new = c("Mother's Education - 9th - 12th Grade","Distance","Family Income","Mother's Age","Mother's Education - 8th Grade or Below","Mother's Education - High School Diploma/Equivalent","Race (White)","Mother's Education - Bachelor's Degree","Mother's Education - Graduate or Professional - No Degree","Mother's Education - Some College"))
love.plot(m.out,
var.order = "unadjusted",
var.names = v)
Compare your propensity score results with those found using non-matched data in part 1.
With the unmatched t test the difference between the standardized test scores was not significant. When matching was significant, we were able to tell there was a significant difference in the standardized test scores between the two groups. This made our analysis much better and more helpful.
Why would we want to use a matching method instead of a simple regression? What advantages and disadvantages does the propensity score matching method have over regression?
Disadvantage: matched studies result in a smaller batch size because you have to omit the N/As which can lead to reduced power.
Advantage: matching minimizes variability between groups caused by extraneous variables and makes the groups more balanced.