library(MatchIt)
library(cobalt)
library(stargazer)
library(tidyverse)
library(ggplot2)
library(knitr)
opts_knit$set(global.par = TRUE)
#install.packages("pander")
library(pander)
library(readr)
ecls <- read_csv("ecls.csv")

cleaned<- na.omit(ecls)

In this assignment we’ll analyze the effect of going to Catholic school, as opposed to public school, on student achievement. Because students who attend Catholic school on average are different from students who attend public school, we will use propensity score matching to get more credible causal estimates of Catholic schooling.

To examine the effect of going to Catholic school (“Treated”) versus public school (“Control”) on student achievement using matching we will go through the following steps:

Estimate the propensity score (the probability of being Treated given a set of pre-treatment covariates).
Examine the region of common support.
Choose and execute a matching algorithm. In this tutorial we’ll use nearest neighbor propensity score matching.
Examine covariate balance after matching.
Estimate treatment effects.

In addition, before we implement a matching method, we’ll conduct the following analyses using the non-matched data:

Examine the difference-in-means between Treated and Control for the outcome variable.
Examine the difference-in-means between Treated and Control for pre-treatment covariates. Before we start, load a few packages and read in ecls.csv:

1 Pre-analysis using non-matched data

1.1 Difference-in-means: outcome variable

Here is some basic information about public and catholic school students in terms of math achievement.

Note that we’re using students’ standardized math score (c5r2mtsc_std) – with a mean of 0 and standard deviation of 1 – as the outcome variable of interest.

The independent variable of interest is catholic (1 = student went to catholic school; 0 = student went to public school).

Find the mean and standard error of the standardized math score for each group.

cleaned %>%
  group_by(catholic) %>%
  summarise(n_students = n(),
            mean_math = mean(c5r2mtsc_std),
            std_error = sd(c5r2mtsc_std) / sqrt(n_students)) %>% round(digits = 4)  %>% kable(col.names = c("Catholic","No. Students","Mean Math Score","Std. Err"))

Catholic	No. Students	Mean Math Score	Std. Err
0	4499	0.1631	0.0145
1	930	0.2197	0.0281

By looking at the means, standardized scores are showing that the Catholic school group does perform better then the public school group.

1.2 Test between the groups

Use a standard t-test, test if there is a difference in the standardized math test scores between the two groups.

t <- t.test(ecls$c5r2mtsc_std~ecls$catholic)

Because the p-value is 1.84119510^{-19} we are able to reject the null hypothesis and the Catholic school students appear to have higher average test scores.

1.3 Difference-in-means: pre-treatment covariates

We’ll work with the following covariates for now:

race_white: Is the student white (1) or not (0)?
p5hmage: Mother’s age
w3income: Family income
p5numpla: Number of places the student has lived for at least 4 months
w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?

Calculate the mean for each covariate by the treatment status.

ecls1<- cleaned[ ,c("catholic","race_white","p5hmage","w3income","p5numpla","w3momed_hsb")]
setNames(aggregate(ecls1[ ,c("race_white","p5hmage","w3income","p5numpla","w3momed_hsb")],list(ecls1$catholic),mean, na.rm = TRUE)
          %>% round(digits = 4), c("Catholic","Race - White (1)/Non-White (0)", "Mother's Age", "Family Income","Nbr Places Lived 4+ Months","Mother's Education - High School or Below (1)/College + (0)"))

Catholic Race - White (1)/Non-White (0) Mother’s Age Family Income 1 0 0.6537 37.7946 65393.93 2 1 0.7667 39.7753 86180.63 Nbr Places Lived 4+ Months 1 1.1062 2 1.0731 Mother’s Education - High School or Below (1)/College + (0) 1 0.3923 2 0.2054

What do you see? Take a moment to reflect on what these differences suggest for the relationship of interest (that between Catholic schooling and student achievement).

There are more white students in the Catholic school group. Mothers age and family income are higher in the Catholic school group. The number of places lived in the last 4 months is slightly lower in the Catholic school group. This may perhaps represent more family stability for students in the Catholic school group. More mothers who are college educated send their children to Catholic school than women who have lower levels of education.

Perform a t-test for each variable across groups. Are there any significant differences?

t1<-t.test(ecls1$race_white ~ ecls1$catholic)
t2<-t.test(ecls1$p5hmage ~ ecls1$catholic)
t3<-t.test(ecls1$w3income ~ ecls1$catholic)
t4<-t.test(ecls1$p5numpla ~ ecls1$catholic)
t5<-t.test(ecls1$w3momed_hsb ~ ecls1$catholic)

Race: Our estimate for the control group is 0.6537008, for the treated group is 0.7666667 with a p.value equal to 6.814075610^{-13}.

Mother’s Age: Our estimate for the control group is 37.794621, for the treated group is 39.7752688 with a p.value equal to 4.283437410^{-28}.

Family Income: Our estimate for the control group is 6.539392910^{4}, for the treated group is 8.618062510^{4} with a p.value equal to 1.212804810^{-37}.

No. of Places lived for more than 4 months: Our estimate for the control group is 1.1062458, for the treated group is 1.0731183 with a p.value equal to 0.0017916.

Mother’s Education Level: Our estimate for the control group is 0.3923094, for the treated group is 0.2053763 with a p.value equal to 1.530071610^{-33}.

2 Propensity score estimation

2.1 Propensity Scores

We estimate the propensity score by running a logit model (probit also works) where the outcome variable is a binary variable indicating treatment status. What covariates should you include?

library(MatchIt); data(cleaned)
cleaned.formu <- catholic~race_white + p5hmage + w3income + p5numpla + w3momed 
glm1 <- glm(cleaned.formu, family=binomial, data=ecls)
stargazer::stargazer(glm1,type="html",covariate.labels = c("Race - White (1)/Non-White (0)", "Mother's Age", "Family Income","Nbr Places Lived 4+ Months","Mother's Education - High School or Below (1)/College + (0)","Mother - Bachelor's Degree","Mother - Doctorate or Professional Degree","Mother - Graduate/Professional School - No Degree","Mother - High School Diploma/Equivalent","Mother - Master's Degree (MA, MS)","Mother - Some College","Mother - Vocational/Technical Program"),single.row = TRUE)


	Dependent variable:

	catholic

Race - White (1)/Non-White (0)	0.242^*** (0.071)
Mother’s Age	0.030^*** (0.005)
Family Income	0.00001^*** (0.00000)
Nbr Places Lived 4+ Months	-0.131 (0.092)
Mother’s Education - High School or Below (1)/College + (0)	-0.095 (0.455)
Mother - Bachelor’s Degree	2.131^*** (0.368)
Mother - Doctorate or Professional Degree	1.610^*** (0.416)
Mother - Graduate/Professional School - No Degree	2.258^*** (0.393)
Mother - High School Diploma/Equivalent	1.291^*** (0.367)
Mother - Master’s Degree (MA, MS)	1.735^*** (0.381)
Mother - Some College	1.714^*** (0.365)
Mother - Vocational/Technical Program	1.570^*** (0.386)
Constant	-5.033^*** (0.435)

Observations	9,267
Log Likelihood	-3,538.428
Akaike Inf. Crit.	7,102.857

Note:	p<0.1; p<0.05; p<0.01

For the matching to give you a causal estimate in the end, you need to include any covariate that is related to both the treatment assignment and potential outcomes.

Give a reason for including each variable.

You need to include these variables: race_white, p5hmage, w3income and w4momed.

Race if white or nonwhite (race_white) will help you understand if certain races enter Catholic school. Mother’s age (p5hmage) may mean that older mothers may have more money to send children to Catholic school because they delayed pregnancy to have a college education and/or a career. Families with higher income may have extra money to send children to Catholic school so will need to include w3income. If mother had less education (w3momed), the mother may have less money to send children to Catholic school because she did not have a career and or did not see the need for Catholic schooling.

2.2 Use logit to estimate the propensity score values

Using the covariates you have selected above, estimate a logit model where the binary variable is 1 for catholic school and zero other wise.

Provide the logit coefficients in a nice table.

formula1 <- catholic ~ race_white + p5hmage + w3income + p5numpla + w3momed_hsb
glm1 <- glm(formula1, family=binomial, data=cleaned)
stargazer::stargazer(glm1,type="html",covariate.labels = c("Race - White (1)/Non-White (0)", "Mother's Age", "Family Income","Nbr Places Lived 4+ Months","Mother's Education - High School or Below (1)/College + (0)"),single.row = TRUE)


	Dependent variable:

	catholic

Race - White (1)/Non-White (0)	0.300^*** (0.087)
Mother’s Age	0.040^*** (0.007)
Family Income	0.00001^*** (0.00000)
Nbr Places Lived 4+ Months	-0.211^* (0.123)
Mother’s Education - High School or Below (1)/College + (0)	-0.564^*** (0.093)
Constant	-3.409^*** (0.326)

Observations	5,429
Log Likelihood	-2,353.385
Akaike Inf. Crit.	4,718.770

Note:	p<0.1; p<0.05; p<0.01

2.3 Save the propensity scores

Using the model you estimated in part 2.2, predict the probability of being in a catholic school for each observation and save these data in a new variable within your dataframe called ps_score.

p.score <-predict.glm(glm1,data=cleaned)

2.4 Create a Historgram for both groups

Make a histogram of the propensity scores for each group, Catholic and public school.

hga <- hist(p.score[cleaned$catholic==1],
     main="Histogram of Propensity Scores for Catholic school vs. Public School", 
     xlab="Catholic School", 
     border="blue", 
     col="lightblue",
     plot = FALSE
    )

hgb <- hist(p.score[cleaned$catholic==0],
     main="Histogram of Propensity Scores for Catholic school vs. Public School", 
     xlab="Public School", 
     border="blue", 
     col="lightblue",
     plot = FALSE
     )

c1 <- rgb(173, 216, 230, max = 255, alpha = 80, names = "ltblue") 
c2 <- rgb(255, 192, 203, max = 255, alpha = 80, names = "ltpink")

plot(hga, 
     col = c1,
     main="Histogram of Propensity Scores for Catholic school vs. Public School", 
     xlab="Catholic School",
     xlim = c(-5, 5), ylim = c(0,600)) 
plot(hgb, 
     col = c2, 
     main="Histogram of Propensity Scores for Catholic school vs. Public School", 
     xlab="Public School",
     add = TRUE)

2.5 Estimate the propensity score treatment effect

Use the package Matchit to estimate the propensity score model and treatment effect. Use the nearest neighbor approach for matching.

m.mahal<-matchit(catholic ~ race_white + p5hmage + w3income + w3momed,
                 data = cleaned,
                 distance = "glm", link = "probit",
                 mahvars = ~ race_white + p5hmage + w3income + w3momed,
                 caliper = .1, ratio = 2)
mahal.match<-match.data(m.mahal)
ncleaned<-matchit(cleaned.formu, data = cleaned, ratio=1, method ="nearest")
nn.match<-match.data(ncleaned)

library(RItools)

pander(xBalance(cleaned.formu, data = cleaned, report=c("chisquare.test")))

results:
overall:

chisquare df p.value

unstrat 287.9 12 1.589e-54

	chisquare	df	p.value
unstrat	287.9	12	1.589e-54

matches<-data.frame(ncleaned$match.matrix)
group1<-match(row.names(matches),row.names(nn.match))
group2<-match(matches[ ,],row.names(nn.match))
yT<-nn.match$catholic[group1]
yC<-nn.match$catholic[group2]
matched.cases<-cbind(matches,yT,yC)
ttest <- t.test(matched.cases$yT,matched.cases$yC,paired=TRUE)

pander(ttest)

Paired t-test: `matched.cases$yT` and `matched.cases$yC` (continued below)
Test statistic	df	P value	Alternative hypothesis
3.596	292	0.0003799 * * *	two.sided

mean of the differences
0.1433

2.6 A love plot

Perform a balance test as we did in the notes, but use a Love plot.

library("MatchIt")
data("cleaned", package = "MatchIt")
m.out <- matchit(catholic ~ race_white + p5hmage + w3income + w3momed, 
                 data = cleaned, method = "full")

v<-data.frame(old = c("w3momed9TH-12TH GRADE","distance","w3income","p5hmage","w3momed8TH GRADE OR BELOW","w3momedHIGH SCHOOL DIPLOMA/EQUIVALENT","race_white","w3momedBACHELOR'S DEGREE","w3momedDOCTORATE OR PROFESSIONAL DEGREE","w3momedSOME COLLEGE"),
         new = c("Mother's Education - 9th - 12th Grade","Distance","Family Income","Mother's Age","Mother's Education - 8th Grade or Below","Mother's Education - High School Diploma/Equivalent","Race (White)","Mother's Education - Bachelor's Degree","Mother's Education - Graduate or Professional - No Degree","Mother's Education - Some College"))


love.plot(m.out, 
     var.order = "unadjusted",
     var.names = v)

2.7 Compare results

Compare your propensity score results with those found using non-matched data in part 1.

With the unmatched t test the difference between the standardized test scores was not significant. When matching was significant, we were able to tell there was a significant difference in the standardized test scores between the two groups. This made our analysis much better and more helpful.

2.8 Theoretical Question

Why would we want to use a matching method instead of a simple regression? What advantages and disadvantages does the propensity score matching method have over regression?

Disadvantage: matched studies result in a smaller batch size because you have to omit the N/As which can lead to reduced power.

Advantage: matching minimizes variability between groups caused by extraneous variables and makes the groups more balanced.

Propensity Score Assignment

Dzhamilya Chakhalidze, Alec Rabalais, Jennifer Russo, John Schulten, Qais Shaban, Amy Shah

MSBA Data Analytics III