In this homework, you will test your understanding of causal
inference methods and the type of data patterns or treatment effects
they help reveal.
Your task is to use R to illustrate
identification strategies with simulated data, similarly to what I do in
my slides.
Pick any TWO causal inference methodologies discussed in
the course: Difference-in-differences, Event Study, PSM, IPW, RDD, IV,
Synthetic control, or anything else mentioned.
1. Create fake data to simulate a relevant situation
The first causal inference methodology I would like to employ is the RDD. I created a fake data set that depicted a hypothetical situation in which 2500 middle school students enrolled in certain academic year are required to take the entrance exam, which is prevalent in major Asian Pacific countries. The overall grade follows a uniform distribution with a mean of 60. Those who scored less than 60 will automatically enroll in a free tutoring program offered by school throughout the entire academic year. By the end of the academic year all students will participate in the exit exam. The main interest of this simulation is to examine the impact of the free tutoring program on students’ performance on test scores.
library(dplyr)
library(stargazer)
library(ggplot2)
setwd("/Users/hunteryuan/Downloads/AAEC 8610/R Working Directory/HW7")
n = 2500; beg_score = runif(n, 20, 100)
summary(beg_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.02 40.88 61.15 60.64 80.17 99.92
Alpha = 60; Beta = 0.39; Gamma = 9.27
id = c(1:n)
edu_program <- data.frame(id, beg_score)
# Create a dummy that indicates whether a student was enrolled in the tutoring program.
edu_program$tutor <- ifelse(edu_program$beg_score < 60, 1, 0)
# The equation below suggests a true RDD model.
edu_program$fin_score <- Alpha + Beta*(edu_program$beg_score) + Gamma*(edu_program$tutor)
summary(edu_program$fin_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 77.08 84.19 88.19 88.12 91.86 98.97
2. Plot the data to show what pattern you are trying to
estimate
Fig_12 <- ggplot(edu_program, aes(x = as.numeric(beg_score), y = fin_score)) +
geom_point(aes(color = factor(tutor))) +
scale_x_continuous (limits=c(20,100), breaks= seq(20, 100, 10)) +
scale_y_continuous (limits=c(70, 100), breaks=seq(70, 100, 5)) +
scale_color_manual (values=c("red","yellow"),
name="tutor", labels=c("Tutoring Program Not Enrolled", "Tutorig Program Enrolled")) +
labs(title = "Figure 1. Student Exam Performance", x = "Entrance Exam Grade", y = "Exit Exam Grade") +
theme(panel.background = element_blank(),
legend.background = element_rect(size=0.5, linetype="solid", colour ="black"),
legend.direction = "vertical", legend.position = c(0.8,0.3), legend.title = element_blank(),
panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.border= element_blank(),
axis.line.x = element_line(color="black", size = 0.5),
axis.line.y = element_line(color="black", size = 0.5),
plot.title = element_text(hjust = 0.5, face ="bold"),
plot.caption = element_text(hjust = 0, size = 7.5),
axis.text.x = element_text(colour = "black", size = 10),
axis.text.y = element_text(colour = "black", size = 10),
axis.title = element_text(size = 12))
Fig_12
3. Use R to run the relevant regression
I conducted a simple regression discontinuity test Run with the same slope and a jump) for the exit exam score by tutor. No additional package was used.
RDD_edu <- lm(fin_score ~ beg_score + tutor, data = edu_program)
4. Present results and discuss
stargazer(RDD_edu, type = "text", title = "Table 1. Simple RD Regression",
align = TRUE,
dep.var.labels = c("Final Exam"), covariate.labels = c("Entrance Exam", "Tutoring Program"),
keep.stat = c("n", "rsq"), omit = c("adj.rsq"),
table.layout = "=d=t-s=n")
##
## Table 1. Simple RD Regression
## ============================================
## Final Exam
## ============================================
## Entrance Exam 0.390***
## (0.000)
##
## Tutoring Program 9.270***
## (0.000)
##
## Constant 60.000***
## (0.000)
##
## --------------------------------------------
## Observations 2,500
## R2 1.000
## ============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The coefficient estimate of tutor is 9.27, suggesting that participating in the tutoring program boosted the score of students’ exit exam by 9.27 points and the result is significant at 1% level of significance and also consistent with the true model we proposed. Student’s performance on the entrance exam also displayed a positive sign but with significantly smaller magnitude, suggesting a student’s prior knowledge does not appear to be a significant factor in their exit exam grade, in comparison to the tutoring program.
5. Add any plots/graphs/comments you deem necessary.
library(rdd)
edu_program$RDD_fit <- predict(RDD_edu, edu_program)
RDD_pack <- RDestimate(fin_score~beg_score, data = edu_program, cutpoint = 60)
print(RDD_pack)
##
## Call:
## RDestimate(formula = fin_score ~ beg_score, data = edu_program,
## cutpoint = 60)
##
## Coefficients:
## LATE Half-BW Double-BW
## -9.27 -9.27 -9.27
summary(RDD_pack)
##
## Call:
## RDestimate(formula = fin_score ~ beg_score, data = edu_program,
## cutpoint = 60)
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 4.322 293 -9.27 2.159e-14 -4.294e+14 0
## Half-BW 2.161 142 -9.27 9.191e-15 -1.009e+15 0
## Double-BW 8.643 550 -9.27 6.029e-15 -1.538e+15 0
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 2.799e+29 3 289 0
## Half-BW 4.858e+29 3 138 0
## Double-BW 7.635e+29 3 546 0
In this section I installed the rdd package in R and used the built-in command to run the simple RD test again. The result is both consistent with the coefficient estimate from the previous, as well as the true parameter.
1. Create fake data to simulate a relevant situation
The second causal inference methodology I would like to employ is the RDD. Under the same background in which middle school students are required to take the entrance exam in certain academic year in some Asian Pacific Country. In this part assume there is a random sampling of 2000 students in one middle school. Instead of those who scored less than 60 points having to enroll in the free tutoring program, 1000 students are randomly selected to participate in the program throughout the academic year while the rest of the 1000 students do not participate. By the end of the academic year all of the students in our sample will participate in the exit exam. The main interest of this simulation remains the same: examining the impact of the free tutoring program on students’ performance on test scores.
library(haven)
library(kableExtra)
library(vtable)
m = 1000; beg_grade = rnorm(m,70,5);
summary(beg_grade)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.18 66.84 70.02 70.12 73.49 86.94
end_grade = rnorm(m,83,5);
summary(end_grade);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 66.10 79.11 82.97 82.76 86.09 100.71
id_2 = c(1:m);
grp_tutor <- data.frame(id=id_2, beg_grade, end_grade)
grp_tutor$tutor <- 1
View(grp_tutor)
beg_grade_cont = rnorm(m,55,5);
summary(beg_grade_cont)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.36 51.60 54.87 55.04 58.53 72.74
end_grade_cont = rnorm(m,60,5);
summary(end_grade_cont);
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 44.99 56.33 59.88 59.94 63.42 78.90
id_3 = c(1001:2000);
grp_control <- data.frame(id=id_3, beg_grade=beg_grade_cont, end_grade=end_grade_cont)
grp_control$tutor <- 0
View(grp_control)
grade <- rbind(grp_tutor, grp_control)
head(grade)
## id beg_grade end_grade tutor
## 1 1 78.97950 79.36146 1
## 2 2 73.58649 91.67095 1
## 3 3 71.15999 81.80617 1
## 4 4 67.80682 89.34094 1
## 5 5 73.69121 84.56540 1
## 6 6 73.90256 79.00032 1
2. Plot the data to show what pattern you are trying to estimate
I created a table of summary statistics that displayed the mean exam grade for the control group and the tutoring at the beginning and the end of an academic year.
grade$group_name <- ifelse(grade$tutor == "0", "Control Croup", "Tutoring Group")
Table_21 <- t(grade %>%
group_by(group_name) %>%
summarise(entrace_grade_bar = mean(beg_grade, na.rm = TRUE),
final_grade_bar = mean(end_grade, na.rm = TRUE)))
colnames(Table_21) <- Table_21[1, ]
Table_21 <- Table_21[-1, ]
rownames(Table_21) <- c("Year 0", "Year 1")
Table_21
## Control Croup Tutoring Group
## Year 0 "55.04262" "70.12146"
## Year 1 "59.93581" "82.75716"
3. Use R to run the relevant regression
I conducted a standard DID regression on test score. No additional package was used.
grade_long <- reshape(grade, varying=c("beg_grade", "end_grade"),
v.names=c("grade_tot"),
timevar = "time",
times=c("1", "2"),
idvar = c("id", "tutor"),
direction = "long")
DID_23 <- lm(grade_tot~ tutor + time + tutor*time, data=grade_long)
4. Present results and discuss
I ran a standard “Difference-in-Difference” regression with the following equation:
where:
\(Y_{i,t}\) is the outcome employment variable for individual i from time t;
Tutor is a binary dummy variable that is equal to 0 if that individual is assigned to the tutoring program and equal to 1 if that individual is not assigned in the tutoring program;
Tutor\(*\) \(Time_t\) is the interaction term between two periods and the tutoring program;
\(\epsilon_{i,t}\) is the error term.
stargazer(DID_23, type="text", title="Figure 2. DID Regression on Exam Performance",
align=TRUE, dep.var.labels = "Difference",
covariate.labels=c("Tutor", "Time", "Tutor * Time"),
keep.stat = c("n", "rsq"), omit=c("Constant", "adj.rsq"),
table.layout = "=d=t-s=n")
##
## Figure 2. DID Regression on Exam Performance
## ========================================
## Difference
## ========================================
## Tutor 15.079***
## (0.228)
##
## Time 4.893***
## (0.228)
##
## Tutor * Time 7.743***
## (0.323)
##
## ----------------------------------------
## Observations 4,000
## R2 0.812
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The result indicates that reaching the end of an academic year, students who enrolled in the tutoring program observed an increase of 8.29 points in their exam score, in relative to students who did not enroll. The coefficient estimate is significant at 99% level of confidence.
5. Add any plots/graphs/comments you deem necessary.
I compute the difference-in-differences estimator manually using the summary statistics from Question 2:
\(\Delta^{Year0}_{Tutor-Control}\) = 70.09555
- 54.9989 = 15.09666
\(\Delta^{Year1}_{Tutor-Control}\) = 83.10376 - 59.71503 = 23.38846
\(\beta_{DID} =\)\(\Delta^{Year1}_{Tutor-Control}\)-\(\Delta^{Year0}_{Tutor-Control}\) = 23.38846
- 15.09666 = 8.2918
The obtained result is almost identical to the
coefficient estimates from the DID regression.