Simulation Homework

In this homework, you will test your understanding of causal inference methods and the type of data patterns or treatment effects they help reveal.
Your task is to use R to illustrate identification strategies with simulated data, similarly to what I do in my slides.
Pick any TWO causal inference methodologies discussed in the course: Difference-in-differences, Event Study, PSM, IPW, RDD, IV, Synthetic control, or anything else mentioned.

Part 1. Regression Discontinuity Design

1. Create fake data to simulate a relevant situation

The first causal inference methodology I would like to employ is the RDD. I created a fake data set that depicted a hypothetical situation in which 2500 middle school students enrolled in certain academic year are required to take the entrance exam, which is prevalent in major Asian Pacific countries. The overall grade follows a uniform distribution with a mean of 60. Those who scored less than 60 will automatically enroll in a free tutoring program offered by school throughout the entire academic year. By the end of the academic year all students will participate in the exit exam. The main interest of this simulation is to examine the impact of the free tutoring program on students’ performance on test scores.

library(dplyr)
library(stargazer)
library(ggplot2)

setwd("/Users/hunteryuan/Downloads/AAEC 8610/R Working Directory/HW7")

n = 2500; beg_score = runif(n, 20, 100)
summary(beg_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.02   40.88   61.15   60.64   80.17   99.92
Alpha = 60; Beta = 0.39; Gamma = 9.27
id = c(1:n)
edu_program <- data.frame(id, beg_score)

# Create a dummy that indicates whether a student was enrolled in the tutoring program. 
edu_program$tutor <- ifelse(edu_program$beg_score < 60, 1, 0)

# The equation below suggests a true RDD model. 
edu_program$fin_score <- Alpha + Beta*(edu_program$beg_score) + Gamma*(edu_program$tutor)

summary(edu_program$fin_score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   77.08   84.19   88.19   88.12   91.86   98.97


2. Plot the data to show what pattern you are trying to estimate

Fig_12 <- ggplot(edu_program, aes(x = as.numeric(beg_score), y = fin_score)) +
  geom_point(aes(color = factor(tutor))) + 
  scale_x_continuous (limits=c(20,100), breaks= seq(20, 100, 10)) + 
  scale_y_continuous (limits=c(70, 100), breaks=seq(70, 100, 5)) +
  scale_color_manual (values=c("red","yellow"), 
                      name="tutor", labels=c("Tutoring Program Not Enrolled", "Tutorig Program Enrolled")) +   
  labs(title = "Figure 1. Student Exam Performance", x = "Entrance Exam Grade", y = "Exit Exam Grade") +
  theme(panel.background = element_blank(),
        legend.background = element_rect(size=0.5, linetype="solid", colour ="black"), 
        legend.direction = "vertical", legend.position = c(0.8,0.3), legend.title = element_blank(),
        panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.border= element_blank(), 
        axis.line.x = element_line(color="black", size = 0.5),
        axis.line.y = element_line(color="black", size = 0.5),
        plot.title = element_text(hjust = 0.5, face ="bold"), 
        plot.caption = element_text(hjust = 0, size = 7.5),
        axis.text.x = element_text(colour = "black", size = 10),
        axis.text.y = element_text(colour = "black", size = 10),
        axis.title = element_text(size = 12))

Fig_12

3. Use R to run the relevant regression

I conducted a simple regression discontinuity test Run with the same slope and a jump) for the exit exam score by tutor. No additional package was used.

RDD_edu <- lm(fin_score ~ beg_score + tutor, data = edu_program)

4. Present results and discuss

stargazer(RDD_edu, type = "text", title = "Table 1. Simple RD Regression", 
          align = TRUE,  
          dep.var.labels = c("Final Exam"), covariate.labels = c("Entrance Exam", "Tutoring Program"),
          keep.stat = c("n", "rsq"), omit = c("adj.rsq"), 
          table.layout = "=d=t-s=n")
## 
## Table 1. Simple RD Regression
## ============================================
##                          Final Exam         
## ============================================
## Entrance Exam             0.390***          
##                            (0.000)          
##                                             
## Tutoring Program          9.270***          
##                            (0.000)          
##                                             
## Constant                  60.000***         
##                            (0.000)          
##                                             
## --------------------------------------------
## Observations                2,500           
## R2                          1.000           
## ============================================
## Note:            *p<0.1; **p<0.05; ***p<0.01

The coefficient estimate of tutor is 9.27, suggesting that participating in the tutoring program boosted the score of students’ exit exam by 9.27 points and the result is significant at 1% level of significance and also consistent with the true model we proposed. Student’s performance on the entrance exam also displayed a positive sign but with significantly smaller magnitude, suggesting a student’s prior knowledge does not appear to be a significant factor in their exit exam grade, in comparison to the tutoring program.

5. Add any plots/graphs/comments you deem necessary.

library(rdd)
edu_program$RDD_fit <- predict(RDD_edu, edu_program)
RDD_pack <- RDestimate(fin_score~beg_score, data = edu_program, cutpoint = 60)
print(RDD_pack)
## 
## Call:
## RDestimate(formula = fin_score ~ beg_score, data = edu_program, 
##     cutpoint = 60)
## 
## Coefficients:
##      LATE    Half-BW  Double-BW  
##     -9.27      -9.27      -9.27
summary(RDD_pack)
## 
## Call:
## RDestimate(formula = fin_score ~ beg_score, data = edu_program, 
##     cutpoint = 60)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value     Pr(>|z|)
## LATE       4.322      293           -9.27     2.159e-14   -4.294e+14  0       
## Half-BW    2.161      142           -9.27     9.191e-15   -1.009e+15  0       
## Double-BW  8.643      550           -9.27     6.029e-15   -1.538e+15  0       
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F          Num. DoF  Denom. DoF  p
## LATE       2.799e+29  3         289         0
## Half-BW    4.858e+29  3         138         0
## Double-BW  7.635e+29  3         546         0

In this section I installed the rdd package in R and used the built-in command to run the simple RD test again. The result is both consistent with the coefficient estimate from the previous, as well as the true parameter.


Part 2. Difference-in-Difference Estimation

1. Create fake data to simulate a relevant situation

The second causal inference methodology I would like to employ is the RDD. Under the same background in which middle school students are required to take the entrance exam in certain academic year in some Asian Pacific Country. In this part assume there is a random sampling of 2000 students in one middle school. Instead of those who scored less than 60 points having to enroll in the free tutoring program, 1000 students are randomly selected to participate in the program throughout the academic year while the rest of the 1000 students do not participate. By the end of the academic year all of the students in our sample will participate in the exit exam. The main interest of this simulation remains the same: examining the impact of the free tutoring program on students’ performance on test scores.

library(haven)
library(kableExtra)
library(vtable)

m = 1000; beg_grade = rnorm(m,70,5); 
summary(beg_grade)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   55.18   66.84   70.02   70.12   73.49   86.94
end_grade = rnorm(m,83,5);
summary(end_grade);
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   66.10   79.11   82.97   82.76   86.09  100.71
id_2 = c(1:m);
grp_tutor <- data.frame(id=id_2, beg_grade, end_grade)
grp_tutor$tutor <- 1
View(grp_tutor)

beg_grade_cont = rnorm(m,55,5); 
summary(beg_grade_cont)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37.36   51.60   54.87   55.04   58.53   72.74
end_grade_cont = rnorm(m,60,5);
summary(end_grade_cont);
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   44.99   56.33   59.88   59.94   63.42   78.90
id_3 = c(1001:2000);
grp_control <- data.frame(id=id_3, beg_grade=beg_grade_cont, end_grade=end_grade_cont)
grp_control$tutor <- 0
View(grp_control)

grade <- rbind(grp_tutor, grp_control)
head(grade)
##   id beg_grade end_grade tutor
## 1  1  78.97950  79.36146     1
## 2  2  73.58649  91.67095     1
## 3  3  71.15999  81.80617     1
## 4  4  67.80682  89.34094     1
## 5  5  73.69121  84.56540     1
## 6  6  73.90256  79.00032     1

2. Plot the data to show what pattern you are trying to estimate

I created a table of summary statistics that displayed the mean exam grade for the control group and the tutoring at the beginning and the end of an academic year.

grade$group_name  <- ifelse(grade$tutor == "0", "Control Croup", "Tutoring Group") 
Table_21 <- t(grade %>% 
                group_by(group_name) %>%
                summarise(entrace_grade_bar = mean(beg_grade, na.rm = TRUE),
                          final_grade_bar = mean(end_grade, na.rm = TRUE)))

colnames(Table_21) <- Table_21[1, ]
Table_21 <- Table_21[-1, ]

rownames(Table_21) <- c("Year 0", "Year 1")
Table_21
##        Control Croup Tutoring Group
## Year 0 "55.04262"    "70.12146"    
## Year 1 "59.93581"    "82.75716"

3. Use R to run the relevant regression

I conducted a standard DID regression on test score. No additional package was used.

grade_long <- reshape(grade, varying=c("beg_grade", "end_grade"), 
                      v.names=c("grade_tot"),
                      timevar = "time",
                      times=c("1", "2"),
                      idvar = c("id", "tutor"),
                      direction = "long")

DID_23 <- lm(grade_tot~ tutor + time + tutor*time, data=grade_long)

4. Present results and discuss

I ran a standard “Difference-in-Difference” regression with the following equation:

\(Y_{i,t}\) = \(\alpha\) + \(\beta\)Tutor + \(\gamma\)\(Time_t\) + \(\delta\)(Tutor\(*\) \(Time_t\)) + \(\epsilon_{i,t}\)


where:

\(Y_{i,t}\) is the outcome employment variable for individual i from time t;

Tutor is a binary dummy variable that is equal to 0 if that individual is assigned to the tutoring program and equal to 1 if that individual is not assigned in the tutoring program;

Tutor\(*\) \(Time_t\) is the interaction term between two periods and the tutoring program;

\(\epsilon_{i,t}\) is the error term.

stargazer(DID_23, type="text", title="Figure 2. DID Regression on Exam Performance", 
          align=TRUE, dep.var.labels = "Difference",
          covariate.labels=c("Tutor", "Time", "Tutor * Time"), 
          keep.stat = c("n", "rsq"), omit=c("Constant", "adj.rsq"), 
          table.layout = "=d=t-s=n")
## 
## Figure 2. DID Regression on Exam Performance
## ========================================
##                      Difference         
## ========================================
## Tutor                 15.079***         
##                        (0.228)          
##                                         
## Time                  4.893***          
##                        (0.228)          
##                                         
## Tutor * Time          7.743***          
##                        (0.323)          
##                                         
## ----------------------------------------
## Observations            4,000           
## R2                      0.812           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

The result indicates that reaching the end of an academic year, students who enrolled in the tutoring program observed an increase of 8.29 points in their exam score, in relative to students who did not enroll. The coefficient estimate is significant at 99% level of confidence.

5. Add any plots/graphs/comments you deem necessary.

I compute the difference-in-differences estimator manually using the summary statistics from Question 2:


\(\Delta^{Year0}_{Tutor-Control}\) = 70.09555 - 54.9989 = 15.09666

\(\Delta^{Year1}_{Tutor-Control}\) = 83.10376 - 59.71503 = 23.38846

\(\beta_{DID} =\)\(\Delta^{Year1}_{Tutor-Control}\)-\(\Delta^{Year0}_{Tutor-Control}\) = 23.38846 - 15.09666 = 8.2918
The obtained result is almost identical to the coefficient estimates from the DID regression.