library(knitr)
library(kableExtra)
library(tidyverse)
library(estimatr)
library(rdd)
library(rdrobust)
library(rddensity)

1 RDDs are Everywhere

Find an empirical study on any topic that interests you, where the study involves a regression discontinuity design. Do not use a study that has been assigned as reading for the course or used as an example in class (i.e., do not use Ludwig & Miller, 2007, or May et al., 2023). Upload a pdf of the study to Canvas and answer the following questions about the study.

1.1 Design (3 pts)

Briefly explain a) the forcing variable(s) involved in the design, b) the rule used to determine treatment assignment, and c) the primary outcome(s) examined.

  1. College course placement test scores were used as the forcing variable

  2. The cut-score that determined remediation assignment was determined by each institution in the sample based on community college system-level policies

  3. Several outcomes are investigated, including (a) the initial decision to enroll; (b) grades in subsequent courses in the same subject; and (c) post-treatment proficiency exam scores

1.2 Assumptions (2 pts)

Did the study present any evidence about the integrity of the forcing variable? If so, explain the approach used to examine the question and briefly summarize what was found.

  • The authors present strong evidence for the integrity of the forcing variable. First, they note that manipulation of scores is unlikely given testing policies: re-taking the test to change assignment to remediation is not allowed unless 20 credits of regular courses are completed. Manipulation of the cut-score is also prohibited by system-wide minimums for remediation cut-scores.

  • Density around the cut-off was assessed visually and revealed no discontinuities (graphs are provided). The McCrary test was also used to formally test this and confirmed no discontinuities were present.

1.3 Analysis (3 pts)

In your own words, explain the analytic approach used to estimate treatment effects. Note details such as a) use of parametric or non-parametric regression, b) choice of bandwidth (if relevant), c) choice of kernel (if relevant), and d) any bias corrections used. Provide citations to relevant passages/page numbers in the article.

  1. Local linear estimation (parametric) is used (p. 25)

  2. A bandwidth of +/-6 points was used for the main specification (p. 25)

  3. No kernel is specified.

  4. Bias correction was not used. However, sensitivity analysis indicated that the estimates were robust across a wider bandwidth, as well as with the inclusion of quadratic terms for the test score distance variables (p. 25)

2 Summer School

In the remainder of this assignment, you will analyze a simulated study that is inspired by a study of the effects of remedial education (summer school) on the achievement levels of elementary school students in Chicago Public Schools (Jacob & Lefgren, 2004). In 1996, Chicago Public Schools implemented a policy requiring that students grades 3, 6, and 8 meet certain performance standards on the Iowa Test of Basic Skills in order to advance to the next grade. Students who did not attain certain minimum scores on the math and reading portions of the exam were required to attend 6 weeks of summer school for further instruction. Summer school attendees were then re-tested at the end of the program, and students that still did not attain the minimum scores were retained in grade (i.e., third graders that did not pass had to repeat third grade). For purposes of these exercise, we will focus only on first-time third graders (excluding those who had already been retained) who passed the math portion of the exam; the goal is to estimate the causal effect of participating in summer school on reading performance the following year (when some of the students had advanced to fourth grade, while others were retained in third grade).

The simulated dataset (posted on Canvas) contains the following variables:

  • stuID: unique identifier for each family
  • schID: unique identifier for the student’s school
  • male: indicator variable equal to one if the student is male
  • free_lunch: indicator variable equal to one if the student is eligible for the free/reduced-price lunch program
  • ITBS_read_96: student’s score on the reading portion of the Iowa Test of Basic Skills in 3rd grade in 1996. Students scoring less than 180 were assigned to remedial summer school.
  • attend_summer: indicator variable equal to one if the student attended summer school
  • ITBS_read_97: student’s score on the reading portion of the Iowa Test of Basic Skills in 1997

The following questions ask you to evaluate whether the assumptions required for RD are reasonable, and then to estimate the effect of attending summer school on following-year reading performance for students near the margin of passing.

summer_school <- read_csv(file = "Summer_School.csv")

2.1 Continuity of \(F\) (1 pt)

Check the continuity of the forcing variable. Create a graphical diagnostic (e.g., a histogram of ITBS_read_96) or report a statistical analysis to evaluate whether the density of the forcing variable is continuous at 180 (the cut-off for being assigned to summer school). Is there any evidence of tampering with the forcing variable?

  • A histogram of the forcing variable shows a large jump at the cut-score.

  • The McCrary test affirms this interpretation, indicating that units exerted control over their scores to influence their assignment (p = 0.016).

ggplot(data = summer_school, aes(x = ITBS_read_96)) +
  geom_histogram(bins = 25) +
  geom_vline(xintercept = 180)

DCdensity(summer_school$ITBS_read_96, 180)

## [1] 0.01624201

2.2 Balance (2 pt)

Check balance on the covariates male and free_lunch in the neighborhood of the cut-off using linear regressions of the covariate on the forcing variable, treatment assignment, and their interaction. Report the difference in the proportion of males and the difference in the proportion of students receiving free lunch at the cut-off. What do the results of your analysis indicate about exogeneity of the forcing varaible?

  • The proportion of males just above the cut-score is not significantly different from the proportion of males just below it (B = -0.021, SE = 0.022, 95% CI [-0.064, 0.023]).

  • The proportion of students eligible for free lunch was not significantly different in the neighborhood of the cut-off (B = 0.003, SE = 0.018, 95% CI [-0.032, 0.039]).

  • These results indicate that the forcing variable is exogenous (not influenced by these variables)

summer_rdd <- summer_school %>%
  mutate(c_read_96 = ITBS_read_96 - 180) #center F

male_balance_bw5 <- lm_robust(
  male ~ c_read_96 + attend_summer + c_read_96:attend_summer,
  data = filter(summer_rdd,
                c_read_96 >= -5 & #bandwidth = 5
                  c_read_96 <= 5))

summary(male_balance_bw5)
## 
## Call:
## lm_robust(formula = male ~ c_read_96 + attend_summer + c_read_96:attend_summer, 
##     data = filter(summer_rdd, c_read_96 >= -5 & c_read_96 <= 
##         5))
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                          Estimate Std. Error t value  Pr(>|t|) CI Lower
## (Intercept)              0.480966    0.03102 15.5052 1.566e-46  0.42006
## c_read_96                0.001173    0.01039  0.1129 9.101e-01 -0.01923
## attend_summer           -0.021386    0.06397 -0.3343 7.382e-01 -0.14699
## c_read_96:attend_summer -0.020791    0.02210 -0.9408 3.472e-01 -0.06419
##                         CI Upper  DF
## (Intercept)              0.54187 667
## c_read_96                0.02158 667
## attend_summer            0.10422 667
## c_read_96:attend_summer  0.02260 667
## 
## Multiple R-squared:  0.002038 ,  Adjusted R-squared:  -0.002451 
## F-statistic: 0.4325 on 3 and 667 DF,  p-value: 0.7298
free_lunch_balance <- lm_robust(
  free_lunch ~ c_read_96 + attend_summer + c_read_96:attend_summer,
  data = filter(summer_rdd,
                c_read_96 >= -5 & #bandwidth = 10
                c_read_96 <= 5))

summary(free_lunch_balance)
## 
## Call:
## lm_robust(formula = free_lunch ~ c_read_96 + attend_summer + 
##     c_read_96:attend_summer, data = filter(summer_rdd, c_read_96 >= 
##     -5 & c_read_96 <= 5))
## 
## Standard error type:  HC2 
## 
## Coefficients:
##                           Estimate Std. Error t value   Pr(>|t|) CI Lower
## (Intercept)              0.7735817   0.026273 29.4439 9.815e-123  0.72199
## c_read_96               -0.0009208   0.008963 -0.1027  9.182e-01 -0.01852
## attend_summer            0.0040561   0.052679  0.0770  9.386e-01 -0.09938
## c_read_96:attend_summer  0.0032444   0.018184  0.1784  8.584e-01 -0.03246
##                         CI Upper  DF
## (Intercept)              0.82517 667
## c_read_96                0.01668 667
## attend_summer            0.10749 667
## c_read_96:attend_summer  0.03895 667
## 
## Multiple R-squared:  4.936e-05 , Adjusted R-squared:  -0.004448 
## F-statistic: 0.01079 on 3 and 667 DF,  p-value: 0.9985

2.3 Instrument effectiveness (2 pt)

Does being assigned to summer school induce students to attend? In other words, is there a discontinuity at the cut-off in the probability of attending summer school? Provide graphical evidence and report a regression analysis in support of your answer.

  • Summer school assignment significantly increased attendance (B = -0.767, SE = 0.035, 95% CI [-0.853, -0.688]), making it an effective instrument for attendance.
first_stage <- rdrobust(y = summer_rdd$attend_summer, 
                        x = summer_rdd$c_read_96,
                        c = 0,
                        vce = "hc2")

summary(first_stage)
## Sharp RD estimates using local polynomial regression.
## 
## Number of Obs.                 4487
## BW type                       mserd
## Kernel                   Triangular
## VCE method                      HC2
## 
## Number of Obs.                  966         3521
## Eff. Number of Obs.             606         1114
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                  13.459       13.459
## BW bias (b)                  21.529       21.529
## rho (h/b)                     0.625        0.625
## Unique Obs.                     966         3521
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional    -0.767     0.035   -21.677     0.000    [-0.837 , -0.698]    
##         Robust         -         -   -18.293     0.000    [-0.853 , -0.688]    
## =============================================================================
with(summer_rdd,
     rdplot(y = attend_summer,
            x = c_read_96,
            c = 0))

2.4 Intent to treat (2 pts)

For this question, you can ignore the fact that students are nested within schools, and treat each observation as independent.

Estimate the effect of being assigned to summer school on following-year reading scores for students close to the cut-off. Report a point estimate, standard error, and 95% CI. Describe your estimation strategy in sufficient detail so that it could be replicated.

  • Assignment to summer school caused a 5.43 point increase in following-year reading scores for students near the cut-off (B = -5.43, SE = 1.50, 95% CI [-9.360, -2.67]).

  • The distribution of scores was curvilinear, so nonparametric estimation was used to estimate the difference in following-year reading scores around the cut-off.

  • First-year test scores were first centered around a cut-score of 180 then regressed on second-year scores. Bandwidth was selected using a MSE-optimal bandwidth selector (“mserd”); triangular kernels and heteroskedasticity-robust standard errors were also used.

itt <- rdrobust(y = summer_rdd$ITBS_read_97, 
                        x = summer_rdd$c_read_96,
                        c = 0,
                        bwselect = "mserd",
                        vce = "hc2")

summary(itt)
## Sharp RD estimates using local polynomial regression.
## 
## Number of Obs.                 4487
## BW type                       mserd
## Kernel                   Triangular
## VCE method                      HC2
## 
## Number of Obs.                  966         3521
## Eff. Number of Obs.             487          761
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                   9.733        9.733
## BW bias (b)                  18.085       18.085
## rho (h/b)                     0.538        0.538
## Unique Obs.                     966         3521
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional    -5.425     1.503    -3.609     0.000    [-8.371 , -2.478]    
##         Robust         -         -    -3.531     0.000    [-9.360 , -2.678]    
## =============================================================================
with(summer_rdd,
     rdplot(y = ITBS_read_97,
            x = c_read_96,
            c = 0))

2.5 CATEC (2 pts)

For this question, you can ignore the fact that students are nested within schools, and treat each observation as independent.

Estimate the effect of attending summer school on following-year reading scores for students close to the cut-off. Report a point estimate, standard error, and 95% CI. Describe your estimation strategy in sufficient detail so that it could be replicated.

  • Attending summer school caused a 6.40 point increase in following-year reading scores for students near the cut-off (B = 6.40, SE = 1.73, 95% CI [-0.841, -0.691]).

  • Nonparametric estimation was used to assess the difference in following-year reading scores around the cut-off.

  • First-year test scores were first centered around a cut-score of 180 then regressed on second-year scores. Bandwidth was selected using a MSE-optimal bandwidth selector (“mserd”); triangular kernels and heteroskedasticity-robust standard errors were also used.

catec <- with(summer_rdd,
              rdrobust(
                y = ITBS_read_97, #outcome
                x = c_read_96, #forcing variable
                c = 0, #cut off
                fuzzy = attend_summer, #treatment receipt
                bwselect = "mserd", #bandwidth estimator
                vce = "hc2"
              ))

summary(catec)
## Fuzzy RD estimates using local polynomial regression.
## 
## Number of Obs.                 4487
## BW type                       mserd
## Kernel                   Triangular
## VCE method                      HC2
## 
## Number of Obs.                  966         3521
## Eff. Number of Obs.             537          903
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                  11.343       11.343
## BW bias (b)                  20.062       20.062
## rho (h/b)                     0.565        0.565
## Unique Obs.                     966         3521
## 
## First-stage estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional    -0.766     0.038   -20.032     0.000    [-0.841 , -0.691]    
##         Robust         -         -   -17.469     0.000    [-0.856 , -0.683]    
## =============================================================================
## 
## Treatment effect estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     6.402     1.727     3.706     0.000     [3.017 , 9.788]     
##         Robust         -         -     3.621     0.000     [3.305 , 11.108]    
## =============================================================================

2.6 Sensitivity analysis (3 pts)

Examine the sensitivity of the CATEC estimate from the previous question to a) variation in the functional form of the regression model and b) the width of the bandwidth/window used for estimation. Provide a table that summarizes estimates based on alternative models, and explain your assessment of whether the results are sensitive to the choice of model.

  • Sensitivity analysis shows that the results are sensitive to both bandwidth and the functional form of the regression model, although bandwidth had a much greater impact on the effect size.

  • Effect estimates were smaller when a parametric model was estimated.

  • Smaller bandwidths produced larger effect sizes.

Sensitivity to Form and Bandwidth

Method Bandwidth Effect Size
Nonparametric 9.67 (ideal) 6.40
Parametric 10 5.17
Nonparametric 19.33 (twice) 4.90
Parametric 20 3.68
Nonparametric 4.83 (half) 10.57
Parametric 5 9.14
summer_rdd <- summer_rdd %>%
  mutate(below_cutoff = ITBS_read_96 <= 180)

catec_lm_10 <- iv_robust(
  ITBS_read_97 ~ c_read_96 + attend_summer | c_read_96 + below_cutoff,
  data = filter(summer_rdd,
                c_read_96 >= -10 & #bandwidth = 10
                  c_read_96 <= 10))

summary(catec_lm_10)
## 
## Call:
## iv_robust(formula = ITBS_read_97 ~ c_read_96 + attend_summer | 
##     c_read_96 + below_cutoff, data = filter(summer_rdd, c_read_96 >= 
##     -10 & c_read_96 <= 10))
## 
## Standard error type:  HC2 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)   181.9346     0.7323 248.426 0.000e+00 180.4978  183.371 1277
## c_read_96       0.9382     0.1102   8.511 4.755e-17   0.7219    1.154 1277
## attend_summer   5.1707     1.6217   3.188 1.465e-03   1.9892    8.352 1277
## 
## Multiple R-squared:  0.1512 ,    Adjusted R-squared:  0.1499 
## F-statistic: 67.01 on 2 and 1277 DF,  p-value: < 2.2e-16
catec_lm_20 <- iv_robust(
  ITBS_read_97 ~ c_read_96 + attend_summer | c_read_96 + below_cutoff,
  data = filter(summer_rdd,
                c_read_96 >= -20 & #bandwidth = 5
                  c_read_96 <= 20))

summary(catec_lm_20)
## 
## Call:
## iv_robust(formula = ITBS_read_97 ~ c_read_96 + attend_summer | 
##     c_read_96 + below_cutoff, data = filter(summer_rdd, c_read_96 >= 
##     -20 & c_read_96 <= 20))
## 
## Standard error type:  HC2 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
## (Intercept)   182.4254    0.54592  334.16 0.000e+00 181.3549 183.4959 2472
## c_read_96       0.8359    0.04164   20.08 3.560e-83   0.7543   0.9175 2472
## attend_summer   3.6782    1.18663    3.10 1.959e-03   1.3513   6.0051 2472
## 
## Multiple R-squared:  0.3268 ,    Adjusted R-squared:  0.3263 
## F-statistic: 524.8 on 2 and 2472 DF,  p-value: < 2.2e-16
catec_lm_5 <- iv_robust(
  ITBS_read_97 ~ c_read_96 + attend_summer | c_read_96 + below_cutoff,
  data = filter(summer_rdd,
                c_read_96 >= -5 & #bandwidth = 5
                  c_read_96 <= 5))

summary(catec_lm_5)
## 
## Call:
## iv_robust(formula = ITBS_read_97 ~ c_read_96 + attend_summer | 
##     c_read_96 + below_cutoff, data = filter(summer_rdd, c_read_96 >= 
##     -5 & c_read_96 <= 5))
## 
## Standard error type:  HC2 
## 
## Coefficients:
##               Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)    180.118     0.9578 188.062 0.000e+00  178.237  181.998 668
## c_read_96        1.628     0.2944   5.530 4.591e-08    1.050    2.206 668
## attend_summer    9.139     2.2424   4.075 5.149e-05    4.735   13.542 668
## 
## Multiple R-squared:  0.152 , Adjusted R-squared:  0.1495 
## F-statistic: 16.26 on 2 and 668 DF,  p-value: 1.276e-07
catec_ep <- with(summer_rdd,
              rdrobust(
                y = ITBS_read_97, #outcome
                x = c_read_96, #forcing variable
                c = 0, #cut off
                fuzzy = attend_summer, #treatment receipt
                kernel = "epanechnikov",
                vce = "hc2"
              ))

summary(catec_ep)
## Fuzzy RD estimates using local polynomial regression.
## 
## Number of Obs.                 4487
## BW type                       mserd
## Kernel                   Epanechnikov
## VCE method                      HC2
## 
## Number of Obs.                  966         3521
## Eff. Number of Obs.             509          813
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                  10.273       10.273
## BW bias (b)                  19.781       19.781
## rho (h/b)                     0.519        0.519
## Unique Obs.                     966         3521
## 
## First-stage estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional    -0.762     0.039   -19.731     0.000    [-0.838 , -0.687]    
##         Robust         -         -   -17.407     0.000    [-0.852 , -0.679]    
## =============================================================================
## 
## Treatment effect estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     6.082     1.759     3.458     0.001     [2.634 , 9.530]     
##         Robust         -         -     3.368     0.001     [2.820 , 10.671]    
## =============================================================================

2.6.1 Bandwidth Sensitivity

#ideal bandwidth (mserd = 9.665)
with(summer_rdd,
     rdbwselect(y = ITBS_read_97, x = c_read_96, c = 0)) %>%
     summary()
## Call: rdbwselect
## 
## Number of Obs.                 4487
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                  966         3521
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## Unique Obs.                     966         3521
## 
## =======================================================
##                   BW est. (h)    BW bias (b)
##             Left of c Right of c  Left of c Right of c
## =======================================================
##      mserd     9.665      9.665     18.040     18.040
## =======================================================
catec_half <- with(summer_rdd,
              rdrobust(
                y = ITBS_read_97, #outcome
                x = c_read_96, #forcing variable
                c = 0, #cut off
                fuzzy = attend_summer, #treatment receipt
                h = 4.83, #bandwidth estimator
                vce = "hc2"
              ))

summary(catec_half)
## Fuzzy RD estimates using local polynomial regression.
## 
## Number of Obs.                 4487
## BW type                      Manual
## Kernel                   Triangular
## VCE method                      HC2
## 
## Number of Obs.                  966         3521
## Eff. Number of Obs.             263          372
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                   4.830        4.830
## BW bias (b)                   4.830        4.830
## rho (h/b)                     1.000        1.000
## Unique Obs.                     966         3521
## 
## First-stage estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional    -0.773     0.057   -13.477     0.000    [-0.886 , -0.661]    
##         Robust         -         -    -9.715     0.000    [-0.926 , -0.615]    
## =============================================================================
## 
## Treatment effect estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional    10.569     2.604     4.059     0.000     [5.466 , 15.673]    
##         Robust         -         -     3.527     0.000     [5.984 , 20.952]    
## =============================================================================
catec_2x <- with(summer_rdd,
              rdrobust(
                y = ITBS_read_97, #outcome
                x = c_read_96, #forcing variable
                c = 0, #cut off
                fuzzy = attend_summer, #treatment receipt
                h = 19.33, #bandwidth estimator
                vce = "hc2"
              ))

summary(catec_2x)
## Fuzzy RD estimates using local polynomial regression.
## 
## Number of Obs.                 4487
## BW type                      Manual
## Kernel                   Triangular
## VCE method                      HC2
## 
## Number of Obs.                  966         3521
## Eff. Number of Obs.             775         1620
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                  19.330       19.330
## BW bias (b)                  19.330       19.330
## rho (h/b)                     1.000        1.000
## Unique Obs.                     966         3521
## 
## First-stage estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional    -0.763     0.030   -25.524     0.000    [-0.822 , -0.704]    
##         Robust         -         -   -17.842     0.000    [-0.858 , -0.689]    
## =============================================================================
## 
## Treatment effect estimates.
## 
## =============================================================================
##         Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
## =============================================================================
##   Conventional     4.900     1.351     3.628     0.000     [2.253 , 7.548]     
##         Robust         -         -     3.727     0.000     [3.504 , 11.278]    
## =============================================================================

3 Extra credit

3.1 Nesting (2 pts extra credit)

Propose methods of estimating the average effect of being assigned to summer school (i.e., the ITT at the cutoff) and the effect of attending summer school (i.e., the CATE at the cutoff) that account for the fact that students are nested within schools. Describe your estimation strategies in sufficient detail so that it could be replicated, and report a point estimate, standard error, and 95% CI for the results.

3.2 Heterogeneity (3 pts extra credit)

Propose a method of examining the degree of heterogeneity across schools in the effects of being assigned to summer school (i.e., the ITT at the cutoff). Describe your estimation strategy in sufficient detail so that it could be replicated, and report results including measures of uncertainty (i.e., standard error, 95% CI).