library(knitr)
library(kableExtra)
library(tidyverse)
library(estimatr)
library(rdd)
library(rdrobust)
library(rddensity)
Find an empirical study on any topic that interests you, where the study involves a regression discontinuity design. Do not use a study that has been assigned as reading for the course or used as an example in class (i.e., do not use Ludwig & Miller, 2007, or May et al., 2023). Upload a pdf of the study to Canvas and answer the following questions about the study.
Briefly explain a) the forcing variable(s) involved in the design, b) the rule used to determine treatment assignment, and c) the primary outcome(s) examined.
College course placement test scores were used as the forcing variable
The cut-score that determined remediation assignment was determined by each institution in the sample based on community college system-level policies
Several outcomes are investigated, including (a) the initial decision to enroll; (b) grades in subsequent courses in the same subject; and (c) post-treatment proficiency exam scores
Did the study present any evidence about the integrity of the forcing variable? If so, explain the approach used to examine the question and briefly summarize what was found.
The authors present strong evidence for the integrity of the forcing variable. First, they note that manipulation of scores is unlikely given testing policies: re-taking the test to change assignment to remediation is not allowed unless 20 credits of regular courses are completed. Manipulation of the cut-score is also prohibited by system-wide minimums for remediation cut-scores.
Density around the cut-off was assessed visually and revealed no discontinuities (graphs are provided). The McCrary test was also used to formally test this and confirmed no discontinuities were present.
In your own words, explain the analytic approach used to estimate treatment effects. Note details such as a) use of parametric or non-parametric regression, b) choice of bandwidth (if relevant), c) choice of kernel (if relevant), and d) any bias corrections used. Provide citations to relevant passages/page numbers in the article.
Local linear estimation (parametric) is used (p. 25)
A bandwidth of +/-6 points was used for the main specification (p. 25)
No kernel is specified.
Bias correction was not used. However, sensitivity analysis indicated that the estimates were robust across a wider bandwidth, as well as with the inclusion of quadratic terms for the test score distance variables (p. 25)
In the remainder of this assignment, you will analyze a simulated study that is inspired by a study of the effects of remedial education (summer school) on the achievement levels of elementary school students in Chicago Public Schools (Jacob & Lefgren, 2004). In 1996, Chicago Public Schools implemented a policy requiring that students grades 3, 6, and 8 meet certain performance standards on the Iowa Test of Basic Skills in order to advance to the next grade. Students who did not attain certain minimum scores on the math and reading portions of the exam were required to attend 6 weeks of summer school for further instruction. Summer school attendees were then re-tested at the end of the program, and students that still did not attain the minimum scores were retained in grade (i.e., third graders that did not pass had to repeat third grade). For purposes of these exercise, we will focus only on first-time third graders (excluding those who had already been retained) who passed the math portion of the exam; the goal is to estimate the causal effect of participating in summer school on reading performance the following year (when some of the students had advanced to fourth grade, while others were retained in third grade).
The simulated dataset (posted on Canvas) contains the following variables:
stuID
: unique identifier for each familyschID
: unique identifier for the student’s schoolmale
: indicator variable equal to one if the student is
malefree_lunch
: indicator variable equal to one if the
student is eligible for the free/reduced-price lunch programITBS_read_96
: student’s score on the reading portion of
the Iowa Test of Basic Skills in 3rd grade in 1996. Students
scoring less than 180 were assigned to remedial summer
school.attend_summer
: indicator variable equal to one if the
student attended summer schoolITBS_read_97
: student’s score on the reading portion of
the Iowa Test of Basic Skills in 1997The following questions ask you to evaluate whether the assumptions required for RD are reasonable, and then to estimate the effect of attending summer school on following-year reading performance for students near the margin of passing.
summer_school <- read_csv(file = "Summer_School.csv")
Check the continuity of the forcing variable. Create a graphical diagnostic (e.g., a histogram of ITBS_read_96) or report a statistical analysis to evaluate whether the density of the forcing variable is continuous at 180 (the cut-off for being assigned to summer school). Is there any evidence of tampering with the forcing variable?
A histogram of the forcing variable shows a large jump at the cut-score.
The McCrary test affirms this interpretation, indicating that units exerted control over their scores to influence their assignment (p = 0.016).
ggplot(data = summer_school, aes(x = ITBS_read_96)) +
geom_histogram(bins = 25) +
geom_vline(xintercept = 180)
DCdensity(summer_school$ITBS_read_96, 180)
## [1] 0.01624201
Check balance on the covariates male
and
free_lunch
in the neighborhood of the cut-off using linear
regressions of the covariate on the forcing variable, treatment
assignment, and their interaction. Report the difference in the
proportion of males and the difference in the proportion of students
receiving free lunch at the cut-off. What do the results of your
analysis indicate about exogeneity of the forcing varaible?
The proportion of males just above the cut-score is not significantly different from the proportion of males just below it (B = -0.021, SE = 0.022, 95% CI [-0.064, 0.023]).
The proportion of students eligible for free lunch was not significantly different in the neighborhood of the cut-off (B = 0.003, SE = 0.018, 95% CI [-0.032, 0.039]).
These results indicate that the forcing variable is exogenous (not influenced by these variables)
summer_rdd <- summer_school %>%
mutate(c_read_96 = ITBS_read_96 - 180) #center F
male_balance_bw5 <- lm_robust(
male ~ c_read_96 + attend_summer + c_read_96:attend_summer,
data = filter(summer_rdd,
c_read_96 >= -5 & #bandwidth = 5
c_read_96 <= 5))
summary(male_balance_bw5)
##
## Call:
## lm_robust(formula = male ~ c_read_96 + attend_summer + c_read_96:attend_summer,
## data = filter(summer_rdd, c_read_96 >= -5 & c_read_96 <=
## 5))
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) 0.480966 0.03102 15.5052 1.566e-46 0.42006
## c_read_96 0.001173 0.01039 0.1129 9.101e-01 -0.01923
## attend_summer -0.021386 0.06397 -0.3343 7.382e-01 -0.14699
## c_read_96:attend_summer -0.020791 0.02210 -0.9408 3.472e-01 -0.06419
## CI Upper DF
## (Intercept) 0.54187 667
## c_read_96 0.02158 667
## attend_summer 0.10422 667
## c_read_96:attend_summer 0.02260 667
##
## Multiple R-squared: 0.002038 , Adjusted R-squared: -0.002451
## F-statistic: 0.4325 on 3 and 667 DF, p-value: 0.7298
free_lunch_balance <- lm_robust(
free_lunch ~ c_read_96 + attend_summer + c_read_96:attend_summer,
data = filter(summer_rdd,
c_read_96 >= -5 & #bandwidth = 10
c_read_96 <= 5))
summary(free_lunch_balance)
##
## Call:
## lm_robust(formula = free_lunch ~ c_read_96 + attend_summer +
## c_read_96:attend_summer, data = filter(summer_rdd, c_read_96 >=
## -5 & c_read_96 <= 5))
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower
## (Intercept) 0.7735817 0.026273 29.4439 9.815e-123 0.72199
## c_read_96 -0.0009208 0.008963 -0.1027 9.182e-01 -0.01852
## attend_summer 0.0040561 0.052679 0.0770 9.386e-01 -0.09938
## c_read_96:attend_summer 0.0032444 0.018184 0.1784 8.584e-01 -0.03246
## CI Upper DF
## (Intercept) 0.82517 667
## c_read_96 0.01668 667
## attend_summer 0.10749 667
## c_read_96:attend_summer 0.03895 667
##
## Multiple R-squared: 4.936e-05 , Adjusted R-squared: -0.004448
## F-statistic: 0.01079 on 3 and 667 DF, p-value: 0.9985
Does being assigned to summer school induce students to attend? In other words, is there a discontinuity at the cut-off in the probability of attending summer school? Provide graphical evidence and report a regression analysis in support of your answer.
first_stage <- rdrobust(y = summer_rdd$attend_summer,
x = summer_rdd$c_read_96,
c = 0,
vce = "hc2")
summary(first_stage)
## Sharp RD estimates using local polynomial regression.
##
## Number of Obs. 4487
## BW type mserd
## Kernel Triangular
## VCE method HC2
##
## Number of Obs. 966 3521
## Eff. Number of Obs. 606 1114
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 13.459 13.459
## BW bias (b) 21.529 21.529
## rho (h/b) 0.625 0.625
## Unique Obs. 966 3521
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional -0.767 0.035 -21.677 0.000 [-0.837 , -0.698]
## Robust - - -18.293 0.000 [-0.853 , -0.688]
## =============================================================================
with(summer_rdd,
rdplot(y = attend_summer,
x = c_read_96,
c = 0))
For this question, you can ignore the fact that students are nested within schools, and treat each observation as independent.
Estimate the effect of being assigned to summer school on following-year reading scores for students close to the cut-off. Report a point estimate, standard error, and 95% CI. Describe your estimation strategy in sufficient detail so that it could be replicated.
Assignment to summer school caused a 5.43 point increase in following-year reading scores for students near the cut-off (B = -5.43, SE = 1.50, 95% CI [-9.360, -2.67]).
The distribution of scores was curvilinear, so nonparametric estimation was used to estimate the difference in following-year reading scores around the cut-off.
First-year test scores were first centered around a cut-score of 180 then regressed on second-year scores. Bandwidth was selected using a MSE-optimal bandwidth selector (“mserd”); triangular kernels and heteroskedasticity-robust standard errors were also used.
itt <- rdrobust(y = summer_rdd$ITBS_read_97,
x = summer_rdd$c_read_96,
c = 0,
bwselect = "mserd",
vce = "hc2")
summary(itt)
## Sharp RD estimates using local polynomial regression.
##
## Number of Obs. 4487
## BW type mserd
## Kernel Triangular
## VCE method HC2
##
## Number of Obs. 966 3521
## Eff. Number of Obs. 487 761
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 9.733 9.733
## BW bias (b) 18.085 18.085
## rho (h/b) 0.538 0.538
## Unique Obs. 966 3521
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional -5.425 1.503 -3.609 0.000 [-8.371 , -2.478]
## Robust - - -3.531 0.000 [-9.360 , -2.678]
## =============================================================================
with(summer_rdd,
rdplot(y = ITBS_read_97,
x = c_read_96,
c = 0))
For this question, you can ignore the fact that students are nested within schools, and treat each observation as independent.
Estimate the effect of attending summer school on following-year reading scores for students close to the cut-off. Report a point estimate, standard error, and 95% CI. Describe your estimation strategy in sufficient detail so that it could be replicated.
Attending summer school caused a 6.40 point increase in following-year reading scores for students near the cut-off (B = 6.40, SE = 1.73, 95% CI [-0.841, -0.691]).
Nonparametric estimation was used to assess the difference in following-year reading scores around the cut-off.
First-year test scores were first centered around a cut-score of 180 then regressed on second-year scores. Bandwidth was selected using a MSE-optimal bandwidth selector (“mserd”); triangular kernels and heteroskedasticity-robust standard errors were also used.
catec <- with(summer_rdd,
rdrobust(
y = ITBS_read_97, #outcome
x = c_read_96, #forcing variable
c = 0, #cut off
fuzzy = attend_summer, #treatment receipt
bwselect = "mserd", #bandwidth estimator
vce = "hc2"
))
summary(catec)
## Fuzzy RD estimates using local polynomial regression.
##
## Number of Obs. 4487
## BW type mserd
## Kernel Triangular
## VCE method HC2
##
## Number of Obs. 966 3521
## Eff. Number of Obs. 537 903
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 11.343 11.343
## BW bias (b) 20.062 20.062
## rho (h/b) 0.565 0.565
## Unique Obs. 966 3521
##
## First-stage estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional -0.766 0.038 -20.032 0.000 [-0.841 , -0.691]
## Robust - - -17.469 0.000 [-0.856 , -0.683]
## =============================================================================
##
## Treatment effect estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 6.402 1.727 3.706 0.000 [3.017 , 9.788]
## Robust - - 3.621 0.000 [3.305 , 11.108]
## =============================================================================
Examine the sensitivity of the CATEC estimate from the previous question to a) variation in the functional form of the regression model and b) the width of the bandwidth/window used for estimation. Provide a table that summarizes estimates based on alternative models, and explain your assessment of whether the results are sensitive to the choice of model.
Sensitivity analysis shows that the results are sensitive to both bandwidth and the functional form of the regression model, although bandwidth had a much greater impact on the effect size.
Effect estimates were smaller when a parametric model was estimated.
Smaller bandwidths produced larger effect sizes.
Sensitivity to Form and Bandwidth
Method | Bandwidth | Effect Size |
---|---|---|
Nonparametric | 9.67 (ideal) | 6.40 |
Parametric | 10 | 5.17 |
Nonparametric | 19.33 (twice) | 4.90 |
Parametric | 20 | 3.68 |
Nonparametric | 4.83 (half) | 10.57 |
Parametric | 5 | 9.14 |
summer_rdd <- summer_rdd %>%
mutate(below_cutoff = ITBS_read_96 <= 180)
catec_lm_10 <- iv_robust(
ITBS_read_97 ~ c_read_96 + attend_summer | c_read_96 + below_cutoff,
data = filter(summer_rdd,
c_read_96 >= -10 & #bandwidth = 10
c_read_96 <= 10))
summary(catec_lm_10)
##
## Call:
## iv_robust(formula = ITBS_read_97 ~ c_read_96 + attend_summer |
## c_read_96 + below_cutoff, data = filter(summer_rdd, c_read_96 >=
## -10 & c_read_96 <= 10))
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 181.9346 0.7323 248.426 0.000e+00 180.4978 183.371 1277
## c_read_96 0.9382 0.1102 8.511 4.755e-17 0.7219 1.154 1277
## attend_summer 5.1707 1.6217 3.188 1.465e-03 1.9892 8.352 1277
##
## Multiple R-squared: 0.1512 , Adjusted R-squared: 0.1499
## F-statistic: 67.01 on 2 and 1277 DF, p-value: < 2.2e-16
catec_lm_20 <- iv_robust(
ITBS_read_97 ~ c_read_96 + attend_summer | c_read_96 + below_cutoff,
data = filter(summer_rdd,
c_read_96 >= -20 & #bandwidth = 5
c_read_96 <= 20))
summary(catec_lm_20)
##
## Call:
## iv_robust(formula = ITBS_read_97 ~ c_read_96 + attend_summer |
## c_read_96 + below_cutoff, data = filter(summer_rdd, c_read_96 >=
## -20 & c_read_96 <= 20))
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 182.4254 0.54592 334.16 0.000e+00 181.3549 183.4959 2472
## c_read_96 0.8359 0.04164 20.08 3.560e-83 0.7543 0.9175 2472
## attend_summer 3.6782 1.18663 3.10 1.959e-03 1.3513 6.0051 2472
##
## Multiple R-squared: 0.3268 , Adjusted R-squared: 0.3263
## F-statistic: 524.8 on 2 and 2472 DF, p-value: < 2.2e-16
catec_lm_5 <- iv_robust(
ITBS_read_97 ~ c_read_96 + attend_summer | c_read_96 + below_cutoff,
data = filter(summer_rdd,
c_read_96 >= -5 & #bandwidth = 5
c_read_96 <= 5))
summary(catec_lm_5)
##
## Call:
## iv_robust(formula = ITBS_read_97 ~ c_read_96 + attend_summer |
## c_read_96 + below_cutoff, data = filter(summer_rdd, c_read_96 >=
## -5 & c_read_96 <= 5))
##
## Standard error type: HC2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
## (Intercept) 180.118 0.9578 188.062 0.000e+00 178.237 181.998 668
## c_read_96 1.628 0.2944 5.530 4.591e-08 1.050 2.206 668
## attend_summer 9.139 2.2424 4.075 5.149e-05 4.735 13.542 668
##
## Multiple R-squared: 0.152 , Adjusted R-squared: 0.1495
## F-statistic: 16.26 on 2 and 668 DF, p-value: 1.276e-07
catec_ep <- with(summer_rdd,
rdrobust(
y = ITBS_read_97, #outcome
x = c_read_96, #forcing variable
c = 0, #cut off
fuzzy = attend_summer, #treatment receipt
kernel = "epanechnikov",
vce = "hc2"
))
summary(catec_ep)
## Fuzzy RD estimates using local polynomial regression.
##
## Number of Obs. 4487
## BW type mserd
## Kernel Epanechnikov
## VCE method HC2
##
## Number of Obs. 966 3521
## Eff. Number of Obs. 509 813
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 10.273 10.273
## BW bias (b) 19.781 19.781
## rho (h/b) 0.519 0.519
## Unique Obs. 966 3521
##
## First-stage estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional -0.762 0.039 -19.731 0.000 [-0.838 , -0.687]
## Robust - - -17.407 0.000 [-0.852 , -0.679]
## =============================================================================
##
## Treatment effect estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 6.082 1.759 3.458 0.001 [2.634 , 9.530]
## Robust - - 3.368 0.001 [2.820 , 10.671]
## =============================================================================
#ideal bandwidth (mserd = 9.665)
with(summer_rdd,
rdbwselect(y = ITBS_read_97, x = c_read_96, c = 0)) %>%
summary()
## Call: rdbwselect
##
## Number of Obs. 4487
## BW type mserd
## Kernel Triangular
## VCE method NN
##
## Number of Obs. 966 3521
## Order est. (p) 1 1
## Order bias (q) 2 2
## Unique Obs. 966 3521
##
## =======================================================
## BW est. (h) BW bias (b)
## Left of c Right of c Left of c Right of c
## =======================================================
## mserd 9.665 9.665 18.040 18.040
## =======================================================
catec_half <- with(summer_rdd,
rdrobust(
y = ITBS_read_97, #outcome
x = c_read_96, #forcing variable
c = 0, #cut off
fuzzy = attend_summer, #treatment receipt
h = 4.83, #bandwidth estimator
vce = "hc2"
))
summary(catec_half)
## Fuzzy RD estimates using local polynomial regression.
##
## Number of Obs. 4487
## BW type Manual
## Kernel Triangular
## VCE method HC2
##
## Number of Obs. 966 3521
## Eff. Number of Obs. 263 372
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 4.830 4.830
## BW bias (b) 4.830 4.830
## rho (h/b) 1.000 1.000
## Unique Obs. 966 3521
##
## First-stage estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional -0.773 0.057 -13.477 0.000 [-0.886 , -0.661]
## Robust - - -9.715 0.000 [-0.926 , -0.615]
## =============================================================================
##
## Treatment effect estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 10.569 2.604 4.059 0.000 [5.466 , 15.673]
## Robust - - 3.527 0.000 [5.984 , 20.952]
## =============================================================================
catec_2x <- with(summer_rdd,
rdrobust(
y = ITBS_read_97, #outcome
x = c_read_96, #forcing variable
c = 0, #cut off
fuzzy = attend_summer, #treatment receipt
h = 19.33, #bandwidth estimator
vce = "hc2"
))
summary(catec_2x)
## Fuzzy RD estimates using local polynomial regression.
##
## Number of Obs. 4487
## BW type Manual
## Kernel Triangular
## VCE method HC2
##
## Number of Obs. 966 3521
## Eff. Number of Obs. 775 1620
## Order est. (p) 1 1
## Order bias (q) 2 2
## BW est. (h) 19.330 19.330
## BW bias (b) 19.330 19.330
## rho (h/b) 1.000 1.000
## Unique Obs. 966 3521
##
## First-stage estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional -0.763 0.030 -25.524 0.000 [-0.822 , -0.704]
## Robust - - -17.842 0.000 [-0.858 , -0.689]
## =============================================================================
##
## Treatment effect estimates.
##
## =============================================================================
## Method Coef. Std. Err. z P>|z| [ 95% C.I. ]
## =============================================================================
## Conventional 4.900 1.351 3.628 0.000 [2.253 , 7.548]
## Robust - - 3.727 0.000 [3.504 , 11.278]
## =============================================================================
Propose methods of estimating the average effect of being assigned to summer school (i.e., the ITT at the cutoff) and the effect of attending summer school (i.e., the CATE at the cutoff) that account for the fact that students are nested within schools. Describe your estimation strategies in sufficient detail so that it could be replicated, and report a point estimate, standard error, and 95% CI for the results.
Propose a method of examining the degree of heterogeneity across schools in the effects of being assigned to summer school (i.e., the ITT at the cutoff). Describe your estimation strategy in sufficient detail so that it could be replicated, and report results including measures of uncertainty (i.e., standard error, 95% CI).