Regression Discontinuity Design (RDD) for Program Evaluation

Federico Ferrero

How can we measure the effect of a program — such as high-dosage tutoring or gifted and talented initiatives — without placing students who truly need or qualify for support into a control group?

Randomly withholding access would be ethically unacceptable. Regression Discontinuity Designs provide a solution for evaluating such programs while maintaining fairness and rigor.

Why Regression Discontinuity Matters in Program Evaluation?

In education, we often want to know whether interventions like high-dosage tutoring or gifted and talented programs actually improve student outcomes. The gold standard for causal inference is a randomized controlled trial (RCT), but randomly assigning students to a “no intervention” control group is often ethically and politically impossible. We don’t want to withhold opportunities from students who could benefit the most.

Regression Discontinuity (RD) provides a solution. Many programs already have eligibility rules based on a cutoff, such as a failing score, a risk index, or a percentile threshold. Regression Discontinuity leverages pre-existing cutoff rules. For example, students just below the eligibility threshold receive the program, while students just above it do not. The RD design then compares posttest outcomes between these two groups—students who were very similar in their pretest scores but differed only in program participation.

This creates a natural comparison group that is as close as possible to random for students near the cutoff.
We can estimate causal effects without denying students access to programs they qualify for.
It preserves ethical standards while maintaining strong internal validity, often approaching the rigor of an RCT.

Assumptions and Requirements

A clear cutoff rule exists: Program assignment must be based on a known threshold (e.g., test score ≥ 70). Students on one side of the cutoff receive the program; those on the other side do not.
The cutoff is strictly applied: Students should not be arbitrarily moved above or below the threshold. If there is some noncompliance, it must be limited and well understood (this leads to a fuzzy RD rather than a sharp RD).
No manipulation around the cutoff: Students, teachers, or administrators should not be able to “game” the system (for example, inflating scores so a student barely qualifies).
Students near the cutoff are comparable: Individuals just above and just below the threshold should be very similar in all relevant ways except for program participation. This is what allows RD to approximate random assignment.
Sufficient data near the cutoff: You need enough students close to the threshold to estimate the effect reliably. RD does not use the full sample equally—most of the information comes from observations near the cutoff.
Statistical Power: RD designs have strong internal validity, similar to randomized experiments. However, they are less statistically powerful. To reach the same level of precision, an RD study may require up to 2.75 times more participants than a randomized experiment. For example, if a randomized study needs 100 participants to achieve adequate power, an RD design may need as many as 275.

Regression Continuity vs. Regression Discontinuity

KEY RULE: If there is a program effect, there will be a discontinuity in the regression lines at the cutoff.

As we can see in the plots below, each dot represents a student.

Left plot – Regression Continuity: Students with scores below the cutoff (3550) were not assigned to a tutoring program. When we compare their pretest scores with posttest scores, the outcome changes smoothly across the cutoff, showing regression continuity. There is no jump at the threshold.
Right plot – Regression Discontinuity: Students with scores below the cutoff (3550) were assigned to the tutoring program. After the program, when we compare pretest and posttest scores, we clearly see a jump at the cutoff, which is a regression discontinuity. This jump indicates the tutoring program had a measurable effect on student outcomes.It works!

Simulating an Educational RD Case (STAAR EOC Algebra I)

This simulated example mirrors a realistic Texas STAAR EOC Algebra I tutoring policy.

Running variable (X): 2023–24 EOY STAAR EOC Algebra I score score.
Treatment: assignment to Algebra I high-dosage tutoring during the 2024–25 school year.
Outcome (Y): 2024–25 EOY STAAR EOC Algebra I scale score.
Cutoff (c): scale score = 3550, corresponding to Approaches Grade Level according to the TEA Raw Score Conversion Table (EOY 2023–24).

Note: To find the specific minimum scale score needed to achieve Approaches Grade Level for a given test, we need to consult the official TEA Raw Score Conversion Tables for that particular subject, grade, and test administration year.

Policy rule (sharp RD):

Students with raw score < 3550 did not meet Approaches and were assigned to tutoring.
Students with raw score ≥ 3550 were not assigned to tutoring.

Local randomization near the cutoff: Students just below vs just above 3550 are assumed to be similar in ability except for treatment assignment. The estimand is the causal effect of tutoring for students near the Approaches cutoff.

Previewing the Simulated Dataset

Before moving to diagnostics and estimation, it is useful to inspect the structure of the simulated data.

summary(staar_rd)

##     eoy_2324       tutoring         eoy_2425   
##  Min.   :3109   Min.   :0.0000   Min.   :3031  
##  1st Qu.:3469   1st Qu.:0.0000   1st Qu.:3513  
##  Median :3549   Median :1.0000   Median :3597  
##  Mean   :3549   Mean   :0.5007   Mean   :3599  
##  3rd Qu.:3625   3rd Qu.:1.0000   3rd Qu.:3686  
##  Max.   :3960   Max.   :1.0000   Max.   :4043

head(staar_rd, 8)

##   eoy_2324 tutoring eoy_2425
## 1     3612        0     3519
## 2     3420        1     3544
## 3     3567        0     3744
## 4     3540        1     3608
## 5     3470        1     3575
## 6     3248        1     3442
## 7     3462        1     3746
## 8     3428        1     3280

The output above shows the first eight rows of the dataset, for each student:

eoy_2324: prior-year STAAR Algebra I score score (running variable).
tutoring: assignment to Algebra I tutoring (1 = assigned, 0 = not assigned).
eoy_2425: next-year STAAR Algebra I scale score (outcome).

Visual Inspection: The First Diagnostic

library(ggplot2)
ggplot(staar_rd, aes(x = eoy_2324, y = eoy_2425, color = factor(tutoring))) +
  geom_point(alpha = 0.6) +
  geom_vline(xintercept = cutoff_score, linetype = "dashed", color = "red") +
  annotate("text", x = cutoff_score + 10, y = max(staar_rd$eoy_2425), 
           label = "Cutoff = 3550", color = "red", hjust = 0) +
  # Local linear regressions per side of cutoff
  geom_smooth(data = subset(staar_rd, eoy_2324 < cutoff_score), 
              method = "lm", se = FALSE, color = "black") +
  geom_smooth(data = subset(staar_rd, eoy_2324 >= cutoff_score), 
              method = "lm", se = FALSE, color = "black") +
  labs(
    x = "
    EOY 2023-24 STAAR EOC
    Algebra I Scale Score",
    y = "
    EOY 2024-25 STAAR EOC
    Algebra I Scale Score",
    title = "Tutoring Effect on STAAR Algebra I",
    color = "Tutoring"
  ) +
  scale_color_manual(values = c("0" = "darkgreen", "1" = "purple"),
                     labels = c("Did Not Take", "Took")) +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5))

We can also zoom in on the area around the cutoff score…

ggplot(staar_rd, aes(x = eoy_2324, y = eoy_2425, color = factor(tutoring))) +
  geom_point(alpha = 0.6) +
  geom_vline(xintercept = cutoff_score, linetype = "dashed", color = "red") +
  annotate("text", x = cutoff_score + 10, y = max(staar_rd$eoy_2425), 
           label = "Cutoff = 3550", color = "red", hjust = 0) +
  # Local linear regressions per side of cutoff
  geom_smooth(data = subset(staar_rd, eoy_2324 < cutoff_score), 
              method = "lm", se = FALSE, color = "black") +
  geom_smooth(data = subset(staar_rd, eoy_2324 >= cutoff_score), 
              method = "lm", se = FALSE, color = "black") +
  labs(
    x = "
    EOY 2023-24 STAAR EOC
    Algebra I Scale Score",
    y = "
    EOY 2024-25 STAAR EOC
    Algebra I Scale Score",
    title = "Tutoring Effect on STAAR Algebra I",
    color = "Tutoring"
  ) +
  scale_color_manual(values = c("0" = "darkgreen", "1" = "purple"),
                     labels = c("Did Not Take", "Took")) +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  coord_cartesian(xlim = c(3500, 3600))

After inspecting the scatterplots, we can observe a regression discontinuity at the cutoff score of 3550 (STAAR EOC Algebra I). This means that among students with similar pretest scores near the cutoff, those who were assigned to and participated in the tutoring program performed better on the posttest than those who were not assigned.

The program appears to have a positive effect, but to determine how large this effect is with statistical accuracy, we need to run formal statistical analyses.

Estimate Regression Discontinuity Metrics

# Load the packages
library(rdrobust) # Implements regression discontinuity (RD) estimation with robust standard errors, allowing us to estimate the causal effect of a program at a cutoff
library(rddensity) # Helps check the density of the running variable around the cutoff, which is important to test for manipulation (no “gaming” of the score)

# Fit a regression discontinuity model: This function estimates the size of the jump at the cutoff, i.e., the causal effect of tutoring on students near the threshold
rd_model <- rdrobust(y = staar_rd$eoy_2425, x = staar_rd$eoy_2324, c = cutoff_score)
summary(rd_model)

## Sharp RD estimates using local polynomial regression.
## 
## Number of Obs.                 3000
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 1502         1498
## Eff. Number of Obs.             984          984
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                 112.731      112.731
## BW bias (b)                 182.005      182.005
## rho (h/b)                     0.619        0.619
## Unique Obs.                     273          271
## 
## =====================================================================
##                    Point    Robust Inference
##                 Estimate         z     P>|z|      [ 95% C.I. ]       
## ---------------------------------------------------------------------
##      RD Effect   -92.661    -8.287     0.000  [-114.495 , -70.696]   
## =====================================================================

Type of analysis: The results are from a sharp regression discontinuity (RD) design using local linear regression (order 1 polynomial) with a triangular kernel. This is appropriate because treatment (tutoring) was strictly assigned based on the cutoff.

Number of observations: 3000 in total. Left of cutoff: 1502 students. Right of cutoff: 1498 students.

The effective number of observations used in the estimation is smaller (≈984 per side) because RD uses observations close to the cutoff, giving more weight to students near 3550.

Bandwidth: The estimation bandwidth (h) is about 113 points. This means that the RD effect is estimated using students with pretest scores roughly within 113 points of the cutoff on either side.The bias bandwidth (b) is larger (182 points) for calculating bias-corrected inference.

Estimated treatment effect: RD Effect = -92.66. Since in your simulation the tutoring was coded as 1 for students below the cutoff, a negative effect here means that the posttest scores of treated students (below cutoff) are higher than those above the cutoff (because of coding direction).

Robust z-value = -8.287, p < 0.001 → the effect is highly statistically significant.

Confidence interval: 95% CI: [-114.5, -70.7]. This means we are 95% confident that the tutoring program increased posttest scores by roughly 71–115 points for students near the cutoff.

The regression discontinuity analysis shows a large and statistically significant jump at the cutoff (3550). Students who were assigned and took tutoring scored substantially higher on the posttest than similar students just above the cutoff, confirming that the program had a clear positive impact.

Estimate the Effect Size of the Program: How Much Did Tutoring Improve Student Scores?

We compare the tutoring effect to how much scores usually vary. This tells us how big the improvement was in a way that’s easy to understand, not just that the tutoring helped.

# Extract the RD effect estimate (jump at cutoff)
tutoring_effect <- rd_model$coef[1]  # takes the first coefficient from the RD model, which is the estimated jump in posttest scores at the cutoff
tutoring_effect

## [1] -92.66131

# Calculate a standardized effect size (Cohen's d)
sd_post <- sd(staar_rd$eoy_2425) # calculates the standard deviation of the posttest scores

effect_size_d <- tutoring_effect / sd_post # divides the RD effect by this standard deviation. This gives a standardized effect size, also called Cohen’s d, which tells you how many standard deviations the tutoring improved scores.
effect_size_d

## [1] -0.6963258

The RD analysis shows that students who received tutoring scored about 93 points higher at the cutoff compared to similar students who did not receive tutoring. When we standardize this effect using the posttest score variability, we get a Cohen’s d of approximately 0.70, meaning the tutoring increased scores by about 0.7 standard deviations.

Note: Because we coded students below the cutoff as treated (1), the RD effect appears as a negative number (-93), even though the program increases posttest scores as we saw in the scatterplot. A negative estimate here actually corresponds to a positive tutoring effect.

According to Cohen’s conventions, an effect size of 0.2 is small, 0.5 is medium, and 0.8 is large, so 0.7 represents a substantial positive impact of tutoring program on student performance.

Validity Checks for the RD Design

Bandwidth Selection

The code below is used to select the optimal bandwidth for a regression discontinuity (RD) analysis. In RD, the bandwidth determines how many observations near the cutoff are included when estimating the treatment effect. Choosing the right bandwidth is important: using a bandwidth that is too wide can include students far from the cutoff, who are less comparable, while using one that is too narrow can reduce statistical power. The function rdbwselect() from the rdrobust package computes data-driven, optimal bandwidths for RD estimation using standard methods, such as minimizing mean squared error.

bw <- rdbwselect(y = staar_rd$eoy_2425, x = staar_rd$eoy_2324, c = cutoff_score)
bw

## Call: rdbwselect
## 
## Number of Obs.                 3000
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 1502         1498
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## Unique Obs.                     273          271

Out of 3,000 students, about 1,502 are below the cutoff and 1,498 above, but the effective number of observations used for estimation is smaller, focusing on students closest to the cutoff (≈273 below and 271 above). The method used to select the bandwidth is mserd (mean squared error–optimal), with a triangular kernel, which gives more weight to students nearer the cutoff. In short, this shows that the analysis focuses on students near the cutoff to get the most precise and unbiased estimate of the tutoring effect.

Density Test (Manipulation Check)

The code below checks whether students’ pretest scores are distributed smoothly around the cutoff, which is a critical assumption for the regression discontinuity design to produce valid causal estimates.

rd_density <- rddensity(staar_rd$eoy_2324, c = cutoff_score)

summary(rd_density)

## 
## Manipulation testing using local polynomial density estimation.
## 
## Number of obs =       3000
## Model =               unrestricted
## Kernel =              triangular
## BW method =           estimated
## VCE method =          jackknife
## 
## c = 3550              Left of c           Right of c          
## Number of obs         1502                1498                
## Eff. Number of obs    721                 875                 
## Order est. (p)        2                   2                   
## Order bias (q)        3                   3                   
## BW est. (h)           77.425              94.121              
## 
## Method                T                   P > |T|             
## Robust                0.7925              0.4281

## 
## P-values of binomial tests (H0: p=0.5).
## 
## Window Length / 2          <c     >=c    P>|T|
## 2.000                      22      34    0.1409
## 4.000                      37      53    0.1133
## 6.000                      56      78    0.0693
## 8.000                      70      98    0.0369
## 10.000                     93     122    0.0559
## 12.000                    119     144    0.1388
## 14.000                    135     161    0.1461
## 16.000                    158     186    0.1454
## 18.000                    177     207    0.1388
## 20.000                    200     229    0.1764

The density test shows that the distribution of pretest scores (eoy_2324) is smooth around the cutoff (3550). The robust test statistic is 0.79 with a p-value of 0.43, well above 0.05, indicating no evidence of manipulation. What we want is p-value greater than 0.05, meaning there is no statistically significant jump in the distribution of scores at the cutoff.

The binomial tests for different window lengths also show p-values mostly above 0.05, confirming that the number of students just below and above the cutoff is roughly balanced. This means students could not “game the cutoff” (they could not manipulate their scores to qualify for tutoring).

Overall, this supports the key RD assumption, so the observed jump in posttest scores can be confidently interpreted as the causal effect of the tutoring program.

Optional: Covariate Check

The code below checks whether other student characteristics are balanced across the cutoff. A non-significant effect indicates that the RD groups are comparable, supporting the validity of the estimated tutoring effect.

covariate <- rnorm(n)
rdrobust <- rdrobust(y = covariate, x = staar_rd$eoy_2324, c = cutoff_score)

summary(rdrobust)

## Sharp RD estimates using local polynomial regression.
## 
## Number of Obs.                 3000
## BW type                       mserd
## Kernel                   Triangular
## VCE method                       NN
## 
## Number of Obs.                 1502         1498
## Eff. Number of Obs.             649          677
## Order est. (p)                    1            1
## Order bias  (q)                   2            2
## BW est. (h)                  67.372       67.372
## BW bias (b)                 108.726      108.726
## rho (h/b)                     0.620        0.620
## Unique Obs.                     273          271
## 
## =====================================================================
##                    Point    Robust Inference
##                 Estimate         z     P>|z|      [ 95% C.I. ]       
## ---------------------------------------------------------------------
##      RD Effect    -0.042    -0.500     0.617    [-0.369 , 0.219]     
## =====================================================================

What we want: If the covariate does not jump at the cutoff, it confirms that students are comparable, and the observed jump in the posttest is likely caused by the tutoring program. Therefore, the estimate should be close to zero, meaning no jump at the cutoff. Likewise, p > 0.05 (not statistically significant) → no evidence of imbalance.

The covariate check shows that the RD estimate for this student-level variable is -0.042 with a p-value of 0.617, which is not statistically significant. The 95% confidence interval includes zero ([-0.369, 0.219]), indicating that the covariate is balanced across the cutoff. Students just below and just above the cutoff are similar in this characteristic, which supports the validity of the RD design and suggests that the jump in posttest scores is due to the tutoring program, not pre-existing differences.

Note: the same Regression Discontinuity analysis can be implemented in SPSS. For further guidance, material from Dr. Wuensch can be consulted for these purposes (see reference list below).

References

Trochim, W., Donnelly, J. P., & Arora, K. (2016). Research methods: The essential knowledge base. Delhi: Cengage Learning.
Matthews, M. S., Peters, S. J., & Housand, A. M. (2012). Regression discontinuity design in gifted and talented education research. Gifted Child Quarterly, 56(2), 105-112.
Wuensch, K. (2018). Using SPSS to Analyze Data From a Regression-Discontinuity Design.