Regression Discontinuity Design

Author

AS

1. Introduction

Without a random process that separates the treatment and control group, the treatment effect can be identified if the assignment to the treatment group follows a regression discontinuity design (RDD). This requires a (running) variable which, at a certain threshold, separates the observations into a treatment and control group.

2. Theory

There are two different variants of the RDD:

  1. sharp RDD:

    • the threshold separates the treatment and control group exactly
  2. fuzzy RDD:

    • the threshold influences the probability of being treated

    • this is in fact an instrumental variable approach (estimating a LATE)

The value of the outomce (Y) for individuals just below the threshold is the missing conterfactual outcome. It increases continuously with the cutoff variable, as opposed to the treatment.

2.1 Estimation methods

Three methods to estimate a RDD can be distinguished:

  1. Method 1:

    • select a subsample for which the value of the running variable is close to the threshold

    • problem: the smaller the sample, the larger the standard errors

  2. Method 2:

    • select a larger sample and estimate parametrically

    • problem: this depends on the functional form and polynomials

  3. Method 3:

    • select a subsample close to the threshold and estimate parametrically

    • extension: different functional forms on the left and right side of the cutoff

2.2 Advantage of RDD

With an RDD approach some assumptions can be tested. Individuals close to the threshold are nearly identical, except for characteristics which are affected by the treatment. Prior to the treatment, the outcome should not differ between the treatment and control group. The distribution of the variable which indicates the threshold should have no jumps around this cutoff value.

3. Replication

I am now replicating a study from Carpenter and Dobkin (2009). The data of their study is available here. They are estimating the effect of alcohol consumption on mortality by utilising the minimum drinking age within a regression discontinuity design.

Below I included a list of the required R packages for this tutorial.


Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
Loading required package: AER
Loading required package: car
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
Loading required package: np
Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-17)
[vignette("np_faq",package="np") provides answers to frequently asked questions]
[vignette("np",package="np") an overview]
[vignette("entropy_np",package="np") an overview of entropy-based methods]

Please consider citing R and rddtools,
citation()
citation("rddtools")

Attaching package: 'psych'
The following object is masked from 'package:car':

    logit
The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 

3.1 The dataset

At first, I am reading the data with RStudio. The dataset contains aggregated values according to the respondents’ age.

remove(list=ls())

# install.packages("stevedata")
## devtools::install_github("svmiller/stevedata")
#data(package = "stevedata")
library(stevedata)     # for data 

We have a data frame with 50 observations on the following 19 variables.

agecell

a numeric

all

a numeric

allfitted

a numeric

internal

a numeric

internalfitted

a numeric

external

a numeric

externalfitted

a numeric

alcohol

a numeric

alcoholfitted

a numeric

homicide

a numeric

homicidefitted

a numeric

suicide

a numeric

suicidefitted

a numeric

mva

a numeric

mvafitted

a numeric

drugs

a numeric

drugsfitted

a numeric

externalother

a numeric

externalotherfitted

a numeric

?stevedata::mm_mlda
carpenter_dobkin_2009 <- stevedata::mm_mlda   


#describe(carpenter_dobkin_2009, skew = F, ranges = F,)
stargazer(as.data.frame(carpenter_dobkin_2009), type="text")

=====================================================
Statistic           N   Mean  St. Dev.  Min     Max  
-----------------------------------------------------
agecell             50 21.000  1.127   19.068 22.932 
all                 48 95.673  3.831   88.428 105.268
allfitted           50 95.803  3.286   91.706 102.892
internal            48 20.285  2.254   15.977 24.373 
internalfitted      50 20.281  1.995   16.738 24.044 
external            48 75.387  2.986   71.341 83.331 
externalfitted      50 75.522  2.270   73.158 81.784 
alcohol             48 1.257   0.350   0.639   2.519 
alcoholfitted       50 1.267   0.260   0.794   1.817 
homicide            48 16.912  0.730   14.948 18.411 
homicidefitted      50 16.953  0.453   16.261 17.762 
suicide             48 12.352  1.063   10.889 14.832 
suicidefitted       50 12.363  0.760   11.592 13.547 
mva                 48 31.623  2.385   26.855 36.385 
mvafitted           50 31.680  2.003   27.868 34.818 
drugs               48 4.250   0.616   3.202   5.565 
drugsfitted         50 4.255   0.521   3.449   5.130 
externalother       48 9.599   0.748   7.973  11.483 
externalotherfitted 50 9.610   0.465   8.388  10.353 
-----------------------------------------------------

Now, let us take a look at the threshold (= the minimum drinking age), which occurs at 21 years. There is a noticeable jump in the mortality rate after 21 years!

carpenter_dobkin_2009 %>% 
  ggplot(aes(x = agecell, y = all)) + 
  geom_point() +
  geom_vline(xintercept = 21, color = "red", size = 1, linetype = "dashed") + 
  annotate("text", x = 20.4, y = 105, label = "Minimum Drinking Age",
           size=4) +
  labs(y = "Mortality rate (per 100.000)",
       x = "Age (binned)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

3.2 Estimation: same slopes

The RDD can be estimated by OLS. The first regression applies the same slopes on both sides of the cutoff.

Model

At first, we have to compute a dummy variable (threshold), indicating whether an indidivual is below or above the cutoff. The dummy is equal to zero for observations below and equal to one for observations aboev the cutoff of 21 years. Then I am specifiying a linear model with function lm() to regress all deaths per 100.000 (all) on the threshold dummy and the respondents’ age which is centered around the threshold value of age (21 years). This is done with function I() by substracting the cutoff from each age bin.

lm_same_slope <- carpenter_dobkin_2009 %>% 
  mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$% 
  lm(all ~ threshold + I(agecell - 21))

summary(lm_same_slope) 

Call:
lm(formula = all ~ threshold + I(agecell - 21))

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0559 -1.8483  0.1149  1.4909  5.8043 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      91.8414     0.8050 114.083  < 2e-16 ***
threshold         7.6627     1.4403   5.320 3.15e-06 ***
I(agecell - 21)  -0.9747     0.6325  -1.541     0.13    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.493 on 45 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.5946,    Adjusted R-squared:  0.5765 
F-statistic: 32.99 on 2 and 45 DF,  p-value: 1.508e-09

The coefficient of the dummy avariable threshold is the average treatment effect. On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 7.66 points higher.

Model via rddtools

There is an alternative approach by using R package rddtools which contains various functions related to applying the RDD. Within function rdd_reg_lm() I am using the argument slope = "same" to achieve the same result with the previous approach.

rdd_data(y = carpenter_dobkin_2009$all, 
         x = carpenter_dobkin_2009$agecell, 
         cutpoint = 21) %>% 
  rdd_reg_lm(slope = "same") %>% 
  summary()

Call:
lm(formula = y ~ ., data = dat_step1, weights = weights)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.0559 -1.8483  0.1149  1.4909  5.8043 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  91.8414     0.8050 114.083  < 2e-16 ***
D             7.6627     1.4403   5.320 3.15e-06 ***
x            -0.9747     0.6325  -1.541     0.13    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.493 on 45 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.5946,    Adjusted R-squared:  0.5765 
F-statistic: 32.99 on 2 and 45 DF,  p-value: 1.508e-09

Same coefficient !

Scatterplot

With a scatterplot I draw the fitted line of the regression, which shows the same slope at both sides of the threshold.

carpenter_dobkin_2009 %>%
  select(agecell, all) %>%
  mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
  ggplot(aes(x = agecell, y = all)) +
  geom_point(aes(color = threshold)) +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_brewer(palette = "Accent") +
  guides(color = FALSE) +
  geom_vline(xintercept = 21, color = "red",
    size = 1, linetype = "dashed") +
  labs(y = "Mortality rate (per 100.000)",
       x = "Age (binned)")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

3.3 Estimation: different slopes

The second regression applies different slopes on both sides of the cutoff.

Model

With function lm() this can be achieved by specifying an interaction between the threshold dummy and age which is centered around the cutoff value.

lm_different_slope <- carpenter_dobkin_2009 %>%
  mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$%
  lm(all ~ threshold + I(agecell - 21) + threshold:I(agecell - 21))

summary(lm_different_slope)

Call:
lm(formula = all ~ threshold + I(agecell - 21) + threshold:I(agecell - 
    21))

Residuals:
   Min     1Q Median     3Q    Max 
-4.368 -1.787  0.117  1.108  5.341 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                93.6184     0.9325 100.399  < 2e-16 ***
threshold                   7.6627     1.3187   5.811  6.4e-07 ***
I(agecell - 21)             0.8270     0.8189   1.010  0.31809    
threshold:I(agecell - 21)  -3.6034     1.1581  -3.111  0.00327 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.283 on 44 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6677,    Adjusted R-squared:  0.645 
F-statistic: 29.47 on 3 and 44 DF,  p-value: 1.325e-10

This approach does not alter the interpretation of the treatment effect! On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 7.66 points higher.

Model via rddtools

Again, we can use R package rddtools to get the job done. Now, the argument slope = "separate" has to be used inside function rdd_reg_lm().

rdd_data(y = carpenter_dobkin_2009$all, 
         x = carpenter_dobkin_2009$agecell, 
         cutpoint = 21) %>% 
  rdd_reg_lm(slope = "separate") %>% 
  summary()

Call:
lm(formula = y ~ ., data = dat_step1, weights = weights)

Residuals:
   Min     1Q Median     3Q    Max 
-4.368 -1.787  0.117  1.108  5.341 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  93.6184     0.9325 100.399  < 2e-16 ***
D             7.6627     1.3187   5.811  6.4e-07 ***
x             0.8270     0.8189   1.010  0.31809    
x_right      -3.6034     1.1581  -3.111  0.00327 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.283 on 44 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6677,    Adjusted R-squared:  0.645 
F-statistic: 29.47 on 3 and 44 DF,  p-value: 1.325e-10

Scatterplot

Let us take at the different slopes with a scatterplot. The slope on the right side of the cutoff is negative while it is positive on the left side.

carpenter_dobkin_2009 %>%
  select(agecell, all) %>%
  mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
  ggplot(aes(x = agecell, y = all, color = threshold)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_brewer(palette = "Accent") +
  guides(color = FALSE) +
  geom_vline(xintercept = 21, color = "red",
    size = 1, linetype = "dashed") +
  labs(y = "Mortality rate (per 100.000)",
       x = "Age (binned)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

3.4 Modifying the functional form

Particular attentions should be paid to the specification of the functional form when applying a RDD.

Model

Below, I am modelling a quadratic relationship between age and the mortality per 100.000 (all). The quadratic term enters the formula via function I(). As in the previous section, different slopes around the cutoff are used.

lm_quadratic <- carpenter_dobkin_2009 %>% 
mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$% 
  lm(all ~ threshold + I(agecell - 21) + I((agecell -21)^2) + threshold:I(agecell - 21) +
       threshold:I((agecell - 21)^2))

summary(lm_quadratic)

Call:
lm(formula = all ~ threshold + I(agecell - 21) + I((agecell - 
    21)^2) + threshold:I(agecell - 21) + threshold:I((agecell - 
    21)^2))

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3343 -1.3946  0.1849  1.2848  5.0817 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    93.0729     1.4038  66.301  < 2e-16 ***
threshold                       9.5478     1.9853   4.809 1.97e-05 ***
I(agecell - 21)                -0.8306     3.2901  -0.252    0.802    
I((agecell - 21)^2)            -0.8403     1.6153  -0.520    0.606    
threshold:I(agecell - 21)      -6.0170     4.6529  -1.293    0.203    
threshold:I((agecell - 21)^2)   2.9042     2.2843   1.271    0.211    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.285 on 42 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6821,    Adjusted R-squared:  0.6442 
F-statistic: 18.02 on 5 and 42 DF,  p-value: 1.624e-09

On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is now 9.55 points higher.

Model via rddtools

In function rdd_reg_lm() I have to modify the argument order = to specify a quadratic term (which is a second order polynomial). It is quite easy to use higher order polynomials with package rddtools compared to the traditional approach with function lm().

rdd_data(y = carpenter_dobkin_2009$all, 
         x = carpenter_dobkin_2009$agecell, 
         cutpoint = 21) %>% 
  rdd_reg_lm(slope = "separate", order = 2) %>% 
  summary()

Call:
lm(formula = y ~ ., data = dat_step1, weights = weights)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3343 -1.3946  0.1849  1.2848  5.0817 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  93.0729     1.4038  66.301  < 2e-16 ***
D             9.5478     1.9853   4.809 1.97e-05 ***
x            -0.8306     3.2901  -0.252    0.802    
`x^2`        -0.8403     1.6153  -0.520    0.606    
x_right      -6.0170     4.6529  -1.293    0.203    
`x^2_right`   2.9042     2.2843   1.271    0.211    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.285 on 42 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.6821,    Adjusted R-squared:  0.6442 
F-statistic: 18.02 on 5 and 42 DF,  p-value: 1.624e-09

Scatterplot

On the right side of the cutoff, this model seems to fit the data better!

carpenter_dobkin_2009 %>%
  select(agecell, all) %>%
  mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
  ggplot(aes(x = agecell, y = all, color = threshold)) +
  geom_point() +
  geom_smooth(method = "lm",
              formula = y ~ x + I(x ^ 2),
              se = FALSE) +
  scale_color_brewer(palette = "Accent") +
  guides(color = FALSE) +
  geom_vline(xintercept = 21, color = "red",
    size = 1, linetype = "dashed") +
  labs(y = "Mortality rate (per 100.000)",
       x = "Age (binned)")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

3.5 Sensitivity analysis

Futhermore, it is advisable to check the sensitivity of the results with respect to limiting the sample size.

Model

Instead of using the full range of age, I am only using respondents aged between 20 and 22 years. Also, I am removing the quadratic term.

lm_sensitivity <- carpenter_dobkin_2009 %>%
  filter(agecell >= 20 & agecell <= 22) %>%
  mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$%
  lm(all ~ threshold + I(agecell - 21) + threshold:I(agecell - 21))

summary(lm_sensitivity)

Call:
lm(formula = all ~ threshold + I(agecell - 21) + threshold:I(agecell - 
    21))

Residuals:
    Min      1Q  Median      3Q     Max 
-4.3038 -0.9132 -0.1746  1.1758  4.3307 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 92.524      1.370  67.550  < 2e-16 ***
threshold                    9.753      1.937   5.035 6.34e-05 ***
I(agecell - 21)             -1.612      2.407  -0.669    0.511    
threshold:I(agecell - 21)   -3.289      3.405  -0.966    0.346    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.366 on 20 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.7161,    Adjusted R-squared:  0.6735 
F-statistic: 16.82 on 3 and 20 DF,  p-value: 1.083e-05

This result is pretty similar to the previous approach with the quadratic approach. On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 9.75 points higher.

Scatterplot

carpenter_dobkin_2009 %>%
  filter(agecell >= 20 & agecell <= 22) %>%
  select(agecell, all) %>%
  mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
  ggplot(aes(x = agecell, y = all, color = threshold)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_brewer(palette = "Accent") +
  guides(color = FALSE) +
  geom_vline(xintercept = 21, color = "red",
             size = 1, linetype = "dashed") +
  labs(y = "Mortality rate (per 100.000)",
       x = "Age (binned)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

References

  1. Carpenter, Christopher, and Carlos Dobkin. 2009. “The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age.” American Economic Journal: Applied Economics 1 (1): 164–82. https://doi.org/10.1257/app.1.1.164.

  2. Data - https://www.rdocumentation.org/packages/stevedata/versions/0.6.0

  3. https://rpubs.com/phle/r_tutorial_regression_discontinuity_design

  4. Ludwig & Miller 2007

  5. Republicans spend more in counties just won