Without a random process that separates the treatment and control group, the treatment effect can be identified if the assignment to the treatment group follows a regression discontinuity design (RDD). This requires a (running) variable which, at a certain threshold, separates the observations into a treatment and control group.
2. Theory
There are two different variants of the RDD:
Sharp RDD:
the threshold separates the treatment and control group exactly
Fuzzy RDD:
the threshold influences the probability of being treated
this is in fact an instrumental variable approach (estimating a LATE)
The value of the outomce (Y) for individuals just below the threshold is the missing conterfactual outcome. It increases continuously with the cutoff variable, as opposed to the treatment.
2.1 Estimation methods
Three methods to estimate a RDD can be distinguished:
Method 1:
Select a subsample for which the value of the running variable is close to the threshold
Problem: the smaller the sample, the larger the standard errors
Method 2:
Select a larger sample and estimate parametrically
Problem: this depends on the functional form and polynomials
Method 3:
Select a subsample close to the threshold and estimate parametrically
Extension: different functional forms on the left and right side of the cutoff
2.2 Advantage of RDD
With an RDD approach some assumptions can be tested. Individuals close to the threshold are nearly identical, except for characteristics which are affected by the treatment. Prior to the treatment, the outcome should not differ between the treatment and control group. The distribution of the variable which indicates the threshold should have no jumps around this cutoff value.
3. Replication
I am now replicating a study from Carpenter and Dobkin (2009). The data of their study is available here. They are estimating the effect of alcohol consumption on mortality by utilising the minimum drinking age within a regression discontinuity design.
Below I included a list of the required R packages for this tutorial.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rddtools)
Loading required package: AER
Loading required package: car
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
Loading required package: lmtest
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
Loading required package: np
Nonparametric Kernel Methods for Mixed Datatypes (version 0.60-17)
[vignette("np_faq",package="np") provides answers to frequently asked questions]
[vignette("np",package="np") an overview]
[vignette("entropy_np",package="np") an overview of entropy-based methods]
Please consider citing R and rddtools,
citation()
citation("rddtools")
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
3.1 The dataset
At first, I am reading the data with RStudio. The dataset contains aggregated values according to the respondents’ age.
remove(list=ls())# install.packages("stevedata")## devtools::install_github("svmiller/stevedata")# data(package = "stevedata")library(stevedata) # for data ?stevedata::mm_mldacarpenter_dobkin_2009 <-as.data.frame(stevedata::mm_mlda)
We have a data frame with 50 observations on the following 19 variables.
agecell: a numeric for age.
all: a numeric for all alcohol related deaths as reported on the death certificate. When more than one cause of death was included on the death certificate, mutually exclusive categories are created in the following order: homicide, suicide, MVA, mention of alcohol, mention of drug use, and other external causes. This variable is decomposed into two categories: deaths due to internal causes and deaths due to external causes. International Classification of Diseases, Ninth and Tenth revisions is used. A full list of the cause-of-death codes is provided in Web Appendix C.
internal: a numeric for alcohol related internal cause of death (as reported in the death certificate). Alcohol may worsen or exacerbate preexisting medical conditions that result in internal causes of death.
It includes: alcoholic psychoses, alcohol dependence syndrome, nondependent abuse of alcohol, alcoholic neuropathy, alcoholic cardiomyopathy, alcoholic gastritis, alcoholic fatty liver, acute alcoholic hepatitis, alcoholic cirrhosis of the liver, other alcoholic-related liver toxicity, and overdose by ethyl alcohol. The use of “alcohol-related” therefore refers to a death for which there was a strong likelihood that alcohol played an important and direct role in the outcome.
external: a numeric for alcohol related external cause of death (as reported in the death certificate). These are further classified into homicides, suicides, motor vehicle accidents, deaths with a mention of alcohol, deaths with a mention of drug use, and deaths due to other external causes.
alcohol: a numeric for alcohol overdose (binge drinking) or alcohol poisoning deaths.
homicide: a numeric
suicide: a numeric
mva(motor vehicle accident): a numeric
drugs: a numeric
externalother: a numeric for deaths due to “other external causes” include mortality from falls, burns, and drownings, all of which are strongly associated with alcohol consumption.
Now, let us take a look at the threshold (= the minimum drinking age), which occurs at 21 years. There is a noticeable jump in the mortality rate after 21 years!
carpenter_dobkin_2009 |>ggplot(aes(x = agecell, y = all)) +geom_point() +geom_vline(xintercept =21, color ="red", size =1, linetype ="dashed") +annotate("text", x =20.4, y =105, label ="Minimum Drinking Age",size=4) +labs(y ="Mortality rate (per 100.000)",x ="Age (binned)")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
3.2 Estimation: same slopes
The RDD can be estimated by OLS. The first regression applies the same slopes on both sides of the cutoff.
Model
At first, we have to compute a dummy variable (threshold), indicating whether an individual is below or above the cutoff. The dummy is equal to zero for observations below and equal to one for observations above the cutoff of 21 years. Then I am specifying a linear model with function lm() to regress all deaths per 100.000 (all) on the threshold dummy and the respondents’ age which is centered around the threshold value of age (21 years). This is done with function I() by subtracting the cutoff from each age bin.
Call:
lm(formula = all ~ threshold + I(agecell - 21), data = mutate(carpenter_dobkin_2009,
threshold = ifelse(test = agecell >= 21, yes = 1, no = 0)))
Residuals:
Min 1Q Median 3Q Max
-5.0559 -1.8483 0.1149 1.4909 5.8043
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 91.8414 0.8050 114.083 < 2e-16 ***
threshold 7.6627 1.4403 5.320 3.15e-06 ***
I(agecell - 21) -0.9747 0.6325 -1.541 0.13
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.493 on 45 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.5946, Adjusted R-squared: 0.5765
F-statistic: 32.99 on 2 and 45 DF, p-value: 1.508e-09
The coefficient of the dummy available threshold is the average treatment effect. On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 7.66 points higher.
The coefficient for threshold represents the discontinuous jump or treatment effect at the threshold. It captures the immediate effect of crossing the threshold on the dependent variable, all. This coefficient is the estimated causal effect of being on one side of the age threshold versus the other.
The coefficient for I(agecell - 21) represents the slope or trend of the dependent variable with respect to age, allowing for a linear effect of age on the outcome. This term controls for the gradual effect of age on the outcome around the threshold and helps separate the continuous trend in age from the discrete jump caused by crossing the threshold.
Model via rddtools
There is an alternative approach by using R package rddtools which contains various functions related to applying the RDD. Within function rdd_reg_lm() I am using the argument slope = "same" to achieve the same result with the previous approach.
Call:
lm(formula = all ~ threshold + I(agecell - 21) + threshold:I(agecell -
21), data = mutate(carpenter_dobkin_2009, threshold = ifelse(agecell >=
21, 1, 0)))
Residuals:
Min 1Q Median 3Q Max
-4.368 -1.787 0.117 1.108 5.341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.6184 0.9325 100.399 < 2e-16 ***
threshold 7.6627 1.3187 5.811 6.4e-07 ***
I(agecell - 21) 0.8270 0.8189 1.010 0.31809
threshold:I(agecell - 21) -3.6034 1.1581 -3.111 0.00327 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.283 on 44 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6677, Adjusted R-squared: 0.645
F-statistic: 29.47 on 3 and 44 DF, p-value: 1.325e-10
This approach does not alter the interpretation of the treatment effect! On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 7.66 points higher.
Model via rddtools
Again, we can use R package rddtools to get the job done. Now, the argument slope = "separate" has to be used inside function rdd_reg_lm().
Call:
lm(formula = y ~ ., data = dat_step1, weights = weights)
Residuals:
Min 1Q Median 3Q Max
-4.368 -1.787 0.117 1.108 5.341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.6184 0.9325 100.399 < 2e-16 ***
D 7.6627 1.3187 5.811 6.4e-07 ***
x 0.8270 0.8189 1.010 0.31809
x_right -3.6034 1.1581 -3.111 0.00327 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.283 on 44 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6677, Adjusted R-squared: 0.645
F-statistic: 29.47 on 3 and 44 DF, p-value: 1.325e-10
Scatter plot
Let us take at the different slopes with a scatter plot. The slope on the right side of the cutoff is negative while it is positive on the left side.
carpenter_dobkin_2009 |>select(agecell, all) |>mutate(threshold =as.factor(ifelse(agecell >=21, 1, 0))) |>ggplot(aes(x = agecell, y = all, color = threshold)) +geom_point() +geom_smooth(method ="lm", se =FALSE) +scale_color_brewer(palette ="Accent") +guides(color =FALSE) +geom_vline(xintercept =21, color ="red",size =1, linetype ="dashed") +labs(y ="Mortality rate (per 100.000)",x ="Age (binned)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
3.4 Modifying the functional form
Particular attentions should be paid to the specification of the functional form when applying a RDD.
Model
Below, I am modelling a quadratic relationship between age and the mortality per 100.000 (all). The quadratic term enters the formula via function I(). As in the previous section, different slopes around the cutoff are used.
Call:
lm(formula = all ~ threshold + I(agecell - 21) + I((agecell -
21)^2) + threshold:I(agecell - 21) + threshold:I((agecell -
21)^2), data = mutate(carpenter_dobkin_2009, threshold = ifelse(agecell >=
21, 1, 0)))
Residuals:
Min 1Q Median 3Q Max
-4.3343 -1.3946 0.1849 1.2848 5.0817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.0729 1.4038 66.301 < 2e-16 ***
threshold 9.5478 1.9853 4.809 1.97e-05 ***
I(agecell - 21) -0.8306 3.2901 -0.252 0.802
I((agecell - 21)^2) -0.8403 1.6153 -0.520 0.606
threshold:I(agecell - 21) -6.0170 4.6529 -1.293 0.203
threshold:I((agecell - 21)^2) 2.9042 2.2843 1.271 0.211
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.285 on 42 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6821, Adjusted R-squared: 0.6442
F-statistic: 18.02 on 5 and 42 DF, p-value: 1.624e-09
On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is now 9.55 points higher.
Model via rddtools
In function rdd_reg_lm() I have to modify the argument order = to specify a quadratic term (which is a second order polynomial). It is quite easy to use higher order polynomials with package rddtools compared to the traditional approach with function lm().
rdd_data(y = carpenter_dobkin_2009$all, x = carpenter_dobkin_2009$agecell, cutpoint =21) |>rdd_reg_lm(slope ="separate", order =2) |>summary()
Call:
lm(formula = y ~ ., data = dat_step1, weights = weights)
Residuals:
Min 1Q Median 3Q Max
-4.3343 -1.3946 0.1849 1.2848 5.0817
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 93.0729 1.4038 66.301 < 2e-16 ***
D 9.5478 1.9853 4.809 1.97e-05 ***
x -0.8306 3.2901 -0.252 0.802
`x^2` -0.8403 1.6153 -0.520 0.606
x_right -6.0170 4.6529 -1.293 0.203
`x^2_right` 2.9042 2.2843 1.271 0.211
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.285 on 42 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.6821, Adjusted R-squared: 0.6442
F-statistic: 18.02 on 5 and 42 DF, p-value: 1.624e-09
Scatterplot
On the right side of the cutoff, this model seems to fit the data better!
carpenter_dobkin_2009 |>select(agecell, all) |>mutate(threshold =as.factor(ifelse(agecell >=21, 1, 0))) |>ggplot(aes(x = agecell, y = all, color = threshold)) +geom_point() +geom_smooth(method ="lm",formula = y ~ x +I(x ^2),se =FALSE) +scale_color_brewer(palette ="Accent") +guides(color =FALSE) +geom_vline(xintercept =21, color ="red",size =1, linetype ="dashed") +labs(y ="Mortality rate (per 100.000)",x ="Age (binned)")
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
3.5 Sensitivity analysis
Furthermore, it is advisable to check the sensitivity of the results with respect to limiting the sample size.
Model
Instead of using the full range of age, I am only using respondents aged between 20 and 22 years. Also, I am removing the quadratic term.
Call:
lm(formula = all ~ threshold + I(agecell - 21) + threshold:I(agecell -
21), data = mutate(filter(carpenter_dobkin_2009, agecell >=
20 & agecell <= 22), threshold = ifelse(agecell >= 21, 1,
0)))
Residuals:
Min 1Q Median 3Q Max
-4.3038 -0.9132 -0.1746 1.1758 4.3307
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 92.524 1.370 67.550 < 2e-16 ***
threshold 9.753 1.937 5.035 6.34e-05 ***
I(agecell - 21) -1.612 2.407 -0.669 0.511
threshold:I(agecell - 21) -3.289 3.405 -0.966 0.346
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.366 on 20 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.7161, Adjusted R-squared: 0.6735
F-statistic: 16.82 on 3 and 20 DF, p-value: 1.083e-05
This result is pretty similar to the previous approach with the quadratic approach. On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 9.75 points higher.
Scatter plot
carpenter_dobkin_2009 |>filter(agecell >=20& agecell <=22) |>select(agecell, all) |>mutate(threshold =as.factor(ifelse(agecell >=21, 1, 0))) |>ggplot(aes(x = agecell, y = all, color = threshold)) +geom_point() +geom_smooth(method ="lm", se =FALSE) +scale_color_brewer(palette ="Accent") +guides(color =FALSE) +geom_vline(xintercept =21, color ="red",size =1, linetype ="dashed") +labs(y ="Mortality rate (per 100.000)",x ="Age (binned)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
References
Carpenter, Christopher, and Carlos Dobkin. 2009. “The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age.” American Economic Journal: Applied Economics 1 (1): 164–82. https://doi.org/10.1257/app.1.1.164.