Without a random process that separates the treatment and control group, the treatment effect can be identified if the assignment to the treatment group follows a regression discontinuity design (RDD). This requires a (running) variable which, at a certain threshold, separates the observations into a treatment and control group.
There are two different variants of the RDD:
sharp RDD:
fuzzy RDD:
the threshold influences the probability of being treated
this is in fact an instrumental variable approach (estimating a LATE)
The value of the outomce (Y) for individuals just below the threshold is the missing conterfactual outcome. It increases continuously with the cutoff variable, as opposed to the treatment.
Three methods to estimate a RDD can be distinguished:
Method 1:
select a subsample for which the value of the running variable is close to the threshold
problem: the smaller the sample, the larger the standard errors
Method 2:
select a larger sample and estimate parametrically
problem: this depends on the functional form and polynomials
Method 3:
select a subsample close to the threshold and estimate parametrically
extension: different functional forms on the left and right side of the cutoff
With an RDD approach some assumptions can be tested. Individuals close to the threshold are nearly identical, except for characteristics which are affected by the treatment. Prior to the treatment, the outcome should not differ between the treatment and control group. The distribution of the variable which indicates the threshold should have no jumps around this cutoff value.
I am now replicating a study from Carpenter and Dobkin (2009). The data of their study is available here. They are estimating the effect of alcohol consumption on mortality by utilising the minimum drinking age within a regression discontinuity design.
Below I included a list of the required R packages for this tutorial.
At first, I am reading the data with RStudio. The dataset contains aggregated values according to the respondents' age.
remove(list=ls())
# install.packages("stevedata")
## devtools::install_github("svmiller/stevedata")
#data(package = "stevedata")
library(stevedata) # for data
We have a data frame with 50 observations on the following 19 variables.
agecella numeric
alla numeric
allfitteda numeric
internala numeric
internalfitteda numeric
externala numeric
externalfitteda numeric
alcohola numeric
alcoholfitteda numeric
homicidea numeric
homicidefitteda numeric
suicidea numeric
suicidefitteda numeric
mvaa numeric
mvafitteda numeric
drugsa numeric
drugsfitteda numeric
externalothera numeric
externalotherfitteda numeric
?stevedata::mm_mlda
carpenter_dobkin_2009 <- stevedata::mm_mlda
#describe(carpenter_dobkin_2009, skew = F, ranges = F,)
stargazer(as.data.frame(carpenter_dobkin_2009), type="text")
##
## =====================================================
## Statistic N Mean St. Dev. Min Max
## -----------------------------------------------------
## agecell 50 21.000 1.127 19.068 22.932
## all 48 95.673 3.831 88.428 105.268
## allfitted 50 95.803 3.286 91.706 102.892
## internal 48 20.285 2.254 15.977 24.373
## internalfitted 50 20.281 1.995 16.738 24.044
## external 48 75.387 2.986 71.341 83.331
## externalfitted 50 75.522 2.270 73.158 81.784
## alcohol 48 1.257 0.350 0.639 2.519
## alcoholfitted 50 1.267 0.260 0.794 1.817
## homicide 48 16.912 0.730 14.948 18.411
## homicidefitted 50 16.953 0.453 16.261 17.762
## suicide 48 12.352 1.063 10.889 14.832
## suicidefitted 50 12.363 0.760 11.592 13.547
## mva 48 31.623 2.385 26.855 36.385
## mvafitted 50 31.680 2.003 27.868 34.818
## drugs 48 4.250 0.616 3.202 5.565
## drugsfitted 50 4.255 0.521 3.449 5.130
## externalother 48 9.599 0.748 7.973 11.483
## externalotherfitted 50 9.610 0.465 8.388 10.353
## -----------------------------------------------------
Now, let us take a look at the threshold (= the minimum drinking age), which occurs at 21 years. There is a noticeable jump in the mortality rate after 21 years!
carpenter_dobkin_2009 %>%
ggplot(aes(x = agecell, y = all)) +
geom_point() +
geom_vline(xintercept = 21, color = "red", size = 1, linetype = "dashed") +
annotate("text", x = 20.4, y = 105, label = "Minimum Drinking Age",
size=4) +
labs(y = "Mortality rate (per 100.000)",
x = "Age (binned)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values (`geom_point()`).
The RDD can be estimated by OLS. The first regression applies the same slopes on both sides of the cutoff.
At first, we have to compute a dummy variable
(threshold), indicating whether an indidivual is below or
above the cutoff. The dummy is equal to zero for observations below and
equal to one for observations aboev the cutoff of 21 years. Then I am
specifiying a linear model with function lm() to regress
all deaths per 100.000 (all) on the threshold
dummy and the respondents' age which is centered around the
threshold value of age (21 years). This is done with
function I() by substracting the cutoff from each age
bin.
lm_same_slope <- carpenter_dobkin_2009 %>%
mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$%
lm(all ~ threshold + I(agecell - 21))
summary(lm_same_slope)
##
## Call:
## lm(formula = all ~ threshold + I(agecell - 21))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0559 -1.8483 0.1149 1.4909 5.8043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.8414 0.8050 114.083 < 2e-16 ***
## threshold 7.6627 1.4403 5.320 3.15e-06 ***
## I(agecell - 21) -0.9747 0.6325 -1.541 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.493 on 45 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5946, Adjusted R-squared: 0.5765
## F-statistic: 32.99 on 2 and 45 DF, p-value: 1.508e-09
The coefficient of the dummy avariable threshold is the
average treatment effect. On average, the mortality rate per 100.000 for
individuals reaching the minimum drinking age is 7.66
points higher.
There is an alternative approach by using R package
rddtools which contains various functions related to
applying the RDD. Within function rdd_reg_lm() I am using
the argument slope = "same" to achieve the same result with
the previous approach.
rdd_data(y = carpenter_dobkin_2009$all,
x = carpenter_dobkin_2009$agecell,
cutpoint = 21) %>%
rdd_reg_lm(slope = "same") %>%
summary()
##
## Call:
## lm(formula = y ~ ., data = dat_step1, weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0559 -1.8483 0.1149 1.4909 5.8043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.8414 0.8050 114.083 < 2e-16 ***
## D 7.6627 1.4403 5.320 3.15e-06 ***
## x -0.9747 0.6325 -1.541 0.13
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.493 on 45 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5946, Adjusted R-squared: 0.5765
## F-statistic: 32.99 on 2 and 45 DF, p-value: 1.508e-09
Same coefficient !
With a scatterplot I draw the fitted line of the regression, which shows the same slope at both sides of the threshold.
carpenter_dobkin_2009 %>%
select(agecell, all) %>%
mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
ggplot(aes(x = agecell, y = all)) +
geom_point(aes(color = threshold)) +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(palette = "Accent") +
guides(color = FALSE) +
geom_vline(xintercept = 21, color = "red",
size = 1, linetype = "dashed") +
labs(y = "Mortality rate (per 100.000)",
x = "Age (binned)")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
The second regression applies different slopes on both sides of the cutoff.
With function lm() this can be achieved by specifying an
interaction between the threshold dummy and
age which is centered around the cutoff value.
lm_different_slope <- carpenter_dobkin_2009 %>%
mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$%
lm(all ~ threshold + I(agecell - 21) + threshold:I(agecell - 21))
summary(lm_different_slope)
##
## Call:
## lm(formula = all ~ threshold + I(agecell - 21) + threshold:I(agecell -
## 21))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.368 -1.787 0.117 1.108 5.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.6184 0.9325 100.399 < 2e-16 ***
## threshold 7.6627 1.3187 5.811 6.4e-07 ***
## I(agecell - 21) 0.8270 0.8189 1.010 0.31809
## threshold:I(agecell - 21) -3.6034 1.1581 -3.111 0.00327 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.283 on 44 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6677, Adjusted R-squared: 0.645
## F-statistic: 29.47 on 3 and 44 DF, p-value: 1.325e-10
This approach does not alter the interpretation of the treatment effect! On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 7.66 points higher.
Again, we can use R package rddtools to get the job
done. Now, the argument slope = "separate" has to be used
inside function rdd_reg_lm().
rdd_data(y = carpenter_dobkin_2009$all,
x = carpenter_dobkin_2009$agecell,
cutpoint = 21) %>%
rdd_reg_lm(slope = "separate") %>%
summary()
##
## Call:
## lm(formula = y ~ ., data = dat_step1, weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.368 -1.787 0.117 1.108 5.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.6184 0.9325 100.399 < 2e-16 ***
## D 7.6627 1.3187 5.811 6.4e-07 ***
## x 0.8270 0.8189 1.010 0.31809
## x_right -3.6034 1.1581 -3.111 0.00327 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.283 on 44 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6677, Adjusted R-squared: 0.645
## F-statistic: 29.47 on 3 and 44 DF, p-value: 1.325e-10
Let us take at the different slopes with a scatterplot. The slope on the right side of the cutoff is negative while it is positive on the left side.
carpenter_dobkin_2009 %>%
select(agecell, all) %>%
mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
ggplot(aes(x = agecell, y = all, color = threshold)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(palette = "Accent") +
guides(color = FALSE) +
geom_vline(xintercept = 21, color = "red",
size = 1, linetype = "dashed") +
labs(y = "Mortality rate (per 100.000)",
x = "Age (binned)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Particular attentions should be paid to the specification of the functional form when applying a RDD.
Below, I am modelling a quadratic relationship between
age and the mortality per 100.000 (all). The
quadratic term enters the formula via function I(). As in
the previous section, different slopes around the cutoff are used.
lm_quadratic <- carpenter_dobkin_2009 %>%
mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$%
lm(all ~ threshold + I(agecell - 21) + I((agecell -21)^2) + threshold:I(agecell - 21) +
threshold:I((agecell - 21)^2))
summary(lm_quadratic)
##
## Call:
## lm(formula = all ~ threshold + I(agecell - 21) + I((agecell -
## 21)^2) + threshold:I(agecell - 21) + threshold:I((agecell -
## 21)^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3343 -1.3946 0.1849 1.2848 5.0817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.0729 1.4038 66.301 < 2e-16 ***
## threshold 9.5478 1.9853 4.809 1.97e-05 ***
## I(agecell - 21) -0.8306 3.2901 -0.252 0.802
## I((agecell - 21)^2) -0.8403 1.6153 -0.520 0.606
## threshold:I(agecell - 21) -6.0170 4.6529 -1.293 0.203
## threshold:I((agecell - 21)^2) 2.9042 2.2843 1.271 0.211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.285 on 42 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6821, Adjusted R-squared: 0.6442
## F-statistic: 18.02 on 5 and 42 DF, p-value: 1.624e-09
On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is now 9.55 points higher.
In function rdd_reg_lm() I have to modify the argument
order = to specify a quadratic term (which is a second
order polynomial). It is quite easy to use higher order polynomials with
package rddtools compared to the traditional approach with
function lm().
rdd_data(y = carpenter_dobkin_2009$all,
x = carpenter_dobkin_2009$agecell,
cutpoint = 21) %>%
rdd_reg_lm(slope = "separate", order = 2) %>%
summary()
##
## Call:
## lm(formula = y ~ ., data = dat_step1, weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3343 -1.3946 0.1849 1.2848 5.0817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.0729 1.4038 66.301 < 2e-16 ***
## D 9.5478 1.9853 4.809 1.97e-05 ***
## x -0.8306 3.2901 -0.252 0.802
## `x^2` -0.8403 1.6153 -0.520 0.606
## x_right -6.0170 4.6529 -1.293 0.203
## `x^2_right` 2.9042 2.2843 1.271 0.211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.285 on 42 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.6821, Adjusted R-squared: 0.6442
## F-statistic: 18.02 on 5 and 42 DF, p-value: 1.624e-09
On the right side of the cutoff, this model seems to fit the data better!
carpenter_dobkin_2009 %>%
select(agecell, all) %>%
mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
ggplot(aes(x = agecell, y = all, color = threshold)) +
geom_point() +
geom_smooth(method = "lm",
formula = y ~ x + I(x ^ 2),
se = FALSE) +
scale_color_brewer(palette = "Accent") +
guides(color = FALSE) +
geom_vline(xintercept = 21, color = "red",
size = 1, linetype = "dashed") +
labs(y = "Mortality rate (per 100.000)",
x = "Age (binned)")
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Futhermore, it is advisable to check the sensitivity of the results with respect to limiting the sample size.
Instead of using the full range of age, I am only using
respondents aged between 20 and 22 years. Also, I am removing the
quadratic term.
lm_sensitivity <- carpenter_dobkin_2009 %>%
filter(agecell >= 20 & agecell <= 22) %>%
mutate(threshold = ifelse(agecell >= 21, 1, 0)) %$%
lm(all ~ threshold + I(agecell - 21) + threshold:I(agecell - 21))
summary(lm_sensitivity)
##
## Call:
## lm(formula = all ~ threshold + I(agecell - 21) + threshold:I(agecell -
## 21))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3038 -0.9132 -0.1746 1.1758 4.3307
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.524 1.370 67.550 < 2e-16 ***
## threshold 9.753 1.937 5.035 6.34e-05 ***
## I(agecell - 21) -1.612 2.407 -0.669 0.511
## threshold:I(agecell - 21) -3.289 3.405 -0.966 0.346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.366 on 20 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7161, Adjusted R-squared: 0.6735
## F-statistic: 16.82 on 3 and 20 DF, p-value: 1.083e-05
This result is pretty similar to the previous approach with the quadratic approach. On average, the mortality rate per 100.000 for individuals reaching the minimum drinking age is 9.75 points higher.
carpenter_dobkin_2009 %>%
filter(agecell >= 20 & agecell <= 22) %>%
select(agecell, all) %>%
mutate(threshold = as.factor(ifelse(agecell >= 21, 1, 0))) %>%
ggplot(aes(x = agecell, y = all, color = threshold)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(palette = "Accent") +
guides(color = FALSE) +
geom_vline(xintercept = 21, color = "red",
size = 1, linetype = "dashed") +
labs(y = "Mortality rate (per 100.000)",
x = "Age (binned)")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Carpenter, Christopher, and Carlos Dobkin. 2009. "The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age." American Economic Journal: Applied Economics 1 (1): 164–82. https://doi.org/10.1257/app.1.1.164.
Data - https://www.rdocumentation.org/packages/stevedata/versions/0.6.0
https://rpubs.com/phle/r_tutorial_regression_discontinuity_design