In this exercise, you will explore regression dicontinuity designs, based the following paper:
Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82
a. Download and read the paper:
For instance, from the AEA website.
b. Get the Carpenter & Dobkin and data:
Can be downloaded straight here http://masteringmetrics.com/wp-content/uploads/2015/01/AEJfigs.dta, from A&P Mastering Metrics resources webpage (If that link is dead just ask me for the file).
Look over the data, understand what the different variables are. The key variables you need are agecell, all, internal, external, as well as the versions of those variables that have the suffix fitted.
Make an over21 dummy to help with the following analysis.
library(haven)
library(stargazer)
library(foreign)
library(kableExtra)
library(vtable)
library(dplyr)
library(ggplot2)
library(tidyr)
library(rdd)
alcon <- read_dta("AEJfigs.dta")
head(alcon)
## # A tibble: 6 × 19
## agecell all allfit…¹ inter…² inter…³ exter…⁴ exter…⁵ alcohol alcoh…⁶ homic…⁷
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.1 92.8 91.7 16.6 16.7 76.2 75.0 0.639 0.794 16.3
## 2 19.2 95.1 91.9 18.3 16.9 76.8 75.0 0.677 0.838 16.9
## 3 19.2 92.1 92.0 18.9 17.1 73.2 75.0 0.866 0.878 15.2
## 4 19.3 88.4 92.2 16.1 17.3 72.3 74.9 0.867 0.915 16.7
## 5 19.4 88.7 92.3 17.4 17.4 71.3 74.9 1.02 0.949 14.9
## 6 19.5 90.2 92.5 17.9 17.6 72.3 74.9 1.17 0.981 15.6
## # … with 9 more variables: homicidefitted <dbl>, suicide <dbl>,
## # suicidefitted <dbl>, mva <dbl>, mvafitted <dbl>, drugs <dbl>,
## # drugsfitted <dbl>, externalother <dbl>, externalotherfitted <dbl>, and
## # abbreviated variable names ¹allfitted, ²internal, ³internalfitted,
## # ⁴external, ⁵externalfitted, ⁶alcoholfitted, ⁷homicide
alcon$over21 <- ifelse(alcon$agecell >= 21, 1, 0)
c. Reproduce Figure 3 of the paper
No need to bother trying to get their exact formatting here (unless you love doing that). Just make sure you show the data and the discontinuities.
alcon_slim <- alcon %>%
select(agecell, all, allfitted, internal, internalfitted, external, externalfitted) %>%
gather(key = "Cause" , value = "value", -agecell, na.rm = TRUE, convert = FALSE, factor_key = FALSE)
Fig3 <- ggplot(alcon_slim, aes(x = as.numeric(agecell), y = value)) +
geom_point(aes(color = Cause)) +
scale_x_continuous (limits=c(19,23), breaks= seq(19, 23, 0.5)) +
scale_y_continuous (limits=c(0, 120), breaks=seq(0, 120, 20)) +
scale_color_manual (values=c("#666666","#333333","#663300", "#330033", "#0066CC", "#0000CC"),
name="Cause", labels=c("All", "All fitted", "Internal", "Internal fitted", "External", "External fitted")) +
labs(title = "Figure 3. Age Profile for Death Rates", x = "Age", y = "Deaths per 100,000 person-years",
caption = "Notes: Deaths from the National Vital Statistics Records. Includes all deaths that occurred in the United States between 1997–2003. \n The population denominators are derived from the census. See online Appendix C for a list of causes of death.") +
theme(panel.background = element_blank(),
legend.background = element_rect(size=0.5, linetype="solid", colour ="black"),
legend.direction = "horizontal", legend.position = c(0.7,0.4), legend.title = element_blank(),
panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.border= element_blank(),
axis.line.x = element_line(color="black", size = 0.5),
axis.line.y = element_line(color="black", size = 0.5),
plot.title = element_text(hjust = 0.5, face ="bold"),
plot.caption = element_text(hjust = 0, size = 7.5),
axis.text.x = element_text(colour = "black", size = 10),
axis.text.y = element_text(colour = "black", size = 10),
axis.title = element_text(size = 12))
Fig3
d. Simplest regression-discontinuity design
Run a simple RD regression (same slope, with a jump) for “all” deaths by age (I don’t mean all the variables, just the one variable that is called “all”).
RDD_d <- lm(all ~ agecell + over21, data = alcon)
stargazer(RDD_d, type = "text", title = "Simple RD Regression",
align = TRUE,
dep.var.labels = c("All Types of Deaths"), covariate.labels = c("agecell", "over21"),
keep.stat = c("n", "rsq"), omit = c("adj.rsq"),
table.layout = "=d=t-s=n")
##
## Simple RD Regression
## ========================================
## All Types of Deaths
## ========================================
## agecell -0.975
## (0.632)
##
## over21 7.663***
## (1.440)
##
## Constant 112.310***
## (12.668)
##
## ----------------------------------------
## Observations 48
## R2 0.595
## ========================================
## Note: *p<0.1; **p<0.05; ***p<0.01
How do you use these results to estimate the relationship between drinking and mortality?
The coefficient estimate of over21 is 7.663, suggesting that as the
age passes 21, alcohol consumption causes additional 7.67 death counts
per 100,000 individuals. The result is significant at 99% level of
confidence.
e. Plot the results Plot the all variable by age, and add the regression lines defined by your regression output.
# Get fitted values using predict() function
alcon$RDD_fit <- predict(RDD_d, alcon)
# Plot the all variable by age
plot(alcon$agecell, alcon$all, main = "Plot e. Simple RD Regression",
xaxs = "i", yaxs="i",
xlim = c(19,23), ylim = c(80, 110),
xlab = "Age", ylab = "Deaths per 100,000 person-years",
pch = 17, lwd = 2, lty = 1,
frame.plot = FALSE)
# Add regression lines
with(subset(alcon, agecell < 21), lines(agecell, RDD_fit, col = "red"))
with(subset(alcon, agecell >= 21), lines(agecell, RDD_fit, col = "red" ))
f. Allow more flexibility to your RD Run another
RD where not only the intercept, but also the slope can differ between
the under- and over-21 subsamples.
Analyse the results. What is your estimate of interest. Does it differ from the previous one?
RDD_f <- lm(all ~ agecell + over21 + agecell*over21, data = alcon)
stargazer(RDD_f, type = "text", title = "RD Regression with Flexibility",
align = TRUE,
dep.var.labels = c("All Types of Deaths"),
covariate.labels = c("agecell", "over21","agecell-over21 Interact"),
keep.stat = c("n", "rsq"), omit = c("adj.rsq"),
table.layout = "=d=t-s=n")
##
## RD Regression with Flexibility
## ===================================================
## All Types of Deaths
## ===================================================
## agecell 0.827
## (0.819)
##
## over21 83.333***
## (24.357)
##
## agecell-over21 Interact -3.603***
## (1.158)
##
## Constant 76.251***
## (16.396)
##
## ---------------------------------------------------
## Observations 48
## R2 0.668
## ===================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimate of interest would be over21. An over21-agecell
interaction term is included n this version of RD regression. Therefore
the magnitude of the effect of individual passing the age of 21 on
mortality should be calculated as: 83.333-3.603*(age=21)=7.67, which is
almost the same result compared to simple RD regresison. The estimate is
significant at 99% level of confidence.
g. Again, plot the data
Plot the data and the regression lines defined by the regression.
# Get fitted values using predict() function
alcon$RDD_flex <- predict(RDD_f, alcon)
# Plot the all variable by age
plot(alcon$agecell, alcon$all, main = "Plot g. RD Regression with Flexibility",
xaxs = "i", yaxs="i",
xlim = c(19,23), ylim = c(80, 110),
xlab = "Age", ylab = "Deaths per 100,000 person-years",
pch = 17, lwd = 2, lty = 1,
frame.plot = FALSE)
# Add regression lines
with(subset(alcon, agecell < 21), lines(agecell, RDD_flex, col = "orange"))
with(subset(alcon, agecell >= 21), lines(agecell, RDD_flex, col = "red" ))
h. Compare to pre-packaged RDD:
Use the rdd package.
Run the RDestimate command on all~agecell, specifying your data and the cutoff option at 21.
Print the output of your regression.
RDD_pack <- RDestimate(all~agecell, data = alcon, cutpoint = 21)
print(RDD_pack)
##
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21)
##
## Coefficients:
## LATE Half-BW Double-BW
## 9.001 9.579 7.953
summary(RDD_pack)
##
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21)
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 1.6561 40 9.001 1.480 6.080 1.199e-09
## Half-BW 0.8281 20 9.579 1.914 5.004 5.609e-07
## Double-BW 3.3123 48 7.953 1.278 6.223 4.882e-10
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 33.08 3 36 3.799e-10
## Half-BW 29.05 3 16 2.078e-06
## Double-BW 32.54 3 44 6.129e-11
Plot the output of your regression (You can just write
plot(name). RDestimate outputs plotable objects). Discuss the results.
Do they differ from what you found? Why?
plot(RDD_pack)
The result indicates an average treatment effect for individuals turning the age of 21 being 9, which is a bit higher than the RD Regression we conducted in part d and part f. The result is statistically significant at 99% level of confidence.
i. Prepackaged linear RDD:
Re-run an RDestimate command, this time adding the options kernel = “rectangular” and bw=2.
Output the results. Plot them. Compare to previous output and discuss.
RDD_rect <- RDestimate(all ~ agecell, data = alcon, cutpoint=21,
kernel = "rectangular", bw=2)
print(RDD_rect)
##
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21,
## bw = 2, kernel = "rectangular")
##
## Coefficients:
## LATE Half-BW Double-BW
## 7.663 9.753 7.663
summary(RDD_rect)
##
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21,
## bw = 2, kernel = "rectangular")
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 2 48 7.663 1.273 6.017 1.776e-09
## Half-BW 1 24 9.753 1.902 5.128 2.929e-07
## Double-BW 4 48 7.663 1.273 6.017 1.776e-09
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 29.47 3 44 2.651e-10
## Half-BW 16.82 3 20 2.167e-05
## Double-BW 29.47 3 44 2.651e-10
plot(RDD_rect)
The default setting for the kernel argument in the RDD package is Gaussian. After the argument was updated to “rectangular” implying a uniformly distributed density function, the ATE (7.663) decreases and becomes very close to the same coefficient estimates we retrieved from part d and part e, the simplest regression-discounity design.