In this exercise, you will explore regression dicontinuity designs, based the following paper: Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82.
a. Download and read the paper: For instance, from the AEA website. b. Get the Carpenter & Dobkin and data:
rm(list=ls())
library(haven)
library(dplyr)
library(ggplot2)
library(stargazer)
AEJfigs <- read_dta("../HW6/AEJfigs.dta")
AEJfigs<- AEJfigs %>%
mutate(over21 = as.factor(ifelse(agecell >= 21, 1, 0)))
View(AEJfigs)
head(AEJfigs)
## # A tibble: 6 x 20
## agecell all allfitted internal internalfitted external externalfitted
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.1 92.8 91.7 16.6 16.7 76.2 75.0
## 2 19.2 95.1 91.9 18.3 16.9 76.8 75.0
## 3 19.2 92.1 92.0 18.9 17.1 73.2 75.0
## 4 19.3 88.4 92.2 16.1 17.3 72.3 74.9
## 5 19.4 88.7 92.3 17.4 17.4 71.3 74.9
## 6 19.5 90.2 92.5 17.9 17.6 72.3 74.9
## # ... with 13 more variables: alcohol <dbl>, alcoholfitted <dbl>,
## # homicide <dbl>, homicidefitted <dbl>, suicide <dbl>, suicidefitted <dbl>,
## # mva <dbl>, mvafitted <dbl>, drugs <dbl>, drugsfitted <dbl>,
## # externalother <dbl>, externalotherfitted <dbl>, over21 <fct>
summary(AEJfigs)
## agecell all allfitted internal
## Min. :19.07 Min. : 88.43 Min. : 91.71 Min. :15.98
## 1st Qu.:20.08 1st Qu.: 92.79 1st Qu.: 93.04 1st Qu.:18.60
## Median :21.00 Median : 95.69 Median : 95.18 Median :20.29
## Mean :21.00 Mean : 95.67 Mean : 95.80 Mean :20.29
## 3rd Qu.:21.92 3rd Qu.: 98.03 3rd Qu.: 97.79 3rd Qu.:21.98
## Max. :22.93 Max. :105.27 Max. :102.89 Max. :24.37
## NA's :2 NA's :2
## internalfitted external externalfitted alcohol
## Min. :16.74 Min. :71.34 Min. :73.16 Min. :0.6391
## 1st Qu.:18.67 1st Qu.:73.04 1st Qu.:74.06 1st Qu.:0.9962
## Median :20.54 Median :74.81 Median :74.74 Median :1.2119
## Mean :20.28 Mean :75.39 Mean :75.52 Mean :1.2573
## 3rd Qu.:21.66 3rd Qu.:77.24 3rd Qu.:76.06 3rd Qu.:1.4701
## Max. :24.04 Max. :83.33 Max. :81.78 Max. :2.5193
## NA's :2 NA's :2
## alcoholfitted homicide homicidefitted suicide
## Min. :0.7943 Min. :14.95 Min. :16.26 Min. :10.89
## 1st Qu.:1.0724 1st Qu.:16.61 1st Qu.:16.54 1st Qu.:11.61
## Median :1.2471 Median :16.99 Median :16.99 Median :12.20
## Mean :1.2674 Mean :16.91 Mean :16.95 Mean :12.35
## 3rd Qu.:1.4455 3rd Qu.:17.29 3rd Qu.:17.25 3rd Qu.:12.82
## Max. :1.8174 Max. :18.41 Max. :17.76 Max. :14.83
## NA's :2 NA's :2
## suicidefitted mva mvafitted drugs
## Min. :11.59 Min. :26.86 Min. :27.87 Min. :3.202
## 1st Qu.:11.61 1st Qu.:30.12 1st Qu.:30.17 1st Qu.:3.755
## Median :12.25 Median :31.64 Median :31.73 Median :4.314
## Mean :12.36 Mean :31.62 Mean :31.68 Mean :4.250
## 3rd Qu.:13.04 3rd Qu.:33.10 3rd Qu.:33.40 3rd Qu.:4.756
## Max. :13.55 Max. :36.39 Max. :34.82 Max. :5.565
## NA's :2 NA's :2
## drugsfitted externalother externalotherfitted over21
## Min. :3.449 Min. : 7.973 Min. : 8.388 0:25
## 1st Qu.:3.769 1st Qu.: 9.149 1st Qu.: 9.347 1:25
## Median :4.323 Median : 9.561 Median : 9.690
## Mean :4.255 Mean : 9.599 Mean : 9.610
## 3rd Qu.:4.679 3rd Qu.:10.122 3rd Qu.: 9.939
## Max. :5.130 Max. :11.483 Max. :10.353
## NA's :2
No need to bother trying to get their exact formatting here (unless you love doing that). Just make sure you show the data and the discontinuities.
ggplot(AEJfigs, aes(x=agecell)) +
geom_point(aes(y=all, shape="ALL"))+
geom_point(mapping=aes(y=internal, shape="Internal")) +
geom_point(aes(y=external, shape="External"))+
geom_line(aes(y=internalfitted, linetype="Internal Fitted"))+
geom_line(aes(y=allfitted, linetype="All Fitted"))+
ylim(0, 120)+
xlim(19, 23)+
geom_line(aes(y=externalfitted, linetype="External Fitted"))+
theme_classic()+
xlab("Age")+
ylab("Deaths per 100,000 person-years")+ggtitle("Figure 3. Age Profile For Death Rates")+
theme(panel.background = element_blank(),
legend.title = element_blank())
Run a simple RD regression (same slope, with a jump) for “all” deaths by age (I don’t mean all the variables, just the one variable that is called “all”). Note: You are NOT trying to reproduce the paper’s tables. That is because you only have access to a collapsed dataset with <50 observations. Don’t expect to closely reproduce the results of the tables in the paper, which uses a full 1500 observations dataset. Also, you don’t have the same variables either. Analyse the results. How do you use these results to estimate the relationship between drinking and mortality?
simpleRDD<-lm(all~agecell+over21, data=AEJfigs)
stargazer(simpleRDD, type ="text", title= "A Simple RDD", align= TRUE, covariate.labels = c("Agecell", "Treated"))
##
## A Simple RDD
## ===============================================
## Dependent variable:
## ---------------------------
## all
## -----------------------------------------------
## Agecell -0.975
## (0.632)
##
## Treated 7.663***
## (1.440)
##
## Constant 112.310***
## (12.668)
##
## -----------------------------------------------
## Observations 48
## R2 0.595
## Adjusted R2 0.577
## Residual Std. Error 2.493 (df = 45)
## F Statistic 32.995*** (df = 2; 45)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Plot the all variable by age, and add the regression lines defined by your regression output (It’s ok if the lines extend throughout the whole figure. Don’t worry too much about formatting.)
# Get the fitted value.
AEJfigs %>%
ggplot(AEJfigs, mapping=aes(x=agecell)) +
geom_point(mapping=aes(y=all)) +
geom_line(data=fortify(simpleRDD), aes(x=agecell,y=.fitted)) +
xlab("Age") +
ylim(0, 120)+
xlim(19, 23)+
ylab("Deaths per 100,000 person-years")+ggtitle("Simplest RDD") +
theme_classic()+
theme(panel.background = element_blank(),
legend.title = element_blank())
Run another RD where not only the intercept, but also the slope can differ between the under- and over-21 subsamples. Analyse the results. What is your estimate of interest. Does it differ from the previous one?
mfrdd<-lm(all~agecell+over21+over21*agecell, data=AEJfigs)
stargazer(mfrdd, type ="text", title= "A More Flexible RDD", align= TRUE, covariate.labels = c("Agecell", "Treated", "Treated*agecell"))
##
## A More Flexible RDD
## ===============================================
## Dependent variable:
## ---------------------------
## all
## -----------------------------------------------
## Agecell 0.827
## (0.819)
##
## Treated 83.333***
## (24.357)
##
## Treated*agecell -3.603***
## (1.158)
##
## Constant 76.251***
## (16.396)
##
## -----------------------------------------------
## Observations 48
## R2 0.668
## Adjusted R2 0.645
## Residual Std. Error 2.283 (df = 44)
## F Statistic 29.466*** (df = 3; 44)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Again, plot the data and the regression lines defined by the regression.
AEJfigs %>%
ggplot(AEJfigs, mapping=aes(x=agecell)) +
geom_point(mapping=aes(y=all)) +
geom_line(data=fortify(mfrdd), aes(x=agecell,y=.fitted)) +
ylim(0, 120)+
xlim(19, 23)+
xlab("Age") +
ylab("Deaths per 100,000 person-years")+ggtitle("More Flexible RDD") +
theme_classic()+
theme(panel.background = element_blank(),
legend.title = element_blank())
Use the rdd package. Run the RDestimate command on all~agecell, specifying your data and the cutoff option at 21. Print the output of your regression. Plot the output of your regression (You can just write plot(name). RDestimate outputs plotable objects). Discuss the results. Do they differ from what you found? Why?
library(rdd)
ppRDD <- RDestimate(all~agecell, data=AEJfigs, cutpoint = 21)
summary(ppRDD)
##
## Call:
## RDestimate(formula = all ~ agecell, data = AEJfigs, cutpoint = 21)
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 1.6561 40 9.001 1.480 6.080 1.199e-09
## Half-BW 0.8281 20 9.579 1.914 5.004 5.609e-07
## Double-BW 3.3123 48 7.953 1.278 6.223 4.882e-10
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 33.08 3 36 3.799e-10
## Half-BW 29.05 3 16 2.078e-06
## Double-BW 32.54 3 44 6.129e-11
plot(ppRDD)
Re-run an RDestimate command, this time adding the options kernel = “rectangular” and bw=2. Output the results. Plot them. Compare to previous output and discuss.
ppRDD <- RDestimate(all~agecell, data=AEJfigs, cutpoint = 21, kernel = "rectangular", bw=2 )
summary(ppRDD)
##
## Call:
## RDestimate(formula = all ~ agecell, data = AEJfigs, cutpoint = 21,
## bw = 2, kernel = "rectangular")
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 2 48 7.663 1.273 6.017 1.776e-09
## Half-BW 1 24 9.753 1.902 5.128 2.929e-07
## Double-BW 4 48 7.663 1.273 6.017 1.776e-09
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 29.47 3 44 2.651e-10
## Half-BW 16.82 3 20 2.167e-05
## Double-BW 29.47 3 44 2.651e-10
plot(ppRDD)