Reference: Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82
For instance, from the AEA website.
Can be downloaded straight here http://masteringmetrics.com/wp-content/uploads/2015/01/AEJfigs.dta, from A&P Mastering Metrics resources webpage. Look over the data, understand what the different variables are. The key variables you need are agecell, all, internal, external, as well as the versions of those variables that have the suffix fitted. Make an over21 dummy to help with the following analysis.
library(haven)
d<-read_dta('hw6/AEJfigs.dta')
#make dummy
d$over21 <- ifelse(d$agecell >= 21, 1, 0)
head(d)
## # A tibble: 6 × 20
## agecell all allfitted internal internalfitted external externalfitted
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19.1 92.8 91.7 16.6 16.7 76.2 75.0
## 2 19.2 95.1 91.9 18.3 16.9 76.8 75.0
## 3 19.2 92.1 92.0 18.9 17.1 73.2 75.0
## 4 19.3 88.4 92.2 16.1 17.3 72.3 74.9
## 5 19.4 88.7 92.3 17.4 17.4 71.3 74.9
## 6 19.5 90.2 92.5 17.9 17.6 72.3 74.9
## # … with 13 more variables: alcohol <dbl>, alcoholfitted <dbl>, homicide <dbl>,
## # homicidefitted <dbl>, suicide <dbl>, suicidefitted <dbl>, mva <dbl>,
## # mvafitted <dbl>, drugs <dbl>, drugsfitted <dbl>, externalother <dbl>,
## # externalotherfitted <dbl>, over21 <dbl>
library(tidyverse)
ggplot()+
geom_point(aes(x = agecell, y = all),data=d)+
geom_line(aes(x = agecell, y = allfitted),subset(d, agecell < 21))+
geom_line(aes(x = agecell, y = allfitted),subset(d, agecell >= 21))+
geom_point(aes(x = agecell, y = internal),data=d)+
geom_line(aes(x = agecell, y = internalfitted),subset(d, agecell < 21))+
geom_line(aes(x = agecell, y = internalfitted),subset(d, agecell >= 21))+
geom_point(aes(x = agecell, y = external),data=d)+
geom_line(aes(x = agecell, y = externalfitted),subset(d, agecell < 21))+
geom_line(aes(x = agecell, y = externalfitted),subset(d, agecell >= 21))+
ggtitle("Figure 3. Age Profile for Death Rates")+
xlab("Age")+ylab("Deaths per 100,000 person-years")
Run a simple RD regression (same slope, with a jump) for “all” deaths by age (I don’t mean all the variables, just the one variable that is called “all”).
library(stargazer)
g<-lm(all~agecell+over21,data=d)
stargazer(g, title="Simple RD regression results", type = "text")
##
## Simple RD regression results
## ===============================================
## Dependent variable:
## ---------------------------
## all
## -----------------------------------------------
## agecell -0.975
## (0.632)
##
## over21 7.663***
## (1.440)
##
## Constant 112.310***
## (12.668)
##
## -----------------------------------------------
## Observations 48
## R2 0.595
## Adjusted R2 0.577
## Residual Std. Error 2.493 (df = 45)
## F Statistic 32.995*** (df = 2; 45)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The dummy variable variable “over21” has a significant positive effect on the death, which can be interpreted that death would increase about 7 people per 1000 people at age of 21.
Plot the all variable by age, and add the regression lines defined by your regression output
d$fit<-predict(g,d)
ggplot()+
geom_line(aes(x = agecell, y = fit, group=over21,color=factor(over21)),data=d)+
geom_point(aes(x = agecell, y = all),data=d)+
ggtitle("RDD regression output")+
theme(legend.position = "none")+
xlab("Age")+ylab("Deaths per 100,000 person-years")
Run another RD where not only the intercept, but also the slope can differ between the under- and over-21 subsamples. Analyse the results. What is your estimate of interest. Does it differ from the previous one?
g2<-lm(all~agecell+over21+I(agecell*over21),data=d)
stargazer(g2, title="Another RD regression results", type = "text")
##
## Another RD regression results
## ===============================================
## Dependent variable:
## ---------------------------
## all
## -----------------------------------------------
## agecell 0.827
## (0.819)
##
## over21 83.333***
## (24.357)
##
## I(agecell * over21) -3.603***
## (1.158)
##
## Constant 76.251***
## (16.396)
##
## -----------------------------------------------
## Observations 48
## R2 0.668
## Adjusted R2 0.645
## Residual Std. Error 2.283 (df = 44)
## F Statistic 29.466*** (df = 3; 44)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
This time, I add interactive term to reflect the effect of “over21” depending on “agecell”. The result shows that death would increase about 83 people per 1000 people at age of 21. And age 21 depend on age to affect the mortality rate.The interactive effect is negative. In the previous RDD results without intercept, agecell has a negative effect on mortality rate but now, it becomes positive. And the coefficient of “over21” becomes much higher after using interactive.
Again, plot the data and the regression lines defined by the regression.
d$fit2<-predict(g2,d)
ggplot()+
geom_line(aes(x = agecell, y = fit2, group=over21,color=factor(over21)),data=d)+
geom_point(aes(x = agecell, y = all),data=d)+
ggtitle("Another RDD regression output")+
theme(legend.position = "none")+
xlab("Age")+ylab("Deaths per 100,000 person-years")
Use the rdd package. Run the RDestimate command on all~agecell, specifying your data and the cutoff option at 21. Print the output of your regression. Plot the output of your regression (You can just write plot(name). RDestimate outputs plotable objects). Discuss the results. Do they differ from what you found? Why?
library(rdd)
g3 <- RDestimate(all ~ agecell, data=d, cutpoint=21)
summary(g3)
##
## Call:
## RDestimate(formula = all ~ agecell, data = d, cutpoint = 21)
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 1.6561 40 9.001 1.480 6.080 1.199e-09
## Half-BW 0.8281 20 9.579 1.914 5.004 5.609e-07
## Double-BW 3.3123 48 7.953 1.278 6.223 4.882e-10
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 33.08 3 36 3.799e-10
## Half-BW 29.05 3 16 2.078e-06
## Double-BW 32.54 3 44 6.129e-11
plot(g3)
The LATE estimate results is 9.001 which is higher than what we have before. The reason behind it could be that LATE results is estimated at a default bandwidth of 1.6561. And in the next question, we change the bandwidth and kernel method, the result become the same.
Re-run an RDestimate command, this time adding the options kernel = “rectangular” and bw=2. Output the results. Plot them. Compare to previous output and discuss.
library(rdd)
g4 <- RDestimate(all ~ agecell, data=d, cutpoint=21, kernel = "rectangular", bw=2)
summary(g4)
##
## Call:
## RDestimate(formula = all ~ agecell, data = d, cutpoint = 21,
## bw = 2, kernel = "rectangular")
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 2 48 7.663 1.273 6.017 1.776e-09
## Half-BW 1 24 9.753 1.902 5.128 2.929e-07
## Double-BW 4 48 7.663 1.273 6.017 1.776e-09
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 29.47 3 44 2.651e-10
## Half-BW 16.82 3 20 2.167e-05
## Double-BW 29.47 3 44 2.651e-10
plot(g4)
After we add the new command, LATE estimate results is 7.663 which is lower than last pre-packaged RDD results; but it has the same results as simple RD regression.
–The end–