Regression Discontinuity

Reference: Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82

a. Download and read the paper:

For instance, from the AEA website.

b. Get the Carpenter & Dobkin and data:

Can be downloaded straight here http://masteringmetrics.com/wp-content/uploads/2015/01/AEJfigs.dta, from A&P Mastering Metrics resources webpage. Look over the data, understand what the different variables are. The key variables you need are agecell, all, internal, external, as well as the versions of those variables that have the suffix fitted. Make an over21 dummy to help with the following analysis.

library(haven)
d<-read_dta('hw6/AEJfigs.dta')
#make dummy
d$over21 <- ifelse(d$agecell >= 21, 1, 0)
head(d)
## # A tibble: 6 × 20
##   agecell   all allfitted internal internalfitted external externalfitted
##     <dbl> <dbl>     <dbl>    <dbl>          <dbl>    <dbl>          <dbl>
## 1    19.1  92.8      91.7     16.6           16.7     76.2           75.0
## 2    19.2  95.1      91.9     18.3           16.9     76.8           75.0
## 3    19.2  92.1      92.0     18.9           17.1     73.2           75.0
## 4    19.3  88.4      92.2     16.1           17.3     72.3           74.9
## 5    19.4  88.7      92.3     17.4           17.4     71.3           74.9
## 6    19.5  90.2      92.5     17.9           17.6     72.3           74.9
## # … with 13 more variables: alcohol <dbl>, alcoholfitted <dbl>, homicide <dbl>,
## #   homicidefitted <dbl>, suicide <dbl>, suicidefitted <dbl>, mva <dbl>,
## #   mvafitted <dbl>, drugs <dbl>, drugsfitted <dbl>, externalother <dbl>,
## #   externalotherfitted <dbl>, over21 <dbl>

c. Reproduce Figure 3 of the paper

library(tidyverse)
ggplot()+
  geom_point(aes(x = agecell, y = all),data=d)+
  geom_line(aes(x = agecell, y = allfitted),subset(d, agecell < 21))+
  geom_line(aes(x = agecell, y = allfitted),subset(d, agecell >= 21))+
    geom_point(aes(x = agecell, y = internal),data=d)+
  geom_line(aes(x = agecell, y = internalfitted),subset(d, agecell < 21))+
  geom_line(aes(x = agecell, y = internalfitted),subset(d, agecell >= 21))+
     geom_point(aes(x = agecell, y = external),data=d)+
  geom_line(aes(x = agecell, y = externalfitted),subset(d, agecell < 21))+
  geom_line(aes(x = agecell, y = externalfitted),subset(d, agecell >= 21))+
  ggtitle("Figure 3. Age Profile for Death Rates")+
    xlab("Age")+ylab("Deaths per 100,000 person-years")

d. Simplest regression-discontinuity design

Run a simple RD regression (same slope, with a jump) for “all” deaths by age (I don’t mean all the variables, just the one variable that is called “all”).

library(stargazer)
g<-lm(all~agecell+over21,data=d)
stargazer(g, title="Simple RD regression results", type = "text")
## 
## Simple RD regression results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 all            
## -----------------------------------------------
## agecell                       -0.975           
##                               (0.632)          
##                                                
## over21                       7.663***          
##                               (1.440)          
##                                                
## Constant                    112.310***         
##                              (12.668)          
##                                                
## -----------------------------------------------
## Observations                    48             
## R2                             0.595           
## Adjusted R2                    0.577           
## Residual Std. Error       2.493 (df = 45)      
## F Statistic           32.995*** (df = 2; 45)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The dummy variable variable “over21” has a significant positive effect on the death, which can be interpreted that death would increase about 7 people per 1000 people at age of 21.

e. Plot the results

Plot the all variable by age, and add the regression lines defined by your regression output

d$fit<-predict(g,d)

ggplot()+
  geom_line(aes(x = agecell, y = fit, group=over21,color=factor(over21)),data=d)+
  geom_point(aes(x = agecell, y = all),data=d)+
  ggtitle("RDD regression output")+
  theme(legend.position = "none")+
  xlab("Age")+ylab("Deaths per 100,000 person-years")

f. Allow more flexibility to your RD

Run another RD where not only the intercept, but also the slope can differ between the under- and over-21 subsamples. Analyse the results. What is your estimate of interest. Does it differ from the previous one?

g2<-lm(all~agecell+over21+I(agecell*over21),data=d)
stargazer(g2, title="Another RD regression results", type = "text")
## 
## Another RD regression results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 all            
## -----------------------------------------------
## agecell                        0.827           
##                               (0.819)          
##                                                
## over21                       83.333***         
##                              (24.357)          
##                                                
## I(agecell * over21)          -3.603***         
##                               (1.158)          
##                                                
## Constant                     76.251***         
##                              (16.396)          
##                                                
## -----------------------------------------------
## Observations                    48             
## R2                             0.668           
## Adjusted R2                    0.645           
## Residual Std. Error       2.283 (df = 44)      
## F Statistic           29.466*** (df = 3; 44)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

This time, I add interactive term to reflect the effect of “over21” depending on “agecell”. The result shows that death would increase about 83 people per 1000 people at age of 21. And age 21 depend on age to affect the mortality rate.The interactive effect is negative. In the previous RDD results without intercept, agecell has a negative effect on mortality rate but now, it becomes positive. And the coefficient of “over21” becomes much higher after using interactive.

g. Again, plot the data

Again, plot the data and the regression lines defined by the regression.

d$fit2<-predict(g2,d)

ggplot()+
  geom_line(aes(x = agecell, y = fit2, group=over21,color=factor(over21)),data=d)+
  geom_point(aes(x = agecell, y = all),data=d)+
  ggtitle("Another RDD regression output")+
  theme(legend.position = "none")+
  xlab("Age")+ylab("Deaths per 100,000 person-years")

f. Compare to pre-packaged RDD:

Use the rdd package. Run the RDestimate command on all~agecell, specifying your data and the cutoff option at 21. Print the output of your regression. Plot the output of your regression (You can just write plot(name). RDestimate outputs plotable objects). Discuss the results. Do they differ from what you found? Why?

library(rdd)
g3 <- RDestimate(all ~ agecell, data=d, cutpoint=21)
summary(g3)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = d, cutpoint = 21)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       1.6561     40            9.001     1.480       6.080    1.199e-09
## Half-BW    0.8281     20            9.579     1.914       5.004    5.609e-07
## Double-BW  3.3123     48            7.953     1.278       6.223    4.882e-10
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       33.08  3         36          3.799e-10
## Half-BW    29.05  3         16          2.078e-06
## Double-BW  32.54  3         44          6.129e-11
plot(g3)

The LATE estimate results is 9.001 which is higher than what we have before. The reason behind it could be that LATE results is estimated at a default bandwidth of 1.6561. And in the next question, we change the bandwidth and kernel method, the result become the same.

g. Prepackaged linear RDD:

Re-run an RDestimate command, this time adding the options kernel = “rectangular” and bw=2. Output the results. Plot them. Compare to previous output and discuss.

library(rdd)
g4 <- RDestimate(all ~ agecell, data=d, cutpoint=21, kernel = "rectangular", bw=2)
summary(g4)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = d, cutpoint = 21, 
##     bw = 2, kernel = "rectangular")
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       2          48            7.663     1.273       6.017    1.776e-09
## Half-BW    1          24            9.753     1.902       5.128    2.929e-07
## Double-BW  4          48            7.663     1.273       6.017    1.776e-09
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       29.47  3         44          2.651e-10
## Half-BW    16.82  3         20          2.167e-05
## Double-BW  29.47  3         44          2.651e-10
plot(g4)

After we add the new command, LATE estimate results is 7.663 which is lower than last pre-packaged RDD results; but it has the same results as simple RD regression.

–The end–