Overview

In this exercise, you will explore regression dicontinuity designs, based the following paper: Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82.

Tasks

a. Download and read the paper: For instance, from the AEA website. b. Get the Carpenter & Dobkin and data:

rm(list=ls())
library(haven)
library(dplyr)
library(ggplot2)
library(stargazer)
AEJfigs <- read_dta("../HW6/AEJfigs.dta")
AEJfigs<- AEJfigs %>% 
  mutate(over21 = as.factor(ifelse(agecell >= 21, 1, 0)))
View(AEJfigs)
head(AEJfigs)
## # A tibble: 6 x 20
##   agecell   all allfitted internal internalfitted external externalfitted
##     <dbl> <dbl>     <dbl>    <dbl>          <dbl>    <dbl>          <dbl>
## 1    19.1  92.8      91.7     16.6           16.7     76.2           75.0
## 2    19.2  95.1      91.9     18.3           16.9     76.8           75.0
## 3    19.2  92.1      92.0     18.9           17.1     73.2           75.0
## 4    19.3  88.4      92.2     16.1           17.3     72.3           74.9
## 5    19.4  88.7      92.3     17.4           17.4     71.3           74.9
## 6    19.5  90.2      92.5     17.9           17.6     72.3           74.9
## # ... with 13 more variables: alcohol <dbl>, alcoholfitted <dbl>,
## #   homicide <dbl>, homicidefitted <dbl>, suicide <dbl>, suicidefitted <dbl>,
## #   mva <dbl>, mvafitted <dbl>, drugs <dbl>, drugsfitted <dbl>,
## #   externalother <dbl>, externalotherfitted <dbl>, over21 <fct>
summary(AEJfigs)
##     agecell           all           allfitted         internal    
##  Min.   :19.07   Min.   : 88.43   Min.   : 91.71   Min.   :15.98  
##  1st Qu.:20.08   1st Qu.: 92.79   1st Qu.: 93.04   1st Qu.:18.60  
##  Median :21.00   Median : 95.69   Median : 95.18   Median :20.29  
##  Mean   :21.00   Mean   : 95.67   Mean   : 95.80   Mean   :20.29  
##  3rd Qu.:21.92   3rd Qu.: 98.03   3rd Qu.: 97.79   3rd Qu.:21.98  
##  Max.   :22.93   Max.   :105.27   Max.   :102.89   Max.   :24.37  
##                  NA's   :2                         NA's   :2      
##  internalfitted     external     externalfitted     alcohol      
##  Min.   :16.74   Min.   :71.34   Min.   :73.16   Min.   :0.6391  
##  1st Qu.:18.67   1st Qu.:73.04   1st Qu.:74.06   1st Qu.:0.9962  
##  Median :20.54   Median :74.81   Median :74.74   Median :1.2119  
##  Mean   :20.28   Mean   :75.39   Mean   :75.52   Mean   :1.2573  
##  3rd Qu.:21.66   3rd Qu.:77.24   3rd Qu.:76.06   3rd Qu.:1.4701  
##  Max.   :24.04   Max.   :83.33   Max.   :81.78   Max.   :2.5193  
##                  NA's   :2                       NA's   :2       
##  alcoholfitted       homicide     homicidefitted     suicide     
##  Min.   :0.7943   Min.   :14.95   Min.   :16.26   Min.   :10.89  
##  1st Qu.:1.0724   1st Qu.:16.61   1st Qu.:16.54   1st Qu.:11.61  
##  Median :1.2471   Median :16.99   Median :16.99   Median :12.20  
##  Mean   :1.2674   Mean   :16.91   Mean   :16.95   Mean   :12.35  
##  3rd Qu.:1.4455   3rd Qu.:17.29   3rd Qu.:17.25   3rd Qu.:12.82  
##  Max.   :1.8174   Max.   :18.41   Max.   :17.76   Max.   :14.83  
##                   NA's   :2                       NA's   :2      
##  suicidefitted        mva          mvafitted         drugs      
##  Min.   :11.59   Min.   :26.86   Min.   :27.87   Min.   :3.202  
##  1st Qu.:11.61   1st Qu.:30.12   1st Qu.:30.17   1st Qu.:3.755  
##  Median :12.25   Median :31.64   Median :31.73   Median :4.314  
##  Mean   :12.36   Mean   :31.62   Mean   :31.68   Mean   :4.250  
##  3rd Qu.:13.04   3rd Qu.:33.10   3rd Qu.:33.40   3rd Qu.:4.756  
##  Max.   :13.55   Max.   :36.39   Max.   :34.82   Max.   :5.565  
##                  NA's   :2                       NA's   :2      
##   drugsfitted    externalother    externalotherfitted over21
##  Min.   :3.449   Min.   : 7.973   Min.   : 8.388      0:25  
##  1st Qu.:3.769   1st Qu.: 9.149   1st Qu.: 9.347      1:25  
##  Median :4.323   Median : 9.561   Median : 9.690            
##  Mean   :4.255   Mean   : 9.599   Mean   : 9.610            
##  3rd Qu.:4.679   3rd Qu.:10.122   3rd Qu.: 9.939            
##  Max.   :5.130   Max.   :11.483   Max.   :10.353            
##                  NA's   :2

c. Reproduce Figure 3 of the paper

No need to bother trying to get their exact formatting here (unless you love doing that). Just make sure you show the data and the discontinuities.

ggplot(AEJfigs, aes(x=agecell)) + 
  geom_point(aes(y=all, shape="ALL"))+
  geom_point(mapping=aes(y=internal, shape="Internal")) +
  geom_point(aes(y=external, shape="External"))+
  geom_line(aes(y=internalfitted, linetype="Internal Fitted"))+
  geom_line(aes(y=allfitted, linetype="All Fitted"))+
  ylim(0, 120)+
  xlim(19, 23)+
  geom_line(aes(y=externalfitted, linetype="External Fitted"))+
  theme_classic()+

  xlab("Age")+
  ylab("Deaths per 100,000 person-years")+ggtitle("Figure 3. Age Profile For Death Rates")+
  theme(panel.background = element_blank(),
        legend.title = element_blank())

d.Simplest regression-discontinuity design

Run a simple RD regression (same slope, with a jump) for “all” deaths by age (I don’t mean all the variables, just the one variable that is called “all”). Note: You are NOT trying to reproduce the paper’s tables. That is because you only have access to a collapsed dataset with <50 observations. Don’t expect to closely reproduce the results of the tables in the paper, which uses a full 1500 observations dataset. Also, you don’t have the same variables either. Analyse the results. How do you use these results to estimate the relationship between drinking and mortality?

The coefficient is \(7.663^{***}\). Which means mortality differs significantly at the point 21 years of age. This means that once the person turns 21, there in an increase in mortality. Turning 21 means that there is access to drinking, which implies that mortality increases when a person turns 21. This design compares people a little just over 21 that are assumed to be similar to people who are a little just below 21. The one that are just a little over 21 have access to alcohol, while the people that are just below 21 have no access to alcohol. These two groups of people are similar to each other.

Here I run the simplest regression
simpleRDD<-lm(all~agecell+over21, data=AEJfigs)
stargazer(simpleRDD, type ="text", title= "A Simple RDD", align= TRUE, covariate.labels = c("Agecell", "Treated"))
## 
## A Simple RDD
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 all            
## -----------------------------------------------
## Agecell                       -0.975           
##                               (0.632)          
##                                                
## Treated                      7.663***          
##                               (1.440)          
##                                                
## Constant                    112.310***         
##                              (12.668)          
##                                                
## -----------------------------------------------
## Observations                    48             
## R2                             0.595           
## Adjusted R2                    0.577           
## Residual Std. Error       2.493 (df = 45)      
## F Statistic           32.995*** (df = 2; 45)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

e. Plot the results

Plot the all variable by age, and add the regression lines defined by your regression output (It’s ok if the lines extend throughout the whole figure. Don’t worry too much about formatting.)

# Get the fitted value.
AEJfigs %>% 
ggplot(AEJfigs, mapping=aes(x=agecell)) + 
  geom_point(mapping=aes(y=all)) + 
  geom_line(data=fortify(simpleRDD), aes(x=agecell,y=.fitted)) +
  xlab("Age") +
  ylim(0, 120)+
  xlim(19, 23)+
  ylab("Deaths per 100,000 person-years")+ggtitle("Simplest RDD") +
  theme_classic()+
  theme(panel.background = element_blank(),
        legend.title = element_blank())

f. Allow more flexibility to your RD

Run another RD where not only the intercept, but also the slope can differ between the under- and over-21 subsamples. Analyse the results. What is your estimate of interest. Does it differ from the previous one?

mfrdd<-lm(all~agecell+over21+over21*agecell, data=AEJfigs)
stargazer(mfrdd, type ="text", title= "A More Flexible RDD", align= TRUE, covariate.labels = c("Agecell", "Treated", "Treated*agecell"))
## 
## A More Flexible RDD
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 all            
## -----------------------------------------------
## Agecell                        0.827           
##                               (0.819)          
##                                                
## Treated                      83.333***         
##                              (24.357)          
##                                                
## Treated*agecell              -3.603***         
##                               (1.158)          
##                                                
## Constant                     76.251***         
##                              (16.396)          
##                                                
## -----------------------------------------------
## Observations                    48             
## R2                             0.668           
## Adjusted R2                    0.645           
## Residual Std. Error       2.283 (df = 44)      
## F Statistic           29.466*** (df = 3; 44)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

We added the interaction term here. This allows for a non linear relationship between mortality and age. We allow slopes can differ between the under and over 21. Our estimates of interest is the interaction term, which allows for the differential effect of MLDA on people who are at different ages.The negative sign does suggest that mortality due to access to drinking decreases as age increases.

g. Again, plot the data

Again, plot the data and the regression lines defined by the regression.

AEJfigs %>% 
ggplot(AEJfigs, mapping=aes(x=agecell)) + 
  geom_point(mapping=aes(y=all)) + 
  geom_line(data=fortify(mfrdd), aes(x=agecell,y=.fitted)) +
  ylim(0, 120)+
  xlim(19, 23)+
  xlab("Age") +
  ylab("Deaths per 100,000 person-years")+ggtitle("More Flexible RDD") +
  theme_classic()+
  theme(panel.background = element_blank(),
        legend.title = element_blank())

f. Compare to pre-packaged RDD:

Use the rdd package. Run the RDestimate command on all~agecell, specifying your data and the cutoff option at 21. Print the output of your regression. Plot the output of your regression (You can just write plot(name). RDestimate outputs plotable objects). Discuss the results. Do they differ from what you found? Why?

The results are different. This could be because, the SRDD takes all the data points. In fact, all data points are not comparable. The canned RDD package usess the a bandwith of 1.6561. This bandwith uses only the data points that are most comparable.The number of observation is 40 in this case, whereas it was 48 in the last case. The LATE estimate is 9.001 in this case, which is higher than the estimates 7.663 in the simple RDD case.

library(rdd)
ppRDD <- RDestimate(all~agecell, data=AEJfigs, cutpoint = 21)
summary(ppRDD)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = AEJfigs, cutpoint = 21)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       1.6561     40            9.001     1.480       6.080    1.199e-09
## Half-BW    0.8281     20            9.579     1.914       5.004    5.609e-07
## Double-BW  3.3123     48            7.953     1.278       6.223    4.882e-10
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       33.08  3         36          3.799e-10
## Half-BW    29.05  3         16          2.078e-06
## Double-BW  32.54  3         44          6.129e-11
plot(ppRDD)

g. Prepackaged linear RDD:

Re-run an RDestimate command, this time adding the options kernel = “rectangular” and bw=2. Output the results. Plot them. Compare to previous output and discuss.

ppRDD <- RDestimate(all~agecell, data=AEJfigs, cutpoint = 21, kernel = "rectangular", bw=2 )
summary(ppRDD)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = AEJfigs, cutpoint = 21, 
##     bw = 2, kernel = "rectangular")
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       2          48            7.663     1.273       6.017    1.776e-09
## Half-BW    1          24            9.753     1.902       5.128    2.929e-07
## Double-BW  4          48            7.663     1.273       6.017    1.776e-09
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       29.47  3         44          2.651e-10
## Half-BW    16.82  3         20          2.167e-05
## Double-BW  29.47  3         44          2.651e-10
plot(ppRDD)

The options here specify bandwidth equal to 2. The estimates is equal to the simple RDD estimates with a linear specification. This estimate also includes all the observations while the one in part f does not.The kernal option is ‘Rectangular’, whereas the default is ‘Triangular’. This is basically a lineear RDD.