Overview

In this exercise, you will explore regression dicontinuity designs, based the following paper:

Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82

Tasks

a. Download and read the paper:

For instance, from the AEA website.

b. Get the Carpenter & Dobkin and data:

Can be downloaded straight here http://masteringmetrics.com/wp-content/uploads/2015/01/AEJfigs.dta, from A&P Mastering Metrics resources webpage (If that link is dead just ask me for the file).

Look over the data, understand what the different variables are. The key variables you need are agecell, all, internal, external, as well as the versions of those variables that have the suffix fitted.

Make an over21 dummy to help with the following analysis.

library(haven)
library(stargazer)
library(foreign)
library(kableExtra)
library(vtable)
library(dplyr)
library(ggplot2)
library(tidyr)
library(rdd)

alcon <- read_dta("AEJfigs.dta")
head(alcon)
## # A tibble: 6 × 19
##   agecell   all allfit…¹ inter…² inter…³ exter…⁴ exter…⁵ alcohol alcoh…⁶ homic…⁷
##     <dbl> <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1    19.1  92.8     91.7    16.6    16.7    76.2    75.0   0.639   0.794    16.3
## 2    19.2  95.1     91.9    18.3    16.9    76.8    75.0   0.677   0.838    16.9
## 3    19.2  92.1     92.0    18.9    17.1    73.2    75.0   0.866   0.878    15.2
## 4    19.3  88.4     92.2    16.1    17.3    72.3    74.9   0.867   0.915    16.7
## 5    19.4  88.7     92.3    17.4    17.4    71.3    74.9   1.02    0.949    14.9
## 6    19.5  90.2     92.5    17.9    17.6    72.3    74.9   1.17    0.981    15.6
## # … with 9 more variables: homicidefitted <dbl>, suicide <dbl>,
## #   suicidefitted <dbl>, mva <dbl>, mvafitted <dbl>, drugs <dbl>,
## #   drugsfitted <dbl>, externalother <dbl>, externalotherfitted <dbl>, and
## #   abbreviated variable names ¹​allfitted, ²​internal, ³​internalfitted,
## #   ⁴​external, ⁵​externalfitted, ⁶​alcoholfitted, ⁷​homicide
alcon$over21  <- ifelse(alcon$agecell >= 21, 1, 0)


c. Reproduce Figure 3 of the paper

No need to bother trying to get their exact formatting here (unless you love doing that). Just make sure you show the data and the discontinuities.

alcon_slim <- alcon %>% 
              select(agecell, all, allfitted, internal, internalfitted, external, externalfitted) %>% 
              gather(key = "Cause" , value = "value", -agecell, na.rm = TRUE, convert = FALSE, factor_key = FALSE)

Fig3 <- ggplot(alcon_slim, aes(x = as.numeric(agecell), y = value)) + 
        geom_point(aes(color = Cause)) + 
        scale_x_continuous (limits=c(19,23), breaks= seq(19, 23, 0.5)) + 
        scale_y_continuous (limits=c(0, 120), breaks=seq(0, 120, 20)) +
        scale_color_manual (values=c("#666666","#333333","#663300", "#330033", "#0066CC", "#0000CC"), 
                            name="Cause", labels=c("All", "All fitted", "Internal", "Internal fitted", "External", "External fitted")) +   
        labs(title = "Figure 3. Age Profile for Death Rates", x = "Age", y = "Deaths per 100,000 person-years", 
             caption = "Notes: Deaths from the National Vital Statistics Records. Includes all deaths that occurred in the United States between 1997–2003. \n The population denominators are derived from the census. See online Appendix C for a list of causes of death.") + 
        theme(panel.background = element_blank(),
        legend.background = element_rect(size=0.5, linetype="solid", colour ="black"), 
        legend.direction = "horizontal", legend.position = c(0.7,0.4), legend.title = element_blank(),
        panel.grid.minor = element_blank(), panel.grid.major = element_blank(), panel.border= element_blank(), 
        axis.line.x = element_line(color="black", size = 0.5),
        axis.line.y = element_line(color="black", size = 0.5),
        plot.title = element_text(hjust = 0.5, face ="bold"), 
        plot.caption = element_text(hjust = 0, size = 7.5),
        axis.text.x = element_text(colour = "black", size = 10),
        axis.text.y = element_text(colour = "black", size = 10),
        axis.title = element_text(size = 12))

Fig3


d. Simplest regression-discontinuity design

Run a simple RD regression (same slope, with a jump) for “all” deaths by age (I don’t mean all the variables, just the one variable that is called “all”).

RDD_d <- lm(all ~ agecell + over21, data = alcon)
stargazer(RDD_d, type = "text", title = "Simple RD Regression", 
          align = TRUE,  
          dep.var.labels = c("All Types of Deaths"), covariate.labels = c("agecell", "over21"),
          keep.stat = c("n", "rsq"), omit = c("adj.rsq"), 
          table.layout = "=d=t-s=n") 
## 
## Simple RD Regression
## ========================================
##                  All Types of Deaths    
## ========================================
## agecell                -0.975           
##                        (0.632)          
##                                         
## over21                7.663***          
##                        (1.440)          
##                                         
## Constant             112.310***         
##                       (12.668)          
##                                         
## ----------------------------------------
## Observations             48             
## R2                      0.595           
## ========================================
## Note:        *p<0.1; **p<0.05; ***p<0.01

How do you use these results to estimate the relationship between drinking and mortality?

The coefficient estimate of over21 is 7.663, suggesting that as the age passes 21, alcohol consumption causes additional 7.67 death counts per 100,000 individuals. The result is significant at 99% level of confidence.

e. Plot the results Plot the all variable by age, and add the regression lines defined by your regression output.

# Get fitted values using predict() function 
alcon$RDD_fit <- predict(RDD_d, alcon)

# Plot the all variable by age
plot(alcon$agecell, alcon$all, main = "Plot e. Simple RD Regression", 
     xaxs = "i", yaxs="i",
     xlim = c(19,23), ylim = c(80, 110), 
     xlab = "Age", ylab = "Deaths per 100,000 person-years",
     pch = 17, lwd = 2, lty = 1, 
     frame.plot = FALSE)

# Add regression lines
with(subset(alcon, agecell < 21), lines(agecell, RDD_fit, col = "red"))
with(subset(alcon, agecell  >= 21), lines(agecell, RDD_fit, col = "red" ))


f. Allow more flexibility to your RD Run another RD where not only the intercept, but also the slope can differ between the under- and over-21 subsamples.

Analyse the results. What is your estimate of interest. Does it differ from the previous one?

RDD_f <- lm(all ~ agecell + over21 + agecell*over21, data = alcon)
stargazer(RDD_f, type = "text", title = "RD Regression with Flexibility", 
          align = TRUE,  
          dep.var.labels = c("All Types of Deaths"), 
          covariate.labels = c("agecell", "over21","agecell-over21 Interact"),
          keep.stat = c("n", "rsq"), omit = c("adj.rsq"), 
          table.layout = "=d=t-s=n") 
## 
## RD Regression with Flexibility
## ===================================================
##                             All Types of Deaths    
## ===================================================
## agecell                            0.827           
##                                   (0.819)          
##                                                    
## over21                           83.333***         
##                                  (24.357)          
##                                                    
## agecell-over21 Interact          -3.603***         
##                                   (1.158)          
##                                                    
## Constant                         76.251***         
##                                  (16.396)          
##                                                    
## ---------------------------------------------------
## Observations                        48             
## R2                                 0.668           
## ===================================================
## Note:                   *p<0.1; **p<0.05; ***p<0.01


The estimate of interest would be over21. An over21-agecell interaction term is included n this version of RD regression. Therefore the magnitude of the effect of individual passing the age of 21 on mortality should be calculated as: 83.333-3.603*(age=21)=7.67, which is almost the same result compared to simple RD regresison. The estimate is significant at 99% level of confidence.

g. Again, plot the data

Plot the data and the regression lines defined by the regression.

# Get fitted values using predict() function 
alcon$RDD_flex <- predict(RDD_f, alcon)

# Plot the all variable by age
plot(alcon$agecell, alcon$all, main = "Plot g. RD Regression with Flexibility", 
     xaxs = "i", yaxs="i",
     xlim = c(19,23), ylim = c(80, 110), 
     xlab = "Age", ylab = "Deaths per 100,000 person-years",
     pch = 17, lwd = 2, lty = 1, 
     frame.plot = FALSE)

# Add regression lines
with(subset(alcon, agecell < 21), lines(agecell, RDD_flex, col = "orange"))
with(subset(alcon, agecell  >= 21), lines(agecell, RDD_flex, col = "red" ))


h. Compare to pre-packaged RDD:

Use the rdd package.

Run the RDestimate command on all~agecell, specifying your data and the cutoff option at 21.

Print the output of your regression.

RDD_pack <- RDestimate(all~agecell, data = alcon, cutpoint = 21)
print(RDD_pack)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21)
## 
## Coefficients:
##      LATE    Half-BW  Double-BW  
##     9.001      9.579      7.953
summary(RDD_pack)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       1.6561     40            9.001     1.480       6.080    1.199e-09
## Half-BW    0.8281     20            9.579     1.914       5.004    5.609e-07
## Double-BW  3.3123     48            7.953     1.278       6.223    4.882e-10
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       33.08  3         36          3.799e-10
## Half-BW    29.05  3         16          2.078e-06
## Double-BW  32.54  3         44          6.129e-11


Plot the output of your regression (You can just write plot(name). RDestimate outputs plotable objects). Discuss the results. Do they differ from what you found? Why?

plot(RDD_pack)

The result indicates an average treatment effect for individuals turning the age of 21 being 9, which is a bit higher than the RD Regression we conducted in part d and part f. The result is statistically significant at 99% level of confidence.

i. Prepackaged linear RDD:

Re-run an RDestimate command, this time adding the options kernel = “rectangular” and bw=2.

Output the results. Plot them. Compare to previous output and discuss.

RDD_rect <- RDestimate(all ~ agecell, data = alcon, cutpoint=21, 
                       kernel = "rectangular", bw=2)

print(RDD_rect)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21, 
##     bw = 2, kernel = "rectangular")
## 
## Coefficients:
##      LATE    Half-BW  Double-BW  
##     7.663      9.753      7.663
summary(RDD_rect)
## 
## Call:
## RDestimate(formula = all ~ agecell, data = alcon, cutpoint = 21, 
##     bw = 2, kernel = "rectangular")
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       2          48            7.663     1.273       6.017    1.776e-09
## Half-BW    1          24            9.753     1.902       5.128    2.929e-07
## Double-BW  4          48            7.663     1.273       6.017    1.776e-09
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       29.47  3         44          2.651e-10
## Half-BW    16.82  3         20          2.167e-05
## Double-BW  29.47  3         44          2.651e-10
plot(RDD_rect)

The default setting for the kernel argument in the RDD package is Gaussian. After the argument was updated to “rectangular” implying a uniformly distributed density function, the ATE (7.663) decreases and becomes very close to the same coefficient estimates we retrieved from part d and part e, the simplest regression-discounity design.