AAEC 8610 Homework 10

Task a: Downloading and reading the paper
Task b: Getting the Carpenter & Dobkin and data
- Restricting the dataset to the variables needed
- Generating an over21 dummy to help with the following analysis
Task c: Reproducing Figure 3 of the paper
Task d: Estimating the simplest regression-discontinuity design
- Analyzing the results in the perspective of estimating the relationship between drinking and mortality
Task e: Plotting the simplest RDD results
Task f: Allowing more flexibility to your RD
Task g: Plotting the non-linear RDD results
- Higher-order polynomial Sharp RDD
- Lower-order polynomial Sharp RDD
Task h: Comparing to pre-packaged (nonlinear) RDD
- Printing the output of the (nonlinear) RDD using RDestimate
- Ploting the output of the (nonlinear) RDD using RDestimate
Task i: Using the prepackaged linear RDD
- Printing the output of the (linear) RDD using RDestimate with the options kernel = “rectangular” and bw=2
- Ploting the output of the (linear) RDD using RDestimate with the options kernel = “rectangular” and bw=2

Task a: Downloading and reading the paper

Paper: Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82

Task b: Getting the Carpenter & Dobkin and data

I download the data from A&P Mastering Metrics.

# Loading packages
knitr::opts_chunk$set(echo = TRUE, eval=TRUE, message=FALSE, warning=FALSE, fig.height=4)
necessaryPackages <- c("foreign","reshape","rvest","tidyverse","dplyr","stringr","ggplot2", "stargazer","readr","AER","rdd")
new.packages <- necessaryPackages[
              !(necessaryPackages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(necessaryPackages, require, character.only = TRUE)

## [[1]]
## [1] TRUE
## 
## [[2]]
## [1] TRUE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] TRUE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE
## 
## [[8]]
## [1] TRUE
## 
## [[9]]
## [1] TRUE
## 
## [[10]]
## [1] TRUE
## 
## [[11]]
## [1] TRUE

# Importing the dataset
hw10data <- read.dta("AEJfigs.dta")

# Describing the dataset
  # str(hw10data)
  # summary(hw10data)

Restricting the dataset to the variables needed

The key variables I would need in the homework are agecell, all, internal, external, as well as the versions of those variables that have the suffix fitted.

hw10data  <- hw10data[c(1:7)]

Generating an over21 dummy to help with the following analysis

hw10data$over21 = ifelse(hw10data$agecell >= 21,1,0)

Task c: Reproducing Figure 3 of the paper

# scatterplot for all deaths
plot(hw10data$agecell, hw10data$all, cex = 0.5, pch=16, col = "steelblue", 
     xlim=c(19,23), ylim=c(0,120), xgap.axis=NA, ygap.axis=NA,
     xlab = "Age", ylab = "Deaths per 100,000 person-years", 
     title(sub = "FIGURE 3. AGE PROFILE FOR DEATH RATES"), frame.plot = FALSE)

# scatterplot for deaths due to internal causes
points(hw10data$agecell, hw10data$internal, cex = 0.35, pch=0, col = "green")

# scatterplot for deaths due to external causes
points(hw10data$agecell, hw10data$external, cex = 0.5, pch=17, col = "red")

# fitted OLS line for all deaths by over21 groups
lines(hw10data$agecell[hw10data$over21==0], hw10data$allfitted[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$allfitted[hw10data$over21==1], lty=3, lwd=1)

# fitted OLS line for all deaths due to internal causes by over21 groups
lines(hw10data$agecell[hw10data$over21==0], hw10data$internalfitted[hw10data$over21==0], lty=2, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$internalfitted[hw10data$over21==1], lty=2, lwd=1)

# fitted OLS line for all deaths due to external causes by over21 groups
lines(hw10data$agecell[hw10data$over21==0], hw10data$externalfitted[hw10data$over21==0], lty=1, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$externalfitted[hw10data$over21==1], lty=1, lwd=1)

# legend
legend(19, 69, text.width=0.5, inset=0.005, ncol=2,text.font=0, col=c("steelblue","green","red","black","black","black"), pch=c(16,0,17,NA,NA,NA), lty=c(0,0,0,3,2,1), c("All","Internal","External","All fitted", "Int. fitted", "Ext. fitted"))

Task d: Estimating the simplest regression-discontinuity design

The simple RD regression (same slope, with a jump) for “all” deaths by age is specified below by a sharp linear RD model (Equations 1), with “all” as the outcome variable, “over21” as the treatment variable, and “agecell” as the running variable. The coefficient of interest is \(\alpha_{1}\).

\[\begin{equation} \tag{1} all = \alpha_{0} + \alpha_{1}over21 + \alpha_{2}agecell + \varepsilon \end{equation}\]

The output below gives the estimation results of Equation 1.

# Dataset of "complete" cases for all deaths 
hw10data <- hw10data[complete.cases(hw10data$all),]

# Regressions
reg.1 <- lm(all ~ over21 + agecell, data=hw10data)

# Formatting the regression output
stargazer(reg.1, type="text", no.space=TRUE, keep.stat = c("n","rsq","adj.rsq"), column.labels=c("(Overall mortality rate -- All deaths per 100,000 person-years)"))

## 
## ============================================================================
##                                    Dependent variable:                      
##              ---------------------------------------------------------------
##                                            all                              
##              (Overall mortality rate -- All deaths per 100,000 person-years)
## ----------------------------------------------------------------------------
## over21                                  7.663***                            
##                                          (1.440)                            
## agecell                                  -0.975                             
##                                          (0.632)                            
## Constant                               112.310***                           
##                                         (12.668)                            
## ----------------------------------------------------------------------------
## Observations                               48                               
## R2                                        0.595                             
## Adjusted R2                               0.577                             
## ============================================================================
## Note:                                            *p<0.1; **p<0.05; ***p<0.01

Analyzing the results in the perspective of estimating the relationship between drinking and mortality

Under the usual assumptions for RDD estimations, \(\alpha_{1}\) can be expressed in a small \(\delta\)-neighborhood around the cutoff age of 21 as follows: \[\alpha_{1}=\mathop{\mathbb{E}}[all|21 \le agecell \le 21+\delta, over21=1]-\mathop{\mathbb{E}}[all|21-\delta \le agecell \le 21, over21=0].\] Since observations just below the cutoff age of 21 are likely to be very similar to those just above the cutoff age of 21 (with the exception that those just above the cutoff are treated while those just below the cutoff are controls), \(\alpha_{1}\) represents a sharp mean difference in the outcome variable at age 21. In other words, it is the local average treatment effect (LATE) in the neighborhood of the cutoff. Given that \(\hat{\alpha_{1}}=7.663\) from the regression output, setting up a minimum legal drinking age (MLDA) at age 21 leads to an increase in overall mortality at age 21 of about 7.663 deaths per 100,000 person years. This impact estimate is significant at the 1% level of confidence.

Task e: Plotting the simplest RDD results

# scatterplot for all deaths
plot(hw10data$agecell, hw10data$all, cex = 0.5, pch=16, col = "steelblue", 
     xlim=c(19,23), ylim=c(80,120), xlab = "Age", ylab = "Deaths per 100,000 person-years", 
     title(sub = "PLOT OF LINEAR SHARP RDD RESULTS"), frame.plot = FALSE)

## Predicted overall mortality
hw10data$linRDD_all <- fitted(reg.1)

# fitted linear sharp RDD for all deaths
lines(hw10data$agecell[hw10data$over21==0], hw10data$linRDD_all[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$linRDD_all[hw10data$over21==1], lty=3, lwd=1)

# Discontinuity line
abline(v=21)

# Legend
legend("topleft", text.width=1.25, inset=0.015, ncol=1,text.font=0, col=c("steelblue","black"), pch=c(16,NA), lty=c(0,3), c("All","fitted-linear All"))

Task f: Allowing more flexibility to your RD

I want to run other RDs allowing for different intercepts and different slopes between the under- and over-21 subsamples. Several options are possible to include non-linearity in the running variable for allowing more flexibility than the linear RD. Abstracting from the choice of the correct order of polynomial, I present a higher-order polynomial specification (Equation 2) and a lower-order polynomial specification (Equation 3). In both non-linear RD models, “agecell” is centered to allow an easy interpretation of the main effect of “over21” as the treatment effect.

\[\begin{equation} \tag{2} all = \beta_{0} + \beta_{1}over21 + \beta_{2}(agecell-21) + \beta_{3}(agecell-21)^{2} + \beta_{4}over21*(agecell-21) + \beta_{4}over21*(agecell-21)^{2} + u \end{equation}\]

\[\begin{equation} \tag{3} all = \gamma_{0} + \gamma_{1}over21 + \gamma_{2}(agecell-21) + \gamma_{3}|agecell-21|^{1/2} + \beta_{4}over21*(agecell-21) + \beta_{5}over21*|agecell-21|^{1/2} + v \end{equation}\]

The output below gives the estimation results of Equations 2 and 3.

# Centered variable and interaction with treatment
hw10data$age                    <- hw10data$agecell - 21
hw10data$age_Interaction_over21 <- hw10data$over21*hw10data$age

# Squared centered variable and interaction with treatment
hw10data$agesq                    <- hw10data$age*hw10data$age
hw10data$agesq_Interaction_over21 <- hw10data$over21*hw10data$agesq

# Square root centered variable and interaction with treatment
hw10data$agesqrt                    <- (abs(hw10data$age))^{1/2} 
hw10data$agesqrt_Interaction_over21 <- hw10data$over21*hw10data$agesqrt

# Regressions 
reg.2 <- lm(all ~ over21 + age + agesq + age_Interaction_over21 + agesq_Interaction_over21 , data=hw10data)
reg.3 <- lm(all ~ over21 + age + agesqrt + age_Interaction_over21 + agesqrt_Interaction_over21 , data=hw10data)

# Formatting the regression outputs
stargazer(reg.2, reg.3, type="text", no.space=TRUE, keep.stat = c("n","rsq","adj.rsq"), column.labels=c("(Higher-order polynomial)","(Lower-order polynomial)"))

## 
## =============================================================================
##                                           Dependent variable:                
##                            --------------------------------------------------
##                                                   all                        
##                            (Higher-order polynomial) (Lower-order polynomial)
##                                       (1)                      (2)           
## -----------------------------------------------------------------------------
## over21                             9.548***                 13.127***        
##                                     (1.985)                  (4.098)         
## age                                 -0.831                    2.238          
##                                     (3.290)                  (4.213)         
## agesq                               -0.840                                   
##                                     (1.615)                                  
## agesqrt                                                       2.486          
##                                                              (7.283)         
## age_Interaction_over21              -6.017                    1.798          
##                                     (4.653)                  (5.958)         
## agesq_Interaction_over21             2.904                                   
##                                     (2.284)                                  
## agesqrt_Interaction_over21                                   -14.489         
##                                                              (10.300)        
## Constant                           93.073***                92.681***        
##                                     (1.404)                  (2.898)         
## -----------------------------------------------------------------------------
## Observations                          48                        48           
## R2                                   0.682                    0.689          
## Adjusted R2                          0.644                    0.652          
## =============================================================================
## Note:                                             *p<0.1; **p<0.05; ***p<0.01

Using non-linear specifications, the impact estimate is still significant at the 1% level of confidence but with a slighlty higher magnitude than that from the linear specification.

Given that \(\hat{\beta_{1}}=9.548\) and \(\hat{\gamma_{1}}=13.127\), the estimated impact of an MLDA at age 21 is an increase in overall mortality at age 21 of about 9.548 deaths per 100,000 person years (using Equation 2) and 13.127 deaths per 100,000 person years (using Equation 3).

Task g: Plotting the non-linear RDD results

Higher-order polynomial Sharp RDD

# scatterplot for all deaths
plot(hw10data$age, hw10data$all, cex = 0.5, pch=16, col = "steelblue", 
     xlim=c(-2,2), ylim=c(80,120), xlab = "Centered Age", 
     ylab = "Deaths per 100,000 person-years", 
     title(sub = "PLOT OF NON-LINEAR (higher-order polynomial) SHARP RDD RESULTS"), frame.plot = FALSE)

## Predicted overall mortality
hw10data$nonlinHighRDD_all <- fitted(reg.2)

# fitted sharp RDD line for all deaths
lines(hw10data$age[hw10data$over21==0], hw10data$nonlinHighRDD_all[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$age[hw10data$over21==1], hw10data$nonlinHighRDD_all[hw10data$over21==1], lty=3, lwd=1)

# Discontinuity line
abline(v=0)

# Legend
legend("topright", text.width=1.5, inset=0.015, ncol=1,text.font=0, col=c("steelblue","black"), pch=c(16,NA), lty=c(0,3), c("All","fitted higher-order pol. All"))

Lower-order polynomial Sharp RDD

# scatterplot for all deaths
plot(hw10data$age, hw10data$all, cex = 0.5, pch=16, col = "steelblue", 
     xlim=c(-2,2), ylim=c(80,120), xlab = "Centered Age", 
     ylab = "Deaths per 100,000 person-years", 
     title(sub = "PLOT OF NON-LINEAR (lower-order polynomial) SHARP RDD RESULTS"), frame.plot = FALSE)

## Predicted overall mortality
hw10data$nonlinLowRDD_all <- fitted(reg.3)

# fitted sharp RDD line for all deaths
lines(hw10data$age[hw10data$over21==0], hw10data$nonlinLowRDD_all[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$age[hw10data$over21==1], hw10data$nonlinLowRDD_all[hw10data$over21==1], lty=3, lwd=1)

# Discontinuity line
abline(v=0)

# Legend
legend("topleft", text.width=1.5, inset=0.015, ncol=1,text.font=0, col=c("steelblue","black"), pch=c(16,NA), lty=c(0,3), c("All","fitted lower-order pol. All"))

The quality of these two nonlinear RD estimations and plots would depend on the effect of age being the same above and below the cutoff. The illustration of the impact estimate at age 21 on these two plots indicates how nonlinear RDD estimations can be prone to some boundary bias. The difference in impact estimates between the two polynomial specifications is more than one-third the magnitude of the impact estimate using the higher-order polynomial specification.

Task h: Comparing to pre-packaged (nonlinear) RDD

From now on, I use the rdd package. I run the RDestimate command on all~agecell, specifying the data and the cutoff age of 21.

Printing the output of the (nonlinear) RDD using RDestimate

reg.4=RDestimate(all~agecell,data=hw10data,cutpoint = 21)
summary(reg.4)

## 
## Call:
## RDestimate(formula = all ~ agecell, data = hw10data, cutpoint = 21)
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       1.6561     40            9.001     1.480       6.080    1.199e-09
## Half-BW    0.8281     20            9.579     1.914       5.004    5.609e-07
## Double-BW  3.3123     48            7.953     1.278       6.223    4.882e-10
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       33.08  3         36          3.799e-10
## Half-BW    29.05  3         16          2.078e-06
## Double-BW  32.54  3         44          6.129e-11

The estimation output shows that the LATE of an MLDA at age 21 is an increase in the overall mortality at age 21 by 9 deaths per 100,000 person years, which is significant at the 1% level of significance.

The LATE is computed within a bandwidth of 1.6561 that contains 40 points “close” to the cutoff. Halving the bandwidth restricts the number of neighboring points to the cutoff that are compared, and yields an impact estimate of 9.579 deaths per 100,000 person years. Doubling the bandwidth enlarges the number of neighboring points to the cutoff that are compared, and returns an impact estimate of 7.953 deaths per 100,000 person years.

The linear RDD impact estimate (Task d.) aligns with the results using the double bandwidth, while the higher-order polynomial RDD impact estimate (Task f.) aligns with the results using half of the bandwidth.

Ploting the output of the (nonlinear) RDD using RDestimate

plot(reg.4)
title(sub="PLOT OF NONLINEAR SHARP RDD RESULTS (using RDestimate) ", xlab="Age",ylab="Mortality rate from all causes (per 100,000)")

The plot shows that using the RDestimate command, without any additional details, fits a non-linear effect of age on mortality rate.

The plot of the 95% confidence interval of the fitted line below and above the cutoff indicates the bounds that can be placed on the estimated LATE. It is easier to observe that the lower-order polynomial RDD impact estimate of 13.127 deaths per 100,000 person years as well as its higher-order polynomial counterpart do lie in the 95% confidence interval of the estimated LATE.

Task i: Using the prepackaged linear RDD

Printing the output of the (linear) RDD using RDestimate with the options kernel = “rectangular” and bw=2

reg.5=RDestimate(all ~ agecell, data=hw10data, cutpoint=21, kernel="rectangular", bw=2)
summary(reg.5)

## 
## Call:
## RDestimate(formula = all ~ agecell, data = hw10data, cutpoint = 21, 
##     bw = 2, kernel = "rectangular")
## 
## Type:
## sharp 
## 
## Estimates:
##            Bandwidth  Observations  Estimate  Std. Error  z value  Pr(>|z|) 
## LATE       2          48            7.663     1.273       6.017    1.776e-09
## Half-BW    1          24            9.753     1.902       5.128    2.929e-07
## Double-BW  4          48            7.663     1.273       6.017    1.776e-09
##               
## LATE       ***
## Half-BW    ***
## Double-BW  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## F-statistics:
##            F      Num. DoF  Denom. DoF  p        
## LATE       29.47  3         44          2.651e-10
## Half-BW    16.82  3         20          2.167e-05
## Double-BW  29.47  3         44          2.651e-10

Specifying a linear RDD with a bandwidth of 2 yields the linear-based impact estimate previously found (Task d.) with similar standard errors. Since this bandwidth already contains all points of sample, doubling the bandwidth changes nothing in the results. Halving the bandwitdth to 1 gives similar results to the higher-order polynomial RDD impact estimate (Task f.) and to the pre-packaged non-linear RDD results with a bandwidth of 0.8281 (Task h.).

Hence, in certain narrow bandwidth, linear and non-linear sharp RDDs give asymptotically similar impact estimates.

Ploting the output of the (linear) RDD using RDestimate with the options kernel = “rectangular” and bw=2

plot(reg.5)
title(sub="PLOT OF LINEAR SHARP RDD RESULTS (using RDestimate) ", xlab="Age",ylab="Mortality rate from all causes (per 100,000)")

Using local linear regressions with rectangular kernels and narrowed age bandwidth yields 95% confidence intervals of impact estimates that are similar to those obtained using local nonlinear regressions.