Paper: Carpenter and Dobkin(2009) The Effect of Alcohol Consumption on Mortality: Regression Discontinuity Evidence from the Minimum Drinking Age AEJ: Applied Economics 1(1)164-82
I download the data from A&P Mastering Metrics.
# Loading packages
knitr::opts_chunk$set(echo = TRUE, eval=TRUE, message=FALSE, warning=FALSE, fig.height=4)
necessaryPackages <- c("foreign","reshape","rvest","tidyverse","dplyr","stringr","ggplot2", "stargazer","readr","AER","rdd")
new.packages <- necessaryPackages[
!(necessaryPackages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
lapply(necessaryPackages, require, character.only = TRUE)
## [[1]]
## [1] TRUE
##
## [[2]]
## [1] TRUE
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] TRUE
##
## [[5]]
## [1] TRUE
##
## [[6]]
## [1] TRUE
##
## [[7]]
## [1] TRUE
##
## [[8]]
## [1] TRUE
##
## [[9]]
## [1] TRUE
##
## [[10]]
## [1] TRUE
##
## [[11]]
## [1] TRUE
# Importing the dataset
hw10data <- read.dta("AEJfigs.dta")
# Describing the dataset
# str(hw10data)
# summary(hw10data)
The key variables I would need in the homework are agecell, all, internal, external, as well as the versions of those variables that have the suffix fitted.
hw10data <- hw10data[c(1:7)]
hw10data$over21 = ifelse(hw10data$agecell >= 21,1,0)
# scatterplot for all deaths
plot(hw10data$agecell, hw10data$all, cex = 0.5, pch=16, col = "steelblue",
xlim=c(19,23), ylim=c(0,120), xgap.axis=NA, ygap.axis=NA,
xlab = "Age", ylab = "Deaths per 100,000 person-years",
title(sub = "FIGURE 3. AGE PROFILE FOR DEATH RATES"), frame.plot = FALSE)
# scatterplot for deaths due to internal causes
points(hw10data$agecell, hw10data$internal, cex = 0.35, pch=0, col = "green")
# scatterplot for deaths due to external causes
points(hw10data$agecell, hw10data$external, cex = 0.5, pch=17, col = "red")
# fitted OLS line for all deaths by over21 groups
lines(hw10data$agecell[hw10data$over21==0], hw10data$allfitted[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$allfitted[hw10data$over21==1], lty=3, lwd=1)
# fitted OLS line for all deaths due to internal causes by over21 groups
lines(hw10data$agecell[hw10data$over21==0], hw10data$internalfitted[hw10data$over21==0], lty=2, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$internalfitted[hw10data$over21==1], lty=2, lwd=1)
# fitted OLS line for all deaths due to external causes by over21 groups
lines(hw10data$agecell[hw10data$over21==0], hw10data$externalfitted[hw10data$over21==0], lty=1, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$externalfitted[hw10data$over21==1], lty=1, lwd=1)
# legend
legend(19, 69, text.width=0.5, inset=0.005, ncol=2,text.font=0, col=c("steelblue","green","red","black","black","black"), pch=c(16,0,17,NA,NA,NA), lty=c(0,0,0,3,2,1), c("All","Internal","External","All fitted", "Int. fitted", "Ext. fitted"))
The simple RD regression (same slope, with a jump) for “all” deaths by age is specified below by a sharp linear RD model (Equations 1), with “all” as the outcome variable, “over21” as the treatment variable, and “agecell” as the running variable. The coefficient of interest is \(\alpha_{1}\).
\[\begin{equation} \tag{1} all = \alpha_{0} + \alpha_{1}over21 + \alpha_{2}agecell + \varepsilon \end{equation}\]
The output below gives the estimation results of Equation 1.
# Dataset of "complete" cases for all deaths
hw10data <- hw10data[complete.cases(hw10data$all),]
# Regressions
reg.1 <- lm(all ~ over21 + agecell, data=hw10data)
# Formatting the regression output
stargazer(reg.1, type="text", no.space=TRUE, keep.stat = c("n","rsq","adj.rsq"), column.labels=c("(Overall mortality rate -- All deaths per 100,000 person-years)"))
##
## ============================================================================
## Dependent variable:
## ---------------------------------------------------------------
## all
## (Overall mortality rate -- All deaths per 100,000 person-years)
## ----------------------------------------------------------------------------
## over21 7.663***
## (1.440)
## agecell -0.975
## (0.632)
## Constant 112.310***
## (12.668)
## ----------------------------------------------------------------------------
## Observations 48
## R2 0.595
## Adjusted R2 0.577
## ============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Under the usual assumptions for RDD estimations, \(\alpha_{1}\) can be expressed in a small \(\delta\)-neighborhood around the cutoff age of 21 as follows: \[\alpha_{1}=\mathop{\mathbb{E}}[all|21 \le agecell \le 21+\delta, over21=1]-\mathop{\mathbb{E}}[all|21-\delta \le agecell \le 21, over21=0].\] Since observations just below the cutoff age of 21 are likely to be very similar to those just above the cutoff age of 21 (with the exception that those just above the cutoff are treated while those just below the cutoff are controls), \(\alpha_{1}\) represents a sharp mean difference in the outcome variable at age 21. In other words, it is the local average treatment effect (LATE) in the neighborhood of the cutoff. Given that \(\hat{\alpha_{1}}=7.663\) from the regression output, setting up a minimum legal drinking age (MLDA) at age 21 leads to an increase in overall mortality at age 21 of about 7.663 deaths per 100,000 person years. This impact estimate is significant at the 1% level of confidence.
# scatterplot for all deaths
plot(hw10data$agecell, hw10data$all, cex = 0.5, pch=16, col = "steelblue",
xlim=c(19,23), ylim=c(80,120), xlab = "Age", ylab = "Deaths per 100,000 person-years",
title(sub = "PLOT OF LINEAR SHARP RDD RESULTS"), frame.plot = FALSE)
## Predicted overall mortality
hw10data$linRDD_all <- fitted(reg.1)
# fitted linear sharp RDD for all deaths
lines(hw10data$agecell[hw10data$over21==0], hw10data$linRDD_all[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$agecell[hw10data$over21==1], hw10data$linRDD_all[hw10data$over21==1], lty=3, lwd=1)
# Discontinuity line
abline(v=21)
# Legend
legend("topleft", text.width=1.25, inset=0.015, ncol=1,text.font=0, col=c("steelblue","black"), pch=c(16,NA), lty=c(0,3), c("All","fitted-linear All"))
I want to run other RDs allowing for different intercepts and different slopes between the under- and over-21 subsamples. Several options are possible to include non-linearity in the running variable for allowing more flexibility than the linear RD. Abstracting from the choice of the correct order of polynomial, I present a higher-order polynomial specification (Equation 2) and a lower-order polynomial specification (Equation 3). In both non-linear RD models, “agecell” is centered to allow an easy interpretation of the main effect of “over21” as the treatment effect.
\[\begin{equation} \tag{2} all = \beta_{0} + \beta_{1}over21 + \beta_{2}(agecell-21) + \beta_{3}(agecell-21)^{2} + \beta_{4}over21*(agecell-21) + \beta_{4}over21*(agecell-21)^{2} + u \end{equation}\]
\[\begin{equation} \tag{3} all = \gamma_{0} + \gamma_{1}over21 + \gamma_{2}(agecell-21) + \gamma_{3}|agecell-21|^{1/2} + \beta_{4}over21*(agecell-21) + \beta_{5}over21*|agecell-21|^{1/2} + v \end{equation}\]
The output below gives the estimation results of Equations 2 and 3.
# Centered variable and interaction with treatment
hw10data$age <- hw10data$agecell - 21
hw10data$age_Interaction_over21 <- hw10data$over21*hw10data$age
# Squared centered variable and interaction with treatment
hw10data$agesq <- hw10data$age*hw10data$age
hw10data$agesq_Interaction_over21 <- hw10data$over21*hw10data$agesq
# Square root centered variable and interaction with treatment
hw10data$agesqrt <- (abs(hw10data$age))^{1/2}
hw10data$agesqrt_Interaction_over21 <- hw10data$over21*hw10data$agesqrt
# Regressions
reg.2 <- lm(all ~ over21 + age + agesq + age_Interaction_over21 + agesq_Interaction_over21 , data=hw10data)
reg.3 <- lm(all ~ over21 + age + agesqrt + age_Interaction_over21 + agesqrt_Interaction_over21 , data=hw10data)
# Formatting the regression outputs
stargazer(reg.2, reg.3, type="text", no.space=TRUE, keep.stat = c("n","rsq","adj.rsq"), column.labels=c("(Higher-order polynomial)","(Lower-order polynomial)"))
##
## =============================================================================
## Dependent variable:
## --------------------------------------------------
## all
## (Higher-order polynomial) (Lower-order polynomial)
## (1) (2)
## -----------------------------------------------------------------------------
## over21 9.548*** 13.127***
## (1.985) (4.098)
## age -0.831 2.238
## (3.290) (4.213)
## agesq -0.840
## (1.615)
## agesqrt 2.486
## (7.283)
## age_Interaction_over21 -6.017 1.798
## (4.653) (5.958)
## agesq_Interaction_over21 2.904
## (2.284)
## agesqrt_Interaction_over21 -14.489
## (10.300)
## Constant 93.073*** 92.681***
## (1.404) (2.898)
## -----------------------------------------------------------------------------
## Observations 48 48
## R2 0.682 0.689
## Adjusted R2 0.644 0.652
## =============================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Using non-linear specifications, the impact estimate is still significant at the 1% level of confidence but with a slighlty higher magnitude than that from the linear specification.
Given that \(\hat{\beta_{1}}=9.548\) and \(\hat{\gamma_{1}}=13.127\), the estimated impact of an MLDA at age 21 is an increase in overall mortality at age 21 of about 9.548 deaths per 100,000 person years (using Equation 2) and 13.127 deaths per 100,000 person years (using Equation 3).
# scatterplot for all deaths
plot(hw10data$age, hw10data$all, cex = 0.5, pch=16, col = "steelblue",
xlim=c(-2,2), ylim=c(80,120), xlab = "Centered Age",
ylab = "Deaths per 100,000 person-years",
title(sub = "PLOT OF NON-LINEAR (higher-order polynomial) SHARP RDD RESULTS"), frame.plot = FALSE)
## Predicted overall mortality
hw10data$nonlinHighRDD_all <- fitted(reg.2)
# fitted sharp RDD line for all deaths
lines(hw10data$age[hw10data$over21==0], hw10data$nonlinHighRDD_all[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$age[hw10data$over21==1], hw10data$nonlinHighRDD_all[hw10data$over21==1], lty=3, lwd=1)
# Discontinuity line
abline(v=0)
# Legend
legend("topright", text.width=1.5, inset=0.015, ncol=1,text.font=0, col=c("steelblue","black"), pch=c(16,NA), lty=c(0,3), c("All","fitted higher-order pol. All"))
# scatterplot for all deaths
plot(hw10data$age, hw10data$all, cex = 0.5, pch=16, col = "steelblue",
xlim=c(-2,2), ylim=c(80,120), xlab = "Centered Age",
ylab = "Deaths per 100,000 person-years",
title(sub = "PLOT OF NON-LINEAR (lower-order polynomial) SHARP RDD RESULTS"), frame.plot = FALSE)
## Predicted overall mortality
hw10data$nonlinLowRDD_all <- fitted(reg.3)
# fitted sharp RDD line for all deaths
lines(hw10data$age[hw10data$over21==0], hw10data$nonlinLowRDD_all[hw10data$over21==0], lty=3, lwd=1)
lines(hw10data$age[hw10data$over21==1], hw10data$nonlinLowRDD_all[hw10data$over21==1], lty=3, lwd=1)
# Discontinuity line
abline(v=0)
# Legend
legend("topleft", text.width=1.5, inset=0.015, ncol=1,text.font=0, col=c("steelblue","black"), pch=c(16,NA), lty=c(0,3), c("All","fitted lower-order pol. All"))
The quality of these two nonlinear RD estimations and plots would depend on the effect of age being the same above and below the cutoff. The illustration of the impact estimate at age 21 on these two plots indicates how nonlinear RDD estimations can be prone to some boundary bias. The difference in impact estimates between the two polynomial specifications is more than one-third the magnitude of the impact estimate using the higher-order polynomial specification.
From now on, I use the rdd package. I run the RDestimate command on all~agecell, specifying the data and the cutoff age of 21.
reg.4=RDestimate(all~agecell,data=hw10data,cutpoint = 21)
summary(reg.4)
##
## Call:
## RDestimate(formula = all ~ agecell, data = hw10data, cutpoint = 21)
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 1.6561 40 9.001 1.480 6.080 1.199e-09
## Half-BW 0.8281 20 9.579 1.914 5.004 5.609e-07
## Double-BW 3.3123 48 7.953 1.278 6.223 4.882e-10
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 33.08 3 36 3.799e-10
## Half-BW 29.05 3 16 2.078e-06
## Double-BW 32.54 3 44 6.129e-11
The estimation output shows that the LATE of an MLDA at age 21 is an increase in the overall mortality at age 21 by 9 deaths per 100,000 person years, which is significant at the 1% level of significance.
The LATE is computed within a bandwidth of 1.6561 that contains 40 points “close” to the cutoff. Halving the bandwidth restricts the number of neighboring points to the cutoff that are compared, and yields an impact estimate of 9.579 deaths per 100,000 person years. Doubling the bandwidth enlarges the number of neighboring points to the cutoff that are compared, and returns an impact estimate of 7.953 deaths per 100,000 person years.
The linear RDD impact estimate (Task d.) aligns with the results using the double bandwidth, while the higher-order polynomial RDD impact estimate (Task f.) aligns with the results using half of the bandwidth.
plot(reg.4)
title(sub="PLOT OF NONLINEAR SHARP RDD RESULTS (using RDestimate) ", xlab="Age",ylab="Mortality rate from all causes (per 100,000)")
The plot shows that using the RDestimate command, without any additional details, fits a non-linear effect of age on mortality rate.
The plot of the 95% confidence interval of the fitted line below and above the cutoff indicates the bounds that can be placed on the estimated LATE. It is easier to observe that the lower-order polynomial RDD impact estimate of 13.127 deaths per 100,000 person years as well as its higher-order polynomial counterpart do lie in the 95% confidence interval of the estimated LATE.
reg.5=RDestimate(all ~ agecell, data=hw10data, cutpoint=21, kernel="rectangular", bw=2)
summary(reg.5)
##
## Call:
## RDestimate(formula = all ~ agecell, data = hw10data, cutpoint = 21,
## bw = 2, kernel = "rectangular")
##
## Type:
## sharp
##
## Estimates:
## Bandwidth Observations Estimate Std. Error z value Pr(>|z|)
## LATE 2 48 7.663 1.273 6.017 1.776e-09
## Half-BW 1 24 9.753 1.902 5.128 2.929e-07
## Double-BW 4 48 7.663 1.273 6.017 1.776e-09
##
## LATE ***
## Half-BW ***
## Double-BW ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## F-statistics:
## F Num. DoF Denom. DoF p
## LATE 29.47 3 44 2.651e-10
## Half-BW 16.82 3 20 2.167e-05
## Double-BW 29.47 3 44 2.651e-10
Specifying a linear RDD with a bandwidth of 2 yields the linear-based impact estimate previously found (Task d.) with similar standard errors. Since this bandwidth already contains all points of sample, doubling the bandwidth changes nothing in the results. Halving the bandwitdth to 1 gives similar results to the higher-order polynomial RDD impact estimate (Task f.) and to the pre-packaged non-linear RDD results with a bandwidth of 0.8281 (Task h.).
Hence, in certain narrow bandwidth, linear and non-linear sharp RDDs give asymptotically similar impact estimates.
plot(reg.5)
title(sub="PLOT OF LINEAR SHARP RDD RESULTS (using RDestimate) ", xlab="Age",ylab="Mortality rate from all causes (per 100,000)")
Using local linear regressions with rectangular kernels and narrowed age bandwidth yields 95% confidence intervals of impact estimates that are similar to those obtained using local nonlinear regressions.