Reference and disclaimer: This homework is attempting to excersice “Regression Discontinuity Design”, based on Carpenter and Dobkin(2009)’s paper (The Effect of Alcohol Consumption on Mortality: Regression Discontinuity).

b. Get the data

content

The data can be downloaded straight here, from A&P Mastering Metrics resources webpage.

Code

data <- read.dta13("D:/class/5. 2020 Spring/AAEC 8610/Homework/HW10/AEJfigs.dta")
data$over21 <- 1
data$over21[data$agecell < 21] <- "0"

c. Reproduce Figure 3 of the paper

Figure 3

Code

plot(data$agecell, data$all, pch=16, xlim=c(19,23), ylim=c(0,140),
     xlab = "Age", ylab = "Deaths per 100,000 person-years", main = "Figure 3. Age Profile for Death Rates")
points(data$agecell, data$internal, pch=0)
points(data$agecell, data$external, pch=17)

lines(data$agecell[1:25], data$allfitted[1:25], lty=3, lwd=1)
lines(data$agecell[26:50], data$allfitted[26:50], lty=3, lwd=1)

lines(data$agecell[1:25], data$internalfitted[1:25], lty=2, lwd=1)
lines(data$agecell[26:50], data$internalfitted[26:50], lty=2, lwd=1)

lines(data$agecell[1:25], data$externalfitted[1:25], lty=1, lwd=1)
lines(data$agecell[26:50], data$externalfitted[26:50], lty=1, lwd=1)

legend("topleft", inset=0.05, ncol=2, pch=c(16,0,17,NA,NA,NA), lty=c(0,0,0,3,2,1),
       c("All","Internal","External","All fitted", "Internal fitted", "External fitted"))

d. Simplest regression-discontinuity design

Analysis

This is the estimation result for simple RD model, given the same slope with a jump.The coefficient of over21 means that the local Average Treatment Effect (**LATE*“). At the 1 % significance level, over21** has a significant effect on the number of deaths per 100,000 person-years, and the LATE is 7.6627.

## 
## Call:
## lm(formula = all ~ agecell + over21, data = data, na.action = "na.exclude")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0559 -1.8483  0.1149  1.4909  5.8043 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 112.3097    12.6681   8.866 1.96e-11 ***
## agecell      -0.9747     0.6325  -1.541     0.13    
## over211       7.6627     1.4403   5.320 3.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.493 on 45 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5946, Adjusted R-squared:  0.5765 
## F-statistic: 32.99 on 2 and 45 DF,  p-value: 1.508e-09

Due to this significance of over21 variable, it can be used to estimate the relationship between drinking and mortality, given that there is no discontinuity at age 21 in other causal variables.The process is as follows:

First, in the simple regression model, if we use the log of the number of deaths per 100,000 person-years as a dependent variable, we can estimate ** the % change in the number of deaths at age 20 (A)** by seeing the coefficient of over21.

Second, if we use the Proportion of days drinking as a dependent variable, we can estimate ** the % change in Proportion of days drinking at age 20 (B)** by seeing the coefficient of over21.

Lastly, through A/B, we can estimate the elasticity of the drinking days on the mortality.

Code

OLS_d <- lm(all ~ agecell + over21, data, na.action="na.exclude")
summary(OLS_d)

e. Plot the results in Simple RD

Plot

Code

fitted_e = fitted(OLS_d)

plot(data$agecell, data$all, pch=16, xlim=c(19,23), ylim=c(80,120),
     xlab = "Age", ylab = "Deaths per 100,000 person-years", main = "Simple RD Regression")
lines(data$agecell, fitted_e, lty=1, lwd=1)

f. Allow more flexibility to your RD

Analysis

## 
## Call:
## lm(formula = all ~ agecell, data = subset(data, over21 == 0), 
##     na.action = "na.exclude")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7972 -1.2565 -0.0694  0.9600  5.3409 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  76.2515    17.1528   4.445 0.000203 ***
## agecell       0.8270     0.8567   0.965 0.344879    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.388 on 22 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.04064,    Adjusted R-squared:  -0.002972 
## F-statistic: 0.9318 on 1 and 22 DF,  p-value: 0.3449
## 
## Call:
## lm(formula = all ~ agecell, data = subset(data, over21 == 1), 
##     na.action = "na.exclude")
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.3683 -1.8255  0.2738  1.1077  4.1014 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 159.5847    17.1402   9.311 4.36e-09 ***
## agecell      -2.7764     0.7793  -3.563  0.00174 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.172 on 22 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3658, Adjusted R-squared:  0.337 
## F-statistic: 12.69 on 1 and 22 DF,  p-value: 0.001742

These are the resuls for local linear regression without considering weights. Since the data is different between the data before age 21 and the data after age 21, the coefficients and intercept can be different.

To estimate the LATE at age 21, we need to see the predicted values at age 21 for both sides. That is, the LATE is the difference between predicted values at age 21 in the models from both sides. In this model, the LATE is 7.6627 which is the same as that in Simple RD.

##        1 
## 7.662709

Code

OLS_f_before <- lm(all ~ agecell, data=subset(data,over21==0), na.action="na.exclude")
OLS_f_after <- lm(all ~ agecell, data=subset(data,over21==1), na.action="na.exclude")
summary(OLS_f_before)
summary(OLS_f_after)

fitted_f_before = predict(OLS_f_before, data.frame(agecell = c(21)))
fitted_f_after = predict(OLS_f_after, data.frame(agecell = c(21)))
LATE_local_linear = fitted_f_after - fitted_f_before
LATE_local_linear

g. Again, plot the data

Plot

Code

fitted_g_before = fitted(OLS_f_before)
fitted_g_after = fitted(OLS_f_after)

plot(data$agecell, data$all, pch=16, xlim=c(19,23), ylim=c(80,120),
     xlab = "Age", ylab = "Deaths per 100,000 person-years", main = "RD Regression_local linear")
lines(subset(data,over21==0)$agecell, fitted_g_before, lty=1, lwd=1)
lines(subset(data,over21==1)$agecell, fitted_g_after, lty=1, lwd=1)

h. Compare to pre-packaged RDD

Analysis

## 
## Call:
## RDestimate(formula = all ~ agecell, data = data, cutpoint = 21)
## 
## Coefficients:
##      LATE    Half-BW  Double-BW  
##     9.001      9.579      7.953

This is the results for Pre-pacakaged RDD, and the estimated LATE is 9.001 which is different from those (7.6627) in the above models.

The reason for the different LATE among models is that in this model triangular kernel is used as the default. Therfore, in the local linear regression around age 21, the farther away from the age 21, the smaller the weight which is applied to squared errors at that age. Hence, the farther away from the age 21, the smaller the effect on the estimated model.

The plot of the regression output is as follows:

Code

RD_h = RDestimate(all~agecell, data=data, cutpoint = 21)  # "triangular" kernel is the default
RD_h
plot(RD_h)

i. Prepackaged linear RDD

Analysis

## 
## Call:
## RDestimate(formula = all ~ agecell, data = data, cutpoint = 21, 
##     bw = 2, kernel = "rectangular")
## 
## Coefficients:
##      LATE    Half-BW  Double-BW  
##     7.663      9.753      7.663

This are the resuls for Pre-pacakaged RDD using the rectangular kernel bandwith =2 which ensures the same interval for the age data in Simple RD and Loocal linear Regression withough considering weights. Under the rectangular kernel, the same weights are appled to all data within the bandwidth. Hence, in this model, the LATE is 7.6627 which is the same as that in Simple RD and Loocal linear Regression withough weights.

which LATE is more accurate depends on the linearity of before/after data in the sample. If the sample shows the non-linearity, the LATE from the triangular kernel will be more accurate. On the other hand, if the sample shows the linearity, the LATE from the triangular kernel will be more accurate

The plot of the regression output is as follows:

Code

RD_i = RDestimate(all~agecell, data=data, cutpoint = 21, kernel = "rectangular", bw=2)
RD_i
plot(RD_i)