content
The data can be downloaded straight here, from A&P Mastering Metrics resources webpage.
Reference and disclaimer: This homework is attempting to excersice “Regression Discontinuity Design”, based on Carpenter and Dobkin(2009)’s paper (The Effect of Alcohol Consumption on Mortality: Regression Discontinuity).
The data can be downloaded straight here, from A&P Mastering Metrics resources webpage.
data <- read.dta13("D:/class/5. 2020 Spring/AAEC 8610/Homework/HW10/AEJfigs.dta")
data$over21 <- 1
data$over21[data$agecell < 21] <- "0"
plot(data$agecell, data$all, pch=16, xlim=c(19,23), ylim=c(0,140),
xlab = "Age", ylab = "Deaths per 100,000 person-years", main = "Figure 3. Age Profile for Death Rates")
points(data$agecell, data$internal, pch=0)
points(data$agecell, data$external, pch=17)
lines(data$agecell[1:25], data$allfitted[1:25], lty=3, lwd=1)
lines(data$agecell[26:50], data$allfitted[26:50], lty=3, lwd=1)
lines(data$agecell[1:25], data$internalfitted[1:25], lty=2, lwd=1)
lines(data$agecell[26:50], data$internalfitted[26:50], lty=2, lwd=1)
lines(data$agecell[1:25], data$externalfitted[1:25], lty=1, lwd=1)
lines(data$agecell[26:50], data$externalfitted[26:50], lty=1, lwd=1)
legend("topleft", inset=0.05, ncol=2, pch=c(16,0,17,NA,NA,NA), lty=c(0,0,0,3,2,1),
c("All","Internal","External","All fitted", "Internal fitted", "External fitted"))
This is the estimation result for simple RD model, given the same slope with a jump.The coefficient of over21 means that the local Average Treatment Effect (**LATE*“). At the 1 % significance level, over21** has a significant effect on the number of deaths per 100,000 person-years, and the LATE is 7.6627.
##
## Call:
## lm(formula = all ~ agecell + over21, data = data, na.action = "na.exclude")
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0559 -1.8483 0.1149 1.4909 5.8043
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 112.3097 12.6681 8.866 1.96e-11 ***
## agecell -0.9747 0.6325 -1.541 0.13
## over211 7.6627 1.4403 5.320 3.15e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.493 on 45 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.5946, Adjusted R-squared: 0.5765
## F-statistic: 32.99 on 2 and 45 DF, p-value: 1.508e-09
Due to this significance of over21 variable, it can be used to estimate the relationship between drinking and mortality, given that there is no discontinuity at age 21 in other causal variables.The process is as follows:
First, in the simple regression model, if we use the log of the number of deaths per 100,000 person-years as a dependent variable, we can estimate ** the % change in the number of deaths at age 20 (A)** by seeing the coefficient of over21.
Second, if we use the Proportion of days drinking as a dependent variable, we can estimate ** the % change in Proportion of days drinking at age 20 (B)** by seeing the coefficient of over21.
Lastly, through A/B, we can estimate the elasticity of the drinking days on the mortality.
OLS_d <- lm(all ~ agecell + over21, data, na.action="na.exclude")
summary(OLS_d)
fitted_e = fitted(OLS_d)
plot(data$agecell, data$all, pch=16, xlim=c(19,23), ylim=c(80,120),
xlab = "Age", ylab = "Deaths per 100,000 person-years", main = "Simple RD Regression")
lines(data$agecell, fitted_e, lty=1, lwd=1)
##
## Call:
## lm(formula = all ~ agecell, data = subset(data, over21 == 0),
## na.action = "na.exclude")
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7972 -1.2565 -0.0694 0.9600 5.3409
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.2515 17.1528 4.445 0.000203 ***
## agecell 0.8270 0.8567 0.965 0.344879
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.388 on 22 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.04064, Adjusted R-squared: -0.002972
## F-statistic: 0.9318 on 1 and 22 DF, p-value: 0.3449
##
## Call:
## lm(formula = all ~ agecell, data = subset(data, over21 == 1),
## na.action = "na.exclude")
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3683 -1.8255 0.2738 1.1077 4.1014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.5847 17.1402 9.311 4.36e-09 ***
## agecell -2.7764 0.7793 -3.563 0.00174 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.172 on 22 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3658, Adjusted R-squared: 0.337
## F-statistic: 12.69 on 1 and 22 DF, p-value: 0.001742
These are the resuls for local linear regression without considering weights. Since the data is different between the data before age 21 and the data after age 21, the coefficients and intercept can be different.
To estimate the LATE at age 21, we need to see the predicted values at age 21 for both sides. That is, the LATE is the difference between predicted values at age 21 in the models from both sides. In this model, the LATE is 7.6627 which is the same as that in Simple RD.
## 1
## 7.662709
OLS_f_before <- lm(all ~ agecell, data=subset(data,over21==0), na.action="na.exclude")
OLS_f_after <- lm(all ~ agecell, data=subset(data,over21==1), na.action="na.exclude")
summary(OLS_f_before)
summary(OLS_f_after)
fitted_f_before = predict(OLS_f_before, data.frame(agecell = c(21)))
fitted_f_after = predict(OLS_f_after, data.frame(agecell = c(21)))
LATE_local_linear = fitted_f_after - fitted_f_before
LATE_local_linear
fitted_g_before = fitted(OLS_f_before)
fitted_g_after = fitted(OLS_f_after)
plot(data$agecell, data$all, pch=16, xlim=c(19,23), ylim=c(80,120),
xlab = "Age", ylab = "Deaths per 100,000 person-years", main = "RD Regression_local linear")
lines(subset(data,over21==0)$agecell, fitted_g_before, lty=1, lwd=1)
lines(subset(data,over21==1)$agecell, fitted_g_after, lty=1, lwd=1)
##
## Call:
## RDestimate(formula = all ~ agecell, data = data, cutpoint = 21)
##
## Coefficients:
## LATE Half-BW Double-BW
## 9.001 9.579 7.953
This is the results for Pre-pacakaged RDD, and the estimated LATE is 9.001 which is different from those (7.6627) in the above models.
The reason for the different LATE among models is that in this model triangular kernel is used as the default. Therfore, in the local linear regression around age 21, the farther away from the age 21, the smaller the weight which is applied to squared errors at that age. Hence, the farther away from the age 21, the smaller the effect on the estimated model.
The plot of the regression output is as follows:
RD_h = RDestimate(all~agecell, data=data, cutpoint = 21) # "triangular" kernel is the default
RD_h
plot(RD_h)
##
## Call:
## RDestimate(formula = all ~ agecell, data = data, cutpoint = 21,
## bw = 2, kernel = "rectangular")
##
## Coefficients:
## LATE Half-BW Double-BW
## 7.663 9.753 7.663
This are the resuls for Pre-pacakaged RDD using the rectangular kernel bandwith =2 which ensures the same interval for the age data in Simple RD and Loocal linear Regression withough considering weights. Under the rectangular kernel, the same weights are appled to all data within the bandwidth. Hence, in this model, the LATE is 7.6627 which is the same as that in Simple RD and Loocal linear Regression withough weights.
which LATE is more accurate depends on the linearity of before/after data in the sample. If the sample shows the non-linearity, the LATE from the triangular kernel will be more accurate. On the other hand, if the sample shows the linearity, the LATE from the triangular kernel will be more accurate
The plot of the regression output is as follows:
RD_i = RDestimate(all~agecell, data=data, cutpoint = 21, kernel = "rectangular", bw=2)
RD_i
plot(RD_i)