setwd("~/Advanced Impact Evaluation")
library(tidyverse)
library(dplyr)
library(readr)
library(estimatr)
library(sandwich)
library(lmtest)
library(texreg)
df <- read_csv("~/Advanced Impact Evaluation/transfer (2).csv")
In RDD there is an intention of finding the most similar treatment and control group in a way that the main difference between them is the treatment they receive. In other words, getting an apples-to-apples comparison. In the case study, the population of municipalities is the cutoff assignment. Brazilian municipalities with populations below the cutoff are the control group because they did not receive additional revenue while municipalities with populations above the cutoff are the treatment group because they received additional revenue for public spending. Based on this, RDD has the assumption that municipalities that are just above and just below the cutoff will be similar, enabling a nearly apples-to-apples comparison with less biases and co-founders. This means that differences in the outcomes will be attributed to the treatment and not to pre-existing differences between municipalities.
In this scenario, if there are outcomes differences in educational attainment, poverty rates or literacy among the municipalities who are around the cutoff, it will be attributed to the treatment based on the assumption that all other factors are similar. If there are systematic differences (apart from the treatment and population level) between municipalities that are just above and just below the cutoff, the RDD assumption can be violated. This situation can lead to biased estimates of the treatment effect.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 1374 2735 4361 6967 13563
##
## Call:
## lm(formula = poverty91 ~ treat + pop82, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57367 -0.17451 0.04475 0.20184 0.34897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.022e-01 1.782e-02 33.797 <2e-16 ***
## treat -3.220e-02 1.659e-02 -1.941 0.0524 .
## pop82 1.702e-06 1.662e-06 1.024 0.3059
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2272 on 1784 degrees of freedom
## Multiple R-squared: 0.002216, Adjusted R-squared: 0.001097
## F-statistic: 1.981 on 2 and 1784 DF, p-value: 0.1383
##
## Call:
## lm(formula = literate91 ~ treat + pop82, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.54846 -0.13968 0.05504 0.14339 0.22818
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.250e-01 1.288e-02 64.060 < 2e-16 ***
## treat 2.004e-02 1.199e-02 1.672 0.09479 .
## pop82 -3.710e-06 1.201e-06 -3.091 0.00203 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1641 on 1783 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.005555, Adjusted R-squared: 0.00444
## F-statistic: 4.98 on 2 and 1783 DF, p-value: 0.006968
##
## Call:
## lm(formula = educ91 ~ treat + pop82, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9234 -1.1934 0.0759 1.1603 3.4656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.681e+00 1.121e-01 41.767 <2e-16 ***
## treat 9.701e-02 1.043e-01 0.930 0.353
## pop82 -1.198e-05 1.045e-05 -1.147 0.252
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.428 on 1783 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.0007576, Adjusted R-squared: -0.0003632
## F-statistic: 0.6759 on 2 and 1783 DF, p-value: 0.5088
In model 4, increasing government spending had a negative effect on the poverty rates and produced a reduction of -0.19 in the metric. In model 5, increasing government spending had a positive effect on literacy, which increased by 0.12. Finally, in model 6 the effect of public spending had increase educational attainment by 1.14. In other words, public spending achieved the expected outcomes results.
##
## Call:
## lm(formula = poverty91 ~ treat + pop82, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.47240 -0.15685 0.03911 0.19015 0.38528
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.7162611 2.6115544 -1.423 0.1578
## treat -0.1987937 0.0932323 -2.132 0.0354 *
## pop82 0.0004344 0.0002605 1.667 0.0986 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2215 on 100 degrees of freedom
## Multiple R-squared: 0.04541, Adjusted R-squared: 0.02632
## F-statistic: 2.379 on 2 and 100 DF, p-value: 0.09789
##
## Call:
## lm(formula = literate91 ~ treat + pop82, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.49942 -0.14047 0.06532 0.12258 0.19959
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.7581100 1.8137012 2.072 0.0408 *
## treat 0.1230522 0.0647490 1.900 0.0603 .
## pop82 -0.0002963 0.0001809 -1.637 0.1047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1539 on 100 degrees of freedom
## Multiple R-squared: 0.03493, Adjusted R-squared: 0.01563
## F-statistic: 1.81 on 2 and 100 DF, p-value: 0.169
##
## Call:
## lm(formula = educ91 ~ treat + pop82, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1851 -1.1835 0.0961 0.9268 2.8718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.936560 15.769558 1.835 0.0695 .
## treat 1.144067 0.562972 2.032 0.0448 *
## pop82 -0.002446 0.001573 -1.555 0.1232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.338 on 100 degrees of freedom
## Multiple R-squared: 0.04206, Adjusted R-squared: 0.0229
## F-statistic: 2.195 on 2 and 100 DF, p-value: 0.1166
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Here, we estimate returns to education with cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986. The survey included students from approximately 1,100 high schools.
## gender ethnicity score fcollege mcollege home
## male :2139 other :3050 Min. :28.95 no :3753 no :4088 no : 852
## female:2600 afam : 786 1st Qu.:43.92 yes: 986 yes: 651 yes:3887
## hispanic: 903 Median :51.19
## Mean :50.89
## 3rd Qu.:57.77
## Max. :72.81
## urban unemp wage distance tuition
## no :3635 Min. : 1.400 Min. : 6.590 Min. : 0.000 Min. :0.2575
## yes:1104 1st Qu.: 5.900 1st Qu.: 8.850 1st Qu.: 0.400 1st Qu.:0.4850
## Median : 7.100 Median : 9.680 Median : 1.000 Median :0.8245
## Mean : 7.597 Mean : 9.501 Mean : 1.803 Mean :0.8146
## 3rd Qu.: 8.900 3rd Qu.:10.150 3rd Qu.: 2.500 3rd Qu.:1.1270
## Max. :24.900 Max. :12.960 Max. :20.000 Max. :1.4042
## education income region
## Min. :12.00 low :3374 other:3796
## 1st Qu.:12.00 high:1365 west : 943
## Median :13.00
## Mean :13.81
## 3rd Qu.:16.00
## Max. :18.00
# 4. Card (1993) estimates returns to education by addressing
endogeneity issues when regressing wages on education directly. Identify
some of the omitted variables that would bias the effect of education on
wages.
There are several omitted variables that might have an influence on the effect of education on wages. In particular, people´s ability might have an incidence in both education and wages. Individuals with more abilities can manage to achieve better education results and, therefore, gain higher earnings.Another omitted variable can be motivation, even though is also unobserved, motivated individuals might put more time and effort in having greater education levels and earnings.
##
## Call:
## lm(formula = log(wage) ~ education, data = CollegeDistance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36208 -0.06271 0.03255 0.07794 0.32436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.213220 0.016228 136.379 <2e-16 ***
## education 0.002024 0.001166 1.737 0.0825 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1435 on 4737 degrees of freedom
## Multiple R-squared: 0.0006364, Adjusted R-squared: 0.0004254
## F-statistic: 3.016 on 1 and 4737 DF, p-value: 0.08249
After including the covariates, the variable of interest has changed. The effect of log wage on education had a positive and low coefficient in the previous model (0.002024) while after including the covariates the coefficient become smaller (0.0006723). In both models the coefficient was not statistically significant.
##
## Call:
## lm(formula = log(wage) ~ education + unemp + ethnicity + gender +
## urban, data = CollegeDistance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39998 -0.08223 0.02833 0.09486 0.37945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1519999 0.0168512 127.706 <2e-16 ***
## education 0.0006723 0.0011121 0.605 0.5455
## unemp 0.0135938 0.0007203 18.874 <2e-16 ***
## ethnicityafam -0.0619139 0.0055990 -11.058 <2e-16 ***
## ethnicityhispanic -0.0535204 0.0052237 -10.246 <2e-16 ***
## genderfemale -0.0091150 0.0039785 -2.291 0.0220 *
## urbanyes 0.0089393 0.0048005 1.862 0.0626 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1361 on 4732 degrees of freedom
## Multiple R-squared: 0.1026, Adjusted R-squared: 0.1015
## F-statistic: 90.2 on 6 and 4732 DF, p-value: < 2.2e-16
College distance will be a valid instrument if its highly correlated with education which is the variable of interest. In addition to this, the instrument can not be correlated with the outcome wages other than through education. In other words, people’s college distance should not haven an effect on earnings.
The instrument distance seems to be relevant. The main objective of instruments is to help isolate variance, following this a valid instrument is highly correlated with X and not casually related with Y other than through X. In this case, the instrument distance is negatively correlated (-0.09318309) with education (X and endogenous variable), and distance is not correlated with the outcome variable wage (-0.0003904288).
## [1] -0.09318309
## [1] -0.0003904288
##
## Call:
## lm(formula = education ~ distance + unemp + ethnicity + gender +
## urban, data = CollegeDistance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1742 -1.7035 -0.4755 1.8914 4.5255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.060680 0.083075 169.253 < 2e-16 ***
## distance -0.086846 0.012244 -7.093 1.51e-12 ***
## unemp 0.010267 0.009768 1.051 0.293
## ethnicityafam -0.524317 0.072444 -7.238 5.30e-13 ***
## ethnicityhispanic -0.274761 0.067879 -4.048 5.25e-05 ***
## genderfemale -0.024645 0.051731 -0.476 0.634
## urbanyes -0.092308 0.065039 -1.419 0.156
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.77 on 4732 degrees of freedom
## Multiple R-squared: 0.02298, Adjusted R-squared: 0.02174
## F-statistic: 18.55 on 6 and 4732 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = log(wage) ~ distance + unemp + ethnicity + gender +
## urban, data = CollegeDistance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38155 -0.08052 0.02458 0.09610 0.37599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1638026 0.0063632 340.049 < 2e-16 ***
## distance -0.0058468 0.0009379 -6.234 4.94e-10 ***
## unemp 0.0149146 0.0007482 19.934 < 2e-16 ***
## ethnicityafam -0.0630613 0.0055489 -11.365 < 2e-16 ***
## ethnicityhispanic -0.0520024 0.0051992 -10.002 < 2e-16 ***
## genderfemale -0.0092693 0.0039624 -2.339 0.0194 *
## urbanyes 0.0002348 0.0049817 0.047 0.9624
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1355 on 4732 degrees of freedom
## Multiple R-squared: 0.1099, Adjusted R-squared: 0.1087
## F-statistic: 97.35 on 6 and 4732 DF, p-value: < 2.2e-16
##
## Call:
## ivreg(formula = log(wage) ~ education + unemp + ethnicity + gender +
## urban | distance + unemp + ethnicity + gender + urban, data = CollegeDistance)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5885016 -0.1191974 -0.0001799 0.1452146 0.4576460
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.2171787 0.2018797 6.029 1.77e-09 ***
## education 0.0673242 0.0143812 4.681 2.93e-06 ***
## unemp 0.0142234 0.0009648 14.743 < 2e-16 ***
## ethnicityafam -0.0277621 0.0104342 -2.661 0.00782 **
## ethnicityhispanic -0.0335043 0.0081520 -4.110 4.02e-05 ***
## genderfemale -0.0076101 0.0052865 -1.440 0.15007
## urbanyes 0.0064494 0.0063892 1.009 0.31283
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1805 on 4732 degrees of freedom
## Multiple R-Squared: -0.5786, Adjusted R-squared: -0.5806
## Wald test: 54.89 on 6 and 4732 DF, p-value: < 2.2e-16
The first stage equation (model 6), shows that education and distance are negatively correlated. In other words, college distance reduce education levels by -0.08. The first condition for instrumental variables is accomplished.
The reduced from equation (model 7), shows a small coefficient between distance and log wage (-0.0058468). In other words, college distance does not have an effect on log wages. The second condition for instrumental variables is accomplished.
In model 5 (OLS model) the coefficient between wage and education is low (0.0006723), while in model 8 (IVreg) the coefficient become greater (0.0304838). In both cases, there is a positive effect of education levels on earnings. However, just in model 8 the coefficient is statistically significant.