Regression Discontinuity —————————————————–

Import libraries and dataset

setwd("~/Advanced Impact Evaluation")

library(tidyverse)
library(dplyr)
library(readr)
library(estimatr)
library(sandwich)
library(lmtest)
library(texreg)

df <- read_csv("~/Advanced Impact Evaluation/transfer (2).csv")

In this exercise, you’ll estimate the effects of increased government spending on educational attainment, literacy, and poverty rates. Some scholars argue that government spending accomplishes very little in high corruption and inequality environments. Others suggest that accountability pressures and the large demand for public goods in such environments will drive elites to respond. To address this debate, we exploit the fact that until 1991, the municipality’s population determined the formula for government transfers to individual Brazilian municipalities. This meant municipalities with populations below the official cutoff did not receive additional revenue, while states above the cutoff did.

Question 1: We will apply the regression discontinuity design to this application. State the required key assumption for this design and interpret it in the context of this specific application. What would be a scenario in which this assumption is violated?

In RDD there is an intention of finding the most similar treatment and control group in a way that the main difference between them is the treatment they receive. In other words, getting an apples-to-apples comparison. In the case study, the population of municipalities is the cutoff assignment. Brazilian municipalities with populations below the cutoff are the control group because they did not receive additional revenue while municipalities with populations above the cutoff are the treatment group because they received additional revenue for public spending. Based on this, RDD has the assumption that municipalities that are just above and just below the cutoff will be similar, enabling a nearly apples-to-apples comparison with less biases and co-founders. This means that differences in the outcomes will be attributed to the treatment and not to pre-existing differences between municipalities.

In this scenario, if there are outcomes differences in educational attainment, poverty rates or literacy among the municipalities who are around the cutoff, it will be attributed to the treatment based on the assumption that all other factors are similar. If there are systematic differences (apart from the treatment and population level) between municipalities that are just above and just below the cutoff, the RDD assumption can be violated. This situation can lead to biased estimates of the treatment effect.

Question 2: Begin by creating a variable that determines how close each municipality was to the cutoff that determined whether states received a transfer (or did not). Transfers occurred at three separate population cutoffs: 10,188, 13,584, and 16,980 (but we will focus only on the first for this exercise – you can try the other two at your own time). Using the first cutoff (10,188), create a single variable that characterizes the difference from the closest population cutoff (e.g., midpoint between two cutoffs).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    1374    2735    4361    6967   13563

Question 3: You could run a regression with the treatment dummy = 1 for those at or above the cutoff and zero below. Note that the coefficient of the treatment dummy captures the RDD treatment effect, and the running variable population could have an independent effect on the outcome. This is a good first approximation of the treatment effect.

## 
## Call:
## lm(formula = poverty91 ~ treat + pop82, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57367 -0.17451  0.04475  0.20184  0.34897 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.022e-01  1.782e-02  33.797   <2e-16 ***
## treat       -3.220e-02  1.659e-02  -1.941   0.0524 .  
## pop82        1.702e-06  1.662e-06   1.024   0.3059    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2272 on 1784 degrees of freedom
## Multiple R-squared:  0.002216,   Adjusted R-squared:  0.001097 
## F-statistic: 1.981 on 2 and 1784 DF,  p-value: 0.1383
## 
## Call:
## lm(formula = literate91 ~ treat + pop82, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54846 -0.13968  0.05504  0.14339  0.22818 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.250e-01  1.288e-02  64.060  < 2e-16 ***
## treat        2.004e-02  1.199e-02   1.672  0.09479 .  
## pop82       -3.710e-06  1.201e-06  -3.091  0.00203 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1641 on 1783 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.005555,   Adjusted R-squared:  0.00444 
## F-statistic:  4.98 on 2 and 1783 DF,  p-value: 0.006968
## 
## Call:
## lm(formula = educ91 ~ treat + pop82, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9234 -1.1934  0.0759  1.1603  3.4656 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.681e+00  1.121e-01  41.767   <2e-16 ***
## treat        9.701e-02  1.043e-01   0.930    0.353    
## pop82       -1.198e-05  1.045e-05  -1.147    0.252    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.428 on 1783 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.0007576,  Adjusted R-squared:  -0.0003632 
## F-statistic: 0.6759 on 2 and 1783 DF,  p-value: 0.5088

Question 4: For the analysis, we will include only those municipalities within 3 points of the funding cutoff on either side. Using regressions, estimate the average causal effect of government transfer on the three outcome variables of interest: educational attainment, literacy, and poverty. Give a brief substantive interpretation of the results.

In model 4, increasing government spending had a negative effect on the poverty rates and produced a reduction of -0.19 in the metric. In model 5, increasing government spending had a positive effect on literacy, which increased by 0.12. Finally, in model 6 the effect of public spending had increase educational attainment by 1.14. In other words, public spending achieved the expected outcomes results.

## 
## Call:
## lm(formula = poverty91 ~ treat + pop82, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.47240 -0.15685  0.03911  0.19015  0.38528 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.7162611  2.6115544  -1.423   0.1578  
## treat       -0.1987937  0.0932323  -2.132   0.0354 *
## pop82        0.0004344  0.0002605   1.667   0.0986 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2215 on 100 degrees of freedom
## Multiple R-squared:  0.04541,    Adjusted R-squared:  0.02632 
## F-statistic: 2.379 on 2 and 100 DF,  p-value: 0.09789
## 
## Call:
## lm(formula = literate91 ~ treat + pop82, data = df1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.49942 -0.14047  0.06532  0.12258  0.19959 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.7581100  1.8137012   2.072   0.0408 *
## treat        0.1230522  0.0647490   1.900   0.0603 .
## pop82       -0.0002963  0.0001809  -1.637   0.1047  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1539 on 100 degrees of freedom
## Multiple R-squared:  0.03493,    Adjusted R-squared:  0.01563 
## F-statistic:  1.81 on 2 and 100 DF,  p-value: 0.169
## 
## Call:
## lm(formula = educ91 ~ treat + pop82, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1851 -1.1835  0.0961  0.9268  2.8718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 28.936560  15.769558   1.835   0.0695 .
## treat        1.144067   0.562972   2.032   0.0448 *
## pop82       -0.002446   0.001573  -1.555   0.1232  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.338 on 100 degrees of freedom
## Multiple R-squared:  0.04206,    Adjusted R-squared:  0.0229 
## F-statistic: 2.195 on 2 and 100 DF,  p-value: 0.1166

Use one of the packages to plot and visualize the analysis at the first cutoff (with default bandwidths).

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

Instrumental Variables —————————————————–

Education and wages

Here, we estimate returns to education with cross-section data from the High School and Beyond survey conducted by the Department of Education in 1980, with a follow-up in 1986. The survey included students from approximately 1,100 high schools.

1. Load the CollegeDistance dataset from the AER package on R

2. Get an overview of the dataset

##     gender        ethnicity        score       fcollege   mcollege    home     
##  male  :2139   other   :3050   Min.   :28.95   no :3753   no :4088   no : 852  
##  female:2600   afam    : 786   1st Qu.:43.92   yes: 986   yes: 651   yes:3887  
##                hispanic: 903   Median :51.19                                   
##                                Mean   :50.89                                   
##                                3rd Qu.:57.77                                   
##                                Max.   :72.81                                   
##  urban          unemp             wage           distance         tuition      
##  no :3635   Min.   : 1.400   Min.   : 6.590   Min.   : 0.000   Min.   :0.2575  
##  yes:1104   1st Qu.: 5.900   1st Qu.: 8.850   1st Qu.: 0.400   1st Qu.:0.4850  
##             Median : 7.100   Median : 9.680   Median : 1.000   Median :0.8245  
##             Mean   : 7.597   Mean   : 9.501   Mean   : 1.803   Mean   :0.8146  
##             3rd Qu.: 8.900   3rd Qu.:10.150   3rd Qu.: 2.500   3rd Qu.:1.1270  
##             Max.   :24.900   Max.   :12.960   Max.   :20.000   Max.   :1.4042  
##    education      income       region    
##  Min.   :12.00   low :3374   other:3796  
##  1st Qu.:12.00   high:1365   west : 943  
##  Median :13.00                           
##  Mean   :13.81                           
##  3rd Qu.:16.00                           
##  Max.   :18.00

3. The variable distance (the distance to the closest 4-year college in 10 miles) will serve as an instrument in later exercises. Use a histogram to visualize the distribution of distance.

# 4. Card (1993) estimates returns to education by addressing endogeneity issues when regressing wages on education directly. Identify some of the omitted variables that would bias the effect of education on wages.

There are several omitted variables that might have an influence on the effect of education on wages. In particular, people´s ability might have an incidence in both education and wages. Individuals with more abilities can manage to achieve better education results and, therefore, gain higher earnings.Another omitted variable can be motivation, even though is also unobserved, motivated individuals might put more time and effort in having greater education levels and earnings.

5. Regress the logarithm of wage on education

## 
## Call:
## lm(formula = log(wage) ~ education, data = CollegeDistance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.36208 -0.06271  0.03255  0.07794  0.32436 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.213220   0.016228 136.379   <2e-16 ***
## education   0.002024   0.001166   1.737   0.0825 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1435 on 4737 degrees of freedom
## Multiple R-squared:  0.0006364,  Adjusted R-squared:  0.0004254 
## F-statistic: 3.016 on 1 and 4737 DF,  p-value: 0.08249

6. Add the covariates specified here to the previous regression: unemp, hispanic, af-am, female and urban. Note if the coefficient of the variable of interest (education) change.

After including the covariates, the variable of interest has changed. The effect of log wage on education had a positive and low coefficient in the previous model (0.002024) while after including the covariates the coefficient become smaller (0.0006723). In both models the coefficient was not statistically significant.

## 
## Call:
## lm(formula = log(wage) ~ education + unemp + ethnicity + gender + 
##     urban, data = CollegeDistance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39998 -0.08223  0.02833  0.09486  0.37945 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.1519999  0.0168512 127.706   <2e-16 ***
## education          0.0006723  0.0011121   0.605   0.5455    
## unemp              0.0135938  0.0007203  18.874   <2e-16 ***
## ethnicityafam     -0.0619139  0.0055990 -11.058   <2e-16 ***
## ethnicityhispanic -0.0535204  0.0052237 -10.246   <2e-16 ***
## genderfemale      -0.0091150  0.0039785  -2.291   0.0220 *  
## urbanyes           0.0089393  0.0048005   1.862   0.0626 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1361 on 4732 degrees of freedom
## Multiple R-squared:  0.1026, Adjusted R-squared:  0.1015 
## F-statistic:  90.2 on 6 and 4732 DF,  p-value: < 2.2e-16

7. Card (1993) suggests instrumental variables regression that uses college distance as an instrument for education. Argue why this could be a valid instrument

College distance will be a valid instrument if its highly correlated with education which is the variable of interest. In addition to this, the instrument can not be correlated with the outcome wages other than through education. In other words, people’s college distance should not haven an effect on earnings.

8. Compute the correlations of the instrument distance with the endogenous regressor education and the dependent variable wage. Is it relevant?

The instrument distance seems to be relevant. The main objective of instruments is to help isolate variance, following this a valid instrument is highly correlated with X and not casually related with Y other than through X. In this case, the instrument distance is negatively correlated (-0.09318309) with education (X and endogenous variable), and distance is not correlated with the outcome variable wage (-0.0003904288).

## [1] -0.09318309
## [1] -0.0003904288

9. Repeat the two original OLS regression instead as an IV regression, i.e., employ distance as an instrument for education in both regressions using ivreg().

## 
## Call:
## lm(formula = education ~ distance + unemp + ethnicity + gender + 
##     urban, data = CollegeDistance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1742 -1.7035 -0.4755  1.8914  4.5255 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       14.060680   0.083075 169.253  < 2e-16 ***
## distance          -0.086846   0.012244  -7.093 1.51e-12 ***
## unemp              0.010267   0.009768   1.051    0.293    
## ethnicityafam     -0.524317   0.072444  -7.238 5.30e-13 ***
## ethnicityhispanic -0.274761   0.067879  -4.048 5.25e-05 ***
## genderfemale      -0.024645   0.051731  -0.476    0.634    
## urbanyes          -0.092308   0.065039  -1.419    0.156    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.77 on 4732 degrees of freedom
## Multiple R-squared:  0.02298,    Adjusted R-squared:  0.02174 
## F-statistic: 18.55 on 6 and 4732 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(wage) ~ distance + unemp + ethnicity + gender + 
##     urban, data = CollegeDistance)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38155 -0.08052  0.02458  0.09610  0.37599 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.1638026  0.0063632 340.049  < 2e-16 ***
## distance          -0.0058468  0.0009379  -6.234 4.94e-10 ***
## unemp              0.0149146  0.0007482  19.934  < 2e-16 ***
## ethnicityafam     -0.0630613  0.0055489 -11.365  < 2e-16 ***
## ethnicityhispanic -0.0520024  0.0051992 -10.002  < 2e-16 ***
## genderfemale      -0.0092693  0.0039624  -2.339   0.0194 *  
## urbanyes           0.0002348  0.0049817   0.047   0.9624    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1355 on 4732 degrees of freedom
## Multiple R-squared:  0.1099, Adjusted R-squared:  0.1087 
## F-statistic: 97.35 on 6 and 4732 DF,  p-value: < 2.2e-16
## 
## Call:
## ivreg(formula = log(wage) ~ education + unemp + ethnicity + gender + 
##     urban | distance + unemp + ethnicity + gender + urban, data = CollegeDistance)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.5885016 -0.1191974 -0.0001799  0.1452146  0.4576460 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.2171787  0.2018797   6.029 1.77e-09 ***
## education          0.0673242  0.0143812   4.681 2.93e-06 ***
## unemp              0.0142234  0.0009648  14.743  < 2e-16 ***
## ethnicityafam     -0.0277621  0.0104342  -2.661  0.00782 ** 
## ethnicityhispanic -0.0335043  0.0081520  -4.110 4.02e-05 ***
## genderfemale      -0.0076101  0.0052865  -1.440  0.15007    
## urbanyes           0.0064494  0.0063892   1.009  0.31283    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1805 on 4732 degrees of freedom
## Multiple R-Squared: -0.5786, Adjusted R-squared: -0.5806 
## Wald test: 54.89 on 6 and 4732 DF,  p-value: < 2.2e-16

10. Compare the estimates from the IVreg to those from OLS. Which one would you trust more?

The first stage equation (model 6), shows that education and distance are negatively correlated. In other words, college distance reduce education levels by -0.08. The first condition for instrumental variables is accomplished.

The reduced from equation (model 7), shows a small coefficient between distance and log wage (-0.0058468). In other words, college distance does not have an effect on log wages. The second condition for instrumental variables is accomplished.

In model 5 (OLS model) the coefficient between wage and education is low (0.0006723), while in model 8 (IVreg) the coefficient become greater (0.0304838). In both cases, there is a positive effect of education levels on earnings. However, just in model 8 the coefficient is statistically significant.