Causal Inference with Instrumental Variable Approach

Preparation

R provides a one-stop shop for instrumental variable approach and two-stage least square in its popular ivreg package. It comes with the standard regression functionality such as estimates reporting, regression inference, heteroskedasticity-robust/clustered s.e., prediction and so on. It also works well with other diagnostic packages such as car, sandwich, lmtest, etc. Please install.packages(ivreg) to locally setup the package.

Loading Packages

# load necessary library and packages for the analysis
library(dplyr) 
library(tidyr)
library(RColorBrewer)
library(ggplot2)
library(knitr)
library(png)

library(animation)
library(gifski)
library(table1)
library(gtsummary)
library(gt)
library(corrplot)
library(xtable)
library(car)
library(DT)

library(AER)   
library(plm)    
library(stargazer)      
library(lattice)

library(wooldridge)
library(ivreg) # IV-regression

Two-Stage Least Square (2SLS) and Instruments (IV)

We say a factor \(X\) has a causal interpretation for an outcome variable \(Y\) if the change of \(X\) has a significant effect on \(Y\), holding all other factors constant or random. There are two basic requirements if \(X\) needs to be a causal factor for \(Y\):

\(X\) needs to have a significant correlation with \(Y\).
\(X\) needs to be a predecessor of \(Y\).

Establishing causality is challenging, especially out of observational data, i.e., historically observed data. The key variable of interest \(X\) is often endogenous. That is, something else left over in the unobserved jointly affects both \(X\) and outcome variable \(Y\), which challenges our claim that it is the change of \(X\) solely causes the change of \(Y\). In our language of econometric modeling, \(X\) is endogenous if \(\text{Corr}(X,\epsilon) \ne 0\).

The way to to escape endogeneity is to look for exogeneity, even a separate source of it. Such additional variable \(Z\), which is correlated with \(X\) (so its variation could also represent that of \(X\)); and uncorrelated with the unobserved error term \(\epsilon\), is called the instrumental variable (IV).

The criteria to pick IV are the following:

(Relevance) The instrument \(Z\) must be correlated with \(X\), i.e. \(\text{Corr}(X,Z) \ne 0\).
(Exclusion) The instrument \(Z\) cannot be correlated with the unobserved error \(\epsilon\), i.e. \(\text{Corr}(\epsilon,Z) = 0\).

Observe that the instrument \(Z\) could be correlated with the outcome variable \(Y\), i.e. \(\text{Corr}(Y,Z) \ne 0\). However, the only reason for \(\text{Corr}(Y,Z) \ne 0\) is because of \(\text{Corr}(X,Z) \ne 0\). This is apparent from the IV estimator for causal effect of \(X\) on \(Y\): \[\beta = \dfrac{\text{Cov}(Y,Z)}{\text{Cov}(X,Z)}\] When we have multiple plausible IVs and other predictors in a multivariate linear regression model, we could use two-staged least square (2SLS) to obtain the IV estimators.

Run OLS of \(s_i\) on \(\boldsymbol{X}_i\) and instruments \(\boldsymbol{Z}_i\) and obtain fitted value \(\hat{s}_i\).
Run OLS of \(y_i\) on \(\boldsymbol{X}_i\) and fitted value \(\hat{s}_i\) in second stage.

We shall use the following example to illustrate the idea of IV and 2SLS.

College Proximity as IV for Education

OLS Regression

The dataset card in wooldridge (see Card (1993) for details) used wage and education data for a sample of men in 1976 to estimate the return to schooling. The instrument for education was a dummy variable for whether someone grew up near a four-year college (nearc4). In a \(\log(wage)\) multivariate linear regression, he included other standard exogenous controls: experience, a black dummy, dummies for living in an SMSA (standard metropolitan statistical area) and living in the South, full set of regional dummies, etc.

For a complete description of each variable, see R documents/wooldridge/card.

# load `card` dataset directly from `wooldridge`
data('card')
head(card)

Let’s first simply run an OLS regression, as a benchmark. Keep in mind that, \(educ\) is potentially an endogenous variable.

# define our regression formula
rhs1 = c("educ", "exper", "expersq", "black", "smsa", "south", "smsa66")
regions = paste("reg", 662:669, sep="")
rhs = c(rhs1, regions)
# put our regression equation together
fmla.ols = reformulate(rhs, "lwage")
# print out and check the formula
print(fmla.ols)

## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669

# running a multivariate OLS regression of log(wage) on education and others
fit.ols = lm(fmla.ols, data = card)
summary(fit.ols)

## 
## Call:
## lm(formula = fmla.ols, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.62326 -0.22141  0.02001  0.23932  1.33340 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.6208068  0.0742327  62.248  < 2e-16 ***
## educ         0.0746933  0.0034983  21.351  < 2e-16 ***
## exper        0.0848320  0.0066242  12.806  < 2e-16 ***
## expersq     -0.0022870  0.0003166  -7.223 6.41e-13 ***
## black       -0.1990123  0.0182483 -10.906  < 2e-16 ***
## smsa         0.1363845  0.0201005   6.785 1.39e-11 ***
## south       -0.1479550  0.0259799  -5.695 1.35e-08 ***
## smsa66       0.0262417  0.0194477   1.349  0.17733    
## reg662       0.0963672  0.0358979   2.684  0.00730 ** 
## reg663       0.1445400  0.0351244   4.115 3.97e-05 ***
## reg664       0.0550756  0.0416573   1.322  0.18623    
## reg665       0.1280248  0.0418395   3.060  0.00223 ** 
## reg666       0.1405174  0.0452469   3.106  0.00192 ** 
## reg667       0.1179810  0.0448025   2.633  0.00850 ** 
## reg668      -0.0564361  0.0512579  -1.101  0.27098    
## reg669       0.1185698  0.0388301   3.054  0.00228 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3723 on 2994 degrees of freedom
## Multiple R-squared:  0.2998, Adjusted R-squared:  0.2963 
## F-statistic: 85.48 on 15 and 2994 DF,  p-value: < 2.2e-16

From the benchmark OLS regression output, we could read that our key variable of interest educ is positively significant with coef. equal to 0.0746933, i.e., “a one-year increase in schooling increases the wage by 7.4% on average, ‘holding other constant’”. The quotation marks emphasize the fact that our interpretation suggests the marginal effect in a causal fashion but it could be misleading, given the fact that educ is likely to be endogenous.

# running a multivariate 2SLS of log(wage) on education and others, with `nearc4` as IV

# observe that the formula for 2SLS highlights the "two stages" of OLS regression
rhs.1stage = c(rhs[-1],"nearc4")
rhs.2stage = rhs
fmla.1stage = paste(rhs.1stage, collapse = " + ")
fmla.2stage = paste("lwage ~ ", paste(rhs.2stage, collapse = " + "), sep = "")
fmla.2sls = paste(fmla.2stage, fmla.1stage, sep = " | ")
fmla.2sls = as.formula(fmla.2sls)
# print out the formula for 2SLS
print(fmla.2sls)

## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669 | exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669 + nearc4

# running a 2SLS with IV being `nearc4`
fit.2sls = ivreg(fmla.2sls, data = card)
summary(fit.2sls)

## 
## Call:
## ivreg(formula = fmla.2sls, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83164 -0.24075  0.02428  0.25208  1.42760 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.6661509  0.9248295   3.964 7.54e-05 ***
## educ         0.1315038  0.0549637   2.393  0.01679 *  
## exper        0.1082711  0.0236586   4.576 4.92e-06 ***
## expersq     -0.0023349  0.0003335  -7.001 3.12e-12 ***
## black       -0.1467757  0.0538999  -2.723  0.00650 ** 
## smsa         0.1118083  0.0316620   3.531  0.00042 ***
## south       -0.1446715  0.0272846  -5.302 1.23e-07 ***
## smsa66       0.0185311  0.0216086   0.858  0.39119    
## reg662       0.1007678  0.0376857   2.674  0.00754 ** 
## reg663       0.1482588  0.0368141   4.027 5.78e-05 ***
## reg664       0.0498971  0.0437398   1.141  0.25406    
## reg665       0.1462719  0.0470639   3.108  0.00190 ** 
## reg666       0.1629029  0.0519096   3.138  0.00172 ** 
## reg667       0.1345722  0.0494023   2.724  0.00649 ** 
## reg668      -0.0830770  0.0593314  -1.400  0.16155    
## reg669       0.1078142  0.0418137   2.578  0.00997 ** 
## 
## Diagnostic tests:
##                   df1  df2 statistic  p-value    
## Weak instruments    1 2994    13.256 0.000276 ***
## Wu-Hausman          1 2993     1.168 0.279973    
## Sargan              0   NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3883 on 2994 degrees of freedom
## Multiple R-Squared: 0.2382,  Adjusted R-squared: 0.2343 
## Wald test: 51.01 on 15 and 2994 DF,  p-value: < 2.2e-16

From the 2SLS, we can see that the estimate for educ is almost doubled, namely, one year increase in schooling causes 13.2% increase in wage on average, holding all other equal. We also observe that the standard errors of 2SLS output is generally larger than those of OLS, the price we pay for fixing endogeneity. We could call for car::compareCoefs to quickly compare the estimates and their respective s.e.,

# print out the formulas again for OLS and 2SLS, respectively
print(fmla.ols)

## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669

print(fmla.2sls)

## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669 | exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669 + nearc4

# compare estimates and s.e. between OLS and 2SLS
car::compareCoefs(fit.ols, fit.2sls)

## Calls:
## 1: lm(formula = fmla.ols, data = card)
## 2: ivreg(formula = fmla.2sls, data = card)
## 
##               Model 1   Model 2
## (Intercept)    4.6208    3.6662
## SE             0.0742    0.9248
##                                
## educ           0.0747    0.1315
## SE             0.0035    0.0550
##                                
## exper         0.08483   0.10827
## SE            0.00662   0.02366
##                                
## expersq     -0.002287 -0.002335
## SE           0.000317  0.000333
##                                
## black         -0.1990   -0.1468
## SE             0.0182    0.0539
##                                
## smsa           0.1364    0.1118
## SE             0.0201    0.0317
##                                
## south         -0.1480   -0.1447
## SE             0.0260    0.0273
##                                
## smsa66         0.0262    0.0185
## SE             0.0194    0.0216
##                                
## reg662         0.0964    0.1008
## SE             0.0359    0.0377
##                                
## reg663         0.1445    0.1483
## SE             0.0351    0.0368
##                                
## reg664         0.0551    0.0499
## SE             0.0417    0.0437
##                                
## reg665         0.1280    0.1463
## SE             0.0418    0.0471
##                                
## reg666         0.1405    0.1629
## SE             0.0452    0.0519
##                                
## reg667         0.1180    0.1346
## SE             0.0448    0.0494
##                                
## reg668        -0.0564   -0.0831
## SE             0.0513    0.0593
##                                
## reg669         0.1186    0.1078
## SE             0.0388    0.0418
##

Running 2SLS with “Two Stage of OLS”

The result of 2SLS should be similar that from running stage 1 and 2 with OLS, since this is how 2SLS works. The only difference is that the standard errors from “manually” running 2 stages of OLS is incorrect and we should refer to ivreg for the correct computation of standard errors. Let’s see make a comparison between ivreg before and “manually” running 2 stage least square with OLS.

## first-stage OLS, `educ ~ controls + nearc4`
fmla.ols1 = reformulate(rhs.1stage, "educ")
print(fmla.ols1)

## educ ~ exper + expersq + black + smsa + south + smsa66 + reg662 + 
##     reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669 + 
##     nearc4

fit.1stage = lm(fmla.ols1, data = card)
summary(fit.1stage)

## 
## Call:
## lm(formula = fmla.ols1, data = card)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.545 -1.370 -0.091  1.278  6.239 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.6382529  0.2406297  69.145  < 2e-16 ***
## exper       -0.4125334  0.0336996 -12.241  < 2e-16 ***
## expersq      0.0008686  0.0016504   0.526 0.598728    
## black       -0.9355287  0.0937348  -9.981  < 2e-16 ***
## smsa         0.4021825  0.1048112   3.837 0.000127 ***
## south       -0.0516126  0.1354284  -0.381 0.703152    
## smsa66       0.0254805  0.1057692   0.241 0.809644    
## reg662      -0.0786363  0.1871154  -0.420 0.674329    
## reg663      -0.0279390  0.1833745  -0.152 0.878913    
## reg664       0.1171820  0.2172531   0.539 0.589665    
## reg665      -0.2726165  0.2184204  -1.248 0.212082    
## reg666      -0.3028147  0.2370712  -1.277 0.201590    
## reg667      -0.2168177  0.2343879  -0.925 0.355021    
## reg668       0.5238914  0.2674749   1.959 0.050246 .  
## reg669       0.2102710  0.2024568   1.039 0.299076    
## nearc4       0.3198989  0.0878638   3.641 0.000276 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.941 on 2994 degrees of freedom
## Multiple R-squared:  0.4771, Adjusted R-squared:  0.4745 
## F-statistic: 182.1 on 15 and 2994 DF,  p-value: < 2.2e-16

# obtain the fitted value of educ from the first-stage
card$fitted.educ = fit.1stage$fitted.values

## second-stage OLS, `lwage ~ controls + fitted.educ`
rhs.ols2 = c("fitted.educ", rhs[-1])
fmla.ols2 = reformulate(rhs.ols2, "lwage")
print(fmla.ols2)

## lwage ~ fitted.educ + exper + expersq + black + smsa + south + 
##     smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + 
##     reg668 + reg669

fit.2stage = lm(fmla.ols2, data = card)
summary(fit.2stage)

## 
## Call:
## lm(formula = fmla.ols2, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57387 -0.25161  0.01483  0.27229  1.38522 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.6661509  0.9508542   3.856 0.000118 ***
## fitted.educ  0.1315038  0.0565103   2.327 0.020028 *  
## exper        0.1082711  0.0243243   4.451 8.85e-06 ***
## expersq     -0.0023349  0.0003429  -6.810 1.18e-11 ***
## black       -0.1467757  0.0554166  -2.649 0.008125 ** 
## smsa         0.1118083  0.0325530   3.435 0.000601 ***
## south       -0.1446715  0.0280524  -5.157 2.67e-07 ***
## smsa66       0.0185311  0.0222167   0.834 0.404286    
## reg662       0.1007678  0.0387462   2.601 0.009349 ** 
## reg663       0.1482588  0.0378501   3.917 9.17e-05 ***
## reg664       0.0498971  0.0449707   1.110 0.267283    
## reg665       0.1462719  0.0483883   3.023 0.002525 ** 
## reg666       0.1629029  0.0533703   3.052 0.002291 ** 
## reg667       0.1345722  0.0507925   2.649 0.008105 ** 
## reg668      -0.0830770  0.0610009  -1.362 0.173333    
## reg669       0.1078142  0.0429903   2.508 0.012199 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3993 on 2994 degrees of freedom
## Multiple R-squared:  0.1947, Adjusted R-squared:  0.1907 
## F-statistic: 48.25 on 15 and 2994 DF,  p-value: < 2.2e-16

As we can see, the estimates and s.e. for educ are same for 2SLS and “two-stage OLS”.

It is interesting to see the result for the first-stage since we could tell if the instrument is relevant for the endogenous variable. Here in this case, it is safe to conclude that the proposed IV nearc4 is correlated with the endogenous variable educ since the coefficient for nearc4 is highly statistical significant with \(p\)-value being 0.000276 and the magnitude of “correlation” could be learned from the estimate itself, 0.32 (note that the coef. in multivariate linear regression, is not equal to their correlation coefficient \(\rho\)). We could test the correlation between educ and nearc4 via more direct approach.

# test the significance of correlation between two variables 
cor.test(card$educ, card$nearc4, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  card$educ and card$nearc4
## S = 3924903877, p-value = 5.522e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.1364632

Since one of two variables, nearc4 is a dummy (categorical variable), the Spearman rank-test would make more sense. From the correlation test above, it is consistent with our observation from the multivariate regression model of the first-stage.

You might notice that we only test on the relevance, without mentioning about tests on exlcusive exogeneity. Unfortunately, there is not any statistical test available to test if \(\text{Corr}(Z, \epsilon) = 0\), when we have a single IV for the endogenous variable. Such test is only possible when we have more IVs than we need for the endogenous variable but still with restriction.

Using Multiple IVs in 2SLS

As you might notice that, in addition to nearc4, there is another nearc2 (if living near a 2-year college) in card. If we believe that both nearc2 and nearc4 are plausible IVs for educ, we could include both in the 2SLS, to be more precise, in the first stage of 2SLS.

# running a multivariate 2SLS of log(wage) on education and others, with both `nearc4` and `nearc2` as IVs

# observe that the formula for 2SLS highlights the "two stages" of OLS regression
rhs.1stage.2iv = c(rhs[-1],"nearc4", "nearc2")
fmla.1stage.2iv = paste(rhs.1stage.2iv, collapse = " + ")
fmla.2sls = paste(fmla.2stage, fmla.1stage.2iv, sep = " | ")
fmla.2sls.2iv = as.formula(fmla.2sls)
# print out the formula for 2SLS
print(fmla.2sls.2iv)

## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669 | exper + expersq + black + smsa + south + smsa66 + 
##     reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + 
##     reg669 + nearc4 + nearc2

# running a 2SLS with IV being `nearc4`
fit.2sls.2iv = ivreg(fmla.2sls.2iv, data = card)
summary(fit.2sls.2iv)

## 
## Call:
## ivreg(formula = fmla.2sls.2iv, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93841 -0.25068  0.01932  0.26519  1.46998 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.2367108  0.8849118   3.658 0.000259 ***
## educ         0.1570594  0.0525782   2.987 0.002839 ** 
## exper        0.1188149  0.0228061   5.210 2.02e-07 ***
## expersq     -0.0023565  0.0003475  -6.781 1.43e-11 ***
## black       -0.1232778  0.0521500  -2.364 0.018147 *  
## smsa         0.1007530  0.0315193   3.197 0.001405 ** 
## south       -0.1431945  0.0284448  -5.034 5.08e-07 ***
## smsa66       0.0150626  0.0223360   0.674 0.500132    
## reg662       0.1027473  0.0392906   2.615 0.008966 ** 
## reg663       0.1499316  0.0383918   3.905 9.62e-05 ***
## reg664       0.0475676  0.0456013   1.043 0.296977    
## reg665       0.1544801  0.0485628   3.181 0.001482 ** 
## reg666       0.1729728  0.0534164   3.238 0.001216 ** 
## reg667       0.1420356  0.0511219   2.778 0.005497 ** 
## reg668      -0.0950611  0.0609801  -1.559 0.119129    
## reg669       0.1029760  0.0434224   2.371 0.017779 *  
## 
## Diagnostic tests:
##                   df1  df2 statistic  p-value    
## Weak instruments    2 2993     7.893 0.000381 ***
## Wu-Hausman          1 2993     2.926 0.087286 .  
## Sargan              1   NA     1.248 0.263905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4053 on 2994 degrees of freedom
## Multiple R-Squared: 0.1702,  Adjusted R-squared: 0.166 
## Wald test: 47.07 on 15 and 2994 DF,  p-value: < 2.2e-16

# let's inspect the how relevant `nearc4` and `nearc2` are in the first stage
fmla.ols1.2iv = reformulate(rhs.1stage.2iv, "educ")
print(fmla.ols1.2iv)

## educ ~ exper + expersq + black + smsa + south + smsa66 + reg662 + 
##     reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669 + 
##     nearc4 + nearc2

fit.ols1.2iv = lm(fmla.ols1.2iv, data = card)
summary(fit.ols1.2iv)

## 
## Call:
## lm(formula = fmla.ols1.2iv, data = card)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5851 -1.3845 -0.0823  1.2765  6.2930 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.660e+01  2.415e-01  68.750  < 2e-16 ***
## exper       -4.123e-01  3.369e-02 -12.237  < 2e-16 ***
## expersq      8.479e-04  1.650e-03   0.514 0.607379    
## black       -9.452e-01  9.391e-02 -10.065  < 2e-16 ***
## smsa         4.014e-01  1.048e-01   3.830 0.000131 ***
## south       -4.191e-02  1.355e-01  -0.309 0.757162    
## smsa66       7.825e-05  1.069e-01   0.001 0.999416    
## reg662      -1.002e-01  1.876e-01  -0.534 0.593049    
## reg663      -2.143e-02  1.834e-01  -0.117 0.906981    
## reg664       1.311e-01  2.174e-01   0.603 0.546580    
## reg665      -2.684e-01  2.184e-01  -1.229 0.219228    
## reg666      -3.334e-01  2.378e-01  -1.402 0.160948    
## reg667      -2.087e-01  2.344e-01  -0.891 0.373199    
## reg668       5.508e-01  2.679e-01   2.056 0.039906 *  
## reg669       1.688e-01  2.041e-01   0.827 0.408286    
## nearc4       3.206e-01  8.784e-02   3.650 0.000267 ***
## nearc2       1.230e-01  7.743e-02   1.589 0.112256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.94 on 2993 degrees of freedom
## Multiple R-squared:  0.4776, Adjusted R-squared:  0.4748 
## F-statistic:   171 on 16 and 2993 DF,  p-value: < 2.2e-16

# we could also jointly test on `nearc4 = nearc2 = 0` using F-test (Chow-test)
linearHypothesis(fit.ols1.2iv, c("nearc4 = 0", "nearc2 = 0"))

From the 2SLS with two IVs (nearc2 and nearc4), we can see that the estimate for educ is slightly larger than that from one IV (nearc4), namely, one year increase in schooling causes 15.7% increase in wage on average, holding all other equal. To see if nearc2 and nearc4 are correlated with educ, we ran the first stage of 2SLS, i.e. educ \(\sim\) controls + nearc4 + nearc2. From its regression output, nearc4 is highly significant while nearc2 is not significant (on the edge). We also conducted a Chow-test to jointly test if nearc2 = nearc4 = 0. Not surprisingly, they are jointly significant. Overall, from relevance, nearc4 seems to be more strongly correlated with educ, which partially explains why nearc4 was chosen as sole IV in Card (1993). Such test for if all instruments are jointly relevant is called weak instrument test, which is an extension of the test for only IV in previous section.

Specification Test in Two Stage Least Square (2SLS)

There are three tests we’re interested in, for “model specification” in 2SLS and IV approach.

Weak instrument test: if all plausible instruments are jointly correlated with the endogenous variable, as we discussed just now.

If no, it means that all our IVs are not relevant, or not strong enough to represent the variation of the endogenous variable.
If yes, we are glad to know at least one of the IVs is relevant.

Hausman test for endogeneity: if the suspicious variable is endogenous at the first place, i.e. \(H_0\colon \text{Corr}(X, \epsilon) = 0\).

If no, both OLS and 2SLS estimators are consistent. However, the standard errors from OLS is much smaller than those from 2SLS.
If yes, only 2SLS under a valid IV delivers a consistent estimator.

Over-Identification (Sagan-Hausman) test: if all plausible IVs are exclusively exogenous, when we have more IVs than we need and assuming at least one of them is exogenous, i.e. \(H_0\colon \text{Corr}(Z_i, \epsilon) = 0\) for all \(i \ge 2\).

If no, or reject the null, we conclude that at least some of the proposed IVs are not exclusively exogenous. Unfortunately, the test does not tell you which are exogenous and which are not. We need to make comparisons among all IV estimators with each of the plausible IV serving as the sole IV and argue for which is the valid IV to use.
If yes, it means that all plausible IVs are exogenous.¹

All these specification tests are readily provided in ivreg. Call summary(2SLS_object, diagnostic = TRUE) right after ivreg. Let’s use the case where both nearc2 and nearc4 serve as IV for educ. The reason we call it “over-identification” is simply because we have more IVs than we need for our endogenous variable: two IVs versus one endogenous variable.

# conduct specification tests after 2SLS
summary(fit.2sls.2iv, diagnostic = TRUE)

## 
## Call:
## ivreg(formula = fmla.2sls.2iv, data = card)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.93841 -0.25068  0.01932  0.26519  1.46998 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.2367108  0.8849118   3.658 0.000259 ***
## educ         0.1570594  0.0525782   2.987 0.002839 ** 
## exper        0.1188149  0.0228061   5.210 2.02e-07 ***
## expersq     -0.0023565  0.0003475  -6.781 1.43e-11 ***
## black       -0.1232778  0.0521500  -2.364 0.018147 *  
## smsa         0.1007530  0.0315193   3.197 0.001405 ** 
## south       -0.1431945  0.0284448  -5.034 5.08e-07 ***
## smsa66       0.0150626  0.0223360   0.674 0.500132    
## reg662       0.1027473  0.0392906   2.615 0.008966 ** 
## reg663       0.1499316  0.0383918   3.905 9.62e-05 ***
## reg664       0.0475676  0.0456013   1.043 0.296977    
## reg665       0.1544801  0.0485628   3.181 0.001482 ** 
## reg666       0.1729728  0.0534164   3.238 0.001216 ** 
## reg667       0.1420356  0.0511219   2.778 0.005497 ** 
## reg668      -0.0950611  0.0609801  -1.559 0.119129    
## reg669       0.1029760  0.0434224   2.371 0.017779 *  
## 
## Diagnostic tests:
##                   df1  df2 statistic  p-value    
## Weak instruments    2 2993     7.893 0.000381 ***
## Wu-Hausman          1 2993     2.926 0.087286 .  
## Sargan              1   NA     1.248 0.263905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4053 on 2994 degrees of freedom
## Multiple R-Squared: 0.1702,  Adjusted R-squared: 0.166 
## Wald test: 47.07 on 15 and 2994 DF,  p-value: < 2.2e-16

From the result of specification tests:

We successfully reject the null in weak instrument tests. It means that we can conclude that at least one of the IVs (nearc2 and nearc4) is relevant to the endgenous variable (educ). A consistent result from our previous Chow-test and welcomed result for our IV approach: at least one IV is relevant!²
We are on the edge as to conclude that we might be concerned about endogeneity of educ from the Hausman endogeneity test. A conservative analyst would rely on 2SLS rather than OLS and it’s good to see that educ is significant in both OLS and 2SLS. Otherwise, it would be a hard decision to make between OLS and 2SLS (remember that standard errors from 2SLS will be generally much larger that those from OLS).
We fail to reject the null in over-identification (Sargan) test and happily conclude that both nearc2 and nearc4 are exclusive exogenous IVs, assuming that at least one of the two is exogenous (untestable).

Supporting Materials

Some good reads for IV and 2SLS in R with more examples: R Bookdown’s The Instrumental Variables (IV) Method.

References

Card, David. 1993. “Using Geographic Variation in College Proximity to Estimate the Return to Schooling.” Working Paper 4483. Working Paper Series. National Bureau of Economic Research. https://doi.org/10.3386/w4483.

You might be curious why we assume that “at least of them is exogenous” for over-identification test. If this is not the case, there are two possibilities when we fail to reject the null: either all IVs are exogenous, or all IVs are not exogenous. Let’s use nearc2 and nearc4 for educ as example. The nature of over-identification test is to first run 2SLS with either nearc2 or nearc4 as sole IV and obtain two 2SLS estimators, \(\hat{\beta}_{c2}\) and \(\hat{\beta}_{c4}\), respectively. Then the test is based on if \(H_0\colon \hat{\beta}_{c2} - \hat{\beta}_{c4} = 0\). Therefore, it could be the (subtle and challenging) case where neither nearc2 nor nearc4 is exclusively exogenous, yet \(\hat{\beta}_{c2}\) and \(\hat{\beta}_{c4}\) are not statistically different from each other, as two 2SLS estimators. Similar to the situation where we cannot test on if the single IV is truly exogenous, we cannot test if one of our plausible IVs is exogenous to begin with for over-identification test here.↩
From the first-stage OLS regression in previous section, we saw that nearc4 is much more significant and has larger coefficient than nearc2.↩

Causal Inference with Instrumental Variable (IV)

IS5126 Hands-on with Applied Analytics

Yingda Z.

March 10, 2021