R provides a one-stop shop for instrumental variable approach and two-stage least square in its popular ivreg
package. It comes with the standard regression functionality such as estimates reporting, regression inference, heteroskedasticity-robust/clustered s.e., prediction and so on. It also works well with other diagnostic packages such as car
, sandwich
, lmtest
, etc. Please install.packages(ivreg)
to locally setup the package.
# load necessary library and packages for the analysis
library(dplyr)
library(tidyr)
library(RColorBrewer)
library(ggplot2)
library(knitr)
library(png)
library(animation)
library(gifski)
library(table1)
library(gtsummary)
library(gt)
library(corrplot)
library(xtable)
library(car)
library(DT)
library(AER)
library(plm)
library(stargazer)
library(lattice)
library(wooldridge)
library(ivreg) # IV-regression
We say a factor \(X\) has a causal interpretation for an outcome variable \(Y\) if the change of \(X\) has a significant effect on \(Y\), holding all other factors constant or random. There are two basic requirements if \(X\) needs to be a causal factor for \(Y\):
Establishing causality is challenging, especially out of observational data, i.e., historically observed data. The key variable of interest \(X\) is often endogenous. That is, something else left over in the unobserved jointly affects both \(X\) and outcome variable \(Y\), which challenges our claim that it is the change of \(X\) solely causes the change of \(Y\). In our language of econometric modeling, \(X\) is endogenous if \(\text{Corr}(X,\epsilon) \ne 0\).
The way to to escape endogeneity is to look for exogeneity, even a separate source of it. Such additional variable \(Z\), which is correlated with \(X\) (so its variation could also represent that of \(X\)); and uncorrelated with the unobserved error term \(\epsilon\), is called the instrumental variable (IV).
The criteria to pick IV are the following:
Observe that the instrument \(Z\) could be correlated with the outcome variable \(Y\), i.e. \(\text{Corr}(Y,Z) \ne 0\). However, the only reason for \(\text{Corr}(Y,Z) \ne 0\) is because of \(\text{Corr}(X,Z) \ne 0\). This is apparent from the IV estimator for causal effect of \(X\) on \(Y\): \[\beta = \dfrac{\text{Cov}(Y,Z)}{\text{Cov}(X,Z)}\] When we have multiple plausible IVs and other predictors in a multivariate linear regression model, we could use two-staged least square (2SLS) to obtain the IV estimators.
We shall use the following example to illustrate the idea of IV and 2SLS.
The dataset card
in wooldridge
(see Card (1993) for details) used wage and education data for a sample of men in 1976 to estimate the return to schooling. The instrument for education was a dummy variable for whether someone grew up near a four-year college (nearc4
). In a \(\log(wage)\) multivariate linear regression, he included other standard exogenous controls: experience, a black dummy, dummies for living in an SMSA (standard metropolitan statistical area) and living in the South, full set of regional dummies, etc.
For a complete description of each variable, see R documents/wooldridge/card.
Let’s first simply run an OLS regression, as a benchmark. Keep in mind that, \(educ\) is potentially an endogenous variable.
# define our regression formula
rhs1 = c("educ", "exper", "expersq", "black", "smsa", "south", "smsa66")
regions = paste("reg", 662:669, sep="")
rhs = c(rhs1, regions)
# put our regression equation together
fmla.ols = reformulate(rhs, "lwage")
# print out and check the formula
print(fmla.ols)
## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669
# running a multivariate OLS regression of log(wage) on education and others
fit.ols = lm(fmla.ols, data = card)
summary(fit.ols)
##
## Call:
## lm(formula = fmla.ols, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.62326 -0.22141 0.02001 0.23932 1.33340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6208068 0.0742327 62.248 < 2e-16 ***
## educ 0.0746933 0.0034983 21.351 < 2e-16 ***
## exper 0.0848320 0.0066242 12.806 < 2e-16 ***
## expersq -0.0022870 0.0003166 -7.223 6.41e-13 ***
## black -0.1990123 0.0182483 -10.906 < 2e-16 ***
## smsa 0.1363845 0.0201005 6.785 1.39e-11 ***
## south -0.1479550 0.0259799 -5.695 1.35e-08 ***
## smsa66 0.0262417 0.0194477 1.349 0.17733
## reg662 0.0963672 0.0358979 2.684 0.00730 **
## reg663 0.1445400 0.0351244 4.115 3.97e-05 ***
## reg664 0.0550756 0.0416573 1.322 0.18623
## reg665 0.1280248 0.0418395 3.060 0.00223 **
## reg666 0.1405174 0.0452469 3.106 0.00192 **
## reg667 0.1179810 0.0448025 2.633 0.00850 **
## reg668 -0.0564361 0.0512579 -1.101 0.27098
## reg669 0.1185698 0.0388301 3.054 0.00228 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3723 on 2994 degrees of freedom
## Multiple R-squared: 0.2998, Adjusted R-squared: 0.2963
## F-statistic: 85.48 on 15 and 2994 DF, p-value: < 2.2e-16
From the benchmark OLS regression output, we could read that our key variable of interest educ
is positively significant with coef. equal to 0.0746933, i.e., “a one-year increase in schooling increases the wage by 7.4% on average, ‘holding other constant’”. The quotation marks emphasize the fact that our interpretation suggests the marginal effect in a causal fashion but it could be misleading, given the fact that educ
is likely to be endogenous.
# running a multivariate 2SLS of log(wage) on education and others, with `nearc4` as IV
# observe that the formula for 2SLS highlights the "two stages" of OLS regression
rhs.1stage = c(rhs[-1],"nearc4")
rhs.2stage = rhs
fmla.1stage = paste(rhs.1stage, collapse = " + ")
fmla.2stage = paste("lwage ~ ", paste(rhs.2stage, collapse = " + "), sep = "")
fmla.2sls = paste(fmla.2stage, fmla.1stage, sep = " | ")
fmla.2sls = as.formula(fmla.2sls)
# print out the formula for 2SLS
print(fmla.2sls)
## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669 | exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669 + nearc4
##
## Call:
## ivreg(formula = fmla.2sls, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83164 -0.24075 0.02428 0.25208 1.42760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6661509 0.9248295 3.964 7.54e-05 ***
## educ 0.1315038 0.0549637 2.393 0.01679 *
## exper 0.1082711 0.0236586 4.576 4.92e-06 ***
## expersq -0.0023349 0.0003335 -7.001 3.12e-12 ***
## black -0.1467757 0.0538999 -2.723 0.00650 **
## smsa 0.1118083 0.0316620 3.531 0.00042 ***
## south -0.1446715 0.0272846 -5.302 1.23e-07 ***
## smsa66 0.0185311 0.0216086 0.858 0.39119
## reg662 0.1007678 0.0376857 2.674 0.00754 **
## reg663 0.1482588 0.0368141 4.027 5.78e-05 ***
## reg664 0.0498971 0.0437398 1.141 0.25406
## reg665 0.1462719 0.0470639 3.108 0.00190 **
## reg666 0.1629029 0.0519096 3.138 0.00172 **
## reg667 0.1345722 0.0494023 2.724 0.00649 **
## reg668 -0.0830770 0.0593314 -1.400 0.16155
## reg669 0.1078142 0.0418137 2.578 0.00997 **
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 2994 13.256 0.000276 ***
## Wu-Hausman 1 2993 1.168 0.279973
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3883 on 2994 degrees of freedom
## Multiple R-Squared: 0.2382, Adjusted R-squared: 0.2343
## Wald test: 51.01 on 15 and 2994 DF, p-value: < 2.2e-16
From the 2SLS, we can see that the estimate for educ
is almost doubled, namely, one year increase in schooling causes 13.2% increase in wage on average, holding all other equal. We also observe that the standard errors of 2SLS output is generally larger than those of OLS, the price we pay for fixing endogeneity. We could call for car::compareCoefs
to quickly compare the estimates and their respective s.e.,
## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669
## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669 | exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669 + nearc4
## Calls:
## 1: lm(formula = fmla.ols, data = card)
## 2: ivreg(formula = fmla.2sls, data = card)
##
## Model 1 Model 2
## (Intercept) 4.6208 3.6662
## SE 0.0742 0.9248
##
## educ 0.0747 0.1315
## SE 0.0035 0.0550
##
## exper 0.08483 0.10827
## SE 0.00662 0.02366
##
## expersq -0.002287 -0.002335
## SE 0.000317 0.000333
##
## black -0.1990 -0.1468
## SE 0.0182 0.0539
##
## smsa 0.1364 0.1118
## SE 0.0201 0.0317
##
## south -0.1480 -0.1447
## SE 0.0260 0.0273
##
## smsa66 0.0262 0.0185
## SE 0.0194 0.0216
##
## reg662 0.0964 0.1008
## SE 0.0359 0.0377
##
## reg663 0.1445 0.1483
## SE 0.0351 0.0368
##
## reg664 0.0551 0.0499
## SE 0.0417 0.0437
##
## reg665 0.1280 0.1463
## SE 0.0418 0.0471
##
## reg666 0.1405 0.1629
## SE 0.0452 0.0519
##
## reg667 0.1180 0.1346
## SE 0.0448 0.0494
##
## reg668 -0.0564 -0.0831
## SE 0.0513 0.0593
##
## reg669 0.1186 0.1078
## SE 0.0388 0.0418
##
The result of 2SLS should be similar that from running stage 1 and 2 with OLS, since this is how 2SLS works. The only difference is that the standard errors from “manually” running 2 stages of OLS is incorrect and we should refer to ivreg
for the correct computation of standard errors. Let’s see make a comparison between ivreg
before and “manually” running 2 stage least square with OLS.
## first-stage OLS, `educ ~ controls + nearc4`
fmla.ols1 = reformulate(rhs.1stage, "educ")
print(fmla.ols1)
## educ ~ exper + expersq + black + smsa + south + smsa66 + reg662 +
## reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669 +
## nearc4
##
## Call:
## lm(formula = fmla.ols1, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.545 -1.370 -0.091 1.278 6.239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.6382529 0.2406297 69.145 < 2e-16 ***
## exper -0.4125334 0.0336996 -12.241 < 2e-16 ***
## expersq 0.0008686 0.0016504 0.526 0.598728
## black -0.9355287 0.0937348 -9.981 < 2e-16 ***
## smsa 0.4021825 0.1048112 3.837 0.000127 ***
## south -0.0516126 0.1354284 -0.381 0.703152
## smsa66 0.0254805 0.1057692 0.241 0.809644
## reg662 -0.0786363 0.1871154 -0.420 0.674329
## reg663 -0.0279390 0.1833745 -0.152 0.878913
## reg664 0.1171820 0.2172531 0.539 0.589665
## reg665 -0.2726165 0.2184204 -1.248 0.212082
## reg666 -0.3028147 0.2370712 -1.277 0.201590
## reg667 -0.2168177 0.2343879 -0.925 0.355021
## reg668 0.5238914 0.2674749 1.959 0.050246 .
## reg669 0.2102710 0.2024568 1.039 0.299076
## nearc4 0.3198989 0.0878638 3.641 0.000276 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.941 on 2994 degrees of freedom
## Multiple R-squared: 0.4771, Adjusted R-squared: 0.4745
## F-statistic: 182.1 on 15 and 2994 DF, p-value: < 2.2e-16
# obtain the fitted value of educ from the first-stage
card$fitted.educ = fit.1stage$fitted.values
## second-stage OLS, `lwage ~ controls + fitted.educ`
rhs.ols2 = c("fitted.educ", rhs[-1])
fmla.ols2 = reformulate(rhs.ols2, "lwage")
print(fmla.ols2)
## lwage ~ fitted.educ + exper + expersq + black + smsa + south +
## smsa66 + reg662 + reg663 + reg664 + reg665 + reg666 + reg667 +
## reg668 + reg669
##
## Call:
## lm(formula = fmla.ols2, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57387 -0.25161 0.01483 0.27229 1.38522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6661509 0.9508542 3.856 0.000118 ***
## fitted.educ 0.1315038 0.0565103 2.327 0.020028 *
## exper 0.1082711 0.0243243 4.451 8.85e-06 ***
## expersq -0.0023349 0.0003429 -6.810 1.18e-11 ***
## black -0.1467757 0.0554166 -2.649 0.008125 **
## smsa 0.1118083 0.0325530 3.435 0.000601 ***
## south -0.1446715 0.0280524 -5.157 2.67e-07 ***
## smsa66 0.0185311 0.0222167 0.834 0.404286
## reg662 0.1007678 0.0387462 2.601 0.009349 **
## reg663 0.1482588 0.0378501 3.917 9.17e-05 ***
## reg664 0.0498971 0.0449707 1.110 0.267283
## reg665 0.1462719 0.0483883 3.023 0.002525 **
## reg666 0.1629029 0.0533703 3.052 0.002291 **
## reg667 0.1345722 0.0507925 2.649 0.008105 **
## reg668 -0.0830770 0.0610009 -1.362 0.173333
## reg669 0.1078142 0.0429903 2.508 0.012199 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3993 on 2994 degrees of freedom
## Multiple R-squared: 0.1947, Adjusted R-squared: 0.1907
## F-statistic: 48.25 on 15 and 2994 DF, p-value: < 2.2e-16
As we can see, the estimates and s.e. for educ
are same for 2SLS and “two-stage OLS”.
It is interesting to see the result for the first-stage since we could tell if the instrument is relevant for the endogenous variable. Here in this case, it is safe to conclude that the proposed IV nearc4
is correlated with the endogenous variable educ
since the coefficient for nearc4
is highly statistical significant with \(p\)-value being 0.000276 and the magnitude of “correlation” could be learned from the estimate itself, 0.32 (note that the coef. in multivariate linear regression, is not equal to their correlation coefficient \(\rho\)). We could test the correlation between educ
and nearc4
via more direct approach.
# test the significance of correlation between two variables
cor.test(card$educ, card$nearc4, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: card$educ and card$nearc4
## S = 3924903877, p-value = 5.522e-14
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.1364632
Since one of two variables, nearc4
is a dummy (categorical variable), the Spearman rank-test would make more sense. From the correlation test above, it is consistent with our observation from the multivariate regression model of the first-stage.
You might notice that we only test on the relevance, without mentioning about tests on exlcusive exogeneity. Unfortunately, there is not any statistical test available to test if \(\text{Corr}(Z, \epsilon) = 0\), when we have a single IV for the endogenous variable. Such test is only possible when we have more IVs than we need for the endogenous variable but still with restriction.
As you might notice that, in addition to nearc4
, there is another nearc2
(if living near a 2-year college) in card
. If we believe that both nearc2
and nearc4
are plausible IVs for educ
, we could include both in the 2SLS, to be more precise, in the first stage of 2SLS.
# running a multivariate 2SLS of log(wage) on education and others, with both `nearc4` and `nearc2` as IVs
# observe that the formula for 2SLS highlights the "two stages" of OLS regression
rhs.1stage.2iv = c(rhs[-1],"nearc4", "nearc2")
fmla.1stage.2iv = paste(rhs.1stage.2iv, collapse = " + ")
fmla.2sls = paste(fmla.2stage, fmla.1stage.2iv, sep = " | ")
fmla.2sls.2iv = as.formula(fmla.2sls)
# print out the formula for 2SLS
print(fmla.2sls.2iv)
## lwage ~ educ + exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669 | exper + expersq + black + smsa + south + smsa66 +
## reg662 + reg663 + reg664 + reg665 + reg666 + reg667 + reg668 +
## reg669 + nearc4 + nearc2
# running a 2SLS with IV being `nearc4`
fit.2sls.2iv = ivreg(fmla.2sls.2iv, data = card)
summary(fit.2sls.2iv)
##
## Call:
## ivreg(formula = fmla.2sls.2iv, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.93841 -0.25068 0.01932 0.26519 1.46998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2367108 0.8849118 3.658 0.000259 ***
## educ 0.1570594 0.0525782 2.987 0.002839 **
## exper 0.1188149 0.0228061 5.210 2.02e-07 ***
## expersq -0.0023565 0.0003475 -6.781 1.43e-11 ***
## black -0.1232778 0.0521500 -2.364 0.018147 *
## smsa 0.1007530 0.0315193 3.197 0.001405 **
## south -0.1431945 0.0284448 -5.034 5.08e-07 ***
## smsa66 0.0150626 0.0223360 0.674 0.500132
## reg662 0.1027473 0.0392906 2.615 0.008966 **
## reg663 0.1499316 0.0383918 3.905 9.62e-05 ***
## reg664 0.0475676 0.0456013 1.043 0.296977
## reg665 0.1544801 0.0485628 3.181 0.001482 **
## reg666 0.1729728 0.0534164 3.238 0.001216 **
## reg667 0.1420356 0.0511219 2.778 0.005497 **
## reg668 -0.0950611 0.0609801 -1.559 0.119129
## reg669 0.1029760 0.0434224 2.371 0.017779 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 2993 7.893 0.000381 ***
## Wu-Hausman 1 2993 2.926 0.087286 .
## Sargan 1 NA 1.248 0.263905
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4053 on 2994 degrees of freedom
## Multiple R-Squared: 0.1702, Adjusted R-squared: 0.166
## Wald test: 47.07 on 15 and 2994 DF, p-value: < 2.2e-16
# let's inspect the how relevant `nearc4` and `nearc2` are in the first stage
fmla.ols1.2iv = reformulate(rhs.1stage.2iv, "educ")
print(fmla.ols1.2iv)
## educ ~ exper + expersq + black + smsa + south + smsa66 + reg662 +
## reg663 + reg664 + reg665 + reg666 + reg667 + reg668 + reg669 +
## nearc4 + nearc2
##
## Call:
## lm(formula = fmla.ols1.2iv, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5851 -1.3845 -0.0823 1.2765 6.2930
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.660e+01 2.415e-01 68.750 < 2e-16 ***
## exper -4.123e-01 3.369e-02 -12.237 < 2e-16 ***
## expersq 8.479e-04 1.650e-03 0.514 0.607379
## black -9.452e-01 9.391e-02 -10.065 < 2e-16 ***
## smsa 4.014e-01 1.048e-01 3.830 0.000131 ***
## south -4.191e-02 1.355e-01 -0.309 0.757162
## smsa66 7.825e-05 1.069e-01 0.001 0.999416
## reg662 -1.002e-01 1.876e-01 -0.534 0.593049
## reg663 -2.143e-02 1.834e-01 -0.117 0.906981
## reg664 1.311e-01 2.174e-01 0.603 0.546580
## reg665 -2.684e-01 2.184e-01 -1.229 0.219228
## reg666 -3.334e-01 2.378e-01 -1.402 0.160948
## reg667 -2.087e-01 2.344e-01 -0.891 0.373199
## reg668 5.508e-01 2.679e-01 2.056 0.039906 *
## reg669 1.688e-01 2.041e-01 0.827 0.408286
## nearc4 3.206e-01 8.784e-02 3.650 0.000267 ***
## nearc2 1.230e-01 7.743e-02 1.589 0.112256
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.94 on 2993 degrees of freedom
## Multiple R-squared: 0.4776, Adjusted R-squared: 0.4748
## F-statistic: 171 on 16 and 2993 DF, p-value: < 2.2e-16
# we could also jointly test on `nearc4 = nearc2 = 0` using F-test (Chow-test)
linearHypothesis(fit.ols1.2iv, c("nearc4 = 0", "nearc2 = 0"))
From the 2SLS with two IVs (nearc2
and nearc4
), we can see that the estimate for educ
is slightly larger than that from one IV (nearc4
), namely, one year increase in schooling causes 15.7% increase in wage on average, holding all other equal. To see if nearc2
and nearc4
are correlated with educ
, we ran the first stage of 2SLS, i.e. educ
\(\sim\) controls + nearc4 + nearc2
. From its regression output, nearc4
is highly significant while nearc2
is not significant (on the edge). We also conducted a Chow-test to jointly test if nearc2 = nearc4 = 0
. Not surprisingly, they are jointly significant. Overall, from relevance, nearc4
seems to be more strongly correlated with educ
, which partially explains why nearc4
was chosen as sole IV in Card (1993). Such test for if all instruments are jointly relevant is called weak instrument test, which is an extension of the test for only IV in previous section.
There are three tests we’re interested in, for “model specification” in 2SLS and IV approach.
All these specification tests are readily provided in ivreg
. Call summary(2SLS_object, diagnostic = TRUE)
right after ivreg
. Let’s use the case where both nearc2
and nearc4
serve as IV for educ
. The reason we call it “over-identification” is simply because we have more IVs than we need for our endogenous variable: two IVs versus one endogenous variable.
##
## Call:
## ivreg(formula = fmla.2sls.2iv, data = card)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.93841 -0.25068 0.01932 0.26519 1.46998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.2367108 0.8849118 3.658 0.000259 ***
## educ 0.1570594 0.0525782 2.987 0.002839 **
## exper 0.1188149 0.0228061 5.210 2.02e-07 ***
## expersq -0.0023565 0.0003475 -6.781 1.43e-11 ***
## black -0.1232778 0.0521500 -2.364 0.018147 *
## smsa 0.1007530 0.0315193 3.197 0.001405 **
## south -0.1431945 0.0284448 -5.034 5.08e-07 ***
## smsa66 0.0150626 0.0223360 0.674 0.500132
## reg662 0.1027473 0.0392906 2.615 0.008966 **
## reg663 0.1499316 0.0383918 3.905 9.62e-05 ***
## reg664 0.0475676 0.0456013 1.043 0.296977
## reg665 0.1544801 0.0485628 3.181 0.001482 **
## reg666 0.1729728 0.0534164 3.238 0.001216 **
## reg667 0.1420356 0.0511219 2.778 0.005497 **
## reg668 -0.0950611 0.0609801 -1.559 0.119129
## reg669 0.1029760 0.0434224 2.371 0.017779 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 2993 7.893 0.000381 ***
## Wu-Hausman 1 2993 2.926 0.087286 .
## Sargan 1 NA 1.248 0.263905
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4053 on 2994 degrees of freedom
## Multiple R-Squared: 0.1702, Adjusted R-squared: 0.166
## Wald test: 47.07 on 15 and 2994 DF, p-value: < 2.2e-16
From the result of specification tests:
We successfully reject the null in weak instrument tests. It means that we can conclude that at least one of the IVs (nearc2
and nearc4
) is relevant to the endgenous variable (educ
). A consistent result from our previous Chow-test and welcomed result for our IV approach: at least one IV is relevant!2
We are on the edge as to conclude that we might be concerned about endogeneity of educ
from the Hausman endogeneity test. A conservative analyst would rely on 2SLS rather than OLS and it’s good to see that educ
is significant in both OLS and 2SLS. Otherwise, it would be a hard decision to make between OLS and 2SLS (remember that standard errors from 2SLS will be generally much larger that those from OLS).
We fail to reject the null in over-identification (Sargan) test and happily conclude that both nearc2
and nearc4
are exclusive exogenous IVs, assuming that at least one of the two is exogenous (untestable).
Some good reads for IV and 2SLS in R with more examples: R Bookdown’s The Instrumental Variables (IV) Method.
Card, David. 1993. “Using Geographic Variation in College Proximity to Estimate the Return to Schooling.” Working Paper 4483. Working Paper Series. National Bureau of Economic Research. https://doi.org/10.3386/w4483.
You might be curious why we assume that “at least of them is exogenous” for over-identification test. If this is not the case, there are two possibilities when we fail to reject the null: either all IVs are exogenous, or all IVs are not exogenous. Let’s use nearc2
and nearc4
for educ
as example. The nature of over-identification test is to first run 2SLS with either nearc2
or nearc4
as sole IV and obtain two 2SLS estimators, \(\hat{\beta}_{c2}\) and \(\hat{\beta}_{c4}\), respectively. Then the test is based on if \(H_0\colon \hat{\beta}_{c2} - \hat{\beta}_{c4} = 0\). Therefore, it could be the (subtle and challenging) case where neither nearc2
nor nearc4
is exclusively exogenous, yet \(\hat{\beta}_{c2}\) and \(\hat{\beta}_{c4}\) are not statistically different from each other, as two 2SLS estimators. Similar to the situation where we cannot test on if the single IV is truly exogenous, we cannot test if one of our plausible IVs is exogenous to begin with for over-identification test here.↩
From the first-stage OLS regression in previous section, we saw that nearc4
is much more significant and has larger coefficient than nearc2
.↩