An instrumental variable (IV) has no direct effect on the outcome variable. IVs are used to solve endogeneity problem, i.e. when there is a correlation between error term and independent variables in ordinary least square (OLS). In most cases endogeneity is commonly attributed to omission of a variable in OLS.
Suppose we wish to measure impact of a voluntary farmer training programme on farmer level outcomes such as income. To measure impact, we might decide to compare outcomes of those who participate and did not participate in the programme using OLS mode below.
\[\begin{equation} Y=\alpha +\beta_1P +\beta_2X +\epsilon \end{equation}\]
where: \[\begin{equation} P=\begin{cases} 1,& \text{If person participates in training}\\ 0, & \text{If person does not participate in training} \end{cases} \end{equation}\]
\[\begin{equation} X=\text{Control variables/exogenous/observed} \\ Y=\text{Outcome variable} \end{equation}\]
However from the equation above, we are likely to encounter an endogeneity problem because:
In this case we will have biased estimates: Notably,Wu-Hausman test can be used to check endogeneity problem in a OLS.
We solve this problem by finding an instrument Z that is:
1. Closely related to participation variable P.
2. Doesn’t directly affect people’s outcomes Y, except through its effect on participation.
Suppose the programme had community workers who were encouraging farmers to enroll in the programme and the data (with a randomly selected sample) contains those who interacted with community workers and those that did not. We thus have encouragement variable Z defined as
\[\begin{equation}
Z=\begin{cases}
1,& \text{If person was randomly chosen to receive the encouragement visit from a community worker
}\\
0, & \text{If person was randomly chosen not to receive the encouragement visit from the social worker}
\end{cases}
\end{equation}\]
In this case:
1) \(Corr (Z,P) >0\)
If person was randomly chosen not to receive the encouragement visit from the social worker
2) \(Corr (Z ,\epsilon) = 0\)
No correlation between receiving a visit and benefit to the program apart from the effect of the visit on participation
Regress treatment variable (P) on the instrumental variable Z and other exogenous variables \[\begin{equation}
P=\delta_0 +\delta_1X+\delta_2Z+\tau
\end{equation}\]
* Calculate and save predicted values \(\hat{P}\)
Regress y on the predicted variable P and the other exogenous variables
\[\begin{equation}
Y=\beta_0+\beta_1\hat{X}+\beta_2\hat{P}+\epsilon
\end{equation}\]
From above equation \(\beta_2\) provides a more unbiased estimator of Local Average Treatment Effect (LATE), i.e. the impact of the farmer training programme for outcome indicator Y.
Suppose an impact assessment is needed for a youth training programme on their income after two years period. A programme activity done to encourage more youths to enroll in the programme, was through mutliple campaign methods such as radio advert, community health workers and posters. In this case we have:
1. Outcome variable - income
2. Dependent variable - treatment (1 beneficiary, 0 non-beneficiary)
3. Instrumental variable (IV) - An index measuring access to information. It is anticipated that beneficiaries were more exposed to information than non-beneficiaries. The index is created using principal component analysis.
4. Other regressors - demographics variables
NB: The IV affected enrollment in the programme but has no correlation with an outcome.
The attached data was simulated and has outcome variable (income), treatment variable, instrumental variable (info_index) and control variables i.e. gender, age, education level and marital status.
library(haven)
library(tidyverse)
library(AER)
library(knitr)
library(kableExtra)
mydata <- read_sav("E:/IV/mydata.sav")
attach(mydata)
head(mydata,10) %>%
kable("html") %>%
kable_styling(font_size=12) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| Treatment | age | sex | education | marital | information | income | newvar | info_index |
|---|---|---|---|---|---|---|---|---|
| 1 | 75 | 1 | 3 | 4 | 1 | 3000 | 1.45 | 2.2967648 |
| 0 | 62 | 2 | 2 | 4 | 1 | 3000 | 1.00 | -0.6487551 |
| 1 | 60 | 1 | 1 | 4 | 1 | 11000 | 1.45 | 1.5254940 |
| 1 | 60 | 2 | 2 | 4 | 1 | 1000 | 1.45 | 1.5254940 |
| 1 | 57 | 2 | 1 | 4 | 1 | 1000 | 1.45 | 1.3712398 |
| 0 | 55 | 2 | 1 | 4 | 1 | 4000 | 1.00 | -0.7515912 |
| 0 | 50 | 2 | 1 | 4 | 1 | 3000 | 1.00 | -0.8250455 |
| 1 | 48 | 2 | 2 | 4 | 1 | 2000 | 1.45 | 0.9084774 |
| 0 | 60 | 1 | 2 | 3 | 1 | 2000 | 1.00 | -0.6781368 |
| 1 | 51 | 2 | 2 | 3 | 1 | 3000 | 1.45 | 1.0627315 |
In R, ivreg function from AER package is used to do two stage regression. The concept of the function is to use predicted scores of first regression - after regressing the treatment variable against the instrument and control variables. In the second stage, the outcome variable (income) is regressed against predicted scores.
Notably, in our case the logarithm of the income (outcome) has been done to give the proportion of income differences between the beneficiaries and non-beneficiaries.
fit=ivreg(log(income)~Treatment+age+sex+education+marital|info_index+age+sex+education+marital,data=mydata)
summary(fit, vcov = sandwich, diagnostics = TRUE)
##
## Call:
## ivreg(formula = log(income) ~ Treatment + age + sex + education +
## marital | info_index + age + sex + education + marital, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.38002 -0.41863 0.08297 0.50117 1.43570
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0418840 0.1534787 52.397 < 2e-16 ***
## Treatment 0.0237302 0.0578414 0.410 0.68175
## age -0.0002928 0.0018587 -0.158 0.87487
## sex -0.1260021 0.0559872 -2.251 0.02474 *
## education 0.0884873 0.0319361 2.771 0.00575 **
## marital 0.0056362 0.0338013 0.167 0.86762
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 661 5036.957 <2e-16 ***
## Wu-Hausman 1 660 0.035 0.852
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6904 on 661 degrees of freedom
## Multiple R-Squared: 0.02546, Adjusted R-squared: 0.01809
## Wald test: 3.696 on 5 and 661 DF, p-value: 0.002632
From the findings above, the income of beneficiaries is 2.4 percent more than non beneficiaries - although difference is non-significant.
It is worth noting that:
1. Weakness of the instrument:The null hypothesis is that we have weak instruments, so a rejection means our instruments are not weak, which is good. In this example the p-value<0.0001 implying that our instruments are not weak.
2. Wu-Hausman:: Test for consistency. When we reject, it means OLS is not consistent, suggesting endogeneity is present. In the example, we fail to reject the null hypothesis. This implies that endogeneity was not a big issue.