An instrumental variable (IV) has no direct effect on the outcome variable. IVs are used to solve endogeneity problem, i.e. when there is a correlation between error term and independent variables in ordinary least square (OLS). In most cases endogeneity is commonly attributed to omission of a variable in OLS.

Example from World Bank

Suppose we wish to measure impact of a voluntary farmer training programme on farmer level outcomes such as income. To measure impact, we might decide to compare outcomes of those who participate and did not participate in the programme using OLS mode below.

\[\begin{equation} Y=\alpha +\beta_1P +\beta_2X +\epsilon \end{equation}\]

where: \[\begin{equation} P=\begin{cases} 1,& \text{If person participates in training}\\ 0, & \text{If person does not participate in training} \end{cases} \end{equation}\]

\[\begin{equation} X=\text{Control variables/exogenous/observed} \\ Y=\text{Outcome variable} \end{equation}\]

Problem

However from the equation above, we are likely to encounter an endogeneity problem because:

  • Decision to participate in training is endogenous (e.g. based on an “unmeasurable” characteristic of a person).
  • Consequently, we have omitted variables that are not present in the OLS

In this case we will have biased estimates: Notably,Wu-Hausman test can be used to check endogeneity problem in a OLS.

Solving the problem

We solve this problem by finding an instrument Z that is:
1. Closely related to participation variable P.
2. Doesn’t directly affect people’s outcomes Y, except through its effect on participation.

Identifying an instrumental variable

Suppose the programme had community workers who were encouraging farmers to enroll in the programme and the data (with a randomly selected sample) contains those who interacted with community workers and those that did not. We thus have encouragement variable Z defined as
\[\begin{equation} Z=\begin{cases} 1,& \text{If person was randomly chosen to receive the encouragement visit from a community worker }\\ 0, & \text{If person was randomly chosen not to receive the encouragement visit from the social worker} \end{cases} \end{equation}\]

In this case:
1) \(Corr (Z,P) >0\)
If person was randomly chosen not to receive the encouragement visit from the social worker
2) \(Corr (Z ,\epsilon) = 0\)
No correlation between receiving a visit and benefit to the program apart from the effect of the visit on participation

Estimating impact using two-stage least squares (2SLS)

Step 1

Regress treatment variable (P) on the instrumental variable Z and other exogenous variables \[\begin{equation} P=\delta_0 +\delta_1X+\delta_2Z+\tau \end{equation}\]
* Calculate and save predicted values \(\hat{P}\)

Step 2

Regress y on the predicted variable P and the other exogenous variables

\[\begin{equation} Y=\beta_0+\beta_1\hat{X}+\beta_2\hat{P}+\epsilon \end{equation}\]
From above equation \(\beta_2\) provides a more unbiased estimator of Local Average Treatment Effect (LATE), i.e. the impact of the farmer training programme for outcome indicator Y.

Own example

Suppose an impact assessment is needed for a youth training programme on their income after two years period. A programme activity done to encourage more youths to enroll in the programme, was through mutliple campaign methods such as radio advert, community health workers and posters. In this case we have:
1. Outcome variable - income
2. Dependent variable - treatment (1 beneficiary, 0 non-beneficiary)
3. Instrumental variable (IV) - An index measuring access to information. It is anticipated that beneficiaries were more exposed to information than non-beneficiaries. The index is created using principal component analysis.
4. Other regressors - demographics variables

NB: The IV affected enrollment in the programme but has no correlation with an outcome.

Data

The attached data was simulated and has outcome variable (income), treatment variable, instrumental variable (info_index) and control variables i.e. gender, age, education level and marital status.

library(haven)
library(tidyverse)
library(AER)
library(knitr)
library(kableExtra)
mydata <- read_sav("E:/IV/mydata.sav")
attach(mydata)
head(mydata,10) %>%
  kable("html") %>%
  kable_styling(font_size=12) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Treatment age sex education marital information income newvar info_index
1 75 1 3 4 1 3000 1.45 2.2967648
0 62 2 2 4 1 3000 1.00 -0.6487551
1 60 1 1 4 1 11000 1.45 1.5254940
1 60 2 2 4 1 1000 1.45 1.5254940
1 57 2 1 4 1 1000 1.45 1.3712398
0 55 2 1 4 1 4000 1.00 -0.7515912
0 50 2 1 4 1 3000 1.00 -0.8250455
1 48 2 2 4 1 2000 1.45 0.9084774
0 60 1 2 3 1 2000 1.00 -0.6781368
1 51 2 2 3 1 3000 1.45 1.0627315

In R, ivreg function from AER package is used to do two stage regression. The concept of the function is to use predicted scores of first regression - after regressing the treatment variable against the instrument and control variables. In the second stage, the outcome variable (income) is regressed against predicted scores.

Notably, in our case the logarithm of the income (outcome) has been done to give the proportion of income differences between the beneficiaries and non-beneficiaries.

fit=ivreg(log(income)~Treatment+age+sex+education+marital|info_index+age+sex+education+marital,data=mydata)

summary(fit, vcov = sandwich, diagnostics = TRUE)
## 
## Call:
## ivreg(formula = log(income) ~ Treatment + age + sex + education + 
##     marital | info_index + age + sex + education + marital, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.38002 -0.41863  0.08297  0.50117  1.43570 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.0418840  0.1534787  52.397  < 2e-16 ***
## Treatment    0.0237302  0.0578414   0.410  0.68175    
## age         -0.0002928  0.0018587  -0.158  0.87487    
## sex         -0.1260021  0.0559872  -2.251  0.02474 *  
## education    0.0884873  0.0319361   2.771  0.00575 ** 
## marital      0.0056362  0.0338013   0.167  0.86762    
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value    
## Weak instruments   1 661  5036.957  <2e-16 ***
## Wu-Hausman         1 660     0.035   0.852    
## Sargan             0  NA        NA      NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6904 on 661 degrees of freedom
## Multiple R-Squared: 0.02546, Adjusted R-squared: 0.01809 
## Wald test: 3.696 on 5 and 661 DF,  p-value: 0.002632

Summary of findings

From the findings above, the income of beneficiaries is 2.4 percent more than non beneficiaries - although difference is non-significant.
It is worth noting that:
1. Weakness of the instrument:The null hypothesis is that we have weak instruments, so a rejection means our instruments are not weak, which is good. In this example the p-value<0.0001 implying that our instruments are not weak.
2. Wu-Hausman:: Test for consistency. When we reject, it means OLS is not consistent, suggesting endogeneity is present. In the example, we fail to reject the null hypothesis. This implies that endogeneity was not a big issue.