This is a simulation to demonstrate the power of instrumental variables. This code is based on the simulation shown on this blog.
The purpose of doing a simulation is to show you how instrumental variables work. Because, if you’re like me, you thought they were a little too magical when you first saw this.
Let’s say we have some outcome variable \(Y\). It is a function of two variables, \(X\) and \(W\). We are interested in knowing the causal effect of \(X\) on \(Y\).
Our model is such:
\[ E(Y) = \beta_0 + \beta_1 X + \beta_2 W \]
Now, let’s assume that \(X\) and \(W\) are correlated with one another because \(W\) is also a cause of \(X\). However, for whatever reason we don’t actually observed or measure \(W\). If we were omniscient, we would have collected \(W\) and we could then just fit a regression of \(Y\) on \(X\) and \(W\) and get an unbiased estimated of \(X\). However, we don’t have \(W\), so we just fit regress \(Y\) on \(X\) using lm(Y ~ X).
This model impies that are observations of \(Y\) are such that:
\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\]
But our problem is that our error term is a function of the missing \(W\) and residual error, \(e_i\). \[ \epsilon_i = f(W_i, e_i)\]
Because of this \(X_i\) and \(e_i\) are correlated and we’ll end up biased estimates of \(\beta_1\), which is the causal effect of \(X\) on \(Y\).
What we need is an instrumental variable, \(Z\), that is a partial cause of \(X\) but has no direct relationship with \(Y\) only an indirect relationship through \(X\).
\[ E(X) = \alpha_0 + \alpha_1 X_i^{*} + \alpha_2\]
If we estimate, the equations:
\[ Y_i = \beta_0 + \beta_1 X + e_1\] \[ X_i = \alpha_0 + \alpha_1 Z + e_2\]
and allow the \(Cov(e_1, e_2)\) to be freely estimated, we’ll get an unbiased estimate of \(\beta_1\).
Let’s try this out with a simulation. First we’ll generate the data.
Now, if we fit the following path model, we’ll get an unbiased estimate of
set.seed(523123)
## Load the required libraries ---------------
library("lavaan")
library("MASS")
library("semPlot")
# Generate X* and W as correlated normal random variables ------------
pred.vars <- mvrnorm(10000, c(20, 15), matrix(c(1, 0.75, 0.75, 1), 2, 2))
x.star <- pred.vars[, 1]
W <- pred.vars[, 2]
# Next generate the instrumental variable -----------------
Z <- rnorm(10000)
X <- x.star + Z
# Note the x.star part is the part of X that is unexplained by Z and contains
# the residual error.
# Finally, let's generate our outcome variable Y -----------------
# Note, it's a function of:
# - beta0 = 1
# - beta1 = 1
# - beta2 = 1
# - residual ~ N(0, 0.5^2)
Y <- 1 + X + W + rnorm(10000, 0, 0.5)
# Then we save our data set
iv.data <- data.frame(X, Y, Z, W)
First, let’s view the true model in lavaan
And let’s view the model output.
summary(fit.true)
lavaan (0.5-20) converged normally after 24 iterations
Number of observations 10000
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
Parameter Estimates:
Information Expected
Standard Errors Standard
Regressions:
Estimate Std.Err Z-value P(>|z|)
X ~
W 0.760 0.012 63.365 0.000
Y ~
X 0.990 0.004 239.396 0.000
W 0.997 0.006 169.882 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
X 1.458 0.021 70.711 0.000
Y 0.249 0.004 70.711 0.000
What would our estimate of \(\beta_1\) be without W? Our estimate is 0.99 for the true parameter of 1.
lm(Y ~ X, data = iv.data)
Call:
lm(formula = Y ~ X, data = iv.data)
Coefficients:
(Intercept) X
8.686 1.366
Ouch! 1.366 is much further from 1 than 0.99!
Let’s see if we can do better with our instrumental variable Z.
summary(fit.iv)
lavaan (0.5-20) converged normally after 18 iterations
Number of observations 10000
Estimator ML
Minimum Function Test Statistic 0.000
Degrees of freedom 0
Minimum Function Value 0.0000000000000
Parameter Estimates:
Information Expected
Standard Errors Standard
Regressions:
Estimate Std.Err Z-value P(>|z|)
X ~
Z 1.002 0.010 98.975 0.000
Y ~
X 0.992 0.011 89.044 0.000
Covariances:
Estimate Std.Err Z-value P(>|z|)
X ~~
Y 0.764 0.018 42.718 0.000
Variances:
Estimate Std.Err Z-value P(>|z|)
X 1.032 0.015 70.711 0.000
Y 1.254 0.025 51.019 0.000
The estimate from the true model was .990 and from our IV model was 0.992. This is much closer to the truth!
Now let’s run this as a full blown simulation with 2000 replicates!
summary(true.fit)
Length Class Mode
[1,] 1 lavaan S4
[2,] 1 lavaan S4