In this post, I show how correlation is not transitive, and include the R code for a simulation, so that you can try it for yourself.
Regression is an important tool in data science, not only for prediction but also to elucidate causality between variables. To establish causality, OLS regression is insufficient, because if a model has endogeneity, i.e. an explanatory variable is correlated with the error term, then the OLS estimates are biased. Hence OLS must always be complemented with IV regression and the Durbin-Wu-Hausman test, to verify which of the two methods is most appropriate for the model at hand.
Mathematically, suppose we have a linear model of the form:
\[y_i = \beta_0 + \beta_1x_i + u_i\] Suppose that the explanatory variable \(x_i\) is correlated with the error term \(u_i\). Then the OLS estimates will be biased. Of course, we don’t know this a priori because the error term \(u_i\) is unobserved. We can test for this with an IV regression.
IV regression uses instrumental variables, which are defined by the following properties:
This seems counterintuitive. Surely if z is correlated with x, and x is correlated with u, then z must be correlated with u. We would expect correlation to be transitive, just like equality:
If z = x and x = u, then z = u.
So it seems intuitive to say: if z is correlated with x and x is correlated with u, then z is correlated with z. Mathematically:
\[cov(z,x)\neq 0\ and\ cov(x,u)\neq 0 => cov(z,u)\neq 0\] However, for correlation, this is not true. In fact, IV regression is possible thanks to correlation NOT being transitive. If it were, then an instrumental variable z being correlated with x that in turn is correlated with u would lead to z being correlated with u. But then z would not satisfy the condition for being an instrumental variable, which is a contradiction.
\[cov(z,x)\neq 0\ and\ cov(x,u)\neq 0 => cov(z,u)\neq0=><=\]
Therefore, it is interesting to see under what conditions we can have the setup where endogeneity is present and we have a good instrumental variable z:
\[cov(z,x)\neq 0\ and\ cov(x,u)\neq 0\ and\ cov(z,u)=0\] More precisely, given variables x and u that satisfy:
\[cov(x,u)\neq 0...(1)\]
find variable z that satisfies the conditions for being an instrument:
\[cov(z,x)\neq0\ldots(2)\] \[cov(z,u)=0\ldots(3)\]
The equivalent equations for these conditions are:
\[x=\alpha_0+\alpha_1u+\varepsilon_x...(4)\] \[z=\gamma_0+\gamma_1x+\varepsilon_z...(5)\]
Without loss of generality, we can assume that equation (4) satisfies the ceteris paribus condition, i.e. \(cov(u,\varepsilon_x)=0\), because the only restriction on x and u is condition (1), which does not imply \(cov(u,\varepsilon_x)\neq0\). In contrast, we cannot assume that equation (5) satisfies \(cov(x,\varepsilon_z)=0\), because z is subject to more restrictions, (1), (2) and (3), which together may imply a restriction on \(cov(x,\varepsilon_z)\).
In addition, for conditions (1) and (2) to hold, necessarily we must have:
\[\alpha_1\neq 0 \ and \ \gamma_1\neq0\] Substitute x from (4) into (5): \[z=\gamma_0+\gamma_1(\alpha_0+\alpha_1u+\varepsilon_x)+\varepsilon_z\]
\[\therefore z=\gamma_0+\gamma_1\alpha_0+\gamma_1\alpha_1u+\gamma_1\varepsilon_x+\varepsilon_z...(6)\] Given \(cov(z,u)=0\):
\[cov(\gamma_0+\gamma_1\alpha_0+\gamma_1\alpha_1u+\gamma_1\varepsilon_x+\varepsilon_z,u)=0\] \[\therefore\gamma_1\alpha_1\sigma_u^2+\gamma_1cov(\varepsilon_x,u)+cov(\varepsilon_z,u)=0\] Since \(cov(u,\varepsilon_x)=0\):
\[cov(\varepsilon_z,u)=-\gamma_1\alpha_1\sigma_u^2...(7)\] Therefore \(cov(\varepsilon_z,u)\neq 0\), as was allowed for. We can write the relation between \(\varepsilon_z\) and u as:
\[\varepsilon_z=\delta_0+\delta_1u+\mu...(8)\] where \(\delta_1\neq 0\) and \(\mu\) is a stochastic error term independent of u. From (8):
\[cov(\varepsilon_z,u) = \delta_1\sigma_u^2\]
Equating with (7) we get:
\[\delta_1=-\gamma_1\alpha_1\] Substituting _1 into (8) yields:
\[\varepsilon_z=\delta_0-\gamma_1\alpha_1u+\mu...(9)\] Substitute \(\varepsilon_z\) from (9) into (6):
\[z=\gamma_0+\gamma_1\alpha_0+\gamma_1\alpha_1u+\gamma_1\varepsilon_x+(\delta_0-\gamma_1\alpha_1u+\mu)\] \[\therefore z=\alpha_0+\alpha_1\gamma_0+\alpha_1\varepsilon_x+\mu...(10)\] Equation (10) is the data generating process (DGP) for z. In this equation, z does not depend on u, and hence \(cov(z,u)=0\). However, x is not included in (10) either; so how is z correlated with x? The answer is that z is generated with the same error term that generates x from u.
We test this with a simulation. The R code is included.
# Packages
library(ggplot2)
library(gridExtra)
# 1. Parameters
sd.ex = 2 # Standard deviation of ex
sd.mu = 0.5 # Standard deviation of mu
n = 100
set.seed(0)
u = sort(runif(n, 0, 10))
set.seed(0)
ex = rnorm(n, sd=sd.ex)
x = 1 + u + ex
set.seed(0)
mu = rnorm(n, sd=sd.mu)
z = 2 + ex + mu
# Scatterplots
g1 = ggplot(mapping = aes(y=x, x=u)) + geom_point(size=1) + geom_smooth(method = "lm")
g2 = ggplot(mapping = aes(y=z, x=x)) + geom_point(size=1) + geom_smooth(method = "lm")
g3 = ggplot(mapping = aes(y=z, x=u)) + geom_point(size=1) + geom_smooth(method = "lm")
grid.arrange(g1, g2, g3, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Correlations
cor(x,u)
## [1] 0.8381772
cor(z,x)
## [1] 0.5509224
cor(z,u)
## [1] 0.00660519
We can see that \(r_{x,u}\) and \(r_{x,x}\) are significantly different than zero and \(r_{z,u}\) is almost equal to zero.
Result: The simulation supports the proof.
I hope you enjoyed this demonstration and that it is a useful contribution to our community.