This example is largely inspired by the Multivariable Regression lecture in the John’s Hopkins University course Regression Models available on Coursera.
Simpson’s Paradox is an interesting phenomenon that can happen when new variables are added to a model. In the following example, there are two explanatory variables (x1 and x2) and a response variable (y).
Let’s start with the explanatory variables. x1 is sequentially/linearly increasing from 1 to 1000. x2 is linearly increasing by 100x x1. In other words, both are increasing in a positive direction.
x1 <- 1:1000
x2 <- x1*100 + runif(1000,-2000,2000)
Now for the response variable. y is simply x2 minus x1. Note that this function (x2 minus x1) is precisely what the model will be trying to predict.
y <- x2-x1+runif(1000,-100,100)
Finally, for ease of use, let’s combine them all into a dataframe…
df <- data.frame(x1=x1,x2=x2,y=y)
First, let’s plot y as a function of x1 so we can see that there is a linear relationship between the two.
library(ggplot2)
ggplot(df,aes(x=x1,y=y))+geom_point()+geom_smooth(method="lm")
Let’s grab the coefficient from x1.
fit <- lm(y~x1,df)
summary(fit)$coef[,1][2]
## x1
## 98.94849
We see that when the model only has x1 to go on, x1 is very strongly and positively associated with the response y.
Now, let’s do the same thing, but with both variables (x1 and x2). Recall that the actual response variable is simply x2 minus x1.
fit <- lm(y~x1+x2,df)
summary(fit)$coef[,1][2:3]
## x1 x2
## -1.444986 1.004497
The coefficient for x1 is now negative! In the first model with only x1, the model had nothing to go on to understand that x1 actually contributed negatively to the response. It was only after including another variable (x2) that explained more of the positive variance, that the model recognized x1 has a negative impact on the model.
Regression models are great at explaining the variance of an outcome given a specific set of variables. However, the importance and impact of those variables can only be judged in relation to the other variables given to the model. Adding a new variable (or new set of variables) that accounts for much more of the variance can drastically change the coefficients and impact of the existing variables.