Residuals May Correlate with Explanatory Variables

Purpose

We will show that residuals in a linear regression can be correlated with an explanatory variable. y is a function of two explanatory variables x1 and x2.

set.seed(0)
x1 <- runif(100)
x2 <- runif(100)
y <- 2 * x1 + x2^2 + 1 #intentionally making the data non-linear to rule out a perfect fit

Allow an intercept in linear regression

When the intercept is allowed, the correlations between residuals and explanatory variables will be zero.

df <- as.data.frame(cbind(x1, x2, y))
lm <- lm(y~x1+x2, data=df)
df$y_hat <- predict(lm, df) 
df$residuals <- df$y - df$y_hat
cor1 = cor(df$residuals, df$x1)
cor2 = cor(df$residuals, df$x2)
sum_residual = sum(df$residuals)

head(df)
##          x1        x2        y    y_hat   residuals
## 1 0.8966972 0.6049333 3.159339 3.248314 -0.08897483
## 2 0.2655087 0.6547239 1.959681 1.995757 -0.03607655
## 3 0.3721239 0.3531973 1.868996 1.905783 -0.03678651
## 4 0.5728534 0.2702601 2.218747 2.235086 -0.01633843
## 5 0.9082078 0.9926841 3.801837 3.670992  0.13084524
## 6 0.2016819 0.6334933 1.804678 1.842076 -0.03739840

Correlation coefficients between the residuals and x1 and x2 respectively are 2.538167310^{-15} and -5.073593610^{-16}. The sum of all residuals is 4.507505510^{-14}. All of these values are close to 0.0.

Now we will see what happens when we do not allow an intercept in the linear fit.

Don’t allow the intercept in linear regression (use the same data)

df <- as.data.frame(cbind(x1, x2, y))
lm <- lm(y~x1+x2+0, data=df)#adding the 0 in the fit forces that there is no intercept.
df$y_hat <- predict(lm, df) 
df$residuals <- df$y - df$y_hat
cor1 = cor(df$residuals, df$x1)
cor2 = cor(df$residuals, df$x2)
sum_residual = sum(df$residuals)

head(df)
##          x1        x2        y    y_hat  residuals
## 1 0.8966972 0.6049333 3.159339 3.461635 -0.3022961
## 2 0.2655087 0.6547239 1.959681 1.824276  0.1354047
## 3 0.3721239 0.3531973 1.868996 1.608236  0.2607600
## 4 0.5728534 0.2702601 2.218747 2.016174  0.2025737
## 5 0.9082078 0.9926841 3.801837 4.144671 -0.3428341
## 6 0.2016819 0.6334933 1.804678 1.614562  0.1901154

When we force the intercept to be 0, the correlation coefficients between the residuals and x1 and x2 respectively are -0.692422 and -0.6843629. The sum of all residuals is 9.3365199. None of these values is 0.0.