Algebra Introduction
Suppose we have a linear relationship between y, and 4 independent variables, x1, x2, x3 and x4
\[x_1 = \begin{bmatrix} x_{11}\\ x_{12}\\ .\\ .\\ x_{1n} \end{bmatrix}, x_2 = \begin{bmatrix} x_{21}\\ x_{22}\\ .\\ .\\ x_{2n} \end{bmatrix}, x_3 = \begin{bmatrix} x_{31}\\ x_{32}\\ .\\ .\\ x_{3n} \end{bmatrix}, x_4 = \begin{bmatrix} x_{41}\\ x_{42}\\ .\\ .\\ x_{4n} \end{bmatrix}\]
And we are trying to estimate y. \[y = \begin{bmatrix} y_1\\ y_2\\ .\\ .\\ y_n \end{bmatrix}\]
We know we can estimate this with the equation \(\hat{y}=X\hat{\beta} + \hat{u}\).
It would look a little like this:
\[\hat{y} = \begin{bmatrix} 1 & x_{11} & x_{21} & x_{31} & x_{41}\\ 1 & x_{21} & x_{22} & x_{32} & x_{42}\\ . & . & . & .\\ 1 & x_{n1} & x_{n2} & x_{n3} & x_{n4} \end{bmatrix} \begin{bmatrix} \hat{\beta_0}\\ \hat{\beta_1}\\ \hat{\beta_2}\\ \hat{\beta_3}\\ \hat{\beta_4}\\ \end{bmatrix} + \begin{bmatrix} \hat{u_1}\\ \hat{u_2}\\ .\\ \hat{u_n}\\ \end{bmatrix}\]
We also know we can get the Beta estimator through the equation \(\hat{\beta} = (X'X)^{-1}X'y\)
How do we know whether or not a regression estimator is biased? Well let’s do a little experiment!
Suppose the true value of the output \(y\) is \(y_i = 1 + 3x_1 + 2x_2 + 0x_3 + x_4\), with some additional noise. Suppose also that \(x_1, x_2, x_3\) are normally distributed, but \(x_4 = 2x_1 + x_2\) with a normally distributed random component.
What happens when we run a linear regression on all our X’s?
generate_data <- function() {
x1 <- rnorm(n = 1000, mean=0, sd=1)
x2 <- rnorm(n=1000, mean=2, sd=2)
x3 <- rnorm(n=1000, mean=1, sd=3)
x4 <- 2*x1 + x2 + rnorm(n=1000, mean=0, sd=1)
X <- data.frame(1,x1, x2, x3, x4)
}
generate_y <- function(X) {
y <- 3*X$x1 + 2*X$x2 + X$x4 + 1 + rnorm(n=1000, mean=0, sd=1)
}
X <- generate_data()
y <- generate_y(X)
colnames(X) <- c("x0", "x1", "x2", "x3", "x4")
find_beta_hat <- function(X, y) {
# Applying the beta hat formula specified above
X <- as.matrix(X)
squared_X <- t(X)%*%X
# Calculates the inverse of squared_X
term_1 <- solve(squared_X)
term_2 <- t(X)%*%y
beta_hat <- term_1 %*% term_2
}
beta_hat <- find_beta_hat(X,y)
beta_hat## [,1]
## x0 1.01378609
## x1 3.00610732
## x2 2.00077800
## x3 0.01076265
## x4 0.99450500
Whoah cool! We get very close to the actual estimates. (Note: The reason why they arne’t exactly on is because of the random noise we added.)
Keep in mind a couple of things 1. We have data for every variable that impacts y 2. Every variable was placed in the regression.
R has a much more comprehensive Linear Regression solver, so let’s use that to check our work.
data <- data.frame(y, X)
model_1 <- lm(y ~ x1 + x2 + x3 + x4, data=data)
summary(model_1)##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3035 -0.6690 0.0298 0.6912 2.7427
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.01379 0.04620 21.946 <2e-16 ***
## x1 3.00611 0.07433 40.443 <2e-16 ***
## x2 2.00078 0.03649 54.837 <2e-16 ***
## x3 0.01076 0.01093 0.985 0.325
## x4 0.99450 0.03301 30.126 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.023 on 995 degrees of freedom
## Multiple R-squared: 0.9834, Adjusted R-squared: 0.9833
## F-statistic: 1.475e+04 on 4 and 995 DF, p-value: < 2.2e-16
We were able to verify our answers, and get significance estimates! As expected, \(x_1, x_2\) and \(x_4\) are all statistically significant and \(x_3\), which is not at all a factor in y, does not pass the significance test.
Obviously at this point, there are no omitted variables. But let’s get technical.
Omitted Variable Bias
I’ll be following mostly off of this wikipedia page for this section Let’s define two conditions 1. The omitted variable is a determinant in y 2. The omitted variable is correlated with one of the explanatory variables
In our case, \(x_4\) is a determinant of y (it’s explicitly in the equation for y), and it is also correlated with both \(x_1\) and \(x_2\).
par(mfrow=c(1,3))
plot(X$x4,X$x1, main="x1 against x4")
plot(X$x4,X$x2, main="x2 against x4")
plot(X$x4,X$x3, main="x3 against x4")So what happens when we omit \(x_4\) from our regression? (We’ll use the lm() function for this)
model_2 <- lm(y ~ x1 + x2 + x3, data=data)
summary(model_2)##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5860 -0.9641 0.0403 0.9945 4.0901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.106526 0.063704 17.370 <2e-16 ***
## x1 5.033324 0.043631 115.360 <2e-16 ***
## x2 2.979822 0.022922 129.996 <2e-16 ***
## x3 -0.003896 0.015090 -0.258 0.796
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.414 on 996 degrees of freedom
## Multiple R-squared: 0.9683, Adjusted R-squared: 0.9682
## F-statistic: 1.014e+04 on 3 and 996 DF, p-value: < 2.2e-16
Both the estimates of \(x_1\) and \(x_2\) were revised upward! By omitting one of the independent variables that was correlated with the others, we now have inaccurate estimates of \(\hat{\beta_1}\) and \(\hat{\beta_2}\)
Omitted Variable Bias: The Algebra
First, the statistical term for bias in an estimator. An estimator if biased if \(E[\hat{\beta}|X] \neq \beta\).
Let’s rewrite the Linear Equation form as \(y_i = x_i\beta + z_i\delta + u_i\) Recall that \(\hat{\beta} = (X'X)^{-1}X'y\) If we substitute y for \(X\beta + Z\delta + U\)
\[ \begin{aligned}\hat{\beta} &= (X'X)^{-1}X'(X\beta + Z\delta + U)\\ &= ((X'X)^{-1}X'X)\beta + (X'X)^{-1}X'Z\delta + (X'X)^{-1}X'U\\ &= \beta + (X'X)^{-1}X'Z\delta + (X'X)^{-1}X'U \end{aligned} \]
If we take the expected value of \(\hat{\beta}\) conditional on \(X\)… \[ \begin{aligned} E[\hat{\beta}|X] &= E[\beta|X] + (X'X)^{-1}X'E[Z|X]\delta + (X'X)^{-1}X'E[U|X] \\ E[\hat{\beta}|X] &= \beta + (X'X)^{-1}E[X'Z|X]\delta + 0 \end{aligned} \]
If \(\delta=0\) (It is not a predictor of Y), the second term will reduce to zero, and we will have \(E[\hat{\beta}|X] = \beta\), which is an unbiased estimator. If Z is uncorrelated with X, it will also reduce to zero!
Implications
Omitted Variable Bias has very strong implications for work in Data Science, as it directly relates to the weights that would be returned in a Linear Regression, or Maximum Likelihood Estimation. In a real world scenario, it is very unlikely that you would have all the variables that make up the true equation for your dependent variable. Thus, you will definitely run into Omitted Variable Bias in your adventures.
Despite the dramatic effect OVB had on the coefficients \(\hat{\beta_1}\) and \(\hat{\beta_2}\), the \(R^2\) of the two models did not change that much. The first model’s \(R^2\) was 0.9816, but the second model had an \(R^2\) of 0.9648. This really didn’t change much. The prediction quality would also change as well, as we can see from our Mean Squared Error calculations below.
# Testing the impact of omitted variable bias on predicting with new data
X_new <- generate_data()
y_new <- generate_y(X_new)
calculate_prediction_error <- function(X_data, y_data, model) {
predictions <- predict(model, X_data)
mean_squared_error <- sum((y_data - predictions)^2)/nrow(X_data)
}
error_model_1 <- calculate_prediction_error(X_new, y_new, model_1)
error_model_2 <- calculate_prediction_error(X_new, y_new, model_2)
data.frame(list("Model"=c("Model 1", "Model 2"),
"MSE"=c(error_model_1, error_model_2)))## Model MSE
## 1 Model 1 0.9357506
## 2 Model 2 1.9650807
Conclusion
With the advent of open-source technology and the massive amount of data available to everyone, it is becoming easier to run advanced statistics. But just because you’re able to run something, and just because it spits out an output, it doesn’t mean it’s correct. At its worst, Omitted Variable Bias can plague the results of a regression so intensely that one cannot reliably figure out whether a certain independent variable impacts the dependent one positively or negatively.
This impact isn’t so dangerous when one is predicting variables. You can still get a model that fits rather well. But a model that fits well does not imply that the coefficients are correct.