The estimator is what we use to estimate or predict an outcome based on sample data; this is our linear regression model. Bias is how different our estimate tends to be compared to the actual outcome. Bias can be positive or negative. Bias of an estimator affects the accuracy of our model’s results and could lead to conclusions that are off the mark or unreliable.
Will bias go away if sample size or variables are increased?
Since omitted variable bias occurs due to the correlation between the omitted variable and both the dependent and independent variables, simply increasing the sample size of the other independent variables will not mitigate the omitted variable bias.
For additional variable(s) to mitigate the omitted variable bias, those additional variable(s) would need to be proxies for the omitted variable in order to incorporate that correlation into the model in a round-about way. We would need to be careful of any multicollinearity violation that this might cause.
Example of OVB
Data Set
The mtcars data set is a data frame with 32 observations on 11 (numeric) variables. The data comes from the 1974 Motor Trend US magazine and measures fuel economy (miles per gallon) and 10 other attributes of automobile design and performance for 32 different car makes/models (between 1973–74 models). Below are the variables in the data set that I will focus on in this analysis.
We know that Omitted Variable Bias is a concern when the omitted variable is correlated with both the independent variables and dependent variable. Next, I will check for correlation.
# Create a correlation matrix of the datacorr_matrix <-cor(my_data)corr_matrix
\(X\) is correlated with the omitted variable - From the correlogram above, we can see that the omitted variable cyl is positively correlated with the key x variable wt having a correlation coefficient of 0.78.
The omitted variable is a determinant of the dependent variable \(Y\) - We also see that cyl is negatively correlated to the y variable mpg with a correlation coefficient of -0.85.
According to the table below, we know that because cyl and wt are positively correlated and cyl has a negative effect on mpg, there is negative bias. With negative bias in a model, we are more likely to underestimate.
Since we have negative bias, we can see in the comparison above that the estimated key coefficient is larger in absolute value than its true unknown value. In other words, the x-variable of weight is more negative because of the negative bias present in the short model.
The omitted variable bias formula works because omitting a relevant variable from a statistical model can lead to correlations being incorrectly attributed to the key variable that is included in the model. This ultimately causes bias in the estimates of the relationships between variables.
Bonus
# Add a variable to the regression that does not impact y (is uncorrelated with y) but # is correlated with the key x variable, and show that the point estimate will not change (significantly).full_model2 <-lm(mpg ~ wt + cyl + gear, data=my_data)summary(full_model2)
Call:
lm(formula = mpg ~ wt + cyl + gear, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-4.8443 -1.5455 -0.3932 1.4220 5.9416
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.3864 4.3790 9.679 1.97e-10 ***
wt -3.3921 0.8208 -4.133 0.000294 ***
cyl -1.5280 0.4198 -3.640 0.001093 **
gear -0.5229 0.7789 -0.671 0.507524
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.592 on 28 degrees of freedom
Multiple R-squared: 0.8329, Adjusted R-squared: 0.815
F-statistic: 46.53 on 3 and 28 DF, p-value: 5.262e-11
Adding in the x-variable gear which is more correlated with the key x-variable wt and less correlated with y-variable mpg, the summary above shows that the point estimate for wt is now -3.3921. Under the original full model, it was -3.1910, so adding gear to the model did not change the coefficient estimate for wt significantly.
# Add a variable to the regression that impacts y (is correlated with y) but # is not correlated with the key x variable, and show that the point estimate # will not change (significantly).full_model3 <-lm(mpg ~ wt + cyl + qsec, data=my_data)summary(full_model3)
Call:
lm(formula = mpg ~ wt + cyl + qsec, data = my_data)
Residuals:
Min 1Q Median 3Q Max
-4.5937 -1.5621 -0.3595 1.2097 5.5500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.4291 8.1912 3.593 0.001238 **
wt -3.8616 0.9138 -4.226 0.000229 ***
cyl -0.9277 0.6113 -1.518 0.140280
qsec 0.4945 0.3863 1.280 0.211061
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.54 on 28 degrees of freedom
Multiple R-squared: 0.8396, Adjusted R-squared: 0.8224
F-statistic: 48.86 on 3 and 28 DF, p-value: 2.979e-11
Adding in the x-variable qsec which is more correlated with y-variable mpg and less correlated with the key x-variable wt, the summary above shows that the point estimate for wt is now -3.8616. Under the original full model, it was -3.1910, so adding qsec to the model did not change the coefficient estimate for wt significantly.