# Clear environment
rm(list=ls())
# Load in the data
data(iris)Omitted Variable Bias
Part I
What is the bias of an estimator?
Bias of an estimator is the difference between the expected value of a point estimate and the true value of that point estimate.
An estimator is unbiased if the expected value of the point estimate is approximately the same as the parameter’s actual value.
If the expected value of a parameter is greater than its true value, then the estimator has positive bias, whereas the bias of an estimator is negative if the expected value is less than the true parameter value.
Part II
In terms of omitted variable bias, will the bias go away if we increase the sample size or add more variables?
Bias from omitting relevant variables will never fully go away if we increase the sample size or add more variables, although doing so may reduce some of the problems caused by OVB.
Increasing the sample size may lead to more precise point estimates for the variables included in a regression, however, this will not remove the bias that is present in those point estimates.
Adding more variables will only result in reduced bias in the coefficient estimates for the existing variables in the regression if the added variables are correlated with the omitted variable. If this is the case, the additional variables will absorb some of the bias that was influencing the other variables in the model.
In summary, increasing the sample size and adding more variables can help limit the problems associated with OVB, but the only way to truly solve this problem is to reintroduce the omitted variable(s) back to the regression model.
Part III
1. Full model
I will use the iris data set from the base R package to examine the effect of omitted variable bias. This is a cross-sectional data set with 150 observations of Iris flowers and 5 variables of information for each observation. I have loaded the data set and described each of the five variables below.
Variable Description
Sepal.Length: The sepal length measured in cm
Sepal.Width: The sepal width measured in cm
Petal.Length: The petal length measured in cm
Petal.Width: The petal width measured in cm
Species: A categorical variable indicating the species of each Iris flower (setosa, versicolor, virginica)
The full correct model is seen below with petal length as the dependent variable and all of the other quantitative variables as the independent variables. Petal width is the key independent variable I am interested in studying when omitted variable bias occurs. \[Pedal.Length_i = \beta_0+\beta_1Petal.Width_i + \beta_2Sepal.Length_i+\beta_3Sepal.Width_i + u_i\]
full_reg <- lm(data = iris, formula = Petal.Length ~ Petal.Width + Sepal.Length + Sepal.Width)2. Incorrect Model
Suppose that I have accidentally omitted the variable Sepal.Length by mistake. Now, the shortened estimating equation is as follows: \[Pedal.Length_i = \beta_0+\beta_1Petal.Width_i+\beta_3Sepal.Width_i + u_i\]
short_reg <- lm(data = iris, formula = Petal.Length ~ Petal.Width + Sepal.Width)3. OVB Formula
For an omitted variable to cause bias, the following conditions must be met:
- The omitted variable is correlated with one of the other predictor variables
- The omitted variable is correlated with the dependent variable
Therefore, for the iris data set, the omitted variable Sepal.Length must be both correlated with the dependent variable Petal.Length and one of the other predictors, either Petal.Width or Sepal.Width.
Condition 1: Sepal.Length is correlated with one of the independent variables.
I have tested the correlation between Sepal.Length and Petal.Width with the significance level below. We can see that the two variables are highly correlated with a value of about \(0.82\) and this is statistically significant since the p value is almost zero with a value of about 2.2e-16.
cor_test1 <- cor.test(iris$Sepal.Length, iris$Petal.Width)
cor_test1
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Width
t = 17.296, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.7568971 0.8648361
sample estimates:
cor
0.8179411
This confirms the first condition that the omitted variable Sepal.Length is correlated with one of the predictor variables.
Condition 2: Sepal.Length is correlated with the dependent variable Petal.Length
I tested the correlation between these two variables below with the significance level. We can see that there is a high correlation between the two with a value of about 0.87. This correlation is statistically significant because the p-value is almost zero at about 2.2e-16.
cor_test2 <- cor.test(iris$Sepal.Length, iris$Petal.Length)
cor_test2
Pearson's product-moment correlation
data: iris$Sepal.Length and iris$Petal.Length
t = 21.646, df = 148, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.8270363 0.9055080
sample estimates:
cor
0.8717538
This confirms that the omitted variable is correlated with the dependent variable. Therefore, both conditions are met, meaning that omitting Sepal.Length leads to bias in our coefficient estimates for the shortened model.
4. Positive or Negative Bias?
The omitted variable bias as a result of removing Sepal.Length from the regression will be positive bias since Sepal.Length is positively correlated with both the dependent variable Petal.Length and the independent variable Petal.Width. Below is the 2 by 2 matrix representing the bias associated with omitted variable bias.
Sepal.Length has positive effect on dependent variable Petal.Length |
Sepal.Length has negative effect on dependent variable Petal.Length |
|
Sepal.Length and the key independent variable Petal.Width are positively correlated |
Positive Bias (overestimate) This case represents the bias from omitting |
Negative Bias (underestimate) |
Sepal.Length and the key independent variable Petal.Width are negatively correlated |
Negative Bias (underestimate) | Positive Bias (overestimate) |
5. Comparison of the regressions
Below is a table comparing the regression results from the correct full model and the model that contains omitted variable bias.
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(full_reg, short_reg, type = "text", title = "Regression Comparison")
Regression Comparison
=========================================================================
Dependent variable:
-----------------------------------------------------
Petal.Length
(1) (2)
-------------------------------------------------------------------------
Petal.Width 1.447*** 2.156***
(0.068) (0.053)
Sepal.Length 0.729***
(0.058)
Sepal.Width -0.646*** -0.355***
(0.068) (0.092)
Constant -0.263 2.258***
(0.297) (0.314)
-------------------------------------------------------------------------
Observations 150 150
R2 0.968 0.934
Adjusted R2 0.967 0.933
Residual Std. Error 0.319 (df = 146) 0.457 (df = 147)
F Statistic 1,472.726*** (df = 3; 146) 1,036.172*** (df = 2; 147)
=========================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
From the table above, we can confirm the positive bias that the OVB formula predicted.
In the first regression, the coefficient for petal width is about 1.447. However, after omitting sepal length in the second regression, the petal width coefficient increases to 2.156.
Therefore, the second regression overestimates the effect that petal width has on petal length when sepal length is omitted from the model, confirming that there is indeed positive bias just as we concluded using the OVB formula.
6. Why does the OVB formula work?
The OVB formula works because it takes into account the relationship that the omitted variable has with the other independent variables and the dependent variable.
In my example, removing sepal width from the model means that a main predictor of petal length is now absent, and its effect must be absorbed by the other predictors that are still present. This is why bias exists in the coefficient estimates when sepal length is omitted.
This bias is positive because sepal length is both positively correlated with petal width and petal length. So when sepal length is omitted from the model, its positive relationship with both variables will be absorbed by the coefficient of petal width which then overestimates the true effect that petal width has on petal length.
Hypothetically, if an omitted variable is negatively correlated with either the dependent variable or an independent variable, then the bias absorbed by the independent variable must be negative and reduce the coefficient estimate, leading to an underestimate.