Bias is the difference between the average estimated value and the true value:
\[ Bias(\hat{\beta}) = E[\hat{\beta}] - \beta \]
No. OVB does not go away with more data, you just get a more precise bust still biased estimate.
mtcarsI am using the built-in mtcars dataset.
mpg: miles per gallon (dependent variable \(Y_i\))wt: car weight (key independent variable \(X_i\))hp: horsepower (omitted variable \(Z_i\))What is the effect of car weight on fuel efficiency?
Key independent variable: \[ X_i = wt_i \]
Omitted variable: \[ Z_i = hp_i \]
\[ mpg_i = \beta_0 + \beta_1 wt_i + \beta_2 hp_i + u_i \]
\[ mpg_i = \alpha_0 + \alpha_1 wt_i + v_i \]
\[ Bias(\hat{\alpha}_1) = \beta_2 \cdot \frac{Cov(wt_i, hp_i)}{Var(wt_i)} \]
\[ \beta_2 \neq 0 \]
Horsepower must affect mpg.
\[ Cov(wt_i, hp_i) \neq 0 \]
Horsepower must be correlated with weight.
Therefore: \[ Bias(\hat{\alpha}_1) < 0 \]
Bias is negative.
if (!require(stargazer)) install.packages("stargazer")
library(stargazer)
data(mtcars)
df <- mtcars[, c("mpg", "wt", "hp")]
summary(df)
## mpg wt hp
## Min. :10.40 Min. :1.513 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:2.581 1st Qu.: 96.5
## Median :19.20 Median :3.325 Median :123.0
## Mean :20.09 Mean :3.217 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:3.610 3rd Qu.:180.0
## Max. :33.90 Max. :5.424 Max. :335.0
cor.test(df$hp, df$mpg)
##
## Pearson's product-moment correlation
##
## data: df$hp and df$mpg
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8852686 -0.5860994
## sample estimates:
## cor
## -0.7761684
cor.test(df$hp, df$wt)
##
## Pearson's product-moment correlation
##
## data: df$hp and df$wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4025113 0.8192573
## sample estimates:
## cor
## 0.6587479
full_model <- lm(mpg ~ wt + hp, data = df)
short_model <- lm(mpg ~ wt, data = df)
summary(full_model)
##
## Call:
## lm(formula = mpg ~ wt + hp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.941 -1.600 -0.182 1.050 5.854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
## wt -3.87783 0.63273 -6.129 1.12e-06 ***
## hp -0.03177 0.00903 -3.519 0.00145 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
## F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
summary(short_model)
##
## Call:
## lm(formula = mpg ~ wt, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5432 -2.3647 -0.1252 1.4096 6.8727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
## wt -5.3445 0.5591 -9.559 1.29e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
stargazer(short_model, full_model,
type = "html",
title = "Short vs Full Model",
column.labels = c("Short Model", "Full Model"),
dep.var.labels = "MPG",
covariate.labels = c("Weight", "Horsepower"),
digits = 3)
| Dependent variable: | ||
| MPG | ||
| Short Model | Full Model | |
| (1) | (2) | |
| Weight | -5.344*** | -3.878*** |
| (0.559) | (0.633) | |
| Horsepower | -0.032*** | |
| (0.009) | ||
| Constant | 37.285*** | 37.227*** |
| (1.878) | (1.599) | |
| Observations | 32 | 32 |
| R2 | 0.753 | 0.827 |
| Adjusted R2 | 0.745 | 0.815 |
| Residual Std. Error | 3.046 (df = 30) | 2.593 (df = 29) |
| F Statistic | 91.375*** (df = 1; 30) | 69.211*** (df = 2; 29) |
| Note: | p<0.1; p<0.05; p<0.01 | |
coef(short_model)["wt"]
## wt
## -5.344472
coef(full_model)["wt"]
## wt
## -3.877831
The comparison between the short and full models shows evidence of omitted variable bias because the estimated effect of weight changes once horsepower is included. In the short model, the coefficient on weight is more negative, suggesting that weight has a larger effect on reducing miles per gallon than it actually does. The full model, which includes horsepower, provides a more accurate estimate because it accounts for an important omitted variable. Since horsepower reduces fuel efficiency and is positively correlated with weight, leaving it out causes the weight coefficient to capture part of horsepower’s negative effect. As a result, the short model overstates the impact of weight. Intuitively, heavier cars tend to have more powerful engines, and those engines use more fuel, so if horsepower is ignored, weight ends up taking the blame for both effects and its estimated impact becomes too large in magnitude.
Full model: \[ mpg_i = \beta_0 + \beta_1 wt_i + \beta_2 hp_i + u_i \]
Short model: \[ mpg_i = \alpha_0 + \alpha_1 wt_i + v_i \]
Both OVB conditions are satisfied.
Direction of bias: \[ (-) \times (+) = (-) \]
Omitting horsepower causes negative bias in the estimated effect of weight.