What is bias of an estimator?

Bias is the difference between the average estimated value and the true value:

\[ Bias(\hat{\beta}) = E[\hat{\beta}] - \beta \]

Will bias go away if we increase sample size or add more variables?

No. OVB does not go away with more data, you just get a more precise bust still biased estimate.

Omitted Variable Bias Example Using mtcars

I am using the built-in mtcars dataset.

Variables

  • mpg: miles per gallon (dependent variable \(Y_i\))
  • wt: car weight (key independent variable \(X_i\))
  • hp: horsepower (omitted variable \(Z_i\))

Research Question

What is the effect of car weight on fuel efficiency?

Key independent variable: \[ X_i = wt_i \]

Omitted variable: \[ Z_i = hp_i \]

Full Model

\[ mpg_i = \beta_0 + \beta_1 wt_i + \beta_2 hp_i + u_i \]

Short Model

\[ mpg_i = \alpha_0 + \alpha_1 wt_i + v_i \]

OVB Formula

\[ Bias(\hat{\alpha}_1) = \beta_2 \cdot \frac{Cov(wt_i, hp_i)}{Var(wt_i)} \]

Two Conditions for OVB

Condition 1

\[ \beta_2 \neq 0 \]

Horsepower must affect mpg.

Condition 2

\[ Cov(wt_i, hp_i) \neq 0 \]

Horsepower must be correlated with weight.

Expected Direction of Bias

Therefore: \[ Bias(\hat{\alpha}_1) < 0 \]

Bias is negative.

R Code

if (!require(stargazer)) install.packages("stargazer")
library(stargazer)

data(mtcars)

df <- mtcars[, c("mpg", "wt", "hp")]

Summary Statistics

summary(df)
##       mpg              wt              hp       
##  Min.   :10.40   Min.   :1.513   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:2.581   1st Qu.: 96.5  
##  Median :19.20   Median :3.325   Median :123.0  
##  Mean   :20.09   Mean   :3.217   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:3.610   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :5.424   Max.   :335.0

Check OVB Conditions

Condition 1: Correlation between hp and mpg

cor.test(df$hp, df$mpg)
## 
##  Pearson's product-moment correlation
## 
## data:  df$hp and df$mpg
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor 
## -0.7761684

Condition 2: Correlation between hp and wt

cor.test(df$hp, df$wt)
## 
##  Pearson's product-moment correlation
## 
## data:  df$hp and df$wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4025113 0.8192573
## sample estimates:
##       cor 
## 0.6587479

Run Regressions

full_model <- lm(mpg ~ wt + hp, data = df)
short_model <- lm(mpg ~ wt, data = df)

summary(full_model)
## 
## Call:
## lm(formula = mpg ~ wt + hp, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
## wt          -3.87783    0.63273  -6.129 1.12e-06 ***
## hp          -0.03177    0.00903  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
summary(short_model)
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

Side-by-Side Table

stargazer(short_model, full_model,
          type = "html",
          title = "Short vs Full Model",
          column.labels = c("Short Model", "Full Model"),
          dep.var.labels = "MPG",
          covariate.labels = c("Weight", "Horsepower"),
          digits = 3)
Short vs Full Model
Dependent variable:
MPG
Short Model Full Model
(1) (2)
Weight -5.344*** -3.878***
(0.559) (0.633)
Horsepower -0.032***
(0.009)
Constant 37.285*** 37.227***
(1.878) (1.599)
Observations 32 32
R2 0.753 0.827
Adjusted R2 0.745 0.815
Residual Std. Error 3.046 (df = 30) 2.593 (df = 29)
F Statistic 91.375*** (df = 1; 30) 69.211*** (df = 2; 29)
Note: p<0.1; p<0.05; p<0.01

Compare Coefficients

coef(short_model)["wt"]
##        wt 
## -5.344472
coef(full_model)["wt"]
##        wt 
## -3.877831

Interpretation

The comparison between the short and full models shows evidence of omitted variable bias because the estimated effect of weight changes once horsepower is included. In the short model, the coefficient on weight is more negative, suggesting that weight has a larger effect on reducing miles per gallon than it actually does. The full model, which includes horsepower, provides a more accurate estimate because it accounts for an important omitted variable. Since horsepower reduces fuel efficiency and is positively correlated with weight, leaving it out causes the weight coefficient to capture part of horsepower’s negative effect. As a result, the short model overstates the impact of weight. Intuitively, heavier cars tend to have more powerful engines, and those engines use more fuel, so if horsepower is ignored, weight ends up taking the blame for both effects and its estimated impact becomes too large in magnitude.

Conclusion

Full model: \[ mpg_i = \beta_0 + \beta_1 wt_i + \beta_2 hp_i + u_i \]

Short model: \[ mpg_i = \alpha_0 + \alpha_1 wt_i + v_i \]

Both OVB conditions are satisfied.

Direction of bias: \[ (-) \times (+) = (-) \]

Omitting horsepower causes negative bias in the estimated effect of weight.