OVB

1. Bias of an Estimator

Bias is the difference between the expected value of an estimator and the true parameter: Bias(β̂) = E(β̂) − β. An unbiased estimator hits the true value on average across repeated samples, like a rifle scope that doesn’t systematically aim left or right. OLS is unbiased when the GM assumptions hold; violate them and your estimates are systematically pulled away from the truth.

2. Does OVB go away with more data or more variables?

More data, no. Bigger samples reduce variance but not bias. You just get a more precise wrong answer. More variables, only if you add the right one. The omitted variable must be correlated with both your key X and Y. That’s the only cure.

data(bwght, package = "wooldridge")
head(bwght)
  faminc cigtax cigprice bwght fatheduc motheduc parity male white cigs
1   13.5   16.5    122.3   109       12       12      1    1     1    0
2    7.5   16.5    122.3   133        6       12      2    1     0    0
3    0.5   16.5    122.3   129       NA       12      2    0     0    0
4   15.5   16.5    122.3   126       12       12      2    1     0    0
5   27.5   16.5    122.3   134       14       12      2    1     1    0
6    7.5   16.5    122.3   118       12       14      6    1     0    0
    lbwght bwghtlbs packs    lfaminc
1 4.691348   6.8125     0  2.6026897
2 4.890349   8.3125     0  2.0149031
3 4.859812   8.0625     0 -0.6931472
4 4.836282   7.8750     0  2.7408400
5 4.897840   8.3750     0  3.3141861
6 4.770685   7.3750     0  2.0149031
dim(bwght)
[1] 1388   14
?bwght
No documentation for 'bwght' in specified packages and libraries:
you could try '??bwght'

OVB conditions

cor.test(bwght$bwght, bwght$faminc)

    Pearson's product-moment correlation

data:  bwght$bwght and bwght$faminc
t = 4.0799, df = 1386, p-value = 4.762e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.05664503 0.16063261
sample estimates:
      cor 
0.1089368 
cor.test(bwght$bwght, bwght$faminc)

    Pearson's product-moment correlation

data:  bwght$bwght and bwght$faminc
t = 4.0799, df = 1386, p-value = 4.762e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.05664503 0.16063261
sample estimates:
      cor 
0.1089368 

Both conditions for OVB are met. Condition 1 requires that the omitted variable be correlated with the dependent variable. Here, family income (faminc) and birth weight (bwght) have a correlation of +0.109, significant at the 0.1% level, richer families tend to have heavier babies due to better nutrition, prenatal care, and lower stress. Condition 2 requires that the omitted variable be correlated with the key independent variable. Family income and cigarettes smoked (cigs) have a correlation of −0.173, also significant at the 0.1% level, lower income mothers smoke more on average. Since both conditions are met, omitting faminc will bias our estimate of the effect of smoking on birth weight.

Regresions

short_reg <- lm(bwght ~ cigs, data = bwght)
full_reg <- lm(bwght ~ cigs + faminc, data = bwght)

stargazer(short_reg, full_reg, type = "text")

=====================================================================
                                   Dependent variable:               
                    -------------------------------------------------
                                          bwght                      
                              (1)                      (2)           
---------------------------------------------------------------------
cigs                       -0.514***                -0.463***        
                            (0.090)                  (0.092)         
                                                                     
faminc                                               0.093***        
                                                     (0.029)         
                                                                     
Constant                   119.772***               116.974***       
                            (0.572)                  (1.049)         
                                                                     
---------------------------------------------------------------------
Observations                 1,388                    1,388          
R2                           0.023                    0.030          
Adjusted R2                  0.022                    0.028          
Residual Std. Error    20.129 (df = 1386)       20.063 (df = 1385)   
F Statistic         32.235*** (df = 1; 1386) 21.274*** (df = 2; 1385)
=====================================================================
Note:                                     *p<0.1; **p<0.05; ***p<0.01

In the short model (1), the coefficient on cigs is −0.514, meaning each additional cigarette per day is associated with a 0.514 ounce reduction in birth weight. In the full model (2), once we control for family income, the coefficient shrinks to −0.463. The short model was overstating the negative effect of smoking by about 0.05 ounces per cigarette, precisely the negative bias we predicted from the bottom left cell of the 2×2 matrix.

Why it happens?

In the short model, cigs is doing two jobs at once, capturing the true effect of smoking and accidentally absorbing the effect of low income, since poorer mothers both smoke more and have lighter babies. The model can’t tell the two effects apart, so it attributes all of it to smoking, making cigarettes look more harmful than they actually are. Once we add faminc to the full model, it takes over its own portion of the explanation, and the smoking coefficient shrinks back toward its true value. This is the essence of OVB, the key variable gets “credit” for effects that don’t actually belong to it.