Week 3_Omitted Variable Bias_Corinne Willis

What is bias of an estimator?

The estimator is what we use to estimate or predict an outcome based on sample data; this is our linear regression model. Bias is how different our estimate tends to be compared to the actual outcome. Bias can be positive or negative. Bias of an estimator affects the accuracy of our model’s results and could lead to conclusions that are off the mark or unreliable.

Will bias go away if sample size or variables are increased?

Since omitted variable bias occurs due to the correlation between the omitted variable and both the dependent and independent variables, simply increasing the sample size of the other independent variables will not mitigate the omitted variable bias.

For additional variable(s) to mitigate the omitted variable bias, those additional variable(s) would need to be proxies for the omitted variable in order to incorporate that correlation into the model in a round-about way. We would need to be careful of any multicollinearity violation that this might cause.

Example of OVB

Data Set

The mtcars data set is a data frame with 32 observations on 11 (numeric) variables. The data comes from the 1974 Motor Trend US magazine and measures fuel economy (miles per gallon) and 10 other attributes of automobile design and performance for 32 different car makes/models (between 1973–74 models). Below are the variables in the data set that I will focus on in this analysis.

  1. mpg: Miles/(US) gallon

  2. wt: Weight (1000 lbs)

  3. cyl: Number of cylinders

my_data <- mtcars
my_data
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Full Estimating Model

The target, independent variable for this analysis will be miles per gallon (mpg).

\[ \begin{align*} mpg_i = & \beta_0 \ + \beta_{1} \ wt_i + \beta_2 cyl_i +\epsilon_i \end{align*} \]

Short Estimating Model

The target, independent variable for this analysis will be miles per gallon. The number of cylinders (cyl) variable has been omitted.

\[ \begin{align*} mpg_i = & \beta_0 \ + \beta_{1} \ wt_i +\epsilon_i \end{align*} \]

We know that Omitted Variable Bias is a concern when the omitted variable is correlated with both the independent variables and dependent variable. Next, I will check for correlation.

# Create a correlation matrix of the data
corr_matrix <- cor(my_data)
corr_matrix
            mpg        cyl       disp         hp        drat         wt
mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
            qsec         vs          am       gear        carb
mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000
# Plot the correlation matrix in a correlogram
library(corrplot)
corrplot 0.92 loaded
corrplot(corr_matrix, type = "upper",
         method = "square",
         addCoef.col = "black", number.cex = 0.5,
         tl.col = "black", tl.srt = 45, tl.cex=0.5)

Two Conditions for OVB

  1. \(X\) is correlated with the omitted variable - From the correlogram above, we can see that the omitted variable cyl is positively correlated with the key x variable wt having a correlation coefficient of 0.78.

  2. The omitted variable is a determinant of the dependent variable \(Y\) - We also see that cyl is negatively correlated to the y variable mpg with a correlation coefficient of -0.85.

According to the table below, we know that because cyl and wt are positively correlated and cyl has a negative effect on mpg, there is negative bias. With negative bias in a model, we are more likely to underestimate.

Model Comparison

full_model <- lm(mpg ~ wt + cyl, data=my_data)
summary(full_model)

Call:
lm(formula = mpg ~ wt + cyl, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2893 -1.5512 -0.4684  1.5743  6.1004 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  39.6863     1.7150  23.141  < 2e-16 ***
wt           -3.1910     0.7569  -4.216 0.000222 ***
cyl          -1.5078     0.4147  -3.636 0.001064 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.568 on 29 degrees of freedom
Multiple R-squared:  0.8302,    Adjusted R-squared:  0.8185 
F-statistic: 70.91 on 2 and 29 DF,  p-value: 6.809e-12
short_model <- lm(mpg ~ wt, data=my_data)
summary(short_model)

Call:
lm(formula = mpg ~ wt, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10
# Compare the two models side-by-side
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
stargazer(full_model, short_model,
          type = "text", covariate.labels = c("Weight(1000 lbs)", "Number of Cylinders", "Constant"))

=================================================================
                                 Dependent variable:             
                    ---------------------------------------------
                                         mpg                     
                             (1)                    (2)          
-----------------------------------------------------------------
Weight(1000 lbs)          -3.191***              -5.344***       
                           (0.757)                (0.559)        
                                                                 
Number of Cylinders       -1.508***                              
                           (0.415)                               
                                                                 
Constant                  39.686***              37.285***       
                           (1.715)                (1.878)        
                                                                 
-----------------------------------------------------------------
Observations                  32                     32          
R2                          0.830                  0.753         
Adjusted R2                 0.819                  0.745         
Residual Std. Error    2.568 (df = 29)        3.046 (df = 30)    
F Statistic         70.908*** (df = 2; 29) 91.375*** (df = 1; 30)
=================================================================
Note:                                 *p<0.1; **p<0.05; ***p<0.01

Conclusion

Since we have negative bias, we can see in the comparison above that the estimated key coefficient is larger in absolute value than its true unknown value. In other words, the x-variable of weight is more negative because of the negative bias present in the short model.

The omitted variable bias formula works because omitting a relevant variable from a statistical model can lead to correlations being incorrectly attributed to the key variable that is included in the model. This ultimately causes bias in the estimates of the relationships between variables.

Bonus

# Add a variable to the regression that does not impact y (is uncorrelated with y) but 
# is correlated with the key x variable, and show that the point estimate will not change (significantly).
full_model2 <- lm(mpg ~ wt + cyl + gear, data=my_data)
summary(full_model2)

Call:
lm(formula = mpg ~ wt + cyl + gear, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8443 -1.5455 -0.3932  1.4220  5.9416 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  42.3864     4.3790   9.679 1.97e-10 ***
wt           -3.3921     0.8208  -4.133 0.000294 ***
cyl          -1.5280     0.4198  -3.640 0.001093 ** 
gear         -0.5229     0.7789  -0.671 0.507524    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.592 on 28 degrees of freedom
Multiple R-squared:  0.8329,    Adjusted R-squared:  0.815 
F-statistic: 46.53 on 3 and 28 DF,  p-value: 5.262e-11

Adding in the x-variable gear which is more correlated with the key x-variable wt and less correlated with y-variable mpg, the summary above shows that the point estimate for wt is now -3.3921. Under the original full model, it was -3.1910, so adding gear to the model did not change the coefficient estimate for wt significantly.

# Add a variable to the regression that impacts y (is correlated with y) but 
# is not correlated with the key x variable, and show that the point estimate 
# will not change (significantly).
full_model3 <- lm(mpg ~ wt + cyl + qsec, data=my_data)
summary(full_model3)

Call:
lm(formula = mpg ~ wt + cyl + qsec, data = my_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5937 -1.5621 -0.3595  1.2097  5.5500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29.4291     8.1912   3.593 0.001238 ** 
wt           -3.8616     0.9138  -4.226 0.000229 ***
cyl          -0.9277     0.6113  -1.518 0.140280    
qsec          0.4945     0.3863   1.280 0.211061    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.54 on 28 degrees of freedom
Multiple R-squared:  0.8396,    Adjusted R-squared:  0.8224 
F-statistic: 48.86 on 3 and 28 DF,  p-value: 2.979e-11

Adding in the x-variable qsec which is more correlated with y-variable mpg and less correlated with the key x-variable wt, the summary above shows that the point estimate for wt is now -3.8616. Under the original full model, it was -3.1910, so adding qsec to the model did not change the coefficient estimate for wt significantly.