Intuitively, this means that the right-hand side
variables are not only correlated with the dependent variable but also
likely with each other. If they are included in the model, this does not
pose issues since we can accurately calculate the individual variables’
impact on the dependent variable. However, if one of the variables is
omitted from the model, we will face omitted variable bias. OVB violates
the Zero Conditional Mean condition, according to which the \(E\)\([\)\(u\)\(|\)\(x\)\(]\)
\(=\) \(0\).
Since one of the variables
in the model is now correlated with a variable omitted from the model,
and the omitted variable affects our dependent variable, our estimates
will now be biased.
Calculating the size of the bias can be
complicated, but we can figure out its direction by simple logic. Since
bias is the difference between the parameter and the sample estimate, we
can denote it as:
\[
\hat{\beta_{1}}={\beta_{1}} + \frac{{\rho(x,
u)*\theta_{x}*\theta_{u}}}{{\theta_x*\theta_x}}
\]
Where \(\hat{\beta_{1}}\) is the sample estimate,
\({\beta_{1}}\) is the parameter slope,
and \(\frac{{\rho(x,
u)*\theta_{x}*\theta_{u}}}{{\theta_x*\theta_x}}\) is the bias
term. A closer inspection of the formula shows that it contains the
correlation figure between the omitted variable and the variable
included in the model and something resembling a slope estimator. This
means we can determine the bias by asking two questions: “is the omitted
and independent variable positively or negatively correlated?” and “What
impact would the omitted variable have on the dependent variable?”
Multiplying these two signs by each other, i.e., positive times positive
or negative times positive, we can determine the direction of the
omitted variable bias.
The increase of sample size or the
inclusion of more variables can address OVB. However, it doesn’t have
to. If the sample itself is biased or the variables added to the model
are not correlated with our endogenous variables, then OVB can remain
even after taking those steps.
The first dataset I am using is ChickWeight. This is a
pre-installed dataset, which tracks fifty chickens over the period of 21
days after their birth and provides information on their weight and the
diet they are on. The chickens were on four possible diets, so I created
four binary variables, each corresponding to one of the diets. The
variables include a chick identifier, a time variable, a diet variable,
and a weight variable.
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
## 7 106 12 1 1
## 8 125 14 1 1
## 9 149 16 1 1
## 10 171 18 1 1
## weight Time Chick Diet
## Min. : 35.0 Min. : 0.00 13 : 12 1:220
## 1st Qu.: 63.0 1st Qu.: 4.00 9 : 12 2:120
## Median :103.0 Median :10.00 20 : 12 3:120
## Mean :121.8 Mean :10.72 10 : 12 4:118
## 3rd Qu.:163.8 3rd Qu.:16.00 17 : 12
## Max. :373.0 Max. :21.00 19 : 12
## (Other):506
| Variable | Description |
|---|---|
| mpg | Miles/(US) gallon |
| cyl | Number of cylinders |
| disp | Displacement (cu.in.) |
| hp | Gross horsepower |
| drat | Rear axle ratio |
| wt | Weight (1000 lbs) |
| qsec | 1/4 mile time |
| vs | Engine (0 = V-shaped, 1 = straight) |
| am | Transmission (0 = automatic, 1 = manual) |
| gear | Number of forward gears |
| carb | Number of carburetors |
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
| weight | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 41.16 | 33.14 – 49.18 | <0.001 |
| Time | 8.75 | 8.31 – 9.19 | <0.001 |
| Diet1 | -30.23 | -38.30 – -22.17 | <0.001 |
| Diet2 | -14.07 | -23.23 – -4.90 | 0.003 |
| Diet3 | 6.27 | -2.90 – 15.43 | 0.180 |
| Observations | 578 | ||
| R2 / R2 adjusted | 0.745 / 0.744 | ||
| mpg | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 39.05 | 35.22 – 42.88 | <0.001 |
| wt | -3.16 | -4.69 – -1.62 | <0.001 |
| cyl | -1.02 | -2.20 – 0.15 | 0.085 |
| hp | -0.01 | -0.04 – 0.02 | 0.481 |
| carb | -0.28 | -1.18 – 0.63 | 0.537 |
| Observations | 32 | ||
| R2 / R2 adjusted | 0.845 / 0.822 | ||
As we can see, the only statistically significant variable
is the car’s weight, which is the car’s weight in thousands of pounds.
The interpretation is that for each additional thousand pounds in added
car weight, the car’s miles per gallon, on average, decreases by
2.22mpg, holding all else constant.
ChicksOVB <- lm(weight~Diet1+Diet2+Diet3,data=ChickWBinary)
tab_model(Chicks, ChicksOVB)
| weight | weight | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 41.16 | 33.14 – 49.18 | <0.001 | 135.26 | 122.73 – 147.80 | <0.001 |
| Time | 8.75 | 8.31 – 9.19 | <0.001 | |||
| Diet1 | -30.23 | -38.30 – -22.17 | <0.001 | -32.62 | -48.15 – -17.08 | <0.001 |
| Diet2 | -14.07 | -23.23 – -4.90 | 0.003 | -12.65 | -30.30 – 5.01 | 0.160 |
| Diet3 | 6.27 | -2.90 – 15.43 | 0.180 | 7.69 | -9.97 – 25.34 | 0.393 |
| Observations | 578 | 578 | ||||
| R2 / R2 adjusted | 0.745 / 0.744 | 0.053 / 0.049 | ||||
In my first regression suffering from OVB, I omitted the
time variable from the regression model. The omission of time has caused
the Diet1 and Diet2 coefficients to become more negative and the Diet3
coefficient to become less positive. Additionally, Diet2 is now no
longer statistically significant. The loss of significance can be
explained by looking at the change in R squared. The model with OVB only
explains about 5 percent of the variation in the chickens’ weight, while
the original one explains nearly 75 percent. Such a decrease in the
model’s accuracy likely comes with the sum of square residuals being
much higher since the model’s fit is now worse. Since the standard
errors for a slope coefficient in a regression are calculated by the
formula: \[
se(\beta_{k})=\frac{\hat{\theta}}{{s.e.(x)}}
\] where \[
\hat\theta^2=\frac{\sum\hat{u_{i}^2}}{{n-2}}
\]
and so an increase in the residuals will lead to an
increase in the standard errors, which will, in turn, decrease the
statistical significance of the estimate.
The bias can be
explained by asking the two questions I described in the first
section.
TD1 <- cov(ChickWBinary$Time,ChickWBinary$Diet1)
TD1
## [1] -0.09004935
TD2 <- cov(ChickWBinary$Time,ChickWBinary$Diet2)
TD2
## [1] 0.0413186
CarsOVB <- lm(mpg~wt, data=mtcars)
tab_model(Cars,CarsOVB)
| mpg | mpg | |||||
|---|---|---|---|---|---|---|
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 39.05 | 35.22 – 42.88 | <0.001 | 37.29 | 33.45 – 41.12 | <0.001 |
| wt | -3.16 | -4.69 – -1.62 | <0.001 | -5.34 | -6.49 – -4.20 | <0.001 |
| cyl | -1.02 | -2.20 – 0.15 | 0.085 | |||
| hp | -0.01 | -0.04 – 0.02 | 0.481 | |||
| carb | -0.28 | -1.18 – 0.63 | 0.537 | |||
| Observations | 32 | 32 | ||||
| R2 / R2 adjusted | 0.845 / 0.822 | 0.753 / 0.745 | ||||
Car1 <- cov(mtcars$wt,mtcars$cyl)
Car1
## [1] 1.367371
Car2 <- cov(mtcars$wt,mtcars$hp)
Car2
## [1] 44.19266
Car3 <- cov(mtcars$wt,mtcars$carb)
Car3
## [1] 0.6757903
In the second regression suffering from OVB, I decided to
remove all of the control variables. I could do this thanks to the fact
that all of the variables are positively correlated with my only
non-omitted variable, weight. Since I know that all of the omitted
variables are positively correlated with weight and that they all have a
negative effect on miles per gallon, I conclude that the coefficient on
mpg will be downwards biased, which is proven by the decrease we can see
in the regression tables.