1 Bias of an Estimator

      To establish what a bias of an estimator is, we must first remember what an estimator is. Calculated with the formula:
\[ \beta_{k}=\frac{{Cov(x, y)}}{{Var(x)}} \]      where the \(Cov(x,y)\) determines the direction of the slope coefficient, and the \(Var(x)\) standardizes the estimate. A single beta coefficient is sufficient if the dependent variable is only affected by this single independent variable, such as in randomized control trials. However, in those cases, the usage of regressions is unnecessary. In the more likely case where a set of independent variables determines the dependent variable’s variation, then an MLR must be used, which would take the form of:

\(y\) \(=\) \(\beta_{0}\) \(+\) \(\beta_{1}\)\(x_{1}\) \(+\) \(\beta_{2}\)\(x_{2}\) \(+\) \(...\) \(+\) \(\beta_{k}\)\(x_{k}\) \(+\) \(\epsilon_{i}\)



      Intuitively, this means that the right-hand side variables are not only correlated with the dependent variable but also likely with each other. If they are included in the model, this does not pose issues since we can accurately calculate the individual variables’ impact on the dependent variable. However, if one of the variables is omitted from the model, we will face omitted variable bias. OVB violates the Zero Conditional Mean condition, according to which the \(E\)\([\)\(u\)\(|\)\(x\)\(]\) \(=\) \(0\).
      Since one of the variables in the model is now correlated with a variable omitted from the model, and the omitted variable affects our dependent variable, our estimates will now be biased.
      Calculating the size of the bias can be complicated, but we can figure out its direction by simple logic. Since bias is the difference between the parameter and the sample estimate, we can denote it as:
\[ \hat{\beta_{1}}={\beta_{1}} + \frac{{\rho(x, u)*\theta_{x}*\theta_{u}}}{{\theta_x*\theta_x}} \]
      Where \(\hat{\beta_{1}}\) is the sample estimate, \({\beta_{1}}\) is the parameter slope, and \(\frac{{\rho(x, u)*\theta_{x}*\theta_{u}}}{{\theta_x*\theta_x}}\) is the bias term. A closer inspection of the formula shows that it contains the correlation figure between the omitted variable and the variable included in the model and something resembling a slope estimator. This means we can determine the bias by asking two questions: “is the omitted and independent variable positively or negatively correlated?” and “What impact would the omitted variable have on the dependent variable?” Multiplying these two signs by each other, i.e., positive times positive or negative times positive, we can determine the direction of the omitted variable bias.
      The increase of sample size or the inclusion of more variables can address OVB. However, it doesn’t have to. If the sample itself is biased or the variables added to the model are not correlated with our endogenous variables, then OVB can remain even after taking those steps.

2 Regression Models

2.1 Datasets Description & Modifications


      The first dataset I am using is ChickWeight. This is a pre-installed dataset, which tracks fifty chickens over the period of 21 days after their birth and provides information on their weight and the diet they are on. The chickens were on four possible diets, so I created four binary variables, each corresponding to one of the diets. The variables include a chick identifier, a time variable, a diet variable, and a weight variable.

## Grouped Data: weight ~ Time | Chick
##    weight Time Chick Diet
## 1      42    0     1    1
## 2      51    2     1    1
## 3      59    4     1    1
## 4      64    6     1    1
## 5      76    8     1    1
## 6      93   10     1    1
## 7     106   12     1    1
## 8     125   14     1    1
## 9     149   16     1    1
## 10    171   18     1    1
##      weight           Time           Chick     Diet   
##  Min.   : 35.0   Min.   : 0.00   13     : 12   1:220  
##  1st Qu.: 63.0   1st Qu.: 4.00   9      : 12   2:120  
##  Median :103.0   Median :10.00   20     : 12   3:120  
##  Mean   :121.8   Mean   :10.72   10     : 12   4:118  
##  3rd Qu.:163.8   3rd Qu.:16.00   17     : 12          
##  Max.   :373.0   Max.   :21.00   19     : 12          
##                                  (Other):506

      My second dataset is also pre-installed into R, and it is mtcars. It is a cross-sectional dataset which contains information on 32 different cars from the 1974 Motor Trend Us magazine.
Variable Description
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

2.2 Correct Regressions

Regression I.

      I have decided to investigate the impact that each diet has on chickens’ weight.
  weight
Predictors Estimates CI p
(Intercept) 41.16 33.14 – 49.18 <0.001
Time 8.75 8.31 – 9.19 <0.001
Diet1 -30.23 -38.30 – -22.17 <0.001
Diet2 -14.07 -23.23 – -4.90 0.003
Diet3 6.27 -2.90 – 15.43 0.180
Observations 578
R2 / R2 adjusted 0.745 / 0.744

      As we can see, Time, Diet1, and Diet2 produce statistically significant results. We see that, on average, for each additional day of the chickens’ lifespan, it gains 8.75 gm cetris paribus. We also see that, on average, chickens on diet one weigh 30.23 grams less compared to the chickens on diet four, cetris paribus, and chickens on diet two weigh, on average, 14.07 grams less compared to chickens on diet four, cetris paribus. Diet three produces no statistically significant results.
Regression II.


      In my second regression, I investigate the impact that some car performance-relevant variables have on its gas consumption. The variables I am using are miles per gallon, car horsepower, car weight, number of car cylinders, and carburetors.

  mpg
Predictors Estimates CI p
(Intercept) 39.05 35.22 – 42.88 <0.001
wt -3.16 -4.69 – -1.62 <0.001
cyl -1.02 -2.20 – 0.15 0.085
hp -0.01 -0.04 – 0.02 0.481
carb -0.28 -1.18 – 0.63 0.537
Observations 32
R2 / R2 adjusted 0.845 / 0.822


      As we can see, the only statistically significant variable is the car’s weight, which is the car’s weight in thousands of pounds. The interpretation is that for each additional thousand pounds in added car weight, the car’s miles per gallon, on average, decreases by 2.22mpg, holding all else constant.

2.3 Regressions with OVB


      Now, I will run these regressions without some key control variables to demonstrate the issues posed by omitted variable bias.
Regression I.


ChicksOVB <- lm(weight~Diet1+Diet2+Diet3,data=ChickWBinary)
tab_model(Chicks, ChicksOVB)
  weight weight
Predictors Estimates CI p Estimates CI p
(Intercept) 41.16 33.14 – 49.18 <0.001 135.26 122.73 – 147.80 <0.001
Time 8.75 8.31 – 9.19 <0.001
Diet1 -30.23 -38.30 – -22.17 <0.001 -32.62 -48.15 – -17.08 <0.001
Diet2 -14.07 -23.23 – -4.90 0.003 -12.65 -30.30 – 5.01 0.160
Diet3 6.27 -2.90 – 15.43 0.180 7.69 -9.97 – 25.34 0.393
Observations 578 578
R2 / R2 adjusted 0.745 / 0.744 0.053 / 0.049


      In my first regression suffering from OVB, I omitted the time variable from the regression model. The omission of time has caused the Diet1 and Diet2 coefficients to become more negative and the Diet3 coefficient to become less positive. Additionally, Diet2 is now no longer statistically significant. The loss of significance can be explained by looking at the change in R squared. The model with OVB only explains about 5 percent of the variation in the chickens’ weight, while the original one explains nearly 75 percent. Such a decrease in the model’s accuracy likely comes with the sum of square residuals being much higher since the model’s fit is now worse. Since the standard errors for a slope coefficient in a regression are calculated by the formula: \[ se(\beta_{k})=\frac{\hat{\theta}}{{s.e.(x)}} \] where \[ \hat\theta^2=\frac{\sum\hat{u_{i}^2}}{{n-2}} \]
      and so an increase in the residuals will lead to an increase in the standard errors, which will, in turn, decrease the statistical significance of the estimate.
      The bias can be explained by asking the two questions I described in the first section.

TD1 <- cov(ChickWBinary$Time,ChickWBinary$Diet1)
TD1
## [1] -0.09004935
TD2 <- cov(ChickWBinary$Time,ChickWBinary$Diet2)
TD2
## [1] 0.0413186

      The table above shows that Time and Diet1 are negatively correlated. We also know that Time has a positive impact on the weight of the chicken. From this we can conclude that the Diet1 estimate after removing the Time variable is downward biased. On the other hand, the covariance of Time and Diet2 is positive, so the estimate for Diet2 will be upward biased. Looking at the results, we can confirm that these assumptions are right.
Regression II.


CarsOVB <- lm(mpg~wt, data=mtcars)

tab_model(Cars,CarsOVB)
  mpg mpg
Predictors Estimates CI p Estimates CI p
(Intercept) 39.05 35.22 – 42.88 <0.001 37.29 33.45 – 41.12 <0.001
wt -3.16 -4.69 – -1.62 <0.001 -5.34 -6.49 – -4.20 <0.001
cyl -1.02 -2.20 – 0.15 0.085
hp -0.01 -0.04 – 0.02 0.481
carb -0.28 -1.18 – 0.63 0.537
Observations 32 32
R2 / R2 adjusted 0.845 / 0.822 0.753 / 0.745
Car1 <- cov(mtcars$wt,mtcars$cyl)
Car1
## [1] 1.367371
Car2 <- cov(mtcars$wt,mtcars$hp)
Car2
## [1] 44.19266
Car3 <- cov(mtcars$wt,mtcars$carb)
Car3
## [1] 0.6757903


      In the second regression suffering from OVB, I decided to remove all of the control variables. I could do this thanks to the fact that all of the variables are positively correlated with my only non-omitted variable, weight. Since I know that all of the omitted variables are positively correlated with weight and that they all have a negative effect on miles per gallon, I conclude that the coefficient on mpg will be downwards biased, which is proven by the decrease we can see in the regression tables.