1.What is bias of an estimator?

The bias of an estimator refers to the expected difference between the estimator’s value and the true value of the parameter being estimated. Mathematically, for an estimator θ^ of a parameter θ, the bias B(θ^) is defined as: B(θ^)= E[θ^]−θ

where E[ θ^] denotes the expected value of the estimator θ^, and θ is the true value of the parameter being estimated.

Bias Equal to Zero: An estimator is said to be unbiased if B(θ^)=0, meaning its expected value equals the true parameter value. This suggests that, on average, the estimator does not systematically overestimate or underestimate the parameter.

Positive Bias: If B(θ^)>0, the estimator tends to overestimate the true parameter value.

Negative Bias: If B(θ^)<0, the estimator tends to underestimate the true parameter value.

Bias is crucial because it affects the accuracy of estimates. A biased estimator may consistently produce results that are systematically off from the true value.Biased estimators can lead to incorrect conclusions in statistical tests or parameter estimation.

2.In terms of omitted variable bias, will the bias go away if we increase the same size or add more variables?

Omitted variable bias occurs when a relevant variable that should be included in a regression model is left out. This omission can lead to biased and inconsistent estimates of the coefficients of other variables in the model.

Impact of Sample Size:

Increasing of the sample size generally does not eliminate omitted variable bias. Omitted variable bias arises from the systematic exclusion of a relevant variable, not from the size of the sample itself. However, with a larger sample size, the precision of your estimates may improve, meaning that the standard errors of your coefficient estimates may decrease. This does not address bias but can provide more precise estimates of the biased coefficients.

Impact of Adding More Variables:

Adding more variables to a model can potentially mitigate omitted variable bias if the new variables are correlated with the omitted variable and are correctly specified. If the additional variables are correlated with the omitted variable, they can absorb some of the variation that was previously causing bias in the coefficients of other variables. This can reduce the bias in the estimates. However, adding more variables does not guarantee that bias will be completely eliminated, especially if the omitted variable is still influential and correlated with the included variables in a complex manner.

However,Increasing sample size improves precision but does not eliminate bias. Adding more relevant variables can potentially reduce bias by capturing some of the omitted variable’s effects, but the effectiveness depends on the correlation and correct specification of these additional variables.

3.Give me 1 distinct example of OVB (on a different dataset). You will choose a dataset, describe the variables in it, and give us the full / correct model (be sure to write out the estimating equation in R markdown, and pay attention to the subscripts as well). Tell us what is your key independent variable that you are interested in studying.

# necessary packages
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.1
## Warning: package 'ggplot2' was built under R version 4.4.1
## Warning: package 'readr' was built under R version 4.4.1
## Warning: package 'forcats' was built under R version 4.4.1
## Warning: package 'lubridate' was built under R version 4.4.1
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the mtcars dataset
data(mtcars)

glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Description Variables in the mtcars Dataset:

mpg: Miles per gallon (fuel efficiency, dependent variable).

cyl: Number of cylinders.

disp: Displacement (engine displacement, in cubic inches).

hp: Gross horsepower.

drat: Rear axle ratio.

wt: Weight (in 1000 lbs).

qsec: Quarter mile time (in seconds).

vs: Engine type (0 = V-shaped, 1 = straight).

am: Transmission type (0 = automatic, 1 = manual).

gear: Number of forward gears.

carb: Number of carburetors.

In the above listed 11 variables, mpg is the dependent variable and all the remaining ten are the independent variables. And in the 10 independent variables, two variables (vs and am) are categorical and they have been encoded with a dummy variable. This involved creating two binary dummy variables (0 and 1) to represent the categories of the categorical variable.

#necessary packages

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Load the mtcars dataset 
data(mtcars)

# Use stargazer to display regression results
stargazer(mtcars, 
          type="text", 
          title = "Summary Statistics", 
          covariate.labels = c("MPG", "Cylinders", "Displacement","Horsepower", "Rear-axle ratio", "Weight", "Quarter-mile time", "Engine", "Transmission", "Gears", "Carburetors")
          )
## 
## Summary Statistics
## ====================================================
## Statistic         N   Mean   St. Dev.  Min     Max  
## ----------------------------------------------------
## MPG               32 20.091   6.027   10.400 33.900 
## Cylinders         32  6.188   1.786     4       8   
## Displacement      32 230.722 123.939  71.100 472.000
## Horsepower        32 146.688  68.563    52     335  
## Rear-axle ratio   32  3.597   0.535   2.760   4.930 
## Weight            32  3.217   0.978   1.513   5.424 
## Quarter-mile time 32 17.849   1.787   14.500 22.900 
## Engine            32  0.438   0.504     0       1   
## Transmission      32  0.406   0.499     0       1   
## Gears             32  3.688   0.738     3       5   
## Carburetors       32  2.812   1.615     1       8   
## ----------------------------------------------------
# Create a histogram of mpg
hist(mtcars$mpg,
     xlab = "Miles per Gallon (mpg)",
     main = "Histogram")

Using log of mpg as there is skewness in mpg variable.

# Create a histogram of mpg
hist(log(mtcars$mpg),
     xlab = "Log Miles per gallon",
     main = "Histogram")

  1. Now, suppose you were running the short / incorrect model where you omitted a variable “by mistake”. Write the estimating equation out as well.

Full Regression

MPGi = β0 + β1Horsepoweri + β2Weighti + ϵi

Full_regression <- lm(data    = mtcars, 
                      formula = log(mpg) ~ hp + wt
                      )

summary(Full_regression)
## 
## Call:
## lm(formula = log(mpg) ~ hp + wt, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18744 -0.07540 -0.02440  0.06244  0.28562 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.8291030  0.0686807  55.752  < 2e-16 ***
## hp          -0.0015435  0.0003879  -3.979 0.000423 ***
## wt          -0.2005368  0.0271810  -7.378 3.96e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1114 on 29 degrees of freedom
## Multiple R-squared:  0.8691, Adjusted R-squared:   0.86 
## F-statistic: 96.23 on 2 and 29 DF,  p-value: 1.577e-13

For Short Regression:

We will omit weight(wt) variable and rerun the same regression -

MPGi = β0 + β1Horsepoweri + ϵi

Short_regression <- lm(data    = mtcars, 
                      formula = log(mpg) ~ hp
                      )

summary(Short_regression)
## 
## Call:
## lm(formula = log(mpg) ~ hp, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41577 -0.06583 -0.01737  0.09827  0.39621 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.4604669  0.0785838  44.035  < 2e-16 ***
## hp          -0.0034287  0.0004867  -7.045 7.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1858 on 30 degrees of freedom
## Multiple R-squared:  0.6233, Adjusted R-squared:  0.6107 
## F-statistic: 49.63 on 1 and 30 DF,  p-value: 7.853e-08

3.

From the OVB formula, tell us whether the omitted variable will cause bias or not i.e. are the two conditions for OVB met or not?

For OVB there are two conditions :

In the dataset mtcars, for OVB to occur, it must meet the following two conditions:

Condition 1: X is correlated with the omitted variable.

mtcars_cor <- round(cor(mtcars[, c("mpg", "hp", "wt")]), 3)
mtcars_cor
##        mpg     hp     wt
## mpg  1.000 -0.776 -0.868
## hp  -0.776  1.000  0.659
## wt  -0.868  0.659  1.000
cor_independent <- cor.test(mtcars$hp, mtcars$wt)
cor_independent
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$hp and mtcars$wt
## t = 4.7957, df = 30, p-value = 4.146e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4025113 0.8192573
## sample estimates:
##       cor 
## 0.6587479

There is a postive correlation (0.659) between the omitted variable “weight” and the independent variable “horsepower”. Also the correlation is statistically significant with a p-value = 4.146e-05

mtcars$lnmpg <- log(mtcars$mpg) 

corr_mat <- round(cor(mtcars[, c("lnmpg", "hp", "wt")]), 3)
corr_mat
##        lnmpg     hp     wt
## lnmpg  1.000 -0.789 -0.893
## hp    -0.789  1.000  0.659
## wt    -0.893  0.659  1.000

However there isn’t a huge difference in correlation values by transforming the dependent variable with log. Taking the log doesn’t really change the relationship.

Condition 2: The omitted variable is a determinant of the dependent variable Y.

weight_coeff_Full_regression <- Full_regression$coefficients[3]
weight_coeff_Full_regression
##         wt 
## -0.2005368
cor_condition_2 <- cor.test(mtcars$lnmpg, mtcars$wt)
cor_condition_2
## 
##  Pearson's product-moment correlation
## 
## data:  mtcars$lnmpg and mtcars$wt
## t = -10.872, df = 30, p-value = 6.31e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9468891 -0.7905477
## sample estimates:
##        cor 
## -0.8930611

The effect of “weight” on “mpg” is negative (-0.2), therefore the omitted variable bias will exist

A and B are positively correlated A and B are negatively correlated
B has a positive effect on Y Positive bias Negative bias*
B has a negative effect on Y Negative bias Positive bias

According to the table below, we know that because hp and wt are positively correlated and wt has a negative effect on mpg, there is negative bias.

4 .Furthermore, OVB will be in what direction (positive/negative bias) ? Which case/cell in the 2 by 2 matrix that lists the 2 OVB conditions?

Due to the positive correlation between the independent variable and the omitted variable and there is a negative effect on the dependent variable due to the omitted variable.

The direction of the OVB is negative, i.e., there is a negative bias.

  1. Show the two regressions side by side (you can use stargazer command) and confirm the bias is in the direction OVB formula predicted.
stargazer(Full_regression, Short_regression, 
          type = "text",
          covariate.labels = c("hp", "wt", "Constant")
          )
## 
## =================================================================
##                                  Dependent variable:             
##                     ---------------------------------------------
##                                       log(mpg)                   
##                              (1)                    (2)          
## -----------------------------------------------------------------
## hp                        -0.002***              -0.003***       
##                            (0.0004)               (0.0005)       
##                                                                  
## wt                        -0.201***                              
##                            (0.027)                               
##                                                                  
## Constant                   3.829***               3.460***       
##                            (0.069)                (0.079)        
##                                                                  
## -----------------------------------------------------------------
## Observations                  32                     32          
## R2                          0.869                  0.623         
## Adjusted R2                 0.860                  0.611         
## Residual Std. Error    0.111 (df = 29)        0.186 (df = 30)    
## F Statistic         96.232*** (df = 2; 29) 49.632*** (df = 1; 30)
## =================================================================
## Note:                                 *p<0.1; **p<0.05; ***p<0.01
hp_coeff_Full_regression  <- Full_regression$coefficients[2]
hp_coeff_Short_regression <- Short_regression$coefficients[2]
hp_coeff_Full_regression  > hp_coeff_Short_regression
##   hp 
## TRUE

The above models confirms that there is a negative bias.

Try to provide some intuition to why does OVB formula work / bias your results in the example in a certain direction.

In the independent variable “Horsepower” and the positive correlation between the omitted variable “Weight” says that when “Horsepower” increases , “Weight” tends to increase. This also suggests that more powerful cars are usually heavier.

I found that the effect of “Weight” on “MPG” is negative (-0.2) in the full regression model. This also means that, as car weight increases, its miles per gallon (MPG) tends to decrease. Cars which is heavier are less fuel-efficient.

In the above condition when I omitted “Weight” from the short regression, the model gave some of the impact of “Weight” to “Horsepower.” The model mistakely assumes that the negative effect of weight on MPG is due to “Horsepower” alone.

All this results in the coefficient for “Horsepower” being more negative because the model thinks that “Horsepower” is responsible for some of the effect of “Weight”.