Discussion W3-2


1. What is bias of an estimator?

The bias of an estimator is the difference between the expectation of the estimator and the true value of the parameter being estimated.

2. In terms of omitted variable bias, will the bias go away if we increase the same size or add more variables?

Omitted variable bias does not go away if we increase the sample size. The bias is because of leaving out an important variable, not because of the sample size. Adding more variables does not necessarily eliminate or reduce bias but if a relevant variable is added it can help.

Example of OVB

1. Dataset and Variables

Diamonds Dataset -

library(ggplot2)
data("diamonds")

Dependent Variable: price (price in USD)

Key Independent Variable: carat (weight of diamond)

Potentially Omitted Variable: table (weidth of the diamond’s top as a percentage of the widest point)

Estimating Equation:

Omitted Variable Bias: Full Model

The full regression model is:

\[ price_i = \beta_0 + \beta_1 carat_i + \beta_2 table_i + u_i \]

2. Short/Incorrect Model

Estimating Equation:

\[ price_i = \alpha_0 + \alpha_1 carat_i + v_i \]

where table is omitted.

3. OVB Formula and Conditions

OVB occurs if:

  1. The omitted variable (table) is correlated with the dependent variable (price).
  2. The omitted variable (table) is correlated with the included independent variable (carat).
cor.test(diamonds$table, diamonds$price)

    Pearson's product-moment correlation

data:  diamonds$table and diamonds$price
t = 29.768, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1188223 0.1354277
sample estimates:
      cor 
0.1271339 
cor.test(diamonds$table, diamonds$carat)

    Pearson's product-moment correlation

data:  diamonds$table and diamonds$carat
t = 42.893, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1734443 0.1897658
sample estimates:
      cor 
0.1816175 

Both conditions for OVB are satisfied.

4. OVB Direction

cor_mat <- cor(diamonds[, c("price", "carat", "table")])
cor_mat
          price     carat     table
price 1.0000000 0.9215913 0.1271339
carat 0.9215913 1.0000000 0.1816175
table 0.1271339 0.1816175 1.0000000
# Full model (includes both carat and table)
full_model <- lm(price ~ carat + table, data = diamonds)

# Short model (only carat)
short_model <- lm(price ~ carat, data = diamonds)

full_model

Call:
lm(formula = price ~ carat + table, data = diamonds)

Coefficients:
(Intercept)        carat        table  
     1962.0       7820.0        -74.3  
short_model

Call:
lm(formula = price ~ carat, data = diamonds)

Coefficients:
(Intercept)        carat  
      -2256         7756  

The OVB will be a negative bias. That is because in our results, the omitted variable (table) is positvely correlated with the independent variable (carat) and the effect on price is negative. According to a standard 2x2 OVB matrix this is the bottom left “Negative Bias” cell.

5. Regressions Side by Side

# Full model: includes both carat and table
full_model <- lm(price ~ carat + table, data = diamonds)
# Short model: omits table
short_model <- lm(price ~ carat, data = diamonds)

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
stargazer(
  full_model, short_model,
  type = "text",
  covariate.labels = c("Carat", "Table", "Constant")
)

=================================================================================
                                         Dependent variable:                     
                    -------------------------------------------------------------
                                                price                            
                                 (1)                            (2)              
---------------------------------------------------------------------------------
Carat                        7,820.038***                   7,756.426***         
                               (14.225)                       (14.067)           
                                                                                 
Table                         -74.301***                                         
                               (3.018)                                           
                                                                                 
Constant                     1,961.992***                  -2,256.361***         
                              (171.811)                       (13.055)           
                                                                                 
---------------------------------------------------------------------------------
Observations                    53,940                         53,940            
R2                              0.851                          0.849             
Adjusted R2                     0.851                          0.849             
Residual Std. Error     1,539.946 (df = 53937)         1,548.562 (df = 53938)    
F Statistic         154,034.600*** (df = 2; 53937) 304,050.900*** (df = 1; 53938)
=================================================================================
Note:                                                 *p<0.1; **p<0.05; ***p<0.01

The coefficient on “carat” increases from 7,756.43 to 7,820.04 between the short model and the full model. This means that omitting the “table” variable led to a negative bias in the estimated effect of “carat” on “price.” This matches what the OVB formula predicted given the correlations.

6. Intuition: Why does OVB formula work and bias your results in this direction?

OVB happens when a relevant variable is left out of a model, such as “table” in this scenario, and that variable is related to both the key independent variable and the outcome. Those being “carat” and “price” respectively. Leaving out that variable causes the regression to wrongly attribute some of the effect if the missing variables to other variables kept. The formula works because it mathematically shows the size and direction of the bias depend on how strongly the omitted variable relates to the others.