library(ggplot2)
data("diamonds")
Discussion W3-2
1. What is bias of an estimator?
The bias of an estimator is the difference between the expectation of the estimator and the true value of the parameter being estimated.
2. In terms of omitted variable bias, will the bias go away if we increase the same size or add more variables?
Omitted variable bias does not go away if we increase the sample size. The bias is because of leaving out an important variable, not because of the sample size. Adding more variables does not necessarily eliminate or reduce bias but if a relevant variable is added it can help.
Example of OVB
1. Dataset and Variables
Diamonds Dataset -
Dependent Variable: price (price in USD)
Key Independent Variable: carat (weight of diamond)
Potentially Omitted Variable: table (weidth of the diamond’s top as a percentage of the widest point)
Estimating Equation:
Omitted Variable Bias: Full Model
The full regression model is:
\[ price_i = \beta_0 + \beta_1 carat_i + \beta_2 table_i + u_i \]
2. Short/Incorrect Model
Estimating Equation:
\[ price_i = \alpha_0 + \alpha_1 carat_i + v_i \]
where table is omitted.
3. OVB Formula and Conditions
OVB occurs if:
- The omitted variable (table) is correlated with the dependent variable (price).
- The omitted variable (table) is correlated with the included independent variable (carat).
cor.test(diamonds$table, diamonds$price)
Pearson's product-moment correlation
data: diamonds$table and diamonds$price
t = 29.768, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1188223 0.1354277
sample estimates:
cor
0.1271339
cor.test(diamonds$table, diamonds$carat)
Pearson's product-moment correlation
data: diamonds$table and diamonds$carat
t = 42.893, df = 53938, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1734443 0.1897658
sample estimates:
cor
0.1816175
Both conditions for OVB are satisfied.
4. OVB Direction
<- cor(diamonds[, c("price", "carat", "table")])
cor_mat cor_mat
price carat table
price 1.0000000 0.9215913 0.1271339
carat 0.9215913 1.0000000 0.1816175
table 0.1271339 0.1816175 1.0000000
# Full model (includes both carat and table)
<- lm(price ~ carat + table, data = diamonds)
full_model
# Short model (only carat)
<- lm(price ~ carat, data = diamonds)
short_model
full_model
Call:
lm(formula = price ~ carat + table, data = diamonds)
Coefficients:
(Intercept) carat table
1962.0 7820.0 -74.3
short_model
Call:
lm(formula = price ~ carat, data = diamonds)
Coefficients:
(Intercept) carat
-2256 7756
The OVB will be a negative bias. That is because in our results, the omitted variable (table) is positvely correlated with the independent variable (carat) and the effect on price is negative. According to a standard 2x2 OVB matrix this is the bottom left “Negative Bias” cell.
5. Regressions Side by Side
# Full model: includes both carat and table
<- lm(price ~ carat + table, data = diamonds)
full_model # Short model: omits table
<- lm(price ~ carat, data = diamonds)
short_model
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(
full_model, short_model,type = "text",
covariate.labels = c("Carat", "Table", "Constant")
)
=================================================================================
Dependent variable:
-------------------------------------------------------------
price
(1) (2)
---------------------------------------------------------------------------------
Carat 7,820.038*** 7,756.426***
(14.225) (14.067)
Table -74.301***
(3.018)
Constant 1,961.992*** -2,256.361***
(171.811) (13.055)
---------------------------------------------------------------------------------
Observations 53,940 53,940
R2 0.851 0.849
Adjusted R2 0.851 0.849
Residual Std. Error 1,539.946 (df = 53937) 1,548.562 (df = 53938)
F Statistic 154,034.600*** (df = 2; 53937) 304,050.900*** (df = 1; 53938)
=================================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
The coefficient on “carat” increases from 7,756.43 to 7,820.04 between the short model and the full model. This means that omitting the “table” variable led to a negative bias in the estimated effect of “carat” on “price.” This matches what the OVB formula predicted given the correlations.
6. Intuition: Why does OVB formula work and bias your results in this direction?
OVB happens when a relevant variable is left out of a model, such as “table” in this scenario, and that variable is related to both the key independent variable and the outcome. Those being “carat” and “price” respectively. Leaving out that variable causes the regression to wrongly attribute some of the effect if the missing variables to other variables kept. The formula works because it mathematically shows the size and direction of the bias depend on how strongly the omitted variable relates to the others.