rm(list =ls()) # Clear all files from your environmentgc() # Clear unused memory
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 581592 31.1 1326975 70.9 NA 669431 35.8
Vcells 1068585 8.2 8388608 64.0 16384 1852345 14.2
cat("\f") # Clear the console
graphics.off() # Clear all graphs
Part 1) What is bias of an estimator?
Bias of an estimator is the difference between the expected value of the estimator and true value of the parameter being estimated (mean, median, etc). There would be no bias if expected value equals the true parameter value.
Measures how far the average of the estimates from repeated samples is from the actual parameter.
Negative bias means you’re consistently underestimating the true effect (your estimates are smaller than they should be).
expected value is less than the true value of the parameter.
Positive bias means you’re consistently overestimating the true effect (your estimates are larger than they should be).
expected value is greater than the true value of the parameter.
Part 2) Omitted Variable Bias (OVB)
What is OVB?
Occurs when an important explanatory variable is left out of the model, leading to a biased estimate of the parameters for the variables in the model. This is especially prevalent when the omitted variable is correlated with both the dependent and the independent variable(s).
Does it go away as you increase the sample size?
No, It will not go away as you increase the sample size. While it does improve the precesion of the estimates it does not address the systematic error caused by the omission of the relevant variable.
Does it go away as you add more variables?
Adding more variables indiscriminately won’t necessarily fix the bias unless the added variables are the ones that were previously omitted and are relevant to the model.
Part 3) Applying Omitted Variable Bias on a data set
Section 3.1 - Bringing in the data and creating new variables
library("AER")
Warning: package 'AER' was built under R version 4.3.3
Loading required package: car
Loading required package: carData
Loading required package: lmtest
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
data("CigarettesSW")cig <- CigarettesSW# excluding taxes as price in data already includes taxes, want to isolate itcig$price_ex_taxes <- cig$price - cig$taxs# per pack metricscig$price_per_pack <- cig$price_ex_taxes / cig$packscig$tax_per_pack <- cig$taxs / cig$packs#summaryhead(cig)
ovb_cig_lm <-lm(packs ~ income + price_per_pack, data = cig)
Section 3.3 - Understanding the conditions
Ommited variable bias has 2 conditions:
The omitted variable (tax per pack) must be correlated with the dependent variable (packs)
The omitted variable (tax per pack) must be correlated with the key independent variable (price per pack)
Based on the below graph we can see tax per pack has a strong correlation with both the dependent variable (packs -0.8) and the key independent variable (price per pack -0.92), satisfying both conditions.
As seen above you can see after we introduced the tax variable, the price coefficient becomes more negative when tax is omitted, meaning the effect of price on cigarette consumption appears stronger in the short model.
This indicates negative bias when tax is omitted from the model.
When tax is not included, the price coefficient is biased downward, overestimating the negative effect of price on cigarette consumption.
In this case tax is an important variables on the impact of consumption, cigarette companies have often tried to lower prices to offset taxes but it shows that taxes provides an important influence on overall pack consuption. Failing to control for it will misattribute some of its effect to price (as they are positively correlated), causing for a bias.