Omitted variable bias (OVB) is a form of bias that can occur in statistical models when an important variable is not included in the analysis. In the context of linear regression, it specifically refers to the situation where a relevant independent variable is not included in the model.
Here’s a more detailed explanation:
Assumption of the Linear Regression Model:
In linear regression, we assume that the relationship between the dependent variable (Y) and the independent variables (X) can be captured by a linear equation: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 … + \epsilon\).
The \(\epsilon\) term represents the error, which is assumed to be independent of the independent variables.
Omitted Variable Bias:
Omitted variable bias occurs when -
there is a relevant variable that influences both the dependent variable and one of the independent variables, but
the relevant variable is not included in the model.
This omitted variable leads to bias in the estimated coefficients of the included variables.
Consequences of OVB:
Omitted variable bias can lead to incorrect estimates of the coefficients of the included variables. The bias may cause coefficients to be overestimated or underestimated, and standard errors may be incorrect as well.
The estimated relationship between the included variables and the dependent variable may be spurious, as the omitted variable might be the actual driver of the observed relationship.
Owning a lighter and getting lung cancer example in class, where smoker was the omitted variable.
Unfortunately, OVB does not go away by increasing the sample size.
Conditions for OVB:
For omitted variable bias to be a concern, the omitted variable must be correlated with both the included independent variable(s) and the dependent variable (since omitted variable should affect the dependent variable).
Correlation does not imply causation (\(Correlation \not\Rightarrow Causation\) ), but causation implies correlation (\(Correlation \Leftarrow Causation\) ).
Correlation does not imply causation i.e. \(Correlation \not\Rightarrow Causation\) :
This means that just because two variables are correlated (i.e., there is a statistical association between them), it doesn’t necessarily mean that one variable is causing the other.
Correlation indicates a relationship in the sense that as one variable changes, the other tends to change, but it doesn’t tell us anything about the direction or cause-and-effect relationship.
Causation implies correlation i.e. \(Causation \Rightarrow Correlation\) :
This part of the statement emphasizes that if there is a true causal relationship between two variables, there should be some level of correlation.
Causation implies that changes in one variable directly lead to changes in another. Therefore, if there is a cause-and-effect relationship, it should manifest as a correlation.
In some cases, the omission of a variable might violate the Gauss-Markov assumptions, leading to inefficient and biased estimates.
Addressing OVB:
To mitigate omitted variable bias, researchers need to carefully consider which variables should be included in the model.
Understanding the underlying theory and potential relationships is crucial for selecting relevant variables.
Sometimes you just have to collect more data (columns/new variables, not rows/observations/increasing sample size).
Change the methodology to a different study design like experiment, Diff-in-Diff, RDD, IV, fixed effects, …
Techniques such as sensitivity analysis and robustness checks can be employed to assess the impact of potential omitted variables on the results.
In summary, omitted variable bias is a concern in regression analysis when important variables are left out of the model. It emphasizes the importance of careful model specification and consideration of potential confounding factors to ensure accurate and unbiased estimates.
# Clear the workspace
rm(list = ls()) # Clear all files from your environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 525518 28.1 1166847 62.4 NA 669265 35.8
## Vcells 968590 7.4 8388608 64.0 32768 1840568 14.1
cat("\f") # Clear the console
graphics.off() # Clear all graphs
Lets load wage1 data.
It is a data.frame with 526 rows and 24 variables (individual demographic, region where they live, their industry/occupation, et cetra:
wage. average hourly earnings
educ. years of education
exper. years potential experience
tenure. years with current employer
nonwhite. =1 if nonwhite
female. =1 if female
married. =1 if married
numdep. number of dependents
smsa. =1 if live in SMSA
northcen. =1 if live in north central U.S
south. =1 if live in southern region
west. =1 if live in western region
construc. =1 if work in construc. indus.
ndurman. =1 if in nondur. manuf. indus.
trcommpu. =1 if in trans, commun, pub ut
trade. =1 if in wholesale or retail
services. =1 if in services indus.
profserv. =1 if in prof. serv. indus.
profocc. =1 if in profess. occupation
clerocc. =1 if in clerical occupation
servocc. =1 if in service occupation
lwage. log(wage)
expersq. exper^2
tenursq. tenure^2
# install.packages("wooldridge")
library("wooldridge")
data(wage1) # comes from wooldridge dataset
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(wage1)
## Rows: 526
## Columns: 24
## $ wage <dbl> 3.10, 3.24, 3.00, 6.00, 5.30, 8.75, 11.25, 5.00, 3.60, 18.18,…
## $ educ <int> 11, 12, 11, 8, 12, 16, 18, 12, 12, 17, 16, 13, 12, 12, 12, 16…
## $ exper <int> 2, 22, 2, 44, 7, 9, 15, 5, 26, 22, 8, 3, 15, 18, 31, 14, 10, …
## $ tenure <int> 0, 2, 0, 28, 2, 8, 7, 3, 4, 21, 2, 0, 0, 3, 15, 0, 0, 10, 0, …
## $ nonwhite <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ female <int> 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
## $ married <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0…
## $ numdep <int> 2, 3, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 0, 3, 0, 0…
## $ smsa <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ northcen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ south <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ west <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ construc <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ndurman <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ trcommpu <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ trade <int> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ services <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ profserv <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1…
## $ profocc <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
## $ clerocc <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
## $ servocc <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ lwage <dbl> 1.1314021, 1.1755733, 1.0986123, 1.7917595, 1.6677068, 2.1690…
## $ expersq <int> 4, 484, 4, 1936, 49, 81, 225, 25, 676, 484, 64, 9, 225, 324, …
## $ tenursq <int> 0, 4, 0, 784, 4, 64, 49, 9, 16, 441, 4, 0, 0, 9, 225, 0, 0, 1…
library("stargazer")
##
## Please cite as:
##
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(wage1,
type="text",
title = "Summary Statistics",
covariate.labels = c("Wage", "Education", "Experience")
)
##
## Summary Statistics
## =============================================
## Statistic N Mean St. Dev. Min Max
## ---------------------------------------------
## Wage 526 5.896 3.693 0.530 24.980
## Education 526 12.563 2.769 0 18
## Experience 526 17.017 13.572 1 51
## tenure 526 5.105 7.224 0 44
## nonwhite 526 0.103 0.304 0 1
## female 526 0.479 0.500 0 1
## married 526 0.608 0.489 0 1
## numdep 526 1.044 1.262 0 6
## smsa 526 0.722 0.448 0 1
## northcen 526 0.251 0.434 0 1
## south 526 0.356 0.479 0 1
## west 526 0.169 0.375 0 1
## construc 526 0.046 0.209 0 1
## ndurman 526 0.114 0.318 0 1
## trcommpu 526 0.044 0.205 0 1
## trade 526 0.287 0.453 0 1
## services 526 0.101 0.301 0 1
## profserv 526 0.259 0.438 0 1
## profocc 526 0.367 0.482 0 1
## clerocc 526 0.167 0.374 0 1
## servocc 526 0.141 0.348 0 1
## lwage 526 1.623 0.532 -0.635 3.218
## expersq 526 473.435 616.045 1 2,601
## tenursq 526 78.150 199.435 0 1,936
## ---------------------------------------------
Note there is skewness in wage variable, so we will use
log of wages.
hist(wage1$wage,
xlab = "Hourly Wages",
main = "Histogram"
)
hist(log(wage1$wage), ,
xlab = "Log Hourly Wages",
main = "Histogram"
)
Pick up on how to write equations in R Markdown using Latex here.
full_model <- lm(data = wage1,
formula = log(wage) ~ female + tenure
)
summary(full_model)
##
## Call:
## lm(formula = log(wage) ~ female + tenure, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.00085 -0.28200 -0.06232 0.31200 1.57325
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.688842 0.034368 49.141 < 2e-16 ***
## female -0.342132 0.042267 -8.095 4.06e-15 ***
## tenure 0.019265 0.002925 6.585 1.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4747 on 523 degrees of freedom
## Multiple R-squared: 0.2055, Adjusted R-squared: 0.2025
## F-statistic: 67.64 on 2 and 523 DF, p-value: < 2.2e-16
tenure_coeff_full_model <- full_model$coefficients[3]
female_coeff_full_model <- full_model$coefficients[2]
We will omit tenure variable and rerun the same
regression -
short_model <- lm(data = wage1,
formula = log(wage) ~ female
)
summary(short_model)
##
## Call:
## lm(formula = log(wage) ~ female, data = wage1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05123 -0.31774 -0.04889 0.35548 1.65773
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.81357 0.02981 60.830 <2e-16 ***
## female -0.39722 0.04307 -9.222 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4935 on 524 degrees of freedom
## Multiple R-squared: 0.1396, Adjusted R-squared: 0.138
## F-statistic: 85.04 on 1 and 524 DF, p-value: < 2.2e-16
female_coeff_short_model <- short_model$coefficients[2]
See if the two
conditions for OVB are met if you omit tenure when
running a regression of wage as a function of
gender.
For omitted variable bias to occur, two conditions must be fulfilled:
X is correlated with the omitted variable.
The omitted variable is a determinant of the dependent variable Y.
Furthermore, we can get the direction of OVB as well.
Note that the omitted variable - tenure should have a
positive impact on the dependent/outcome variable -
wage.
0.3255379
In practice, this is usually done by arguments. In our case, we know the true regression and thus the actual effect.
Full Model: \[Y = \beta_0 \ + \beta_1 \ A + \beta_2 \ B + u\]
Short Model: \[Y = \alpha_0 \ + \alpha_1 \ A + v\]
| A and B are positively correlated | A and B are negatively correlated | |
| B has a positive effect on Y | Positive bias | Negative bias* |
| B has a negative effect on Y | Negative bias | Positive bias |
Finally -
Since the correlation of tenure (omitted variable)
and female (key variable of interest) is
negative (-0.198), and the the effect of
tenure on wage is positive
(0.0192) from full regression model, omitted variable bias will exist
!
Thus, in our short model, our estimated key coefficient will be larger in absolute value than its true unknown value i.e. it is more negative !
?stargazer
stargazer(full_model, short_model,
type = "text",
covariate.labels = c("Female", "Tenure", "Constant")
)
##
## ===================================================================
## Dependent variable:
## -----------------------------------------------
## log(wage)
## (1) (2)
## -------------------------------------------------------------------
## Female -0.342*** -0.397***
## (0.042) (0.043)
##
## Tenure 0.019***
## (0.003)
##
## Constant 1.689*** 1.814***
## (0.034) (0.030)
##
## -------------------------------------------------------------------
## Observations 526 526
## R2 0.206 0.140
## Adjusted R2 0.202 0.138
## Residual Std. Error 0.475 (df = 523) 0.494 (df = 524)
## F Statistic 67.642*** (df = 2; 523) 85.044*** (df = 1; 524)
## ===================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
tenure_coeff_full_model
## tenure
## 0.01926476
female_coeff_full_model > female_coeff_short_model
## female
## TRUE
Stata output for the two regression is the same on page 4.
Unbiased estimate plus the bias term gives us the exact coefficient from the short model (which suffers from OVB).
female_coeff_short_model
## female
## -0.3972175
female_coeff_full_model + tenure_coeff_full_model * short_model$coefficients[2]
## female
## -0.3497846
We would expect student test scores on English and Math to be higher on average if the student teacher ratio (STR) is low (more resources per student).
\[ TestScore \ = \beta_0 + \beta_1 STR \]
A data frame containing 420 observations on 14 variables.
character. District code.
character. School name.
factor indicating county.
factor indicating grade span of district.
Total enrollment.
Number of teachers.
Percent qualifying for CalWorks (income assistance).
Percent qualifying for reduced-price lunch.
Number of computers.
Expenditure per student.
District average income (in USD 1,000).
Percent of English learners.
Average reading score.
Average math score.
# load the AER package
library(AER)
## Loading required package: car
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: survival
?CASchools
# load the data set
data(CASchools)
stargazer(CASchools,
type = "text")
##
## =======================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------------
## students 420 2,628.793 3,913.105 81 27,176
## teachers 420 129.067 187.913 4.850 1,429.000
## calworks 420 13.246 11.455 0.000 78.994
## lunch 420 44.705 27.123 0.000 100.000
## computer 420 303.383 441.341 0 3,324
## expenditure 420 5,312.408 633.937 3,926.070 7,711.507
## income 420 15.317 7.226 5.335 55.328
## english 420 15.768 18.286 0.000 85.540
## read 420 654.970 20.108 604.500 704.000
## math 420 653.343 18.754 605.400 709.500
## -------------------------------------------------------
# define variables
## y variable
CASchools$score <- (CASchools$read + CASchools$math) / 2
## x variable
CASchools$STR <- CASchools$students/CASchools$teachers
short <- lm(score ~ STR, data = CASchools)
summary(short)
##
## Call:
## lm(formula = score ~ STR, data = CASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.727 -14.251 0.483 12.822 48.540
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 698.9329 9.4675 73.825 < 2e-16 ***
## STR -2.2798 0.4798 -4.751 2.78e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897
## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
\[ TestScore \ = \beta_0 + \beta_1 STR + \beta_2 PctEL \]
A highly relevant variable could be the percentage of English learners in the school district: it is plausible that the ability to speak, read and write English is an important factor for successful learning. Therefore, students that are still learning English are likely to perform worse in tests than native speakers.
# compute correlations
cor(CASchools$score, CASchools$english)
## [1] -0.6441238
Correlation is not causation, but it does suggest PctEL
has an effect on TestScore.
Negative Impact of Omitted Variable.
Also, it is conceivable that the share of English learning students is bigger in school districts where class sizes are relatively large: think of poor urban districts where a lot of immigrants live.
Omitted Variable is positive correlated with Key Independent Variable.
cor(CASchools$STR, CASchools$english)
## [1] 0.1876424
Thus, we will have a negative bias (in our short
regression) if we omit the variable.
long <- lm(score ~ STR + english,
data = CASchools
)
stargazer(long, short,
type="text"
)
##
## ====================================================================
## Dependent variable:
## ------------------------------------------------
## score
## (1) (2)
## --------------------------------------------------------------------
## STR -1.101*** -2.280***
## (0.380) (0.480)
##
## english -0.650***
## (0.039)
##
## Constant 686.032*** 698.933***
## (7.411) (9.467)
##
## --------------------------------------------------------------------
## Observations 420 420
## R2 0.426 0.051
## Adjusted R2 0.424 0.049
## Residual Std. Error 14.464 (df = 417) 18.581 (df = 418)
## F Statistic 155.014*** (df = 2; 417) 22.575*** (df = 1; 418)
## ====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Indeed have a negative bias !