Omitted Variable Bias

OVB

Omitted variable bias (OVB) is a form of bias that can occur in statistical models when an important variable is not included in the analysis. In the context of linear regression, it specifically refers to the situation where a relevant independent variable is not included in the model.

Here’s a more detailed explanation:

  1. Assumption of the Linear Regression Model:

    • In linear regression, we assume that the relationship between the dependent variable (Y) and the independent variables (X) can be captured by a linear equation: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 … + \epsilon\).

    • The \(\epsilon\) term represents the error, which is assumed to be independent of the independent variables.

  2. Omitted Variable Bias:

    • Omitted variable bias occurs when -

      • there is a relevant variable that influences both the dependent variable and one of the independent variables, but

      • the relevant variable is not included in the model.

    • This omitted variable leads to bias in the estimated coefficients of the included variables.

  3. Consequences of OVB:

    • Omitted variable bias can lead to incorrect estimates of the coefficients of the included variables. The bias may cause coefficients to be overestimated or underestimated, and standard errors may be incorrect as well.

    • The estimated relationship between the included variables and the dependent variable may be spurious, as the omitted variable might be the actual driver of the observed relationship.

      • Owning a lighter and getting lung cancer example in class, where smoker was the omitted variable.

      • Unfortunately, OVB does not go away by increasing the sample size.

  4. Conditions for OVB:

    • For omitted variable bias to be a concern, the omitted variable must be correlated with both the included independent variable(s) and the dependent variable (since omitted variable should affect the dependent variable).

      • Correlation does not imply causation (\(Correlation \not\Rightarrow Causation\) ), but causation implies correlation (\(Correlation \Leftarrow Causation\) ).

        • Correlation does not imply causation i.e. \(Correlation \not\Rightarrow Causation\) :

          • This means that just because two variables are correlated (i.e., there is a statistical association between them), it doesn’t necessarily mean that one variable is causing the other.

          • Correlation indicates a relationship in the sense that as one variable changes, the other tends to change, but it doesn’t tell us anything about the direction or cause-and-effect relationship.

        • Causation implies correlation i.e. \(Causation \Rightarrow Correlation\) :

          • This part of the statement emphasizes that if there is a true causal relationship between two variables, there should be some level of correlation.

          • Causation implies that changes in one variable directly lead to changes in another. Therefore, if there is a cause-and-effect relationship, it should manifest as a correlation.

    • In some cases, the omission of a variable might violate the Gauss-Markov assumptions, leading to inefficient and biased estimates.

  5. Addressing OVB:

    • To mitigate omitted variable bias, researchers need to carefully consider which variables should be included in the model.

    • Understanding the underlying theory and potential relationships is crucial for selecting relevant variables.

      • Sometimes you just have to collect more data (columns/new variables, not rows/observations/increasing sample size).

      • Change the methodology to a different study design like experiment, Diff-in-Diff, RDD, IV, fixed effects, …

    • Techniques such as sensitivity analysis and robustness checks can be employed to assess the impact of potential omitted variables on the results.

In summary, omitted variable bias is a concern in regression analysis when important variables are left out of the model. It emphasizes the importance of careful model specification and consideration of potential confounding factors to ensure accurate and unbiased estimates.

# Clear the workspace
  rm(list = ls()) # Clear all files from your environment
  gc()            # Clear unused memory
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  597436 32.0    1356840 72.5         NA   700240 37.4
Vcells 1104716  8.5    8388608 64.0      49152  1963330 15.0
  cat("\f")       # Clear the console
  graphics.off()  # Clear all graphs

EXAMPLE: Gender Discrimination in Wages

Data

Lets load wage1 data.

It is a data.frame with 526 rows and 24 variables (individual demographic, region where they live, their industry/occupation, et cetra:

  • wage. average hourly earnings

  • educ. years of education

  • exper. years potential experience

  • tenure. years with current employer

  • nonwhite. =1 if nonwhite

  • female. =1 if female

  • married. =1 if married

  • numdep. number of dependents

  • smsa. =1 if live in SMSA

  • northcen. =1 if live in north central U.S

  • south. =1 if live in southern region

  • west. =1 if live in western region

  • construc. =1 if work in construc. indus.

  • ndurman. =1 if in nondur. manuf. indus.

  • trcommpu. =1 if in trans, commun, pub ut

  • trade. =1 if in wholesale or retail

  • services. =1 if in services indus.

  • profserv. =1 if in prof. serv. indus.

  • profocc. =1 if in profess. occupation

  • clerocc. =1 if in clerical occupation

  • servocc. =1 if in service occupation

  • lwage. log(wage)

  • expersq. exper^2

  • tenursq. tenure^2

Data dictionary.

# install.packages("wooldridge")
library("wooldridge") 
data(wage1) # comes from wooldridge dataset

library("tidyverse")
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(wage1)
Rows: 526
Columns: 24
$ wage     <dbl> 3.10, 3.24, 3.00, 6.00, 5.30, 8.75, 11.25, 5.00, 3.60, 18.18,…
$ educ     <int> 11, 12, 11, 8, 12, 16, 18, 12, 12, 17, 16, 13, 12, 12, 12, 16…
$ exper    <int> 2, 22, 2, 44, 7, 9, 15, 5, 26, 22, 8, 3, 15, 18, 31, 14, 10, …
$ tenure   <int> 0, 2, 0, 28, 2, 8, 7, 3, 4, 21, 2, 0, 0, 3, 15, 0, 0, 10, 0, …
$ nonwhite <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ female   <int> 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1…
$ married  <int> 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0…
$ numdep   <int> 2, 3, 2, 0, 1, 0, 0, 0, 2, 0, 0, 0, 2, 0, 1, 1, 0, 0, 3, 0, 0…
$ smsa     <int> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ northcen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ south    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ west     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ construc <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ ndurman  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ trcommpu <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ trade    <int> 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ services <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ profserv <int> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1…
$ profocc  <int> 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1…
$ clerocc  <int> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0…
$ servocc  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
$ lwage    <dbl> 1.1314021, 1.1755733, 1.0986123, 1.7917595, 1.6677068, 2.1690…
$ expersq  <int> 4, 484, 4, 1936, 49, 81, 225, 25, 676, 484, 64, 9, 225, 324, …
$ tenursq  <int> 0, 4, 0, 784, 4, 64, 49, 9, 16, 441, 4, 0, 0, 9, 225, 0, 0, 1…
library("stargazer")

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
stargazer(wage1, 
          type="text", 
          title = "Summary Statistics", 
          covariate.labels = c("Wage", "Education", "Experience")
          )

Summary Statistics
=============================================
Statistic   N   Mean   St. Dev.  Min    Max  
---------------------------------------------
Wage       526  5.896   3.693   0.530  24.980
Education  526 12.563   2.769     0      18  
Experience 526 17.017   13.572    1      51  
tenure     526  5.105   7.224     0      44  
nonwhite   526  0.103   0.304     0      1   
female     526  0.479   0.500     0      1   
married    526  0.608   0.489     0      1   
numdep     526  1.044   1.262     0      6   
smsa       526  0.722   0.448     0      1   
northcen   526  0.251   0.434     0      1   
south      526  0.356   0.479     0      1   
west       526  0.169   0.375     0      1   
construc   526  0.046   0.209     0      1   
ndurman    526  0.114   0.318     0      1   
trcommpu   526  0.044   0.205     0      1   
trade      526  0.287   0.453     0      1   
services   526  0.101   0.301     0      1   
profserv   526  0.259   0.438     0      1   
profocc    526  0.367   0.482     0      1   
clerocc    526  0.167   0.374     0      1   
servocc    526  0.141   0.348     0      1   
lwage      526  1.623   0.532   -0.635 3.218 
expersq    526 473.435 616.045    1    2,601 
tenursq    526 78.150  199.435    0    1,936 
---------------------------------------------

Note there is skewness in wage variable, so we will use log of wages.

hist(wage1$wage, 
     xlab = "Hourly Wages",
     main = "Histogram"
     )

hist(log(wage1$wage), , 
     xlab = "Log Hourly Wages",
     main = "Histogram"
     )

Full Regression

Pick up on how to write equations in R Markdown using Latex here.

  • \(log(wage)_i = β_0 + β_1 \ female_i + β_2 \ tenure_i + u_i\)
full_model <- lm(data = wage1, 
                 formula = log(wage) ~ female + tenure
                 )

summary(full_model)

Call:
lm(formula = log(wage) ~ female + tenure, data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.00085 -0.28200 -0.06232  0.31200  1.57325 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.688842   0.034368  49.141  < 2e-16 ***
female      -0.342132   0.042267  -8.095 4.06e-15 ***
tenure       0.019265   0.002925   6.585 1.11e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4747 on 523 degrees of freedom
Multiple R-squared:  0.2055,    Adjusted R-squared:  0.2025 
F-statistic: 67.64 on 2 and 523 DF,  p-value: < 2.2e-16
tenure_coeff_full_model <- full_model$coefficients[3]
female_coeff_full_model <- full_model$coefficients[2]

Short Regression

We will omit tenure variable and rerun the same regression -

  • \(log(wage)_i = β_0 + β_1 \ female_i + u_i\)
short_model <- lm(data = wage1, 
                  formula = log(wage) ~ female 
                  )

summary(short_model)

Call:
lm(formula = log(wage) ~ female, data = wage1)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.05123 -0.31774 -0.04889  0.35548  1.65773 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.81357    0.02981  60.830   <2e-16 ***
female      -0.39722    0.04307  -9.222   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4935 on 524 degrees of freedom
Multiple R-squared:  0.1396,    Adjusted R-squared:  0.138 
F-statistic: 85.04 on 1 and 524 DF,  p-value: < 2.2e-16
female_coeff_short_model <- short_model$coefficients[2]

Check for OVB

See if the two conditions for OVB are met if you omit tenure when running a regression of wage as a function of gender.

For omitted variable bias to occur, two conditions must be fulfilled:

  1. X is correlated with the omitted variable.

  2. The omitted variable is a determinant of the dependent variable Y.

Furthermore, we can get the direction of OVB as well.

Condition I. X is correlated with the omitted variable.

Note that the omitted variable - tenure is negatively correlated with key independent variable/variable of interest - female.

-0.1979103

The correlation is not close to 0.

cor(wage1[,c(1,4,6)]) # correlation coefficient of wage, tenure and female 
             wage     tenure     female
wage    1.0000000  0.3468896 -0.3400979
tenure  0.3468896  1.0000000 -0.1979103
female -0.3400979 -0.1979103  1.0000000
wage1$lnwage <- log(wage1$wage) 

corr_mat <- cor(wage1[,c(25,4,6)]) # replace wage with ln(wage)
corr_mat
           lnwage     tenure     female
lnwage  1.0000000  0.3255379 -0.3736774
tenure  0.3255379  1.0000000 -0.1979103
female -0.3736774 -0.1979103  1.0000000

ASIDE: Note that the correlation between the variables and wages / log(wages) (its transformation) is pretty much the same. Taking the log does not really change the relationship.

Condition II. The omitted variable is a determinant of the dependent variable Y.

Note that the omitted variable - tenure should have a positive impact on the dependent/outcome variable - wage.

0.3255379

In practice, this is usually done by arguments. In our case, we know the true regression and thus the actual effect.

Full Model: \[Y = \beta_0 \ + \beta_1 \ A + \beta_2 \ B + u\]

Short Model: \[Y = \alpha_0 \ + \alpha_1 \ A + v\]

A and B are positively correlated A and B are negatively correlated
B has a positive effect on Y Positive bias Negative bias*
B has a negative effect on Y Negative bias Positive bias

Finally -

  • Since the correlation of tenure (omitted variable) and female (key variable of interest) is negative (-0.198), and the the effect of tenure on wage is positive (0.0192) from full regression model, omitted variable bias will exist !

    • Furthermore, we can say we will have a negative bias !
  • Thus, in our short model, our estimated key coefficient will be larger in absolute value than its true unknown value i.e. it is more negative !

?stargazer
stargazer(full_model, short_model, 
          type = "text",
          covariate.labels = c("Female", "Tenure", "Constant")
          )

===================================================================
                                  Dependent variable:              
                    -----------------------------------------------
                                       log(wage)                   
                              (1)                     (2)          
-------------------------------------------------------------------
Female                     -0.342***               -0.397***       
                            (0.042)                 (0.043)        
                                                                   
Tenure                     0.019***                                
                            (0.003)                                
                                                                   
Constant                   1.689***                1.814***        
                            (0.034)                 (0.030)        
                                                                   
-------------------------------------------------------------------
Observations                  526                     526          
R2                           0.206                   0.140         
Adjusted R2                  0.202                   0.138         
Residual Std. Error    0.475 (df = 523)        0.494 (df = 524)    
F Statistic         67.642*** (df = 2; 523) 85.044*** (df = 1; 524)
===================================================================
Note:                                   *p<0.1; **p<0.05; ***p<0.01
tenure_coeff_full_model
    tenure 
0.01926476 
female_coeff_full_model > female_coeff_short_model
female 
  TRUE 

Stata output for the two regression is the same on page 4.

Note
# install.packages("coefplot")
library(coefplot)
coefplot(full_model)

Predicting Biased Coefficient Value

Unbiased estimate plus the bias term gives us the exact coefficient from the short model (which suffers from OVB).

  • Biased Estimate (Short Model)
female_coeff_short_model 
    female 
-0.3972175 
  • Create the biased estimate from the the Full Model.
female_coeff_full_model + tenure_coeff_full_model * female_coeff_short_model 
    female 
-0.3497846 
female_coeff_full_model + tenure_coeff_full_model * short_model$coefficients[2] 
    female 
-0.3497846 

EXAMPLE: Test Scores and Student Teacher Ratio

We would expect student test scores on English and Math to be higher on average if the student teacher ratio (STR) is low (more resources per student).

\[ TestScore \ = \beta_0 + \beta_1 STR \]

A data frame containing 420 observations on 14 variables.

district

character. District code.

school

character. School name.

county

factor indicating county.

grades

factor indicating grade span of district.

students

Total enrollment.

teachers

Number of teachers.

calworks

Percent qualifying for CalWorks (income assistance).

lunch

Percent qualifying for reduced-price lunch.

computer

Number of computers.

expenditure

Expenditure per student.

income

District average income (in USD 1,000).

english

Percent of English learners.

read

Average reading score.

math

Average math score.

# load the AER package
library(AER)
Loading required package: car
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
Loading required package: lmtest
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: survival
?CASchools
# load the data set
data(CASchools) 



stargazer(CASchools, 
          type = "text")

=======================================================
Statistic    N    Mean    St. Dev.     Min       Max   
-------------------------------------------------------
students    420 2,628.793 3,913.105    81      27,176  
teachers    420  129.067   187.913    4.850   1,429.000
calworks    420  13.246    11.455     0.000    78.994  
lunch       420  44.705    27.123     0.000    100.000 
computer    420  303.383   441.341      0       3,324  
expenditure 420 5,312.408  633.937  3,926.070 7,711.507
income      420  15.317     7.226     5.335    55.328  
english     420  15.768    18.286     0.000    85.540  
read        420  654.970   20.108    604.500   704.000 
math        420  653.343   18.754    605.400   709.500 
-------------------------------------------------------
# define variables
## y variable
CASchools$score <- (CASchools$read + CASchools$math) / 2 
## x variable
CASchools$STR   <- CASchools$students/CASchools$teachers       


short <- lm(score ~ STR, data = CASchools)
summary(short)

Call:
lm(formula = score ~ STR, data = CASchools)

Residuals:
    Min      1Q  Median      3Q     Max 
-47.727 -14.251   0.483  12.822  48.540 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
STR          -2.2798     0.4798  -4.751 2.78e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.58 on 418 degrees of freedom
Multiple R-squared:  0.05124,   Adjusted R-squared:  0.04897 
F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06

English Language Ability an Omitted Variable

\[ TestScore \ = \beta_0 + \beta_1 STR + \beta_2 PctEL \]

A highly relevant variable could be the percentage of English learners in the school district: it is plausible that the ability to speak, read and write English is an important factor for successful learning. Therefore, students that are still learning English are likely to perform worse in tests than native speakers.

# compute correlations
cor(CASchools$score, CASchools$english)
[1] -0.6441238

Correlation is not causation, but it does suggest PctEL has an effect on TestScore.

Negative Impact of Omitted Variable.

Also, it is conceivable that the share of English learning students is bigger in school districts where class sizes are relatively large: think of poor urban districts where a lot of immigrants live.

Omitted Variable is positive correlated with Key Independent Variable.

cor(CASchools$STR, CASchools$english)
[1] 0.1876424

Thus, we will have a negative bias (in our short regression) if we omit the variable.

long <- lm(score ~ STR + english, 
           data = CASchools
           )
stargazer(long, short, 
          type="text"
          )

====================================================================
                                  Dependent variable:               
                    ------------------------------------------------
                                         score                      
                              (1)                      (2)          
--------------------------------------------------------------------
STR                        -1.101***                -2.280***       
                            (0.380)                  (0.480)        
                                                                    
english                    -0.650***                                
                            (0.039)                                 
                                                                    
Constant                   686.032***              698.933***       
                            (7.411)                  (9.467)        
                                                                    
--------------------------------------------------------------------
Observations                  420                      420          
R2                           0.426                    0.051         
Adjusted R2                  0.424                    0.049         
Residual Std. Error    14.464 (df = 417)        18.581 (df = 418)   
F Statistic         155.014*** (df = 2; 417) 22.575*** (df = 1; 418)
====================================================================
Note:                                    *p<0.1; **p<0.05; ***p<0.01

Indeed have a negative bias !