Omitted Variable Bias

Author

Betty Wang

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

1 + 1
[1] 2

You can add options to executable code like this

[1] 4

The echo: false option disables the printing of code (only output is displayed).

# Load the dataset
data(mtcars)

# View the first few rows of the dataset
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# Description of the mtcars dataset
# Dependent variable: mpg Miles per gallon (fuel efficiency)
# Independent variables: cyl Number of cylinders
                       # disp Displacement (in cubic inches)
                       # hp Horsepower
                       # drat Rear axle ratio
                       # wt Weight (in 1000 lbs)
                       # qsec 1/4 mile time (seconds)
                       # vs Engine type (0 = V-shaped, 1 = straight)
                       # am Transmission type (0 = automatic, 1 = manual)
                       # gear Number of forward gears
                       # carb Number of carburetors

# Correct model
# Run the full multivariate linear regression model
full_model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear +
                       carb, data = mtcars)
# View the summary of the model
summary(full_model)

Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
    am + gear + carb, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Estimating Equation \[ mpg_i = \beta_0 + \beta_1 cyl_i +\beta_2 disp_i + \beta_3 hp_i + \beta_4 drat_i + \beta_5 wt_i + \beta_6 qsec_i + \beta_7 vs_i + \beta_8 am_i + \beta_9 gear_i + \\beta_{10} carb_i + \epsilon_i \]

# Short / Incorrect model: omit the predictor variable "gear"
# Calculate the correlation between "mpg" and "gear"
correlation_result <- cor(mtcars$mpg, mtcars$gear)
# Print the correlation
print(correlation_result)
[1] 0.4802848
# Test the statistical significance of the correlation
cor_test_result <- cor.test(mtcars$mpg, mtcars$gear)
# Print the result of the correlation test
print(cor_test_result)

    Pearson's product-moment correlation

data:  mtcars$mpg and mtcars$gear
t = 2.9992, df = 30, p-value = 0.005401
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1580618 0.7100628
sample estimates:
      cor 
0.4802848 
# The outcome variable "mpg" and the omitted predictor variable
# "gear" are correlated because the p-value = 0.005401 < alpha = 1%
# Calculate the correlation between "cyl" and "gear"
correlation_result_2 <- cor(mtcars$cyl, mtcars$gear)
# Print the correlation
print(correlation_result_2)
[1] -0.4926866
# Test the statistical significance of the correlation
cor_test_result_2 <- cor.test(mtcars$cyl, mtcars$gear)
# Print the result of the correlation test
print(cor_test_result_2)

    Pearson's product-moment correlation

data:  mtcars$cyl and mtcars$gear
t = -3.1011, df = 30, p-value = 0.004173
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7180260 -0.1738615
sample estimates:
       cor 
-0.4926866 
# The included independent variable "cyl" and the omitted predictor variable
# "gear" are correlated because the p-value = 0.004173 < alpha = 1%
# The omitted variable will cause bias (i.e. the two conditions for OVB are met)
# because 1) The omitted variable "gear" is a determinant of the dependent
# variable "mpg" and 2) The omitted variable is correlated with at least one of
# the included independent variables.

Estimating Equation \[ mpg_i = \beta_0 + \beta_1 cyl_i +\beta_2 disp_i + \beta_3 hp_i + \beta_4 drat_i + \beta_5 wt_i + \beta_6 qsec_i + \beta_7 vs_i + \beta_8 am_i + + \beta_9 carb_i + \epsilon_i \]

# Calculate correlations
correlations <- cor(mtcars)

# Print correlations involving "gear", "mpg", and other predictor variables
cor_gear_mpg <- correlations["gear", "mpg"]
cor_gear_other <- correlations["gear", ]

# Print results
print(paste("Correlation between gear and mpg:", round(cor_gear_mpg, 2)))
[1] "Correlation between gear and mpg: 0.48"
print("Correlation of gear with other variables:")
[1] "Correlation of gear with other variables:"
print(round(cor_gear_other, 2))
  mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
 0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27 
# Check OVB conditions
if (cor_gear_mpg > 0 && any(cor_gear_other[c("cyl", "disp", "hp", "drat", "wt",
                                             "qsec", "vs", "am", "carb")] > 0))
  {print("Omitting gear may cause positive OVB.")}
[1] "Omitting gear may cause positive OVB."
if (cor_gear_mpg > 0 && any(cor_gear_other[c("cyl", "disp", "hp", "drat", "wt",
                                             "qsec", "vs", "am", "carb")] < 0))
  {print("Omitting gear may cause negative OVB.")}
[1] "Omitting gear may cause negative OVB."
if (cor_gear_mpg < 0 && any(cor_gear_other[c("cyl", "disp", "hp", "drat", "wt",
                                             "qsec", "vs", "am", "carb")] < 0))
  {print("Omitting gear may cause positive OVB.")}

if (cor_gear_mpg < 0 && any(cor_gear_other[c("cyl", "disp", "hp", "drat", "wt",
                                             "qsec", "vs", "am", "carb")] > 0))
  {print("Omitting gear may cause negative OVB.")}
# Calculate correlations
correlations <- cor(mtcars)

# Print correlations involving gear and mpg
cor_gear_mpg <- correlations["gear", "mpg"]
cor_gear_other <- correlations["gear", ]

# Print results
print(paste("Correlation between gear and mpg:", round(cor_gear_mpg, 2)))
[1] "Correlation between gear and mpg: 0.48"
print("Correlation of gear with other variables:")
[1] "Correlation of gear with other variables:"
print(round(cor_gear_other, 2))
  mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb 
 0.48 -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27 
# Identify potential OVB direction
if (cor_gear_mpg > 0 && any(cor_gear_other[c("cyl", "disp", "hp", "drat", "wt",
                                             "qsec", "vs", "am", "carb")] < 0))
  {
    print("Positive OVB expected.")
} else if (cor_gear_mpg < 0 && any(cor_gear_other[c("cyl", "disp", "hp", "drat",
                                                    "wt", "qsec", "vs", "am",
                                                    "carb")] > 0))
  {
    print("Negative OVB expected.")
} else {
    print("OVB direction is unclear based on correlations.")
}
[1] "Positive OVB expected."
# Direction of the omitted variable bias
# Thus, the OVB of "gear" will be in positive direction.

# Compare the 2 regressions: full, correct model vs. omitted variable "gear"
# Load necessary packages
library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
# Fit the full model (including "gear"), with all the independent variables
full_model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
                 + gear + carb, data = mtcars)
# Fit the model without the "gear" variable, one omitted variable
omitted_model <- lm(mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am
                    + carb, data = mtcars)
# Check the 2 models
summary(full_model)

Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
    am + gear + carb, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4506 -1.6044 -0.1196  1.2193  4.6271 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 12.30337   18.71788   0.657   0.5181  
cyl         -0.11144    1.04502  -0.107   0.9161  
disp         0.01334    0.01786   0.747   0.4635  
hp          -0.02148    0.02177  -0.987   0.3350  
drat         0.78711    1.63537   0.481   0.6353  
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739  
vs           0.31776    2.10451   0.151   0.8814  
am           2.52023    2.05665   1.225   0.2340  
gear         0.65541    1.49326   0.439   0.6652  
carb        -0.19942    0.82875  -0.241   0.8122  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared:  0.869, Adjusted R-squared:  0.8066 
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07
summary(omitted_model)

Call:
lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
    am + carb, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5183 -1.4339 -0.3922  1.2020  4.5924 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept) 15.64181   16.78606   0.932   0.3615  
cyl         -0.27315    0.95981  -0.285   0.7786  
disp         0.01395    0.01747   0.798   0.4332  
hp          -0.02063    0.02128  -0.969   0.3429  
drat         0.84089    1.60057   0.525   0.6046  
wt          -3.86609    1.82850  -2.114   0.0461 *
qsec         0.79507    0.71495   1.112   0.2781  
vs           0.35800    2.06357   0.173   0.8639  
am           2.80345    1.91663   1.463   0.1577  
carb        -0.04506    0.73653  -0.061   0.9518  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.601 on 22 degrees of freedom
Multiple R-squared:  0.8678,    Adjusted R-squared:  0.8137 
F-statistic: 16.05 on 9 and 22 DF,  p-value: 9.885e-08
dim(mtcars)
[1] 32 11
colSums(is.na(mtcars))
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
   0    0    0    0    0    0    0    0    0    0    0 
# Use stargazer to show the regression results side by side
stargazer(full_model, omitted_model, 
          title = "Regression Results: Full Model vs. Omitted Gear Model", 
          type = "text",
          column.labels = c("Full Model", "Omitted Gear Model"),
          dep.var.labels = "Miles per Gallon (mpg)",
          model.numbers = FALSE,
          single.row = TRUE,
          digits = 3)

Regression Results: Full Model vs. Omitted Gear Model
==================================================================
                                 Dependent variable:              
                    ----------------------------------------------
                                Miles per Gallon (mpg)            
                          Full Model          Omitted Gear Model  
------------------------------------------------------------------
cyl                     -0.111 (1.045)          -0.273 (0.960)    
disp                     0.013 (0.018)          0.014 (0.017)     
hp                      -0.021 (0.022)          -0.021 (0.021)    
drat                     0.787 (1.635)          0.841 (1.601)     
wt                      -3.715* (1.894)        -3.866** (1.828)   
qsec                     0.821 (0.731)          0.795 (0.715)     
vs                       0.318 (2.105)          0.358 (2.064)     
am                       2.520 (2.057)          2.803 (1.917)     
gear                     0.655 (1.493)                            
carb                    -0.199 (0.829)          -0.045 (0.737)    
Constant                12.303 (18.718)        15.642 (16.786)    
------------------------------------------------------------------
Observations                  32                      32          
R2                           0.869                  0.868         
Adjusted R2                  0.807                  0.814         
Residual Std. Error     2.650 (df = 21)        2.601 (df = 22)    
F Statistic         13.932*** (df = 10; 21) 16.048*** (df = 9; 22)
==================================================================
Note:                                  *p<0.1; **p<0.05; ***p<0.01
# The direction of the OVB is confirmed as it is predicted in the formula.