OVB

Author

Song

1. Data Set and Variables

trees data set has 31 observations on 3 variables:

Girth: Tree diameter in inchest

Height: Height in ft

Volume: Volume of timber in cubic ft

The independent variables are Height and Volume, the dependent variable is Girth. We are interested in studying the relationship of how Height and Volume affect Girth.

df <- trees
lm(df)


Call:
lm(formula = df)

Coefficients:
(Intercept)       Height       Volume  
   10.81637     -0.04548      0.19518

After running a linear regression, we get the estimating equation as:

Girth = -0.04548 * Height + 0.19518 * Volume + 10.81637

2. Omit Variable

In this question, we intentionally omit variable Volume

ovb_model <- lm( Girth ~ Height, df)

ovb_model


Call:
lm(formula = Girth ~ Height, data = df)

Coefficients:
(Intercept)       Height  
    -6.1884       0.2557

This gives us the estimating equation with an omitted variable:

Girth = 0.2557 * Height - 6.1884

3. Omitted Variable Bias

These conditions must be met for the new function to have an omitted variable bias:

Height is correlated with the omitted variable: Volume
Volume is a determinant of Girth

cor_height_volume <- cor(df$Height, df$Volume)

print(paste("Correlation between Height and Volume:", cor_height_volume))

[1] "Correlation between Height and Volume: 0.598249651991782"

cor.test(df$Height, df$Volume)


    Pearson's product-moment correlation

data:  df$Height and df$Volume
t = 4.0205, df = 29, p-value = 0.0003784
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3095235 0.7859756
sample estimates:
      cor 
0.5982497

Here, the corr is 0.598249651991782 and the corr.test returns a p-value smaller than 0.05. This suggests there is a positive correlation between independent variable Height and omitted variable Volume, and it is statistically somewhat significant.

cor_girth_volume <- cor(df$Girth, df$Volume)

print(paste("Correlation between Girth and Volume:", cor_girth_volume))

[1] "Correlation between Girth and Volume: 0.967119368255631"

cor.test(df$Girth, df$Volume)


    Pearson's product-moment correlation

data:  df$Girth and df$Volume
t = 20.478, df = 29, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9322519 0.9841887
sample estimates:
      cor 
0.9671194

Here, the corr is 0.967119368255631 and the corr.test returns a p-value smaller than 0.0001. This suggests there is a positive correlation between dependent variable Girth and omitted variable Volume, and it is statistically significant.

Thus the two conditions are satisfied and we have an omitted variable bias in the new estimating equation.

4. Bias Direction

Because the correlations are both positive in our omitted variable bias conditions, this means we have a positive bias. It would be the corner where “A and B are positively correlated” and “B is positively correlated to Y”

5. Regression Comparison

library(stargazer)


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(lm(df), ovb_model, type = "text",
          title = "Regression Comparison",
          column.labels = c("Original Model", "Omit Volume"),
          covariate.labels = c("Girth", "Height"),
          dep.var.labels = c("Volume", "Girth"),
          out = "regression_comparison.txt")


Regression Comparison
==================================================================
                                 Dependent variable:              
                    ----------------------------------------------
                                        Volume                    
                        Original Model           Omit Volume      
                              (1)                    (2)          
------------------------------------------------------------------
Girth                       -0.045                 0.256***       
                            (0.028)                (0.078)        
                                                                  
Height                     0.195***                               
                            (0.011)                               
                                                                  
Constant                   10.816***                -6.188        
                            (1.973)                (5.960)        
                                                                  
------------------------------------------------------------------
Observations                  31                      31          
R2                           0.941                  0.270         
Adjusted R2                  0.937                  0.244         
Residual Std. Error     0.790 (df = 28)        2.728 (df = 29)    
F Statistic         222.471*** (df = 2; 28) 10.707*** (df = 1; 29)
==================================================================
Note:                                  *p<0.1; **p<0.05; ***p<0.01

Since Volume is positively correlated with both Girth and Height, omitting Volume leads to a positive bias in the coefficient of Height. This means the effect of Height on Girth is overestimated when Volume is omitted.

6. OVB Formula Reasoning

The correlation between the omitted independent variable and the independent variable represents the impact of the omitted independent variable on the independent variable. This means that as the omitted variable increases, the other independent variable also tends to increase.

The correlation between the omitted independent variable and the dependent variable represents the impact of the omitted independent variable on the dependent variable. This means that as the omitted variable increases, the dependent variable also tends to increase.

So the omission of the independent variable causes the liner regression to scale up the correlated independent variable to make up for the missing information.