<- trees
df lm(df)
Call:
lm(formula = df)
Coefficients:
(Intercept) Height Volume
10.81637 -0.04548 0.19518
trees
data set has 31 observations on 3 variables:
Girth: Tree diameter in inchest
Height: Height in ft
Volume: Volume of timber in cubic ft
The independent variables are Height and Volume, the dependent variable is Girth. We are interested in studying the relationship of how Height and Volume affect Girth.
<- trees
df lm(df)
Call:
lm(formula = df)
Coefficients:
(Intercept) Height Volume
10.81637 -0.04548 0.19518
After running a linear regression, we get the estimating equation as:
Girth = -0.04548 * Height + 0.19518 * Volume + 10.81637
In this question, we intentionally omit variable Volume
<- lm( Girth ~ Height, df)
ovb_model
ovb_model
Call:
lm(formula = Girth ~ Height, data = df)
Coefficients:
(Intercept) Height
-6.1884 0.2557
This gives us the estimating equation with an omitted variable:
Girth = 0.2557 * Height - 6.1884
These conditions must be met for the new function to have an omitted variable bias:
Height is correlated with the omitted variable: Volume
Volume is a determinant of Girth
<- cor(df$Height, df$Volume)
cor_height_volume
print(paste("Correlation between Height and Volume:", cor_height_volume))
[1] "Correlation between Height and Volume: 0.598249651991782"
cor.test(df$Height, df$Volume)
Pearson's product-moment correlation
data: df$Height and df$Volume
t = 4.0205, df = 29, p-value = 0.0003784
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3095235 0.7859756
sample estimates:
cor
0.5982497
Here, the corr
is 0.598249651991782 and the corr.test
returns a p-value smaller than 0.05. This suggests there is a positive correlation between independent variable Height and omitted variable Volume, and it is statistically somewhat significant.
<- cor(df$Girth, df$Volume)
cor_girth_volume
print(paste("Correlation between Girth and Volume:", cor_girth_volume))
[1] "Correlation between Girth and Volume: 0.967119368255631"
cor.test(df$Girth, df$Volume)
Pearson's product-moment correlation
data: df$Girth and df$Volume
t = 20.478, df = 29, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9322519 0.9841887
sample estimates:
cor
0.9671194
Here, the corr
is 0.967119368255631 and the corr.test
returns a p-value smaller than 0.0001. This suggests there is a positive correlation between dependent variable Girth and omitted variable Volume, and it is statistically significant.
Thus the two conditions are satisfied and we have an omitted variable bias in the new estimating equation.
Because the correlations are both positive in our omitted variable bias conditions, this means we have a positive bias. It would be the corner where “A and B are positively correlated” and “B is positively correlated to Y”
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(lm(df), ovb_model, type = "text",
title = "Regression Comparison",
column.labels = c("Original Model", "Omit Volume"),
covariate.labels = c("Girth", "Height"),
dep.var.labels = c("Volume", "Girth"),
out = "regression_comparison.txt")
Regression Comparison
==================================================================
Dependent variable:
----------------------------------------------
Volume
Original Model Omit Volume
(1) (2)
------------------------------------------------------------------
Girth -0.045 0.256***
(0.028) (0.078)
Height 0.195***
(0.011)
Constant 10.816*** -6.188
(1.973) (5.960)
------------------------------------------------------------------
Observations 31 31
R2 0.941 0.270
Adjusted R2 0.937 0.244
Residual Std. Error 0.790 (df = 28) 2.728 (df = 29)
F Statistic 222.471*** (df = 2; 28) 10.707*** (df = 1; 29)
==================================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Since Volume is positively correlated with both Girth and Height, omitting Volume leads to a positive bias in the coefficient of Height. This means the effect of Height on Girth is overestimated when Volume is omitted.
The correlation between the omitted independent variable and the independent variable represents the impact of the omitted independent variable on the independent variable. This means that as the omitted variable increases, the other independent variable also tends to increase.
The correlation between the omitted independent variable and the dependent variable represents the impact of the omitted independent variable on the dependent variable. This means that as the omitted variable increases, the dependent variable also tends to increase.
So the omission of the independent variable causes the liner regression to scale up the correlated independent variable to make up for the missing information.