The bias of an estimator is the difference between the estimator’s expected value and real value. There are several reasons for bias of an estimator omitted variables, sample selection bias, model specification errors, and reverse causality. These biases can cause the model to estimate the coefficient of the estimator, overestimating or underestimating incorrectly.
Omitted variable bias occurs when a model omits a variable that affects both the key explanatory variable and the dependent variable. Therefore, Omitted bias will not go away by increasing the same size because the increase in size cannot compensate for the effect of the omitted variable. However, it is possible that adding new variables can reduce this bias if the new variables and omitted variables have similar effects on the key explanatory and dependent variables.
library(MASS)
data(Boston)
rm(list = ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 552826 29.6 1236295 66.1 686414 36.7
## Vcells 1005991 7.7 8388608 64.0 1875897 14.4
cat("\f")
graphics.off()
\[ medv_i = \beta_0 + \beta_1 rm_i + \beta_2 ptratio_i \]
The key independent vatiable is rm.
full_model <- lm(data=Boston,
formula = medv ~ rm + ptratio
)
summary(full_model)
##
## Call:
## lm(formula = medv ~ rm + ptratio, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.672 -2.821 0.102 2.770 39.819
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.5612 4.1889 -0.611 0.541
## rm 7.7141 0.4136 18.650 <2e-16 ***
## ptratio -1.2672 0.1342 -9.440 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.104 on 503 degrees of freedom
## Multiple R-squared: 0.5613, Adjusted R-squared: 0.5595
## F-statistic: 321.7 on 2 and 503 DF, p-value: < 2.2e-16
\[ medv_i = \beta_0 + \beta_1 rm_i \]
short_model <- lm(data=Boston,
formula = medv ~ rm
)
summary(short_model)
##
## Call:
## lm(formula = medv ~ rm, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.346 -2.547 0.090 2.986 39.433
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34.671 2.650 -13.08 <2e-16 ***
## rm 9.102 0.419 21.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.616 on 504 degrees of freedom
## Multiple R-squared: 0.4835, Adjusted R-squared: 0.4825
## F-statistic: 471.8 on 1 and 504 DF, p-value: < 2.2e-16
library("stargazer")
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(full_model,short_model,
type = "text",
covariate.labels = c("rm","ptratio","constant")
)
##
## =====================================================================
## Dependent variable:
## -------------------------------------------------
## medv
## (1) (2)
## ---------------------------------------------------------------------
## rm 7.714*** 9.102***
## (0.414) (0.419)
##
## ptratio -1.267***
## (0.134)
##
## constant -2.561 -34.671***
## (4.189) (2.650)
##
## ---------------------------------------------------------------------
## Observations 506 506
## R2 0.561 0.484
## Adjusted R2 0.560 0.483
## Residual Std. Error 6.104 (df = 503) 6.616 (df = 504)
## F Statistic 321.724*** (df = 2; 503) 471.847*** (df = 1; 504)
## =====================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Based on the above results, both conditions are satisfied. The omitted variable “ptratio” is statistically significant for the dependent variable , and there is a correlation between the omitted and key variables.
Because ptratio has negative correlations for medv and rm. When the model omits ptratio, it will attribute the negative effect of patratio on rm thus overestimating the beta of rm.