Part 1

The bias of an estimator is the difference between the estimator’s expected value and real value. There are several reasons for bias of an estimator omitted variables, sample selection bias, model specification errors, and reverse causality. These biases can cause the model to estimate the coefficient of the estimator, overestimating or underestimating incorrectly.

Part 2

Omitted variable bias occurs when a model omits a variable that affects both the key explanatory variable and the dependent variable. Therefore, Omitted bias will not go away by increasing the same size because the increase in size cannot compensate for the effect of the omitted variable. However, it is possible that adding new variables can reduce this bias if the new variables and omitted variables have similar effects on the key explanatory and dependent variables.

Part 3

library(MASS)
data(Boston)

1.Clear data

rm(list = ls())
gc()  
##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells  552826 29.6    1236295 66.1   686414 36.7
## Vcells 1005991  7.7    8388608 64.0  1875897 14.4
cat("\f")
graphics.off()

2.Full Model

\[ medv_i = \beta_0 + \beta_1 rm_i + \beta_2 ptratio_i \]

The key independent vatiable is rm.

full_model <- lm(data=Boston,
              formula = medv ~ rm + ptratio
              )
summary(full_model)
## 
## Call:
## lm(formula = medv ~ rm + ptratio, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.672  -2.821   0.102   2.770  39.819 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.5612     4.1889  -0.611    0.541    
## rm            7.7141     0.4136  18.650   <2e-16 ***
## ptratio      -1.2672     0.1342  -9.440   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.104 on 503 degrees of freedom
## Multiple R-squared:  0.5613, Adjusted R-squared:  0.5595 
## F-statistic: 321.7 on 2 and 503 DF,  p-value: < 2.2e-16

3.Short Model

\[ medv_i = \beta_0 + \beta_1 rm_i \]

short_model <- lm(data=Boston,
              formula = medv ~ rm 
              )
summary(short_model)
## 
## Call:
## lm(formula = medv ~ rm, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.346  -2.547   0.090   2.986  39.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -34.671      2.650  -13.08   <2e-16 ***
## rm             9.102      0.419   21.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.616 on 504 degrees of freedom
## Multiple R-squared:  0.4835, Adjusted R-squared:  0.4825 
## F-statistic: 471.8 on 1 and 504 DF,  p-value: < 2.2e-16

4.Check for OVB

Condition 1: X is correlated with the omitted variable

cor(Boston[,c(6,11,14)])
##                 rm    ptratio       medv
## rm       1.0000000 -0.3555015  0.6953599
## ptratio -0.3555015  1.0000000 -0.5077867
## medv     0.6953599 -0.5077867  1.0000000

This table show the Pearson Correlation Coefficient among the three variables.rm and ptratio are negatively correlated and ptratio has a negative effect on medv. Based on this result, I think there is a in positive bias.

Condition 2: The omitted variable is a determinant of the dependent variable medv

library("stargazer")
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(full_model,short_model,
          type = "text",
          covariate.labels = c("rm","ptratio","constant")
          )
## 
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                           medv                       
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## rm                          7.714***                 9.102***        
##                             (0.414)                  (0.419)         
##                                                                      
## ptratio                    -1.267***                                 
##                             (0.134)                                  
##                                                                      
## constant                     -2.561                 -34.671***       
##                             (4.189)                  (2.650)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                  506                      506           
## R2                           0.561                    0.484          
## Adjusted R2                  0.560                    0.483          
## Residual Std. Error     6.104 (df = 503)         6.616 (df = 504)    
## F Statistic         321.724*** (df = 2; 503) 471.847*** (df = 1; 504)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Based on the above results, both conditions are satisfied. The omitted variable “ptratio” is statistically significant for the dependent variable , and there is a correlation between the omitted and key variables.

Because ptratio has negative correlations for medv and rm. When the model omits ptratio, it will attribute the negative effect of patratio on rm thus overestimating the beta of rm.

Advanced bonus question : Adding a variables does not impact y but correlated with the key x

1.Full model and Short model

Full model:

\[ medv_i = \beta_0 + \beta_1 rm_i + \beta_2 lstat_i + \beta_3 indus_i \] Short model:

\[ medv_i = \beta_0 + \beta_1 rm_i + \beta_2 lstat_i \]

rm is the key x and the indus is the added variable

2.First condition: X is correlated with the omitted variable

cor(Boston[,c(3,6,13)])
##            indus         rm      lstat
## indus  1.0000000 -0.3916759  0.6037997
## rm    -0.3916759  1.0000000 -0.6138083
## lstat  0.6037997 -0.6138083  1.0000000

The Pearson Correlation Coefficient between nox and rm is -0.3916759 and it indicates that indus is negatively correlated with the key variables

3.Second condition: The omitted variable is a determinant of the dependent variable Y

xr_fullmodel <- lm(data=Boston,
              formula = medv ~ rm + lstat + indus
              )
xr_shortmodel <- lm(data=Boston,
              formula = medv ~ rm + lstat 
              )
stargazer(xr_fullmodel,xr_shortmodel,
          type = "text",
          covariate.labels = c("rm","lstat","indus"," constant")
          )
## 
## =====================================================================
##                                    Dependent variable:               
##                     -------------------------------------------------
##                                           medv                       
##                               (1)                      (2)           
## ---------------------------------------------------------------------
## rm                          5.074***                 5.095***        
##                             (0.444)                  (0.444)         
##                                                                      
## lstat                      -0.607***                -0.642***        
##                             (0.050)                  (0.044)         
##                                                                      
## indus                        -0.064                                  
##                             (0.045)                                  
##                                                                      
## constant                     -0.969                   -1.358         
##                             (3.182)                  (3.173)         
##                                                                      
## ---------------------------------------------------------------------
## Observations                  506                      506           
## R2                           0.640                    0.639          
## Adjusted R2                  0.638                    0.637          
## Residual Std. Error     5.535 (df = 502)         5.540 (df = 503)    
## F Statistic         297.471*** (df = 3; 502) 444.331*** (df = 2; 503)
## =====================================================================
## Note:                                     *p<0.1; **p<0.05; ***p<0.01

Compared to full model, the coefficient of the key variable rm and R^2 not change significantly.

Even though indus is correlated with the key x variable, indus does not have a significant effect on Y. indus is not an omitted variable in this model.