Weekly Discussion: Omitted Variable Bias

Author

Will Brewster

I.

The bias of an estimator is the expected value of the difference between an estimator and the parameter that it is estimating. For example, the bias of \(\mu_{y} = E(\hat\mu_{y})- \mu_{y}\) .

II.

In terms of omitted variable bias, the bias won’t go away even in larger samples because X is correlated with the omitted variable and the omitted variable determines the outcome of the dependent variable Y. Adding more variables as control variables so that the omitted factors are constant can eliminate this bias.

III.

  1. I chose the dataset “gpa” from the openintro dataset. It is a survey of 55 Duke student on GPA, how many hours they study at night, number of nights they go out, and gender. I will make the dependent variable GPA and the independent variables hours studying, sleep per night, and number of nights spent going out:

    Rows: 55
    Columns: 5
    $ gpa        <dbl> 3.890, 3.900, 3.750, 3.600, 4.000, 3.150, 3.250, 3.925, 3.4…
    $ studyweek  <int> 50, 15, 15, 10, 25, 20, 15, 10, 12, 2, 10, 30, 30, 21, 10, …
    $ sleepnight <dbl> 6.0, 6.0, 7.0, 6.0, 7.0, 7.0, 6.0, 8.0, 8.0, 8.0, 8.0, 6.0,…
    $ out        <dbl> 3.0, 1.0, 1.0, 4.0, 3.0, 3.0, 1.0, 3.0, 2.0, 4.0, 1.0, 2.0,…
    $ gender     <fct> female, female, female, male, female, male, female, female,…

\[ gpa = \beta_{0}+ \beta_{1}studyweek + \beta_{2}sleepnight + \beta_{3}out \]


Call:
lm(formula = gpa ~ studyweek + sleepnight + out, data = gpa)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7334 -0.2231  0.0180  0.2428  1.0418 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.434822   0.347601   9.882 1.93e-13 ***
studyweek   0.001408   0.003812   0.369    0.713    
sleepnight  0.006501   0.049439   0.131    0.896    
out         0.043793   0.050161   0.873    0.387    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3417 on 51 degrees of freedom
Multiple R-squared:  0.02116,   Adjusted R-squared:  -0.03642 
F-statistic: 0.3675 on 3 and 51 DF,  p-value: 0.7768
  1. I will omit the hours of sleep per night:

    \[ gpa = \beta_{0}+ \beta_{1}studyweek + \beta_{3}out \]

    
    Call:
    lm(formula = gpa ~ studyweek + out, data = gpa)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.74245 -0.21880  0.01709  0.24122  1.03550 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 3.477065   0.131506  26.440   <2e-16 ***
    studyweek   0.001325   0.003723   0.356    0.723    
    out         0.046295   0.045972   1.007    0.319    
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 0.3384 on 52 degrees of freedom
    Multiple R-squared:  0.02083,   Adjusted R-squared:  -0.01683 
    F-statistic: 0.553 on 2 and 52 DF,  p-value: 0.5786
  2. Developing the correlation between all the variables (except for categorical variable of gender), hours of sleep/night seems to be positively correlated with gpa, although it doesn’t appear to be statistically significant:

::: {.cell}
::: {.cell-output .cell-output-stdout}

```
                          gpa          studyweek        sleepnight out
gpa                         1                                         
studyweek  0.0416040278004677                  1                      
sleepnight 0.0609830832119484 -0.173834225013165                 1    
out         0.135802634202299 -0.052716305410692 0.382163376100019   1
```


:::
:::
  1. We see that the OVB is in the positive direction, since the \(R^2\) is become less negative once hours of sleep are omitted

  2. Using the stargazer command to show the regressions:

    
    =========================================================
                                 Dependent variable:         
                        -------------------------------------
                                         gpa                 
                               (1)                (2)        
    ---------------------------------------------------------
    studyweek                 0.001              0.001       
                             (0.004)            (0.004)      
    
    sleepnight                0.007                          
                             (0.049)                         
    
    out                       0.044              0.046       
                             (0.050)            (0.046)      
    
    Constant                 3.435***           3.477***     
                             (0.348)            (0.132)      
    
    ---------------------------------------------------------
    Observations                55                 55        
    R2                        0.021              0.021       
    Adjusted R2               -0.036             -0.017      
    Residual Std. Error  0.342 (df = 51)    0.338 (df = 52)  
    F Statistic         0.367 (df = 3; 51) 0.553 (df = 2; 52)
    =========================================================
    Note:                         *p<0.1; **p<0.05; ***p<0.01
  1. In terms of the intuition behind the OVB formula, fewer variables would eliminate some of the variation that goes into determining in this case gpa. For example, we see that there is less variation in the standard deviation for hours of sleep/night than for hours of studying per week. So, eliminating hours of sleep should not make such a difference compared to hours of study/week.

    sd(gpa$studyweek)
    [1] 12.3864
    sd(gpa$out)
    [1] 1.003194
    sd(gpa$sleepnight)
    [1] 1.032143