Rows: 55
Columns: 5
$ gpa <dbl> 3.890, 3.900, 3.750, 3.600, 4.000, 3.150, 3.250, 3.925, 3.4…
$ studyweek <int> 50, 15, 15, 10, 25, 20, 15, 10, 12, 2, 10, 30, 30, 21, 10, …
$ sleepnight <dbl> 6.0, 6.0, 7.0, 6.0, 7.0, 7.0, 6.0, 8.0, 8.0, 8.0, 8.0, 6.0,…
$ out <dbl> 3.0, 1.0, 1.0, 4.0, 3.0, 3.0, 1.0, 3.0, 2.0, 4.0, 1.0, 2.0,…
$ gender <fct> female, female, female, male, female, male, female, female,…
Weekly Discussion: Omitted Variable Bias
I.
The bias of an estimator is the expected value of the difference between an estimator and the parameter that it is estimating. For example, the bias of \(\mu_{y} = E(\hat\mu_{y})- \mu_{y}\) .
II.
In terms of omitted variable bias, the bias won’t go away even in larger samples because X is correlated with the omitted variable and the omitted variable determines the outcome of the dependent variable Y. Adding more variables as control variables so that the omitted factors are constant can eliminate this bias.
III.
I chose the dataset “gpa” from the openintro dataset. It is a survey of 55 Duke student on GPA, how many hours they study at night, number of nights they go out, and gender. I will make the dependent variable GPA and the independent variables hours studying, sleep per night, and number of nights spent going out:
\[ gpa = \beta_{0}+ \beta_{1}studyweek + \beta_{2}sleepnight + \beta_{3}out \]
Call:
lm(formula = gpa ~ studyweek + sleepnight + out, data = gpa)
Residuals:
Min 1Q Median 3Q Max
-0.7334 -0.2231 0.0180 0.2428 1.0418
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.434822 0.347601 9.882 1.93e-13 ***
studyweek 0.001408 0.003812 0.369 0.713
sleepnight 0.006501 0.049439 0.131 0.896
out 0.043793 0.050161 0.873 0.387
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3417 on 51 degrees of freedom
Multiple R-squared: 0.02116, Adjusted R-squared: -0.03642
F-statistic: 0.3675 on 3 and 51 DF, p-value: 0.7768
I will omit the hours of sleep per night:
\[ gpa = \beta_{0}+ \beta_{1}studyweek + \beta_{3}out \]
Call: lm(formula = gpa ~ studyweek + out, data = gpa) Residuals: Min 1Q Median 3Q Max -0.74245 -0.21880 0.01709 0.24122 1.03550 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.477065 0.131506 26.440 <2e-16 *** studyweek 0.001325 0.003723 0.356 0.723 out 0.046295 0.045972 1.007 0.319 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.3384 on 52 degrees of freedom Multiple R-squared: 0.02083, Adjusted R-squared: -0.01683 F-statistic: 0.553 on 2 and 52 DF, p-value: 0.5786Developing the correlation between all the variables (except for categorical variable of gender), hours of sleep/night seems to be positively correlated with gpa, although it doesn’t appear to be statistically significant:
::: {.cell}
::: {.cell-output .cell-output-stdout}
```
gpa studyweek sleepnight out
gpa 1
studyweek 0.0416040278004677 1
sleepnight 0.0609830832119484 -0.173834225013165 1
out 0.135802634202299 -0.052716305410692 0.382163376100019 1
```
:::
:::
We see that the OVB is in the positive direction, since the \(R^2\) is become less positive once hours of sleep are omitted
Using the stargazer command to show the regressions:
========================================================= Dependent variable: ------------------------------------- gpa (1) (2) --------------------------------------------------------- studyweek 0.001 0.001 (0.004) (0.004) sleepnight 0.007 (0.049) out 0.044 0.046 (0.050) (0.046) Constant 3.435*** 3.477*** (0.348) (0.132) --------------------------------------------------------- Observations 55 55 R2 0.021 0.021 Adjusted R2 -0.036 -0.017 Residual Std. Error 0.342 (df = 51) 0.338 (df = 52) F Statistic 0.367 (df = 3; 51) 0.553 (df = 2; 52) ========================================================= Note: *p<0.1; **p<0.05; ***p<0.01
In terms of the intuition behind the OVB formula, fewer variables would eliminate some of the variation that goes into determining in this case gpa. For example, we see that there is less variation in the standard deviation for hours of sleep/night than for hours of studying per week. So, eliminating hours of sleep should not make such a difference compared to hours of study/week.
sd(gpa$studyweek)[1] 12.3864sd(gpa$out)[1] 1.003194sd(gpa$sleepnight)[1] 1.032143