Karim Naguib (Boston University)
9/29/2013
Recall that our regression model was
\[ TestScore = \beta_0 + \beta_1 STR + u_i,~i=1,\dots,n \]
cor(test.score.data$str, test.score.data$el.pct)
[1] 0.1876
What OLS assumption does OVB violate?
What OLS assumption does OVB violate?
It violates the OLS assumption A.1: \( E[u_i|X_i] = 0 \).
As \( u_i \) and \( X_i \) are correlated due to \( u_i \)'s causal influence on \( X_i \), \( E[u_i|X_i] \ne 0 \). This causes a bias in the estimator that does not vanish in very large samples.
Since there is a correlation between \( u_i \) and \( X_i \) \[ Corr(X_i, u_i) = \rho_{Xu} \ne 0 \]
The OLS estimator has the limit \[ \hat{\beta}_1 \overset{p}{\longrightarrow} \beta_1 + \rho_{Xu}\frac{\sigma_u}{\sigma_X} \] which means that \( \hat{\beta}_2 \) approaches the right hand value with increasing probability as the sample size grows.
In a multiple regression model we allow for more than one regressor. This allows us to isolate the effect of a particular variable holding all others constant.
The population regression line (function) with two regressors would be
\[ E[Y_i|X_{1i} = x_1, X_{2i} = x_2] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]
For simplicity let us write the population regression line as
\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 \]
Suppose we change \( X_1 \) by an amount \( \Delta X_1 \), which would cause \( Y \) to change to \( Y + \Delta Y \).
\[ Y + \Delta Y = \beta_0 + \beta_1 (X_1 + \Delta X_1) + \beta_2 X_2 \]
\[ \begin{align} Y =& \beta_0 + \beta_1 X_1 + \beta_2 X_2 \label{eqn:popregline} \\ Y + \Delta Y =& \beta_0 + \beta_1 (X_1 + \Delta X_1) + \beta_2 X_2 \label{eqn:changeinregline} \end{align} \]
\[ \begin{align*} \Delta Y &= \beta_1 \Delta X_1 \\ \beta_1 &= \frac{\Delta Y}{\Delta X_1} \end{align*} \]
These other unobserved factors are captured by the error term \( u_i \) in the population multiple regression model
\[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + u_i,~i=1,\dots,n \]
(Generally, we can have any number of \( k \) regressors as shown above)
As with the case of a single regressor model, the population multiple regression model can be either homoskedastic or heteroskedastic. it is homoskedastic if
\[ Var(u_i|X_{1i},\dots,X_{ki}) \]
is constant for all \( i=1,\dots,n \). Otherwise, it is heteroskedastic.
We do this by minimizing the sum of squared differences between the observed dependent variable and its predicted value
\[ \min_{b_0,\dots,b_k} \sum_i (Y_i - b_0 - b_1 X_{1i} - \cdots - b_k X_{ki})^2 \]
The predicted values would be
\[ \hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{1i} + \cdots + \hat{\beta}_k X_{ki},~i=1,\dots,n \]
The OLS residuals would be
\[ \hat{u}_i = Y_i - \hat{Y}_i,~i=1,\dots,n \]
Recall that, using observations from 420 school districts, we regressed student test scores on STR we got \[ \widehat{TestScore} = 698.9 - 2.28 \times STR \]
However, there was concern about the possibility of OVB due to the exclusion of the percentage of English learners in a district, when it influences both test scores and STR.
We can now address this concern by including the percentage of English learners in our model
\[ TestScore_i = \beta_0 + \beta_1 \times STR_i + \beta_2 \times PctEL_i + u_i \]
where \( PctEL_i \) is the percentage of English learners in school district \( i \).
(Notice that we are using heteroskedastic-robust standard errors)
regress.results <- lm(testscr ~ str + el.pct, data = test.score.data)
het.se <- vcovHC(regress.results)
coeftest(regress.results, vcov.=het.se)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 686.0322 8.8122 77.85 <2e-16 ***
str -1.1013 0.4371 -2.52 0.012 *
el.pct -0.6498 0.0313 -20.76 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the single regressor case our estimates were \[ \widehat{TestScore} = 698.9 - 2.28 \times STR \] and with the added regressor we have \[ \widehat{TestScore} = 686.0 - 1.10 \times STR - 0.65 \times PctEL \]
Similar to the single regressor case, except for the modified adjustment for the degrees of freedom, the \( SER \) is
\[ SER = s_{\hat{u}}\text{ where }s_{\hat{u}}^2 = \frac{\sum_i \hat{u}^2_i}{n - k - 1} = \frac{SSR}{n - k - 1} \]
Instead of adjusting for the two degrees of freedom used to estimate two coefficients, we now need to adjust for \( k+1 \) estimations.
\[ R^2 = \frac{EES}{TSS} = 1 - \frac{SSR}{TSS} \]
In order to address the inflation problem of the \( R^2 \) we can calculated an “adjusted” version to corrects for that \[ \bar{R}^2 = 1 - \frac{n-1}{n-k-1}\frac{SSR}{TSS} = 1 - \frac{s_{\hat{u}}^2}{s_Y^2} \]
From our multiple regression of the test scores on STR and the percentage of English learners we have the \( R^2 \), \( \bar{R}^2 \), and \( SER \)
regress.summary <- summary(regress.results)
regress.summary$r.squared
[1] 0.4264
regress.summary$adj.r.squared
[1] 0.4237
regress.summary$sigma
[1] 14.46
For multiple regressions we have four assumptions: three of them are updated versions of the single regressor assumptions and one new assumption.