Karim Naguib (Boston University)
10/3/2013
A Two-sided test for any parameter \( \beta_j \) would be \[ \begin{align*} H_0&: \beta_j = \beta_{j,0} \\ H_1&: \beta_j \ne \beta_{j,0} \end{align*} \]
For example, suppose the coefficient for \( STR \) is \( \beta_j \) and we want to test the hypothesis that it is equal to zero. In that case, we would have \( \beta_{j,0} = 0 \)
The \( t \)-statistic would be \[ t = \frac{\hat{\beta}_j - \beta_{j,0}}{SE(\hat{\beta}_j)} \]
The \( p \)-value would be \[ p\text{-value} = 2\Phi(-|t^{act}|) \] where \( t^{act} \) is the value of \( t \) for the observed sample.
The method for constructing a confidence interval is the same as with a single regression. The \( (1-\alpha)\times 100\% \) confidence interval for coefficient \( \beta_j \) is
\[ [\hat{\beta}_j - z_{\alpha/2}SE(\hat{\beta}_j), \hat{\beta}_j + z_{\alpha/2}SE(\hat{\beta}_j)] \]
To restate our results from regressing \( TestScore \) on \( STR \) and \( PctEL \)
To test the hypothesis that the coefficient on \( STR \) is 0 we need to compute the \( t \)-value \[ t = \frac{-1.10 - 0}{0.43} = -2.54 \] } and the associated \( p \)-value is \[ p\text{-value} = 2\Phi(-2.54) = 0.011 \] We can reject the hypothesis at a significance level of 5% but not 1%.
To calculate the 95% confidence interval for the coefficient on \( STR \)
\[ -1.10 \pm 1.96 \times 0.43 = (-1.95, -0.26) \]
And in response to a increase of 2 of the STR, a 95% confidence interval for the effect on test scores
\[ (-1.95\times 2, -0.26\times 2) = (-3.90, -0.52) \]
Suppose we now want to also estimate the effect of expenditure per student. We want to know whether budget-cuts would be a good idea. We add a new regressor to the two we already have
\[ TestScore_i = \beta_0 + \beta_1 STR + \beta_2 Expn + \beta_3 PctEL + u_i \]
test.score.data$expn.per.1k <- test.score.data$expn.stu/1000
regress.results <- lm(testscr ~ str + expn.per.1k + el.pct, data = test.score.data)
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 649.5779 15.6686 41.46 <2e-16 ***
str -0.2864 0.4875 -0.59 0.557
expn.per.1k 3.8679 1.6074 2.41 0.017 *
el.pct -0.6560 0.0321 -20.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The regression results can be restated as
cor(test.score.data$str, test.score.data$expn.stu)
[1] -0.62
and hence we're seeing the effect of imperfect multicollinearity
\[ TestScore_i = \beta_0 + \beta_1 STR + \beta_2 Expn + u_i \]
Suppose that in the test score/STR analysis, an angry taxpayer hypothesizes that neither the STR nor expenditure per student have an effect on test scores
\[ \begin{align*} H_0&: \beta_1 = 0 \text{ and } \beta_2 = 0 \\ H_1&: \beta_1 \ne 0 \text{ and/or } \beta_2 \ne 0 \end{align*} \]
(As a matter of terminology, here we see the null hypothesis imposing two restrictions)
\( H_0: \beta_j = \beta_{j,0}, \beta_m = \beta_{m,0},\dots \) for \( q \) restrictions
\( H_1: \) one or more of the \( q \) restrictions under \( H_0 \) does not hold
For example, suppose we wanted to test the null hypothesis that the \( 2^{nd}, 4^{th}, \text{and }5^{th} \) coefficients are zero, we would have the \( q=3 \) restrictions \( \beta_2 = 0, \beta_4 = 0, \text{and }\beta_5 = 0 \)
Consider the special case where \( t_1 \) and \( t_2 \) are uncorrelated
We can use the large-sample \( F_{q,\infty} \) approximation to calculate the \( p \)-value for an observe \( F^{act} \) \[ p\text{-value} = Pr[F_{q,\infty} > F^{act}] \]
We can now test the null hypothesis that the coefficients on \( STR \) and \( Expn \) are zero, against the alternative that at least one of them is nonzero, holding \( PctEL \) fixed. In R we can calculate the heteroskedasticity-robust \( F \)-statistic and its \( p \)-value, for the restrictions \( \beta_1 = 0 \) and \( \beta_2 = 0 \)
library(car)
regress.results <- lm(testscr ~ str + expn.per.1k + el.pct, data = test.score.data)
lht(regress.results, c('str = 0', 'expn.per.1k = 0'), test='F', vcov.=vcovHC(regress.results))
Linear hypothesis test
Hypothesis:
str = 0
expn.per.1k = 0
Model 1: restricted model
Model 2: testscr ~ str + expn.per.1k + el.pct
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 418
2 416 2 5.26 0.0055 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Suppose we now want to test the null hypothesis (\( q = 1 \)) that two of the parameters are equal
\[ \begin{align*} H_0&: \beta_1 = \beta_2 \\ H_1&: \beta_1 \ne \beta_2 \end{align*} \]
There are two approaches to do this
Transform the Regression Consider the model \[ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \] We can transform it by adding and subtracting \( \beta_2 X_{1i} \) \[ \begin{align*} Y_i &= \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + u_i \\ &= \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \beta_2 X_{1i} - \beta_2 X_{1i} + u_i \\ &= \beta_0 + (\underbrace{\beta_1 - \beta_2}_{\gamma_1}) X_{1i} + \beta_2 (\underbrace{X_{1i} + X_{2i}}_{W_i}) + u_i \\ &= \beta_0 + \gamma_1 X_{1i} + \beta_2 W_i + u_i \end{align*} \]
The test simplifies to testing the \( H_0: \gamma_1 = 0 \) vs \( H_1: \gamma \ne 0 \).
Consider testing \( \beta_1 = \beta_3 \) in the test scores model
regress.results <- lm(testscr ~ str + expn.per.1k + el.pct, data = test.score.data)
lht(regress.results, 'str = el.pct', test='F', vcov.=vcovHC(regress.results))
Linear hypothesis test
Hypothesis:
str - el.pct = 0
Model 1: restricted model
Model 2: testscr ~ str + expn.per.1k + el.pct
Note: Coefficient covariance matrix supplied.
Res.Df Df F Pr(>F)
1 417
2 416 1 0.56 0.45
Consider the test scores example
Now in order to control for the possible OVB in the test scores example due to “outside learning opportunities”“ available to richer households, we add the regressor \( LchPct \): the percentage of students receiving free or subsidized school lunch.
regress.results <- lm(testscr ~ str + el.pct + meal.pct,
data = test.score.data)
coeftest(regress.results, vcov.=vcovHC(regress.results))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 700.1500 5.6410 124.12 < 2e-16 ***
str -0.9983 0.2738 -3.65 0.00030 ***
el.pct -0.1216 0.0332 -3.66 0.00029 ***
meal.pct -0.5473 0.0243 -22.51 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
To explain conditional mean independence, consider a regression with two regressors: \( X_{1i} \), the variable of interest, and \( X_{2i} \), the control variable. Conditional mean independence would be
\[ E[u_i|X_{1i}, X_{2i}] = E[u_i|X_{2i}] \]
\[ E[u_i|X_{1i}, X_{2i}] = E[u_i|X_{2i}] \]
As mentioned before we must be careful not to over rely on these measures in making our choice of specification. Some problems with the \( R^2 \) and \( \bar{R}^2 \).
As we've done before we want to run a multiple regression to determine the effect of student to teacher ratio on average district test scores. We explained how we were concerned about OVB and need to control for students' background characteristics. Some controls we will consider