Karim Naguib (Boston University)
9/14/2013
Recall that our model is
\[ Y_i = \beta_0 + \beta_1 X_i + u_i \]
We know the sampling distribution of the estimator \( \hat{\beta}_1 \)
\[ \hat{\beta}_1 \sim N\left(\beta_1, \frac{Var[(X_i - \mu_X)u_i]}{n[Var(X_i)]^2}\right) \]
We would like to carry out some inferences on the population parameter \( \beta_1 \)
Recall that to test any hypothesis we rely on the \( t \)-statistic
\[ t = \frac{\text{estimator} - \text{hypothesized value}}{\text{standard error of the estimator}} \]
From that we calculate the \( p \)-value which tells us the smallest significance level at which \( H_0 \) can be rejected
\[ p\text{-value} = 2 \Phi(-|t|) \]
\[ \begin{align*} H_0&: \beta_1 = \beta_{1,0} \\ H_1&: \beta_1 \ne \beta_{1,0} \end{align*} \] and the \( t \)-statistic is
\[ t = \frac{\hat{\beta}_1 - \beta_{1,0}}{SE(\hat{\beta}_1)} \]
To calculate the \( t \)-statistic, the first thing we need is the standard error \( \hat{\beta}_1 \). Recall that the standard error is an estimate of \( \sigma_{\hat{\beta}_1} \) (the standard deviation of \( \hat{\beta}_1 \)) \[ SE(\hat{\beta}_1) = \sqrt{\hat{\sigma}_{\hat{\beta}_1}^2} \] where \[ \hat{\sigma}_{\hat{\beta}_1}^2 = \frac{1}{n}\times \frac{\overbrace{\frac{1}{n-2}\sum_i(X_i - \bar{X})^2\hat{u}_i^2}^{\text{sample variance of }(X_i - \mu_X)u_i}}{\big[\underbrace{\frac{1}{n - 1}\sum_i(X_i - \bar{X})^2}_{\text{sample variance of }X_i}\big]^2} \]
To calculate the \( p \)-value \[ \begin{align*} p\text{-value} &= Pr\left[|\hat{\beta}_1 - \beta_{1,0}| > |\hat{\beta}_1^{act} - \beta_{1,0}|\big|H_0\right] \\ &= Pr\left[\left|\frac{\hat{\beta}_1 - \beta_{1,0}}{SE(\hat{\beta}_1)}\right| > \left|\frac{\hat{\beta}_1^{act} - \beta_{1,0}}{SE(\hat{\beta}_1)}\right|\big|H_0\right] \\ &= Pr[|t| > |t^{act}|\big|H_0] \end{align*} \] And using a large sample, \( t \) can be approximated by a standard normal distribution, and therefore \[ p\text{-value} = Pr(|Z| > |t^{act}|) = 2\Phi(-|t^{act}|) \] where \( Z \sim N(0,1) \)
regress.results <- lm(testscr ~ str, data = test.score.data)
summary(regress.results)
Call:
lm(formula = testscr ~ str, data = test.score.data)
Residuals:
Min 1Q Median 3Q Max
-47.73 -14.25 0.48 12.82 48.54
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.93 9.47 73.82 < 2e-16 ***
str -2.28 0.48 -4.75 2.8e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.6 on 418 degrees of freedom
Multiple R-squared: 0.0512, Adjusted R-squared: 0.049
F-statistic: 22.6 on 1 and 418 DF, p-value: 2.78e-06
For the test scores/STR model we saw that \( SE(\hat{\beta}_0) = 9.4675 \) and \( SE(\hat{\beta}_1) = 0.4798 \). A compact way to report regression results is \[ \begin{align*} \widehat{TestScore} = &698.9330 - 2.2798 \times STR\\ (&9.4675)~ ~ ~ ~ ~(0.4798) \\ \\ R^2 = &0.05124,~SER = 18.58 \end{align*} \] And the reported \( t \)-statistic is calculated as \[ t^{act} = \frac{-2.2798 - 0}{0.4798} = -4.751 \] Since \( |t^{act}| > z_{0.025} = 1.96 \) we can reject \( H_0 \) at the 5% significance level.
Alternatively, we can have a one-sided test for \( \beta_1 \) such as \[ \begin{align*} H_0&: \beta_1 = \beta_{1,0} \\ H_1&: \beta_1 < \beta_{1,0} \end{align*} \] In this case, our decision rule would be to reject \( H_0 \) if \[ t^{act} < -z_{\alpha} \] and \[ p\text{-value} = Pr(Z < t^{act}) = \Phi(t^{act}) \] (If the direction of inequality in \( H_1 \) was reversed, the decision rule would be to reject if \( t^{act} > z_{\alpha} \), and \( p\text{value} = 1 - \Phi(t^{act}) \))
Consider the hypotheses
\[ \begin{align*} H_0&: \beta_{ClassSize} = 0 \\ H_1&: \beta_{ClassSize} < 0 \end{align*} \]
Since we know from the regression output that
\[ t^{act} = -4.751 < -z_{0.05} = -1.645 \]
we can reject \( H_0 \) at the 5% significance level.
The \( (1-\alpha)\times 100 \% \) confidence interval for the parameter \( \beta_1 \) is \[ [\hat{\beta}_1 - z_{\alpha/2}SE(\hat{\beta}_1), \hat{\beta}_1 + z_{\alpha/2}SE(\hat{\beta}_1)] \]
The 95% confidence interval for \( \beta_{ClassSize} \) is \[ \begin{align*} &[\hat{\beta}_{ClassSize} - z_{0.025}SE(\hat{\beta}_{ClassSize}),\hat{\beta}_{ClassSize} + z_{0.025}SE(\hat{\beta}_{ClassSize})] = \\ &[-2.28 - 1.96\times 0.48,-2.28 + 1.96\times 0.48] = \\ &[-3.22, -1.34] \end{align*} \]
regress.results <- lm(testscr ~ str, data = test.score.data)
confint(regress.results, "str", level = 0.95)
2.5 % 97.5 %
str -3.223 -1.337
The confidence interval for the predicted change in the dependent variable \( Y \) in response to change \( \Delta x \) in \( X \) \[ \left[ [\hat{\beta}_1 - z_{\alpha/2}SE(\hat{\beta}_1)]\times \Delta x, [\hat{\beta}_1 + z_{\alpha/2}SE(\hat{\beta}_1)]\times \Delta x \right] \]
Suppose we want to calculate the confidence interval for the predicted change in test scores in response to a reduction of 2 in the \( STR \) \[ \begin{align*} &\left[ (-2.28 - 1.96\times 0.48)\times 2,(-2.28 + 1.96\times 0.48)\times 2\right] = \\ &[-6.45, -2.67] \end{align*} \]
confint(regress.results, 'str', level = 0.95) * 2
2.5 % 97.5 %
str -6.446 -2.673
In order to interpret the meaning of the coefficient \( \beta_1 \), we need to consider the different cases of \( D_i \)
Therefore, \( \beta_1 \) can be viewed as the difference between the two means \[ E[Y_i|D_i = 1] - E[Y_i|D_i = 0] = (\beta_0 + \beta_1) - \beta_0 = \beta_1 \]
\[ \begin{align*} H_0&: \beta_1 = 0 \\ H_1&: \beta_1 \ne 0 \end{align*} \]
test.score.data$d <- ifelse(test.score.data$str < 20, 1, 0)
str(test.score.data$d)
num [1:420] 1 0 1 1 1 0 1 0 1 0 ...
regression.results <- lm(testscr ~ d, data = test.score.data)
summary(regression.results)
Call:
lm(formula = testscr ~ d, data = test.score.data)
Residuals:
Min 1Q Median 3Q Max
-50.60 -14.05 -0.45 12.84 49.40
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 649.98 1.39 468 < 2e-16 ***
d 7.37 1.84 4 7.5e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.7 on 418 degrees of freedom
Multiple R-squared: 0.0369, Adjusted R-squared: 0.0345
F-statistic: 16 on 1 and 418 DF, p-value: 7.52e-05
The error term \( u_i \) is said to be homoskedastic if the variance of the conditional distribution of \( u_i \) given \( X_i \) \[ Var(u_i|X_i = x), \] is constant for all \( i \) and does not depend on \( x \). Otherwise, the error term is said to be heteroskedastic.
Suppose we are interested in the gender gap in income of college graduates in the US. Our model is
\[ Earnings_i = \beta_0 + \beta_1 MALE_i + u_i,~i=1,\dots,n \]
where \( MALE_i \) is a dummy variables that is equal to 1 for males and 0 for females. \( \beta_1 \) would be the difference in mean earnings between men and women.
It is useful to write this model as two equations
Subpopulation | |
---|---|
Women: | \( Earnings_i = \beta_0 + u_i \) |
Men: | \( Earnings_i = \beta_0 + \beta_1 + u_i \) |
If the error terms \( u_i \) are homoskedastic, we can simplify the formula for the estimate of the variance of \( \hat{\beta}_1 \) to be \[ \tilde{\sigma}_{\hat{\beta}_1}^2 = \frac{s_{\hat{u}}^2}{\sum_i (X_i - \bar{X})^2} \] and the homoskedasticity-only standard error of \( \hat{\beta}_1 \) would be \[ SE(\hat{\beta}_1) = \sqrt{\tilde{\sigma}_{\hat{\beta}_1}^2} \]
The reason why some of standard errors we got before using R were different than from the textbook is that R assumes that the error terms are homoskedastic while the textbook does all its calculations using heteroskedasticity-robust standard errors.
To use heteroskedasticity-robust standard errors in R
library(sandwich)
library(lmtest)
regress.results <- lm(testscr ~ str, data = test.score.data)
heterosk.robust.se <- vcovHC(regress.results)
coeftest(regress.results, vcov.=heterosk.robust.se)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.933 10.461 66.82 < 2e-16 ***
str -2.280 0.524 -4.35 1.7e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1