In a regression model, heteroskedasticity occurs when the
variance of the error terms varies across all levels of the independent
variable. The spread of the residuals are not constant along the range
of independent variable.
The presence of heteroskedasticity violates the assumption of constant
variance in the error terms. When such a violation is present, the
standard errors of the regression coefficients are affected. As a
result, the standard errors may be biased which leads to invalid
conclusions and statistical significance. Furthermore, the estimators
can become less precise which affects the confidence intervals.
White Test
Null Hypothesis - Heteroskedasticity is not present. The variance of the error terms is constant for all levels of the independent variable.
Alternative Hypothesis - Heteroskedasticity is present. The variance of the error terms is not constant for all levels of the independent variable.
The logic behind the White Test does make sense because it seeks to determine the relationship between the variance of the residuals and the independent variable by running an auxiliary regression. The test conclusion is then drawn from assessing the p-value. If the p-value is less than the significance level, the null hypothesis of homoskedasticity is rejected (heteroskedasticity is present).
data("MASchools")
df <- MASchools
The MASchools dataset in the AER package records test
performance, school characteristics and student demographic backgrounds
for school districts in Massachusetts.
skim(df)
| Name | df |
| Number of rows | 220 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| district | 0 | 1 | 1 | 3 | 0 | 220 | 0 |
| municipality | 0 | 1 | 3 | 18 | 0 | 220 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| expreg | 0 | 1.00 | 4605.46 | 880.25 | 2905.00 | 4065.00 | 4488.50 | 4971.75 | 8759.00 | ▃▇▂▁▁ |
| expspecial | 0 | 1.00 | 8900.73 | 3511.70 | 3832.23 | 7442.30 | 8353.79 | 9721.92 | 53569.24 | ▇▁▁▁▁ |
| expbil | 0 | 1.00 | 3037.31 | 20259.26 | 0.00 | 0.00 | 0.00 | 0.00 | 295140.00 | ▇▁▁▁▁ |
| expocc | 0 | 1.00 | 1104.21 | 2732.45 | 0.00 | 0.00 | 0.00 | 0.00 | 15088.00 | ▇▁▁▁▁ |
| exptot | 0 | 1.00 | 5370.25 | 977.04 | 3465.00 | 4730.25 | 5155.00 | 5788.75 | 9868.00 | ▃▇▂▁▁ |
| scratio | 9 | 0.96 | 8.11 | 2.84 | 2.30 | 6.10 | 7.80 | 9.80 | 18.40 | ▃▇▅▁▁ |
| special | 0 | 1.00 | 15.97 | 3.54 | 8.10 | 13.37 | 15.45 | 17.92 | 34.30 | ▃▇▂▁▁ |
| lunch | 0 | 1.00 | 15.32 | 15.06 | 0.40 | 5.30 | 10.55 | 20.02 | 76.20 | ▇▂▁▁▁ |
| stratio | 0 | 1.00 | 17.34 | 2.28 | 11.40 | 15.80 | 17.10 | 19.03 | 27.00 | ▂▇▆▁▁ |
| income | 0 | 1.00 | 18.75 | 5.81 | 9.69 | 15.22 | 17.13 | 20.38 | 46.85 | ▇▆▂▁▁ |
| score4 | 0 | 1.00 | 709.83 | 15.13 | 658.00 | 701.00 | 711.00 | 720.00 | 740.00 | ▁▁▆▇▃ |
| score8 | 40 | 0.82 | 698.41 | 21.05 | 641.00 | 685.00 | 698.00 | 712.00 | 747.00 | ▁▃▇▅▂ |
| salary | 25 | 0.89 | 35.99 | 3.19 | 24.97 | 33.80 | 35.88 | 37.96 | 44.49 | ▁▂▇▅▂ |
| english | 0 | 1.00 | 1.12 | 2.90 | 0.00 | 0.00 | 0.00 | 0.89 | 24.49 | ▇▁▁▁▁ |
The variables that have missing data will not be used for our model so we will ignore it.
Dependent Variable:
score4 - 4th grade score (math + English + science)
Independent Variables:
exptot - expenditures per pupil, total
stratio - student teacher ratio
income - per capita income
Regression Equation:
\[ score4 = \beta_0+\beta_1exptot+\beta_2stratio+\beta_3income+\epsilon \]
lm_mod <- lm(data = df, formula = score4 ~ exptot + stratio + income)
summary(lm_mod)
##
## Call:
## lm(formula = score4 ~ exptot + stratio + income, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.025 -8.336 0.641 6.765 32.373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.354e+02 9.677e+00 75.994 < 2e-16 ***
## exptot -4.722e-03 9.329e-04 -5.062 8.88e-07 ***
## stratio -1.971e+00 3.734e-01 -5.278 3.17e-07 ***
## income 1.812e+00 1.392e-01 13.016 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.01 on 216 degrees of freedom
## Multiple R-squared: 0.4772, Adjusted R-squared: 0.4699
## F-statistic: 65.72 on 3 and 216 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lm_mod)
The scale-location plot is used to detect heteroskedasticity. In this case, it does seem like the assumption of homoskedasticity is violated as the residuals are not equally spread out and the red line is not approximately horizontal. We can assess this further using the White Test.
white_lm_mod <- white(mainlm = lm_mod, interactions = TRUE)
white_lm_mod
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 31.1 0.000286 9 White's Test greater
The test statistic has a value of 31.09 while the p-value is 0.00029. Hence, we would reject the null hypothesis of homoskedasticity.
df$residuals <- resid(object = lm_mod)
summary(df$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -34.0253 -8.3361 0.6415 0.0000 6.7654 32.3730
df$squared_residuals <- (df$residuals)^2
summary(df$squared_residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0016 9.2648 53.8435 119.0766 152.9238 1157.7216
Let’s check if an auxiliary regression gives us the same value.
white_auxreg_lm <- lm(formula = squared_residuals ~ exptot + stratio + income +
I(exptot^2) + I(stratio^2) + I(income^2) +
exptot:stratio + stratio:income + exptot:income ,
data = df
)
white_auxreg_lm_summary <- summary(white_auxreg_lm)
white_auxreg_lm_summary
##
## Call:
## lm(formula = squared_residuals ~ exptot + stratio + income +
## I(exptot^2) + I(stratio^2) + I(income^2) + exptot:stratio +
## stratio:income + exptot:income, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -284.41 -95.28 -40.89 42.18 915.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.511e+02 9.041e+02 -0.499 0.6184
## exptot 5.049e-02 1.704e-01 0.296 0.7673
## stratio 6.183e+01 6.288e+01 0.983 0.3266
## income -6.775e+00 3.059e+01 -0.221 0.8249
## I(exptot^2) 6.192e-06 9.823e-06 0.630 0.5292
## I(stratio^2) -1.527e+00 1.492e+00 -1.024 0.3072
## I(income^2) 1.154e+00 2.656e-01 4.344 2.18e-05 ***
## exptot:stratio 1.251e-03 4.879e-03 0.256 0.7979
## stratio:income -1.079e+00 1.317e+00 -0.819 0.4136
## exptot:income -5.584e-03 2.740e-03 -2.038 0.0428 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 163.7 on 210 degrees of freedom
## Multiple R-squared: 0.1413, Adjusted R-squared: 0.1045
## F-statistic: 3.84 on 9 and 210 DF, p-value: 0.0001618
# calculate p-value
chisq_p_value <- pchisq(q = white_auxreg_lm_summary$r.squared * nobs(white_auxreg_lm),
df = 9, lower.tail = FALSE)
chisq_p_value
## [1] 0.000285544
# calculating the test statistic
test_stat <- white_auxreg_lm_summary$r.squared * nobs(white_auxreg_lm) # number of observations multiplied by the coefficient
test_stat
## [1] 31.09129
The White test and auxiliary regression produced the same p-value and
test statistics. Therefore, we can establish the same conclusion that
heteroskedasticity is present since the p-value is less than 0.05. The
\(R^2\) has a value of 0.14 which is
relatively low. This suggests that the variation in the squared
residuals is not well explained by the independent variables included in
the model.