Heteroskedasticity is the existence of variance of residuals across different segments of the population. In other words, the unobserved factors’ variance differs at different levels of the independent variable. Heteroskedasticity in the model results puts into question the standard errors of the estimators, which means the hypothesis tests and confidence intervals related to the significance of coefficients of independent variables may be unreliable.
In both the Breusch-Pagan and White tests, the null hypothesis is that the assumption of homoskedasticity holds, and the alternative is heteroskedasticity. The Breusch-Pagan test uses an auxillary regression wherein the sum of squared residuals is estimated using the independent variables. If any of the independent variables’ coefficients are significantly different from zero, that indicates heteroskedasticity. Instead of regressing squared residuals on all independent variables, the White test estimates it using the dependent variable and its squared version.
From what I have read, the White test is more general and better suited for testing for heteroskedasticity with no specific cause in mind, whereas the Breusch-Pagan test is more suitable for models where you suspect a particular variable that is causing the problem.
data("MASchools")
I chose the MASchools dataset, which contains 220 observations of these 16 variables:
Variable | Definition |
---|---|
district | District code. |
municipality | Municipality name |
expreg | Expenditures per pupil, regular |
expspecial | Expenditures per pupil, special |
expbil | Expenditures per pupil, bilingual |
expocc | Expenditures per pupil, occupational |
exptot | Expenditures per pupil, total |
scratio | Students per computer |
special | Special ed students |
lunch | Expenditures per pupil, occupational |
stratio | student-teacher ratio |
income | Per capita income |
score4 | 4th grade scores |
score8 | 8th grade scores |
salary | Average teacher salary |
english | Percent of English learners |
I’d like to see the relationship between 8th grade scores and 4th grade scores, expenditure per student, and per capita income of the municipality. First I removed a few variables with higher missingness (salary and scratio), then looked at the distributions of the variables to determine if any skew is present.
## Warning: Removed 40 rows containing non-finite values (`stat_bin()`).
##
## Summary Statistics
## --------------------------------------------------------
## Statistic N Mean St. Dev. Min Max
## ========================================================
## expreg 180 4,709.656 867.316 3,023 8,759
## expspecial 180 8,919.599 3,767.501 3,832.230 53,569.240
## expbil 180 3,677.628 22,353.440 0 295,140
## expocc 180 1,349.589 2,966.767 0 15,088
## exptot 180 5,464.761 961.078 3,868 9,868
## special 180 16.053 3.265 10.400 26.000
## lunch 180 16.057 15.951 0.400 76.200
## stratio 180 17.274 2.209 11.400 27.000
## income 180 18.739 5.619 9.686 46.855
## score4 180 709.000 15.798 658 740
## score8 180 698.411 21.053 641 747
## english 180 1.258 3.072 0.000 24.494
## ========================================================
While the test scores appear normally distributed, the remaining variables (both explanatory) are skewed. For this reason, I chose a level log approach when building the simple model.
eighth_grade_scores <- lm(score8 ~ log(score4) + log(exptot) + log(income), data = MASchools)
summary(eighth_grade_scores)
##
## Call:
## lm(formula = score8 ~ log(score4) + log(exptot) + log(income),
## data = MASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.0143 -5.4394 -0.4751 5.9285 27.6716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2466.738 298.385 -8.267 3.24e-14 ***
## log(score4) 479.323 44.470 10.779 < 2e-16 ***
## log(exptot) -10.970 4.725 -2.322 0.0214 *
## log(income) 39.170 4.145 9.449 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.69 on 176 degrees of freedom
## Multiple R-squared: 0.8325, Adjusted R-squared: 0.8296
## F-statistic: 291.5 on 3 and 176 DF, p-value: < 2.2e-16
residualPlots(eighth_grade_scores)
## Test stat Pr(>|Test stat|)
## log(score4) -0.7526 0.4527
## log(exptot) 0.2923 0.7704
## log(income) -1.0154 0.3113
## Tukey test -1.1215 0.2621
plot(eighth_grade_scores)
Visually, the model I’ve created does not appear to have heteroskedasticity. I will test that with the skedastic package’s White and Breusch-Pagan test functions.
white_test <- skedastic::white(mainlm = eighth_grade_scores, interactions = TRUE)
print(white_test)
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 11.0 0.273 9 White's Test greater
With a p-value of 0.27, we reject to fail the null hypothesis of homoskedasticity.
MASchools$residuals <- resid(object = eighth_grade_scores)
summary(MASchools$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -22.0143 -5.4394 -0.4751 0.0000 5.9285 27.6716
#mean is zero, which makes sense
MASchools$squared_residuals <- MASchools$residuals^2
summary(MASchools$squared_residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0008 7.7515 31.6603 73.8365 109.0621 765.7162
Now that we have the data required, we can run the auxillary regression.
auxillary_regression <- lm(squared_residuals ~ log(score4) + log(exptot) + log(income) + I(log(score4)^2) + I(log(exptot)^2) + I(log(income)^2) + log(score4):log(exptot) + log(score4):log(income) + log(exptot):log(income), data = MASchools)
summary(auxillary_regression)
##
## Call:
## lm(formula = squared_residuals ~ log(score4) + log(exptot) +
## log(income) + I(log(score4)^2) + I(log(exptot)^2) + I(log(income)^2) +
## log(score4):log(exptot) + log(score4):log(income) + log(exptot):log(income),
## data = MASchools)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128.50 -61.25 -37.28 28.70 688.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1821952.0 1109424.4 1.642 0.1024
## log(score4) -573739.5 331251.0 -1.732 0.0851 .
## log(exptot) -790.6 25166.4 -0.031 0.9750
## log(income) 43075.9 26286.0 1.639 0.1031
## I(log(score4)^2) 44926.5 24879.9 1.806 0.0727 .
## I(log(exptot)^2) -204.2 286.9 -0.712 0.4776
## I(log(income)^2) 42.1 212.0 0.199 0.8428
## log(score4):log(exptot) 545.7 3563.4 0.153 0.8785
## log(score4):log(income) -6928.7 3942.8 -1.757 0.0807 .
## log(exptot):log(income) 246.9 385.7 0.640 0.5228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 108 on 170 degrees of freedom
## Multiple R-squared: 0.06129, Adjusted R-squared: 0.0116
## F-statistic: 1.233 on 9 and 170 DF, p-value: 0.2777
The low r-squared value demonstrates that the squared residuals are not well explained by the independent variables. This suggests that there is not a significant pattern of heteroskedasticity. Combining this auxillary analysis with the visual test of the plot of residuals presents strong evidence that the assumption of homoskedasticity holds.
pchisq(q = summary(auxillary_regression)$r.squared * nobs(auxillary_regression), df = 9, lower.tail = FALSE)
## [1] 0.2734816
print(white_test)
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 11.0 0.273 9 White's Test greater
P value from the chi-square test statistic matches the p-value from the white() function.