1. What is heteroskedasticity? Breusch-Pagan and White tests

Heteroskedasticity is the existence of variance of residuals across different segments of the population. In other words, the unobserved factors’ variance differs at different levels of the independent variable. Heteroskedasticity in the model results puts into question the standard errors of the estimators, which means the hypothesis tests and confidence intervals related to the significance of coefficients of independent variables may be unreliable.

In both the Breusch-Pagan and White tests, the null hypothesis is that the assumption of homoskedasticity holds, and the alternative is heteroskedasticity. The Breusch-Pagan test uses an auxillary regression wherein the sum of squared residuals is estimated using the independent variables. If any of the independent variables’ coefficients are significantly different from zero, that indicates heteroskedasticity. Instead of regressing squared residuals on all independent variables, the White test estimates it using the dependent variable and its squared version.

From what I have read, the White test is more general and better suited for testing for heteroskedasticity with no specific cause in mind, whereas the Breusch-Pagan test is more suitable for models where you suspect a particular variable that is causing the problem.

2. Heteroskedasticity exercise

data("MASchools")

I chose the MASchools dataset, which contains 220 observations of these 16 variables:

Variable Definition
district District code.
municipality Municipality name
expreg Expenditures per pupil, regular
expspecial Expenditures per pupil, special
expbil Expenditures per pupil, bilingual
expocc Expenditures per pupil, occupational
exptot Expenditures per pupil, total
scratio Students per computer
special Special ed students
lunch Expenditures per pupil, occupational
stratio student-teacher ratio
income Per capita income
score4 4th grade scores
score8 8th grade scores
salary Average teacher salary
english Percent of English learners

I’d like to see the relationship between 8th grade scores and 4th grade scores, expenditure per student, and per capita income of the municipality. First I removed a few variables with higher missingness (salary and scratio), then looked at the distributions of the variables to determine if any skew is present.

## Warning: Removed 40 rows containing non-finite values (`stat_bin()`).

## 
## Summary Statistics
## --------------------------------------------------------
## Statistic   N    Mean     St. Dev.     Min       Max    
## ========================================================
## expreg     180 4,709.656  867.316     3,023     8,759   
## expspecial 180 8,919.599 3,767.501  3,832.230 53,569.240
## expbil     180 3,677.628 22,353.440     0      295,140  
## expocc     180 1,349.589 2,966.767      0       15,088  
## exptot     180 5,464.761  961.078     3,868     9,868   
## special    180  16.053     3.265     10.400     26.000  
## lunch      180  16.057     15.951     0.400     76.200  
## stratio    180  17.274     2.209     11.400     27.000  
## income     180  18.739     5.619      9.686     46.855  
## score4     180  709.000    15.798      658       740    
## score8     180  698.411    21.053      641       747    
## english    180   1.258     3.072      0.000     24.494  
## ========================================================

While the test scores appear normally distributed, the remaining variables (both explanatory) are skewed. For this reason, I chose a level log approach when building the simple model.

eighth_grade_scores <- lm(score8 ~ log(score4) + log(exptot) + log(income), data = MASchools)
summary(eighth_grade_scores)
## 
## Call:
## lm(formula = score8 ~ log(score4) + log(exptot) + log(income), 
##     data = MASchools)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0143  -5.4394  -0.4751   5.9285  27.6716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2466.738    298.385  -8.267 3.24e-14 ***
## log(score4)   479.323     44.470  10.779  < 2e-16 ***
## log(exptot)   -10.970      4.725  -2.322   0.0214 *  
## log(income)    39.170      4.145   9.449  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.69 on 176 degrees of freedom
## Multiple R-squared:  0.8325, Adjusted R-squared:  0.8296 
## F-statistic: 291.5 on 3 and 176 DF,  p-value: < 2.2e-16
residualPlots(eighth_grade_scores)

##             Test stat Pr(>|Test stat|)
## log(score4)   -0.7526           0.4527
## log(exptot)    0.2923           0.7704
## log(income)   -1.0154           0.3113
## Tukey test    -1.1215           0.2621
plot(eighth_grade_scores)

Visually, the model I’ve created does not appear to have heteroskedasticity. I will test that with the skedastic package’s White and Breusch-Pagan test functions.

white_test <- skedastic::white(mainlm = eighth_grade_scores, interactions = TRUE)
print(white_test)
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      11.0   0.273         9 White's Test greater

With a p-value of 0.27, we reject to fail the null hypothesis of homoskedasticity.

MASchools$residuals <- resid(object = eighth_grade_scores)

summary(MASchools$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -22.0143  -5.4394  -0.4751   0.0000   5.9285  27.6716
#mean is zero, which makes sense
MASchools$squared_residuals <- MASchools$residuals^2 

summary(MASchools$squared_residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0008   7.7515  31.6603  73.8365 109.0621 765.7162

Now that we have the data required, we can run the auxillary regression.

auxillary_regression <- lm(squared_residuals ~ log(score4) + log(exptot) + log(income) + I(log(score4)^2) + I(log(exptot)^2) + I(log(income)^2) + log(score4):log(exptot) + log(score4):log(income) + log(exptot):log(income), data = MASchools)

summary(auxillary_regression)
## 
## Call:
## lm(formula = squared_residuals ~ log(score4) + log(exptot) + 
##     log(income) + I(log(score4)^2) + I(log(exptot)^2) + I(log(income)^2) + 
##     log(score4):log(exptot) + log(score4):log(income) + log(exptot):log(income), 
##     data = MASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128.50  -61.25  -37.28   28.70  688.79 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             1821952.0  1109424.4   1.642   0.1024  
## log(score4)             -573739.5   331251.0  -1.732   0.0851 .
## log(exptot)                -790.6    25166.4  -0.031   0.9750  
## log(income)               43075.9    26286.0   1.639   0.1031  
## I(log(score4)^2)          44926.5    24879.9   1.806   0.0727 .
## I(log(exptot)^2)           -204.2      286.9  -0.712   0.4776  
## I(log(income)^2)             42.1      212.0   0.199   0.8428  
## log(score4):log(exptot)     545.7     3563.4   0.153   0.8785  
## log(score4):log(income)   -6928.7     3942.8  -1.757   0.0807 .
## log(exptot):log(income)     246.9      385.7   0.640   0.5228  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108 on 170 degrees of freedom
## Multiple R-squared:  0.06129,    Adjusted R-squared:  0.0116 
## F-statistic: 1.233 on 9 and 170 DF,  p-value: 0.2777

The low r-squared value demonstrates that the squared residuals are not well explained by the independent variables. This suggests that there is not a significant pattern of heteroskedasticity. Combining this auxillary analysis with the visual test of the plot of residuals presents strong evidence that the assumption of homoskedasticity holds.

pchisq(q = summary(auxillary_regression)$r.squared * nobs(auxillary_regression), df = 9, lower.tail = FALSE)
## [1] 0.2734816
print(white_test)
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      11.0   0.273         9 White's Test greater

P value from the chi-square test statistic matches the p-value from the white() function.