Heteroskedasticity test: White and Breusch-Pagan

2. Heteroskedasticity exercise

data("MASchools")

I chose the MASchools dataset, which contains 220 observations of these 16 variables:

Variable	Definition
district	District code.
municipality	Municipality name

expreg	Expenditures per pupil, regular
expspecial	Expenditures per pupil, special
expbil	Expenditures per pupil, bilingual
expocc	Expenditures per pupil, occupational
exptot	Expenditures per pupil, total
scratio	Students per computer
special	Special ed students

lunch	Expenditures per pupil, occupational

stratio	student-teacher ratio

income	Per capita income

score4	4th grade scores

score8	8th grade scores

salary	Average teacher salary

english	Percent of English learners

I’d like to see the relationship between 8th grade scores and 4th grade scores, expenditure per student, and per capita income of the municipality. First I removed a few variables with higher missingness (salary and scratio), then looked at the distributions of the variables to determine if any skew is present.

## Warning: Removed 40 rows containing non-finite values (`stat_bin()`).

## 
## Summary Statistics
## --------------------------------------------------------
## Statistic   N    Mean     St. Dev.     Min       Max    
## ========================================================
## expreg     180 4,709.656  867.316     3,023     8,759   
## expspecial 180 8,919.599 3,767.501  3,832.230 53,569.240
## expbil     180 3,677.628 22,353.440     0      295,140  
## expocc     180 1,349.589 2,966.767      0       15,088  
## exptot     180 5,464.761  961.078     3,868     9,868   
## special    180  16.053     3.265     10.400     26.000  
## lunch      180  16.057     15.951     0.400     76.200  
## stratio    180  17.274     2.209     11.400     27.000  
## income     180  18.739     5.619      9.686     46.855  
## score4     180  709.000    15.798      658       740    
## score8     180  698.411    21.053      641       747    
## english    180   1.258     3.072      0.000     24.494  
## ========================================================

While the test scores appear normally distributed, the remaining variables (both explanatory) are skewed. For this reason, I chose a level log approach when building the simple model.

eighth_grade_scores <- lm(score8 ~ log(score4) + log(exptot) + log(income), data = MASchools)
summary(eighth_grade_scores)

## 
## Call:
## lm(formula = score8 ~ log(score4) + log(exptot) + log(income), 
##     data = MASchools)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22.0143  -5.4394  -0.4751   5.9285  27.6716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2466.738    298.385  -8.267 3.24e-14 ***
## log(score4)   479.323     44.470  10.779  < 2e-16 ***
## log(exptot)   -10.970      4.725  -2.322   0.0214 *  
## log(income)    39.170      4.145   9.449  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.69 on 176 degrees of freedom
## Multiple R-squared:  0.8325, Adjusted R-squared:  0.8296 
## F-statistic: 291.5 on 3 and 176 DF,  p-value: < 2.2e-16

residualPlots(eighth_grade_scores)

##             Test stat Pr(>|Test stat|)
## log(score4)   -0.7526           0.4527
## log(exptot)    0.2923           0.7704
## log(income)   -1.0154           0.3113
## Tukey test    -1.1215           0.2621

plot(eighth_grade_scores)

Visually, the model I’ve created does not appear to have heteroskedasticity. I will test that with the skedastic package’s White and Breusch-Pagan test functions.

white_test <- skedastic::white(mainlm = eighth_grade_scores, interactions = TRUE)
print(white_test)

## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      11.0   0.273         9 White's Test greater

With a p-value of 0.27, we reject to fail the null hypothesis of homoskedasticity.

MASchools$residuals <- resid(object = eighth_grade_scores)

summary(MASchools$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -22.0143  -5.4394  -0.4751   0.0000   5.9285  27.6716

#mean is zero, which makes sense
MASchools$squared_residuals <- MASchools$residuals^2 

summary(MASchools$squared_residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0008   7.7515  31.6603  73.8365 109.0621 765.7162

Now that we have the data required, we can run the auxillary regression.

auxillary_regression <- lm(squared_residuals ~ log(score4) + log(exptot) + log(income) + I(log(score4)^2) + I(log(exptot)^2) + I(log(income)^2) + log(score4):log(exptot) + log(score4):log(income) + log(exptot):log(income), data = MASchools)

summary(auxillary_regression)

## 
## Call:
## lm(formula = squared_residuals ~ log(score4) + log(exptot) + 
##     log(income) + I(log(score4)^2) + I(log(exptot)^2) + I(log(income)^2) + 
##     log(score4):log(exptot) + log(score4):log(income) + log(exptot):log(income), 
##     data = MASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128.50  -61.25  -37.28   28.70  688.79 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)  
## (Intercept)             1821952.0  1109424.4   1.642   0.1024  
## log(score4)             -573739.5   331251.0  -1.732   0.0851 .
## log(exptot)                -790.6    25166.4  -0.031   0.9750  
## log(income)               43075.9    26286.0   1.639   0.1031  
## I(log(score4)^2)          44926.5    24879.9   1.806   0.0727 .
## I(log(exptot)^2)           -204.2      286.9  -0.712   0.4776  
## I(log(income)^2)             42.1      212.0   0.199   0.8428  
## log(score4):log(exptot)     545.7     3563.4   0.153   0.8785  
## log(score4):log(income)   -6928.7     3942.8  -1.757   0.0807 .
## log(exptot):log(income)     246.9      385.7   0.640   0.5228  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 108 on 170 degrees of freedom
## Multiple R-squared:  0.06129,    Adjusted R-squared:  0.0116 
## F-statistic: 1.233 on 9 and 170 DF,  p-value: 0.2777

The low r-squared value demonstrates that the squared residuals are not well explained by the independent variables. This suggests that there is not a significant pattern of heteroskedasticity. Combining this auxillary analysis with the visual test of the plot of residuals presents strong evidence that the assumption of homoskedasticity holds.

pchisq(q = summary(auxillary_regression)$r.squared * nobs(auxillary_regression), df = 9, lower.tail = FALSE)

## [1] 0.2734816

print(white_test)

## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      11.0   0.273         9 White's Test greater

P value from the chi-square test statistic matches the p-value from the white() function.

Heteroskedasticity test: White and Breusch-Pagan

Daniel McLaughlin

12/4/2023

1. What is heteroskedasticity? Breusch-Pagan and White tests

2. Heteroskedasticity exercise