Issue Summary

  1. In a regression model, heteroskedasticity occurs when the variance of the error terms varies across all levels of the independent variable. The spread of the residuals are not constant along the range of independent variable.

    The presence of heteroskedasticity violates the assumption of constant variance in the error terms. When such a violation is present, the standard errors of the regression coefficients are affected. As a result, the standard errors may be biased which leads to invalid conclusions and statistical significance. Furthermore, the estimators can become less precise which affects the confidence intervals.

  2. White Test

    • Null Hypothesis - Heteroskedasticity is not present. The variance of the error terms is constant for all levels of the independent variable.

    • Alternative Hypothesis - Heteroskedasticity is present. The variance of the error terms is not constant for all levels of the independent variable.

    The logic behind the White Test does make sense because it seeks to determine the relationship between the variance of the residuals and the independent variable by running an auxiliary regression. The test conclusion is then drawn from assessing the p-value. If the p-value is less than the significance level, the null hypothesis of homoskedasticity is rejected (heteroskedasticity is present).

Coding

Import Packages & Dataset

data("MASchools")
df <- MASchools

Overview of Dataset

The MASchools dataset in the AER package records test performance, school characteristics and student demographic backgrounds for school districts in Massachusetts.

skim(df)
Data summary
Name df
Number of rows 220
Number of columns 16
_______________________
Column type frequency:
character 2
numeric 14
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
district 0 1 1 3 0 220 0
municipality 0 1 3 18 0 220 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
expreg 0 1.00 4605.46 880.25 2905.00 4065.00 4488.50 4971.75 8759.00 ▃▇▂▁▁
expspecial 0 1.00 8900.73 3511.70 3832.23 7442.30 8353.79 9721.92 53569.24 ▇▁▁▁▁
expbil 0 1.00 3037.31 20259.26 0.00 0.00 0.00 0.00 295140.00 ▇▁▁▁▁
expocc 0 1.00 1104.21 2732.45 0.00 0.00 0.00 0.00 15088.00 ▇▁▁▁▁
exptot 0 1.00 5370.25 977.04 3465.00 4730.25 5155.00 5788.75 9868.00 ▃▇▂▁▁
scratio 9 0.96 8.11 2.84 2.30 6.10 7.80 9.80 18.40 ▃▇▅▁▁
special 0 1.00 15.97 3.54 8.10 13.37 15.45 17.92 34.30 ▃▇▂▁▁
lunch 0 1.00 15.32 15.06 0.40 5.30 10.55 20.02 76.20 ▇▂▁▁▁
stratio 0 1.00 17.34 2.28 11.40 15.80 17.10 19.03 27.00 ▂▇▆▁▁
income 0 1.00 18.75 5.81 9.69 15.22 17.13 20.38 46.85 ▇▆▂▁▁
score4 0 1.00 709.83 15.13 658.00 701.00 711.00 720.00 740.00 ▁▁▆▇▃
score8 40 0.82 698.41 21.05 641.00 685.00 698.00 712.00 747.00 ▁▃▇▅▂
salary 25 0.89 35.99 3.19 24.97 33.80 35.88 37.96 44.49 ▁▂▇▅▂
english 0 1.00 1.12 2.90 0.00 0.00 0.00 0.89 24.49 ▇▁▁▁▁

The variables that have missing data will not be used for our model so we will ignore it.

Regression Model

Dependent Variable:

score4 - 4th grade score (math + English + science)

Independent Variables:

exptot - expenditures per pupil, total

stratio - student teacher ratio

income - per capita income

Regression Equation:

\[ score4 = \beta_0+\beta_1exptot+\beta_2stratio+\beta_3income+\epsilon \]

lm_mod <- lm(data = df, formula = score4 ~ exptot + stratio + income)

summary(lm_mod)
## 
## Call:
## lm(formula = score4 ~ exptot + stratio + income, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.025  -8.336   0.641   6.765  32.373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.354e+02  9.677e+00  75.994  < 2e-16 ***
## exptot      -4.722e-03  9.329e-04  -5.062 8.88e-07 ***
## stratio     -1.971e+00  3.734e-01  -5.278 3.17e-07 ***
## income       1.812e+00  1.392e-01  13.016  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.01 on 216 degrees of freedom
## Multiple R-squared:  0.4772, Adjusted R-squared:  0.4699 
## F-statistic: 65.72 on 3 and 216 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))

plot(lm_mod)

The scale-location plot is used to detect heteroskedasticity. In this case, it does seem like the assumption of homoskedasticity is violated as the residuals are not equally spread out and the red line is not approximately horizontal. We can assess this further using the White Test.

White Test

white_lm_mod <- white(mainlm =  lm_mod, interactions = TRUE)   

white_lm_mod
## # A tibble: 1 × 5
##   statistic  p.value parameter method       alternative
##       <dbl>    <dbl>     <dbl> <chr>        <chr>      
## 1      31.1 0.000286         9 White's Test greater

The test statistic has a value of 31.09 while the p-value is 0.00029. Hence, we would reject the null hypothesis of homoskedasticity.

df$residuals <- resid(object = lm_mod)

summary(df$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -34.0253  -8.3361   0.6415   0.0000   6.7654  32.3730
df$squared_residuals <- (df$residuals)^2 

summary(df$squared_residuals)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0016    9.2648   53.8435  119.0766  152.9238 1157.7216

Auxiliary Regression

Let’s check if an auxiliary regression gives us the same value.

white_auxreg_lm <- lm(formula = squared_residuals ~ exptot     +   stratio      +  income  + 
                                                     I(exptot^2)  + I(stratio^2)   + I(income^2) +
                                                   exptot:stratio + stratio:income + exptot:income ,
                             data = df
                   ) 

white_auxreg_lm_summary <- summary(white_auxreg_lm)
white_auxreg_lm_summary
## 
## Call:
## lm(formula = squared_residuals ~ exptot + stratio + income + 
##     I(exptot^2) + I(stratio^2) + I(income^2) + exptot:stratio + 
##     stratio:income + exptot:income, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -284.41  -95.28  -40.89   42.18  915.34 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -4.511e+02  9.041e+02  -0.499   0.6184    
## exptot          5.049e-02  1.704e-01   0.296   0.7673    
## stratio         6.183e+01  6.288e+01   0.983   0.3266    
## income         -6.775e+00  3.059e+01  -0.221   0.8249    
## I(exptot^2)     6.192e-06  9.823e-06   0.630   0.5292    
## I(stratio^2)   -1.527e+00  1.492e+00  -1.024   0.3072    
## I(income^2)     1.154e+00  2.656e-01   4.344 2.18e-05 ***
## exptot:stratio  1.251e-03  4.879e-03   0.256   0.7979    
## stratio:income -1.079e+00  1.317e+00  -0.819   0.4136    
## exptot:income  -5.584e-03  2.740e-03  -2.038   0.0428 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 163.7 on 210 degrees of freedom
## Multiple R-squared:  0.1413, Adjusted R-squared:  0.1045 
## F-statistic:  3.84 on 9 and 210 DF,  p-value: 0.0001618
# calculate p-value
chisq_p_value <-  pchisq(q = white_auxreg_lm_summary$r.squared * nobs(white_auxreg_lm), 
                         df = 9, lower.tail = FALSE)

chisq_p_value
## [1] 0.000285544
# calculating the test statistic 
test_stat <- white_auxreg_lm_summary$r.squared * nobs(white_auxreg_lm) # number of observations multiplied by the coefficient 

test_stat
## [1] 31.09129


The White test and auxiliary regression produced the same p-value and test statistics. Therefore, we can establish the same conclusion that heteroskedasticity is present since the p-value is less than 0.05. The \(R^2\) has a value of 0.14 which is relatively low. This suggests that the variation in the squared residuals is not well explained by the independent variables included in the model.