How to identify and correct for heteroskedasticity

This Vignette will demonstrate how to identify heteroskedasticity within a data set, and the different options that can be taken to correct this.

Heteroskedasticity is the situation whereby the variance of a variable changes across the range of observed values. This is important, as one of the assumptions of linear regression is that the residuals have a constant variance. As such, heteroskedasticity can invalidate the results of a linear regression model if it is present and not corrected.

Identifying heteroskedasticity

We are going to identify heteroskedasticity both visually and by using the ‘gvlma’ package. To do this we are going to be using packages, which I will introduce as we are using these. We are also going to need a regression model, for this example I will be using the built in ‘cars’ data available in R, attempting to explain ‘dist’ from the variable ‘speed’.

lmcars <- lm(dist~speed, data = cars)

We will begin by identifying heteroskedasticity graphically. We will be using the ‘ggplot’ package to help achieve this.

library(ggplot2)
plot(lmcars)

The important plots are the first and third ones. If heteroskedasticity were not present, both plots would have a flat horizontal red line indicating the average residuals of this regression. As the residuals are clumped in areas, we can say that heteroskedasticity is evident in this data set.

The second method we will use is the global validation of linear models assumptions under package ‘gvlma’.

library(gvlma)
gvlma(lmcars)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##     -17.579        3.932  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = lmcars) 
## 
##                     Value  p-value                   Decision
## Global Stat        15.801 0.003298 Assumptions NOT satisfied!
## Skewness            6.528 0.010621 Assumptions NOT satisfied!
## Kurtosis            1.661 0.197449    Assumptions acceptable.
## Link Function       2.329 0.126998    Assumptions acceptable.
## Heteroscedasticity  5.283 0.021530 Assumptions NOT satisfied!

This tests for the linear model assumptions and helpfully provides information on other assumptions. In this case we are going to look at the heteroskedasticity decisions, which has been identified as not being satisfied. We therefore reject the null hypothesis and state that there is heteroskedasticity in this model at the 5% significance level.

Correcting heteroskedasticity

In order to correct for heteroskedasticity, we will be using Box Cox Transformation. This will transform our dependent variable, ‘dist’, to correct our problem. To do this we will be using the ‘caret’ package as below.

library(caret)

## Loading required package: lattice

BCdist <- BoxCoxTrans(cars$dist)

This gives us our model that we will be using. We will use this to create a transformed ‘dist’ and add it to the ‘cars’ database.

cars <- cbind(cars, dist_BC = predict(BCdist, cars$dist))
head(cars)

##   speed dist   dist_BC
## 1     4    2 0.8284271
## 2     4   10 4.3245553
## 3     7    4 2.0000000
## 4     7   22 7.3808315
## 5     8   16 6.0000000
## 6     9   10 4.3245553

We will now build a new regression model using the newly created ‘dist_BC’ variable and test for heteroskedasticity using the ‘gvlma’ package.

BClmcars <- lm(dist_BC ~ speed, data = cars)
gvlma(BClmcars)

## 
## Call:
## lm(formula = dist_BC ~ speed, data = cars)
## 
## Coefficients:
## (Intercept)        speed  
##      0.5541       0.6448  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = BClmcars) 
## 
##                      Value p-value                Decision
## Global Stat        3.03592 0.55183 Assumptions acceptable.
## Skewness           2.81284 0.09351 Assumptions acceptable.
## Kurtosis           0.04915 0.82455 Assumptions acceptable.
## Link Function      0.09269 0.76078 Assumptions acceptable.
## Heteroscedasticity 0.08124 0.77563 Assumptions acceptable.

As we can see this now shows that our assumptions are acceptable, and therefore we state that heteroskedasticity is not present in this model.

Finally, I will demonstrate that we have corrected the model graphically. As per previously, the first and third plots are the ones to look at. You will see that the red lines are much more flat, and that the residuals are no longer clumped in locations. Therefore we say that we have corrected for heteroskedasticity in the model.

plot(BClmcars)