Methods for Detecting and Resolving Heteroskedasticity

Introduction

Heteroskedasticity occurs when the variance for all observations in a data set are not the same. In this demonstration, we examine the consequences of heteroskedasticity, find ways to detect it, and see how we can correct for heteroskedasticity using regression with robust standard errors and weighted least squares regression.

Consequences of Heteroskedasticity

As mentioned previously, heteroskedasticity occurs when the variance for all observations in a data set are not the same. Conversely, when the variance for all observations are equal, we call that homoskedasticity. Why should we care about heteroskedasticity? Because it is a violation of the ordinary least square assumption that $var(y_i)=var(e_i)=\sigma^2$. In the presence of heteroskedasticity, there are two main consequences on the least squares estimators:

The least squares estimator is still a linear and unbiased estimator, but it is no longer best. That is, there is another estimator with a smaller variance.
The standard errors computed for the least squares estimators are incorrect. This can affect confidence intervals and hypothesis testing that use those standard errors, which could lead to misleading conclusions.

Most real world data will probably be heteroskedastic. However, one can still use ordinary least squares without correcting for heteroskedasticity because if the sample size is large enough, the variance of the least squares estimator may still be sufficiently small to obtain precise estimates.

Detecting Heteroskedasticity

Residual Plots

One informal way of detecting heteroskedasticity is by creating a residual plot where you plot the least squares residuals against the explanatory variable or $\hat{y}$ if it’s a multiple regression. If there is an evident pattern in the plot, then heteroskedasticity is present. Let’s use a basic example using a household food expenditure dataset, where our dependent variable is household monthly food expenditures and our independent variable is income.

food <- read.csv('/Users/cyobero/Documents/food.csv')
head(food)

##   food_exp income
## 1   115.22   3.69
## 2   135.98   4.39
## 3   119.34   4.75
## 4   114.96   6.03
## 5   187.05  12.47
## 6   243.92  12.98

summary(food)

##     food_exp         income     
##  Min.   :109.7   Min.   : 3.69  
##  1st Qu.:200.4   1st Qu.:17.11  
##  Median :264.5   Median :20.03  
##  Mean   :283.6   Mean   :19.60  
##  3rd Qu.:363.3   3rd Qu.:24.40  
##  Max.   :587.7   Max.   :33.40

food.ols <- lm(food_exp ~ income, data = food)
food$resi <- food.ols$residuals
library(ggplot2)
ggplot(data = food, aes(y = resi, x = income)) + geom_point(col = 'blue') + geom_abline(slope = 0)

There seems to be no evident pattern. However, it does seem to look as if there’s more variation in food expenditures for households with higher levels of income.

The Breusch-Pagan Test

A more formal, mathematical way of detecting heteroskedasticity is what is known as the Breusch-Pagan test. It involves using a variance function and using a $\chi^2$-test to test the null hypothesis that heteroskedasticity is not present (i.e. homoskedastic) against the alternative hypothesis that heteroskedasticity is present.

To start, we need a variance function, a function that relates the variance to a set of explanatory variables $z_{i1},z_{i2},\ldots\ ,z_{is}$ that are potentially different from $x_{i1},x_{i2},\ldots\ ,x_{is}$. A more general form of the variance function is

\[ \begin{aligned} var(y_i)=\sigma_i^2=E(e_i^2)=h(\alpha_1+\alpha_2z_{i2}+\alpha_3z_{i3}+ \ldots\ +\alpha_sz_{is}) \end{aligned} \]

Notice in the above equation that the variance of $y_i$ changes for each observation depending on the values of $z$’s. If $\alpha_2=\alpha_3=\ldots\ =\alpha_s=0$, then we have constant variance and thus heteroskedasticity is not present. Recall that we are testing the following null and alternative hypotheses

\[ \begin{aligned} H_0:\alpha_1=\alpha_2= \ldots\ =\alpha_s=0 \\ \end{aligned} \] \[ \begin{aligned} H_1:\text{At least one of the $\alpha$'s is not zero} \end{aligned} \]

We reject the null hypothesis and accept the alternative if $\chi^2≥\chi_{(1-\alpha,S-1)}^2$ (we’ll get to this a bit later). To obtain a test statistic for our hypothesis test, we consider the linear variance function $h(\alpha_1+\alpha_2z_{i2}+ \ldots\ +\alpha_sz_{is})$ and substitute into $var(y_i)=\sigma_i^2=E(e_i^2)$ to obtain

\[ \begin{aligned} e_i^2=E(e_i^2)=\alpha_1+\alpha_2z_{i2}+ \ldots\ +\alpha_sz_{is} \end{aligned} \]

Then, let $v_i=e_i^2-E(e_i^2)$ denote the difference between a squared error term and its mean. From the above equation, we can write

\[ \begin{aligned} e_i^2=E(e_i^2)+v_i=\alpha_1 + \alpha_2z_{i2} + \ldots\ +\alpha_sz_{is} + v_i \end{aligned} \]

Because the dependent variable $e_i^2$ is unobservable, we need to substitute with its least squares estimate $\hat{e}_i^2$. We can then write the rewrite the above equation as

\[ \begin{aligned} \hat{e}_i^2=\alpha_1 + \alpha_2z_{i2} + \ldots + \alpha_sz_{is} + v_i \end{aligned} \]

We are interested in figuring out whether the variables $z_{i2},z_{i3},\ldots\ ,z_{is}$ help explain the variation in the least squares residual $\hat{e}_i^2$, and since $R^2$ measures the proportion of variance in $\hat{e}_i^2$ (i.e. the proportion due to regression) explained by the $z$’s, it is a natural candidate for a test statistic. When $H_0$ is true, the sample size $N$ multiplied by $R^2$ has a $\chi^2$ distribution with $S-1$ degrees of freedom. That is,

\[ \begin{aligned} \chi^2=N \times R^2 \sim \chi_{S-1}^2 \end{aligned} \]

There are two ways we can conduct the Breusch-Pagan test in R; the easy way and the hard way. Let’s try the hard way first to get a better understanding of the concept behind it. We’ll use the same food expenditure data we’ve been using so far and use a significance level of $\alpha=0.05$ for our hypothesis test. We will test

\[ \begin{aligned} H_0: \alpha_2=0 \end{aligned} \] \[ \begin{aligned} H_1: \alpha_2 ≠ 0 \end{aligned} \]

var.func <- lm(resi^2 ~ income, data = food)
summary(var.func)

## 
## Call:
## lm(formula = resi^2 ~ income, data = food)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14654  -5990  -1426   2811  38843 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -5762.4     4823.5  -1.195  0.23963   
## income         682.2      232.6   2.933  0.00566 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9947 on 38 degrees of freedom
## Multiple R-squared:  0.1846, Adjusted R-squared:  0.1632 
## F-statistic: 8.604 on 1 and 38 DF,  p-value: 0.005659

Our $R^2$ is 0.1846 with $N=40$ observations, making our test statistic 7.384. Let’s get our critical value.

qchisq(.95, df = 1)

## [1] 3.841459

Since 7.384 > 3.842, we reject the null hypothesis and conclude that heteroskedasticity is present.

That was the “hard” way of conducting the Breusch-Pagan test. Well, it wasn’t really hard (that’s what she said), but it involved multiple steps. There’s an “easier” way to conduct the Breusch-Pagan test that involves less steps. It involves using the lmtest package and calling the bptest function on our fitted model. This is how we do it (shout out to Montell Jordan):

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(food.ols)

## 
##  studentized Breusch-Pagan test
## 
## data:  food.ols
## BP = 7.3844, df = 1, p-value = 0.006579

While it doesn’t give us the critical value to compare the test statistic, all you need to look at is the p-value to determine whether or not you should reject the null. If the p-value is less than the level of significance (in this case if the p-value is less than $\alpha=0.05$), then you reject the null hypothesis. Since 0.006579 < 0.05, we can reject the null hypothesis.

Resolving Heteroskedasticity

Now that we’ve identified the presence of heteroskedasticity in our data, what can we do about it? Recall that the two main consequences of heteroskedasticity are 1) ordinary least squares no longer produces the best estimators and 2) standard errors computed using least squares can be incorrect and misleading. Let’s first deal with the issue of incorrect standard errors.

Regression With Robust Standard Errors

If we’re willing to accept the fact that ordinary least squares no longer produces the best linear unbiased estimators (BLUE), we can still perform our regression analysis to correct the issue of incorrect standard errors so that our interval estimates and hypothesis tests are valid. We do this by using heteroskedasticity-consistent standard errors or simply robust standard errors. The concept of robust standard errors was suggested by some dude named Halbert White, so shout out to my mans for introducing this.

To begin, note that the formula for obtaining the variance of ordinary least squares estimator $b_2$ is

\[ \begin{aligned} var(b_2)=\frac{\sum_{i=1}^{N}\big[(x_i-\bar{x})^2\sigma_i^2\big]}{\big[\sum_{i=1}^{N}(x_i-\bar{x}^2)\big]^2} \end{aligned} \]

The robust standard error for $b_2$ suggested by White is obtained by from the above equation by replacing $\sigma_i^2$ with the squares of the least squares residuals $\hat{e}_i^2=y_i-b_1-b_2x_i$, and including a degrees of freedom adjust ment $N/N-K$, where $K$ represents the number of parameters in your model. Since our food expenditure data only has two parameters, $K=2$. Thus, the robust standard error can be obtained by using

\[ \begin{aligned} \hat{var(b_2)}=\frac{N}{N-2} \frac{\sum_{i=1}^N\big[(x_i-\bar{x})^2\hat{e}_i^2\big]}{\big[\sum_{i=1}^N(x_i-\bar{x})^2\big]^2} \end{aligned} \]

First, let’s check the standard errors of our estimators for our original model.

summary(food.ols)

## 
## Call:
## lm(formula = food_exp ~ income, data = food)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.025  -50.816   -6.324   67.879  212.044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   83.416     43.410   1.922   0.0622 .  
## income        10.210      2.093   4.877 1.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.52 on 38 degrees of freedom
## Multiple R-squared:  0.385,  Adjusted R-squared:  0.3688 
## F-statistic: 23.79 on 1 and 38 DF,  p-value: 1.946e-05

The standard errors for $b_1$ (intercept) and $b_2$ are 43.41 and 2.09, respectively. Now, let’s compare them with robust standard errors. To do so, you’ll first need to install the sandwich package.

library(lmtest)
library(sandwich)
coeftest(food.ols, vcov = vcovHC(food.ols, "HC1"))   # HC1 gives us the White standard errors

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  83.4160    27.4637  3.0373  0.004299 ** 
## income       10.2096     1.8091  5.6436 1.755e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice how drastically different our standard errors are! Our robust standard errors for $b_1$ and $b_2$ are 27.46 and 1.81, respectively. The huge difference in standard errors is probably due to our small sample size. Recall that a sufficiently large sample size could result in more precise standard errors. Using a confidence level of 0.95, notice the discrepancy in our confidence intervals for $b_2$

\[ \begin{aligned} \text{White: } b_2 ± t_cse(b_2)=10.21 ± 2.024 \times 1.81 = [6.55, 13.87] \end{aligned} \] \[ \begin{aligned} \text{OLS: } b_2 ± t_cse(b_2)=10.21 ± 2.024 \times 2.09 = [5.97, 14.45] \end{aligned} \]

So, I guess the conclusion here is that if it ain’t White, it ain’t right! Just kidding. That sounds a tad bit racist. Moving along, regressing with robust standard errors addresses the issue of computing incorrect inverval estimates or incorrect values for our test statistics. However, it doesn’t address the issue of the second consequence of heteroskedasticity, which is the least squares estimators no longer being best. However, as I mentioned before, this may not be too consequential. Again, if you have a sufficiently large enough sample size (which is generally the case in real world applications), the variance of your estimators may still be small enough to get precise estimates.

Generalized Least Squares With Unknown Form of Variance

When heteroskedasticity is present, the best linear unbiased estimator depends on the uknown $\sigma_i^2$. This estimator is referred to as the generalized least squares estimator. When the ordinary least squares estimator is no longer BLUE, we can solve thsi problem by transforming the model into one with homoskedastic errors. Leaving the structure of the model in tact, it is possible to turn the heteroskedastic model into a homoskedastic one.

To begin, let’s introduce a general specification of the variance function, which can be written as

\[ \begin{aligned} var(e_i)=\sigma_i^2=\sigma^2x_i^\gamma \end{aligned} \]

where $\gamma$ is the unknown parameter we must estimate before we can proceed with the transformation. Notice that the variance function depends on a constant term $\sigma^2$ and increases as $x_i$ increases. It’s more convenient to consider a framework more general than the above equation. To inroduce this framework, let’s start by taking the natural logs of both sides of the above equations so that we get

\[ \begin{aligned} \ln(\sigma_i^2)=\ln(\sigma^2)+\gamma\ln(x_i) \end{aligned} \]

Then, we take the anti-log of both sides

\[ \begin{aligned} \sigma_i^2=\exp[\ln(\sigma^2)+\gamma\ln(x_i)]=\exp(\alpha_1+\alpha_2z_i) \end{aligned} \]

where $\alpha_1=\ln(\sigma^2)$, $\alpha_2=\gamma$, and $z_i=\ln(x_i)$. Writing the variance function in this form is convenient because it shows how the variance can be related to any explanatory variable $z_i$. Also, if we believe the variance is likely to depend on on more than one explanatory variable, say $z_{i2},z_{i3},\ldots ,z_{is}$, we can extend the equation to

\[ \begin{aligned} \sigma_i^2=\exp(\alpha_1+\alpha_2z_{i2}+\ldots +\alpha_sz_{is}) \end{aligned} \]

The exponential function is convenient because it ensures that we will get non-negative values for the variances $\sigma_i^2$ for all possible values of the parameters $\alpha_1,\alpha_2,\ldots ,\alpha_s$. Returning to the equation $\sigma_i^2=\exp(\alpha_1+\alpha_2z_i)$, we can rewrite it as

\[ \begin{aligned} \ln(\sigma_i^2)=\alpha_1+\alpha_2z_i \end{aligned} \]

We now have an equation in which we can estimate the unkown parameters $\alpha_1$ and $\alpha_2$. We can do this the same way we obtain estimates for the parameters $\beta_1$ and $\beta_2$ in a simple regression model $y_i=\beta_1+\beta_2x_i+e_i$ using ordinary least squares. We can do this by using the squares of our least squares residuals $\hat{e}_i^2$ as our observations. That is, we can write the above equation as

\[ \begin{aligned} \ln(\hat{e}_i^2)=\ln(\sigma_i^2)+v_i=\alpha_1+\alpha_2z_i+v_i \end{aligned} \]

We can now apply least squares to get our parameter estimates. Let’s use our food expenditure data that we’ve been working with so far.

food.ols <- lm(food_exp ~ income, data = food) # Fit our model to get our residuals.
food$resi <- food.ols$residuals
varfunc.ols <- lm(log(resi^2) ~ log(income), data = food)
summary(varfunc.ols)

## 
## Call:
## lm(formula = log(resi^2) ~ log(income), data = food)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5318 -0.5367  0.4727  1.0833  2.4339 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9378     1.5831   0.592 0.557107    
## log(income)   2.3292     0.5413   4.303 0.000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.721 on 38 degrees of freedom
## Multiple R-squared:  0.3276, Adjusted R-squared:  0.3099 
## F-statistic: 18.51 on 1 and 38 DF,  p-value: 0.0001139

The least squares estimate for our variance function is

\[ \begin{aligned} \hat{\ln(\sigma_i^2)}=0.9378+2.329z_i \end{aligned} \]

The next step is to transform the observations in such a way that the transformed model has a constant error variance. To do so, we can obtain variance estimates from

\[ \begin{aligned} \hat{\sigma}_i^2=\exp(\hat{\alpha}_1+\hat{\alpha}_2z_i) \end{aligned} \]

and then divide both sides of the regression model $y_i=\beta_1+\beta_2x_i+e_i$ by $\hat{\sigma}_i$. Doing so yields to the following equation

\[ \begin{aligned} \Big(\frac{y_i}{\sigma_i}\Big)=\beta_1\Big(\frac{1}{\sigma_i}\Big)+\beta_2\Big(\frac{x_i}{\sigma_i}\Big)+\Big(\frac{e_i}{\sigma_i}\Big) \end{aligned} \]

The variance of the transformed error is homoskedastic because

\[ \begin{aligned} var\Big(\frac{e_i}{\sigma_i}\Big)=\Big(\frac{1}{\sigma_i^2}\Big)var(e_i)=\Big(\frac{1}{\sigma_i^2}\Big)\sigma_i^2=1 \end{aligned} \]

Using the estimates of our variance function $\hat{\sigma}_i^2$ in place of $\sigma_i^2$ to obtain the generalized least squares estimators of $\beta_1$ and $\beta_2$, we define the transformed variables as

\[ \begin{aligned} y_i^*=\Big(\frac{y_i}{\hat{\sigma}_i}\Big) && x_{i1}^*=\Big(\frac{1}{\hat{\sigma}_i}\Big) && x_{i2}^*=\Big(\frac{x_i}{\hat{\sigma_i}}\Big) && \end{aligned} \]

and apply weighted least squares to the equation

\[ \begin{aligned} y_i^*=\beta_1x_{i1}^*+\beta_2x_{i2}^*+e_i^* \end{aligned} \]

Here’s how we can do it using R.

food.ols <- lm(food_exp ~ income, data = food)
food$resi <- food.ols$residuals
varfunc.ols <- lm(log(resi^2) ~ log(income), data = food)
food$varfunc <- exp(varfunc.ols$fitted.values)
food.gls <- lm(food_exp ~ income, weights = 1/sqrt(varfunc), data = food)

Let’s compare the estimators resulting from ordinary least squares to the estimators using generalized least squares. Ordinary least squares:

summary(food.ols)

## 
## Call:
## lm(formula = food_exp ~ income, data = food)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.025  -50.816   -6.324   67.879  212.044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   83.416     43.410   1.922   0.0622 .  
## income        10.210      2.093   4.877 1.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.52 on 38 degrees of freedom
## Multiple R-squared:  0.385,  Adjusted R-squared:  0.3688 
## F-statistic: 23.79 on 1 and 38 DF,  p-value: 1.946e-05

Generalized least squares:

summary(food.gls)

## 
## Call:
## lm(formula = food_exp ~ income, data = food, weights = 1/sqrt(varfunc))
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.6182  -7.2624  -0.7894   9.4541  23.4988 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   78.070     21.275    3.67 0.000743 ***
## income        10.487      1.301    8.06  9.5e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 38 degrees of freedom
## Multiple R-squared:  0.631,  Adjusted R-squared:  0.6212 
## F-statistic: 64.97 on 1 and 38 DF,  p-value: 9.498e-10

Our fitted models are

\[ \begin{aligned} \text{OLS: }\hat{FOODEXP}=83.42 + 10.21INCOME \end{aligned} \] \[ \begin{aligned} \text{GLS: }\hat{FOODEXP}=78.07 + 10.49INCOME \end{aligned} \]

Which model is better? In thise case, it’s the generalized least squares model, since it has much lower variances for our estimators, resulting in a much higher $R^2$. Let’s visualize how our fitted lines differ.

library(ggplot2)
g <- ggplot(data = food, aes(y = food_exp, x = income)) + geom_point(col = 'blue')
g + geom_abline(slope = food.ols$coefficients[2], intercept = food.ols$coefficients[1], col = 'red') + geom_abline(slope = food.gls$coefficients[2], intercept = food.gls$coefficients[1], col = 'green')

The green line represents the fitted GLS regression line and the red line represents the fitted OLS regression line. Visually, there doesn’t seem to be much difference, but referring to the comparisons of our model, the GLS model is definitely more precise, as it has a higher $R^2$.