Processing math: 100%

Introduction

Heteroskedasticity occurs when the variance for all observations in a data set are not the same. In this demonstration, we examine the consequences of heteroskedasticity, find ways to detect it, and see how we can correct for heteroskedasticity using regression with robust standard errors and weighted least squares regression.

Consequences of Heteroskedasticity

As mentioned previously, heteroskedasticity occurs when the variance for all observations in a data set are not the same. Conversely, when the variance for all observations are equal, we call that homoskedasticity. Why should we care about heteroskedasticity? Because it is a violation of the ordinary least square assumption that var(yi)=var(ei)=σ2. In the presence of heteroskedasticity, there are two main consequences on the least squares estimators:

  1. The least squares estimator is still a linear and unbiased estimator, but it is no longer best. That is, there is another estimator with a smaller variance.
  2. The standard errors computed for the least squares estimators are incorrect. This can affect confidence intervals and hypothesis testing that use those standard errors, which could lead to misleading conclusions.

Most real world data will probably be heteroskedastic. However, one can still use ordinary least squares without correcting for heteroskedasticity because if the sample size is large enough, the variance of the least squares estimator may still be sufficiently small to obtain precise estimates.

Detecting Heteroskedasticity

Residual Plots

One informal way of detecting heteroskedasticity is by creating a residual plot where you plot the least squares residuals against the explanatory variable or ˆy if it’s a multiple regression. If there is an evident pattern in the plot, then heteroskedasticity is present. Let’s use a basic example using a household food expenditure dataset, where our dependent variable is household monthly food expenditures and our independent variable is income.

food <- read.csv('/Users/cyobero/Documents/food.csv')
head(food)
##   food_exp income
## 1   115.22   3.69
## 2   135.98   4.39
## 3   119.34   4.75
## 4   114.96   6.03
## 5   187.05  12.47
## 6   243.92  12.98
summary(food)
##     food_exp         income     
##  Min.   :109.7   Min.   : 3.69  
##  1st Qu.:200.4   1st Qu.:17.11  
##  Median :264.5   Median :20.03  
##  Mean   :283.6   Mean   :19.60  
##  3rd Qu.:363.3   3rd Qu.:24.40  
##  Max.   :587.7   Max.   :33.40
food.ols <- lm(food_exp ~ income, data = food)
food$resi <- food.ols$residuals
library(ggplot2)
ggplot(data = food, aes(y = resi, x = income)) + geom_point(col = 'blue') + geom_abline(slope = 0)

There seems to be no evident pattern. However, it does seem to look as if there’s more variation in food expenditures for households with higher levels of income.

The Breusch-Pagan Test

A more formal, mathematical way of detecting heteroskedasticity is what is known as the Breusch-Pagan test. It involves using a variance function and using a χ2-test to test the null hypothesis that heteroskedasticity is not present (i.e. homoskedastic) against the alternative hypothesis that heteroskedasticity is present.

To start, we need a variance function, a function that relates the variance to a set of explanatory variables zi1,zi2, ,zis that are potentially different from xi1,xi2, ,xis. A more general form of the variance function is

var(yi)=σ2i=E(e2i)=h(α1+α2zi2+α3zi3+ +αszis)

Notice in the above equation that the variance of yi changes for each observation depending on the values of z’s. If α2=α3= =αs=0, then we have constant variance and thus heteroskedasticity is not present. Recall that we are testing the following null and alternative hypotheses

H0:α1=α2= =αs=0 H1:At least one of the α's is not zero

We reject the null hypothesis and accept the alternative if χ2χ2(1α,S1) (we’ll get to this a bit later). To obtain a test statistic for our hypothesis test, we consider the linear variance function h(α1+α2zi2+ +αszis) and substitute into var(yi)=σ2i=E(e2i) to obtain

e2i=E(e2i)=α1+α2zi2+ +αszis

Then, let vi=e2iE(e2i) denote the difference between a squared error term and its mean. From the above equation, we can write

e2i=E(e2i)+vi=α1+α2zi2+ +αszis+vi

Because the dependent variable e2i is unobservable, we need to substitute with its least squares estimate ˆe2i. We can then write the rewrite the above equation as

ˆe2i=α1+α2zi2++αszis+vi

We are interested in figuring out whether the variables zi2,zi3, ,zis help explain the variation in the least squares residual ˆe2i, and since R2 measures the proportion of variance in ˆe2i (i.e. the proportion due to regression) explained by the z’s, it is a natural candidate for a test statistic. When H0 is true, the sample size N multiplied by R2 has a χ2 distribution with S1 degrees of freedom. That is,

χ2=N×R2χ2S1

There are two ways we can conduct the Breusch-Pagan test in R; the easy way and the hard way. Let’s try the hard way first to get a better understanding of the concept behind it. We’ll use the same food expenditure data we’ve been using so far and use a significance level of α=0.05 for our hypothesis test. We will test

H0:α2=0 H1:α20

var.func <- lm(resi^2 ~ income, data = food)
summary(var.func)
## 
## Call:
## lm(formula = resi^2 ~ income, data = food)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -14654  -5990  -1426   2811  38843 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -5762.4     4823.5  -1.195  0.23963   
## income         682.2      232.6   2.933  0.00566 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9947 on 38 degrees of freedom
## Multiple R-squared:  0.1846, Adjusted R-squared:  0.1632 
## F-statistic: 8.604 on 1 and 38 DF,  p-value: 0.005659

Our R2 is 0.1846 with N=40 observations, making our test statistic 7.384. Let’s get our critical value.

qchisq(.95, df = 1)
## [1] 3.841459

Since 7.384 > 3.842, we reject the null hypothesis and conclude that heteroskedasticity is present.

That was the “hard” way of conducting the Breusch-Pagan test. Well, it wasn’t really hard (that’s what she said), but it involved multiple steps. There’s an “easier” way to conduct the Breusch-Pagan test that involves less steps. It involves using the lmtest package and calling the bptest function on our fitted model. This is how we do it (shout out to Montell Jordan):

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(food.ols)
## 
##  studentized Breusch-Pagan test
## 
## data:  food.ols
## BP = 7.3844, df = 1, p-value = 0.006579

While it doesn’t give us the critical value to compare the test statistic, all you need to look at is the p-value to determine whether or not you should reject the null. If the p-value is less than the level of significance (in this case if the p-value is less than α=0.05), then you reject the null hypothesis. Since 0.006579 < 0.05, we can reject the null hypothesis.

Resolving Heteroskedasticity

Now that we’ve identified the presence of heteroskedasticity in our data, what can we do about it? Recall that the two main consequences of heteroskedasticity are 1) ordinary least squares no longer produces the best estimators and 2) standard errors computed using least squares can be incorrect and misleading. Let’s first deal with the issue of incorrect standard errors.

Regression With Robust Standard Errors

If we’re willing to accept the fact that ordinary least squares no longer produces the best linear unbiased estimators (BLUE), we can still perform our regression analysis to correct the issue of incorrect standard errors so that our interval estimates and hypothesis tests are valid. We do this by using heteroskedasticity-consistent standard errors or simply robust standard errors. The concept of robust standard errors was suggested by some dude named Halbert White, so shout out to my mans for introducing this.

To begin, note that the formula for obtaining the variance of ordinary least squares estimator b2 is

var(b2)=Ni=1[(xiˉx)2σ2i][Ni=1(xiˉx2)]2

The robust standard error for b2 suggested by White is obtained by from the above equation by replacing σ2i with the squares of the least squares residuals ˆe2i=yib1b2xi, and including a degrees of freedom adjust ment N/NK, where K represents the number of parameters in your model. Since our food expenditure data only has two parameters, K=2. Thus, the robust standard error can be obtained by using

^var(b2)=NN2Ni=1[(xiˉx)2ˆe2i][Ni=1(xiˉx)2]2

First, let’s check the standard errors of our estimators for our original model.

summary(food.ols)
## 
## Call:
## lm(formula = food_exp ~ income, data = food)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.025  -50.816   -6.324   67.879  212.044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   83.416     43.410   1.922   0.0622 .  
## income        10.210      2.093   4.877 1.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.52 on 38 degrees of freedom
## Multiple R-squared:  0.385,  Adjusted R-squared:  0.3688 
## F-statistic: 23.79 on 1 and 38 DF,  p-value: 1.946e-05

The standard errors for b1 (intercept) and b2 are 43.41 and 2.09, respectively. Now, let’s compare them with robust standard errors. To do so, you’ll first need to install the sandwich package.

library(lmtest)
library(sandwich)
coeftest(food.ols, vcov = vcovHC(food.ols, "HC1"))   # HC1 gives us the White standard errors
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  83.4160    27.4637  3.0373  0.004299 ** 
## income       10.2096     1.8091  5.6436 1.755e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Notice how drastically different our standard errors are! Our robust standard errors for b1 and b2 are 27.46 and 1.81, respectively. The huge difference in standard errors is probably due to our small sample size. Recall that a sufficiently large sample size could result in more precise standard errors. Using a confidence level of 0.95, notice the discrepancy in our confidence intervals for b2

White: b2±tcse(b2)=10.21±2.024×1.81=[6.55,13.87] OLS: b2±tcse(b2)=10.21±2.024×2.09=[5.97,14.45]

So, I guess the conclusion here is that if it ain’t White, it ain’t right! Just kidding. That sounds a tad bit racist. Moving along, regressing with robust standard errors addresses the issue of computing incorrect inverval estimates or incorrect values for our test statistics. However, it doesn’t address the issue of the second consequence of heteroskedasticity, which is the least squares estimators no longer being best. However, as I mentioned before, this may not be too consequential. Again, if you have a sufficiently large enough sample size (which is generally the case in real world applications), the variance of your estimators may still be small enough to get precise estimates.

Generalized Least Squares With Unknown Form of Variance

When heteroskedasticity is present, the best linear unbiased estimator depends on the uknown σ2i. This estimator is referred to as the generalized least squares estimator. When the ordinary least squares estimator is no longer BLUE, we can solve thsi problem by transforming the model into one with homoskedastic errors. Leaving the structure of the model in tact, it is possible to turn the heteroskedastic model into a homoskedastic one.

To begin, let’s introduce a general specification of the variance function, which can be written as

var(ei)=σ2i=σ2xγi

where γ is the unknown parameter we must estimate before we can proceed with the transformation. Notice that the variance function depends on a constant term σ2 and increases as xi increases. It’s more convenient to consider a framework more general than the above equation. To inroduce this framework, let’s start by taking the natural logs of both sides of the above equations so that we get

ln(σ2i)=ln(σ2)+γln(xi)

Then, we take the anti-log of both sides

σ2i=exp[ln(σ2)+γln(xi)]=exp(α1+α2zi)

where α1=ln(σ2), α2=γ, and zi=ln(xi). Writing the variance function in this form is convenient because it shows how the variance can be related to any explanatory variable zi. Also, if we believe the variance is likely to depend on on more than one explanatory variable, say zi2,zi3,,zis, we can extend the equation to

σ2i=exp(α1+α2zi2++αszis)

The exponential function is convenient because it ensures that we will get non-negative values for the variances σ2i for all possible values of the parameters α1,α2,,αs. Returning to the equation σ2i=exp(α1+α2zi), we can rewrite it as

ln(σ2i)=α1+α2zi

We now have an equation in which we can estimate the unkown parameters α1 and α2. We can do this the same way we obtain estimates for the parameters β1 and β2 in a simple regression model yi=β1+β2xi+ei using ordinary least squares. We can do this by using the squares of our least squares residuals ˆe2i as our observations. That is, we can write the above equation as

ln(ˆe2i)=ln(σ2i)+vi=α1+α2zi+vi

We can now apply least squares to get our parameter estimates. Let’s use our food expenditure data that we’ve been working with so far.

food.ols <- lm(food_exp ~ income, data = food) # Fit our model to get our residuals.
food$resi <- food.ols$residuals
varfunc.ols <- lm(log(resi^2) ~ log(income), data = food)
summary(varfunc.ols)
## 
## Call:
## lm(formula = log(resi^2) ~ log(income), data = food)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5318 -0.5367  0.4727  1.0833  2.4339 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9378     1.5831   0.592 0.557107    
## log(income)   2.3292     0.5413   4.303 0.000114 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.721 on 38 degrees of freedom
## Multiple R-squared:  0.3276, Adjusted R-squared:  0.3099 
## F-statistic: 18.51 on 1 and 38 DF,  p-value: 0.0001139

The least squares estimate for our variance function is

^ln(σ2i)=0.9378+2.329zi

The next step is to transform the observations in such a way that the transformed model has a constant error variance. To do so, we can obtain variance estimates from

ˆσ2i=exp(ˆα1+ˆα2zi)

and then divide both sides of the regression model yi=β1+β2xi+ei by ˆσi. Doing so yields to the following equation

(yiσi)=β1(1σi)+β2(xiσi)+(eiσi)

The variance of the transformed error is homoskedastic because

var(eiσi)=(1σ2i)var(ei)=(1σ2i)σ2i=1

Using the estimates of our variance function ˆσ2i in place of σ2i to obtain the generalized least squares estimators of β1 and β2, we define the transformed variables as

yi=(yiˆσi)xi1=(1ˆσi)xi2=(xi^σi)

and apply weighted least squares to the equation

yi=β1xi1+β2xi2+ei

Here’s how we can do it using R.

food.ols <- lm(food_exp ~ income, data = food)
food$resi <- food.ols$residuals
varfunc.ols <- lm(log(resi^2) ~ log(income), data = food)
food$varfunc <- exp(varfunc.ols$fitted.values)
food.gls <- lm(food_exp ~ income, weights = 1/sqrt(varfunc), data = food)

Let’s compare the estimators resulting from ordinary least squares to the estimators using generalized least squares. Ordinary least squares:

summary(food.ols)
## 
## Call:
## lm(formula = food_exp ~ income, data = food)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -223.025  -50.816   -6.324   67.879  212.044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   83.416     43.410   1.922   0.0622 .  
## income        10.210      2.093   4.877 1.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89.52 on 38 degrees of freedom
## Multiple R-squared:  0.385,  Adjusted R-squared:  0.3688 
## F-statistic: 23.79 on 1 and 38 DF,  p-value: 1.946e-05

Generalized least squares:

summary(food.gls)
## 
## Call:
## lm(formula = food_exp ~ income, data = food, weights = 1/sqrt(varfunc))
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -27.6182  -7.2624  -0.7894   9.4541  23.4988 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   78.070     21.275    3.67 0.000743 ***
## income        10.487      1.301    8.06  9.5e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.49 on 38 degrees of freedom
## Multiple R-squared:  0.631,  Adjusted R-squared:  0.6212 
## F-statistic: 64.97 on 1 and 38 DF,  p-value: 9.498e-10

Our fitted models are

OLS: ^FOODEXP=83.42+10.21INCOME GLS: ^FOODEXP=78.07+10.49INCOME

Which model is better? In thise case, it’s the generalized least squares model, since it has much lower variances for our estimators, resulting in a much higher R2. Let’s visualize how our fitted lines differ.

library(ggplot2)
g <- ggplot(data = food, aes(y = food_exp, x = income)) + geom_point(col = 'blue')
g + geom_abline(slope = food.ols$coefficients[2], intercept = food.ols$coefficients[1], col = 'red') + geom_abline(slope = food.gls$coefficients[2], intercept = food.gls$coefficients[1], col = 'green')

The green line represents the fitted GLS regression line and the red line represents the fitted OLS regression line. Visually, there doesn’t seem to be much difference, but referring to the comparisons of our model, the GLS model is definitely more precise, as it has a higher R2.

Conclusion

Heteroskedasticity occurs when the variance for all observations are not the same. To detect heteroskedasticity, one can plot the least squares residuals ˆei against the independent variable xi (or ˆyi if it’s a multiple regression model). If there is an distinguishable pattern, then heteroskedasticity might be present. A more formal way of identifying heteroskedasticity is by conducting a Breusch-Pagan test, where we estimate a variance function that depends on the independent variable(s), and test the null hypothesis that heteroskedasticity is not present against the alternative that heteroskedasticity is present. The test statistic for the Breusch-Pagan test can be obtained by multiplying the R2 of the estimated variance function and multiplying it by the number of observations N. If the null hypothesis is true, the sample size has a χ2 distribution.

There are two main consequences in the presence of heteroskedasticity. First, the ordinary least squares estimators are still linear and unbiased, but are no longer best; there is another form that produces smaller variances. Second, the standard errors may be misleading and incorrectly, which can affect interval estimation and hypothesis testing. To correct for the first consequence, we use generalized least squares to obtain our parameter estimates. This involves keeping the functional form in tact, but transforming the model in such a way that it becomes a heteroskedastic model to a homoskedastic one. To do this, we estimated a variance function and used the square root of the estimates as weights to transform our model, reducing in smaller standard errors and more precise estimators. To correct for the second consequence of misleading and incorrect standard errors, we used ordinary least squares regression using robust standard errors. Regressing with robust standard errors doesn’t change our estimators, but corrects for misleading and incorrect standard errors.

In our food expenditure example, we detected the presence of heteroskedasticity and corrected for it. Using R2 as a measure of performance for comparing OLS vs GLS and regressing with and without robust standard errors, we saw that correcting for heteroskedasticity lead to a much higher R2. However, not correcting for heteroskedasticity may not be a grave sin. If the sample size is sufficiently large, the variance of your estimators may be small enough to still produce precise estiamtes.