Although heteroskedasticity does not produce biased OLS estimates, it leads to a bias in the variance-covariance matrix. This means that standard model testing methods such as t tests or F tests cannot be relied on any longer. This post provides an intuitive illustration of heteroskedasticity and covers the calculation of standard errors that are robust to it.
A popular illustration of heteroskedasticity is the relationship between saving and income, which is shown in the following graph. The dataset is contained the wooldridge package.
# Clear the workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 527850 28.2 1173508 62.7 NA 669268 35.8
## Vcells 970766 7.5 8388608 64.0 32768 1840524 14.1
cat("\f") # Clear the console
# Load packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(wooldridge)
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Load the sample
data("saving")
stargazer(saving, type="text")
##
## ================================================
## Statistic N Mean St. Dev. Min Max
## ------------------------------------------------
## sav 100 1,582.510 3,284.902 -5,577 25,405
## inc 100 9,941.240 5,583.998 750 32,080
## size 100 4.350 1.493 2 10
## educ 100 11.580 3.435 2 20
## age 100 38.770 7.399 26 54
## black 100 0.070 0.256 0 1
## cons 100 8,358.730 5,729.535 -13,055 30,280
## ------------------------------------------------
Observations, where variable inc is larger than 20,000
or variable sav is negative or larger than inc
are dropped from the sample.
# Only use positive values of saving, which are smaller than income
saving <- saving %>%
filter(sav > 0,
inc < 20000,
sav < inc)
# Plot
ggplot(data = saving,
mapping = aes(x = inc, y = sav)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "Annual income", y = "Annual savings")
## `geom_smooth()` using formula = 'y ~ x'
The regression line in the graph shows a clear positive relationship between saving and income. However, as income increases, the differences between the observations and the regression line become larger. This means that there is higher uncertainty about the estimated relationship between the two variables at higher income levels. This is an example of heteroskedasticity.
Since standard model testing methods rely on the assumption that there is no correlation between the independent variables and the variance of the dependent variable, the usual standard errors are not very reliable in the presence of heteroskedasticity. Fortunately, the calculation of robust standard errors can help to mitigate this problem.
The regression line above was derived from the model
\[ sav_i=\beta_0 + \beta_1 \ inc_i+ \epsilon_i , \]
for which the following code produces the standard R output:
# Estimate the model
model <- lm(formula = sav ~ inc,
data = saving
)
# Print estimates and standard test statistics
summary(model)
##
## Call:
## lm(formula = sav ~ inc, data = saving)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2667.8 -874.5 -302.7 431.1 4606.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 316.19835 462.06882 0.684 0.49595
## inc 0.14052 0.04672 3.007 0.00361 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1413 on 73 degrees of freedom
## Multiple R-squared: 0.1102, Adjusted R-squared: 0.09805
## F-statistic: 9.044 on 1 and 73 DF, p-value: 0.003613
par(mfrow = c(2, 2))
plot(model)
library(skedastic)
skedastic::white(model)
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 4.31 0.116 2 White's Test greater
# library(MASS)
# model2 <- rlm(formula = sav ~ inc,
# data = saving
# )
#
# stargazer(model, model2, type = "text")
Since we already know that the model above suffers from
heteroskedasticity, we want to obtain heteroskedasticity robust standard
errors and their corresponding t values. In R the function
coeftest from the lmtest package can be used in
combination with the function vcovHC from the
sandwich package to do this.
The first argument of the coeftest function contains the
output of the lm function and calculates the t test based
on the variance-covariance matrix provided in the vcov
argument. The vcovHC function produces that matrix and
allows to obtain several types of heteroskedasticity robust versions of
it. In our case we obtain a simple White standard error, which is
indicated by type = "HC0". Other, more sophisticated
methods are described in the documentation of the function,
?vcovHC
?vcovHC
## No documentation for 'vcovHC' in specified packages and libraries:
## you could try '??vcovHC'
# Load libraries
library("lmtest")
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library("sandwich")
# Robust t test
model0 <- coeftest(model, vcov = vcovHC(model, type = "HC0"))
# Robust t test
model1 <- coeftest(model, vcov = vcovHC(model, type = "HC1"))
# Robust t test
model2 <- coeftest(model, vcov = vcovHC(model, type = "HC2"))
# Robust t test
model3 <- coeftest(model, vcov = vcovHC(model, type = "HC3"))
library(stargazer)
stargazer(model, model1, model2, model3, type = "text")
##
## =======================================================================
## Dependent variable:
## ---------------------------------------------------
## sav
## OLS coefficient
## test
## (1) (2) (3) (4)
## -----------------------------------------------------------------------
## inc 0.141*** 0.141*** 0.141*** 0.141***
## (0.047) (0.049) (0.051) (0.052)
##
## Constant 316.198 316.198 316.198 316.198
## (462.069) (420.371) (428.596) (443.298)
##
## -----------------------------------------------------------------------
## Observations 75
## R2 0.110
## Adjusted R2 0.098
## Residual Std. Error 1,413.472 (df = 73)
## F Statistic 9.044*** (df = 1; 73)
## =======================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
F test is presented as a method to test the joint significance of
multiple regressors. The following example adds two new regressors on
education and age to the above model and calculates the corresponding
(non-robust) F test using the anova function.
# Estimate unrestricted model
model_unres <- lm(sav ~ inc + size + educ + age, data = saving)
# F test
anova(model, model_unres)
## Analysis of Variance Table
##
## Model 1: sav ~ inc
## Model 2: sav ~ inc + size + educ + age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 73 145846877
## 2 70 144286605 3 1560272 0.2523 0.8594
For a heteroskedasticity robust F test we perform a Wald test using
the waldtest function, which is also contained in the
lmtest package. It can be used in a similar way as the
anova function, i.e., it uses the output of the restricted
and unrestricted model and the robust variance-covariance matrix as
argument vcov. Based on the variance-covariance matrix of
the unrestriced model we, again, calculate White standard errors.
waldtest(model, model_unres, vcov = vcovHC(model_unres, type = "HC0"))
## Wald test
##
## Model 1: sav ~ inc
## Model 2: sav ~ inc + size + educ + age
## Res.Df Df F Pr(>F)
## 1 73
## 2 70 3 0.3625 0.7803
The second method to check for heteroscedasticity among residuals in R is by performing the Breusch-Pagan test. This test checks whether the variance of the residuals depends on the value of the independent variable.
Therefore, the hypothesis of the Breusch-Pagan Test is:
NULL - Homoskedasticty (Residuals are distributed with equal variance)
ALTERNATIVE - Heteroskedastcity (Residuals are not distributed with equal variance)
In R, you can perform the Breusch-Pagan test in different ways, for instance with:
The bptest function from the lmtest package,
The ncvTest function from the car package,
The plmtest functionfrom the plm package, or
The breusch_pagan function from the skedastic package.
In this example, we use the bptest function from the lmtest package which requires just a fitted “lm”-object as its argument (i.e., a linear regression model).
https://www.codingprof.com/3-easy-ways-to-test-for-heteroscedasticity-in-r-examples/
library(lmtest)
lmtest::bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 4.0869, df = 1, p-value = 0.04322
library(lmtest)
model <- lm(mpg~hp, data = mtcars)
plot(y = mtcars$mpg,
x = mtcars$hp)
lmtest::bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 0.049298, df = 1, p-value = 0.8243
library(lmtest)
model <- lm(Volume~Height, data = trees)
plot(y = trees$Volume, x = trees$Height)
lmtest::bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 12.207, df = 1, p-value = 0.0004762
https://rdocumentation.org/packages/skedastic/versions/1.0.4/topics/white_lm
library(skedastic)
model <- lm(mpg~wt, data = mtcars)
skedastic::white(model)
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 1.37 0.505 2 White's Test greater
library(skedastic)
model <- lm(Volume~Height, data = trees)
skedastic::white(model)
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 17.1 0.000197 2 White's Test greater