Heteroskedasticity is an issue where the variance of the errors or residuals in a regression model is not constant across different predictors. This can be caused by outliers in the data or by omitted variable biases. Heteroskedasticity results in biased estimates of the standard errors, and unbiased but less precise point estimates.
Whites test:
Null Hypothesis: Homoskedasticity (p> threshold (e.g. 0.05))
Alternative Hypothesis: p<0.05 heteroskedasticity assumed
I agree with the test logic, as we can see whether or not the variation in standard errors, in this case using squared residuals, is attributed to the regressors used in the model. Additionally, in the presence of homoskedasticity, we should observe, other than the constant, coefficient estimates around zero, and we should observe small R^2 values.
library(AER)
## Warning: package 'AER' was built under R version 4.2.3
## Loading required package: car
## Loading required package: carData
## Loading required package: lmtest
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Warning: package 'sandwich' was built under R version 4.2.3
## Loading required package: survival
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.2.3
library("skedastic")
## Warning: package 'skedastic' was built under R version 4.2.3
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.3
## Warning: package 'readr' was built under R version 4.2.3
## Warning: package 'purrr' was built under R version 4.2.3
## Warning: package 'forcats' was built under R version 4.2.3
## Warning: package 'lubridate' was built under R version 4.2.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::recode() masks car::recode()
## ✖ MASS::select() masks dplyr::select()
## ✖ purrr::some() masks car::some()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
## Warning: package 'psych' was built under R version 4.2.3
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
##
## The following object is masked from 'package:car':
##
## logit
data("Guns")
subset_Guns <- Guns[,c(2,4,5,8)]
\[ Violent_i = \beta_0 + robbery_i\beta_1 + male_i\beta_2 + prisoners_i\beta_3 + \epsilon \]
Violent: violent crime rate (incidents per 100,000 members of the population)
law: robbery rate (incidents per 100,000)
male: percent of state population that is male, ages 10 to 29
prisoners: incarceration rate in the state in the previous year (sentenced prisoners per 100,000 residents; value for the previous year)
OLS_1 <- lm(violent ~ ., data = subset_Guns)
stargazer(OLS_1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## violent
## -----------------------------------------------
## robbery 1.458***
## (0.025)
##
## prisoners 0.546***
## (0.027)
##
## male 4.776**
## (2.284)
##
## Constant 66.654*
## (39.150)
##
## -----------------------------------------------
## Observations 1,173
## R2 0.876
## Adjusted R2 0.875
## Residual Std. Error 118.052 (df = 1169)
## F Statistic 2,742.709*** (df = 3; 1169)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(subset_Guns, type = "text")
##
## =================================================
## Statistic N Mean St. Dev. Min Max
## -------------------------------------------------
## violent 1,173 503.075 334.277 47.000 2,921.800
## robbery 1,173 161.820 170.510 6.400 1,635.100
## prisoners 1,173 226.580 178.888 19 1,913
## male 1,173 16.081 1.732 12.214 22.353
## -------------------------------------------------
white(mainlm = OLS_1, interactions = TRUE)
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 284. 5.05e-56 9 White's Test greater
In this case, we have a very low p-value below the threshold of 0.05. Therefore, we must reject the null hypothesis that there is homoskedasticity. In this case, we have heteroskedasticity.
plot(OLS_1, which = 1)
In this residuals vs fitted plot above, we observe that there is a strong pattern associated with the data, and we can see that the line of best fit is not very linear. The distribution of the plot is clustered to the left and dispersed on the right, while there seems to be an outlier value highlighted by the plot (point 189). These evaluations point more toward the idea that heteroskedasticity is at play in this scenario.
subset_Guns$residuals <- resid(object = OLS_1)
summary(subset_Guns$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -492.20 -84.08 -15.52 0.00 64.36 488.30
subset_Guns$squared_residuals <- (subset_Guns$residuals^2)
Auxiliary_test <- lm(squared_residuals ~ robbery + male + prisoners + I(robbery^2) + I(male^2) + I(prisoners^2) + robbery:male + robbery:prisoners + male:prisoners, data = subset_Guns )
stargazer(Auxiliary_test, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## squared_residuals
## -----------------------------------------------
## robbery 246.017***
## (78.102)
##
## male -33,530.860***
## (8,559.157)
##
## prisoners -183.813*
## (99.396)
##
## I(robbery2) 0.149***
## (0.014)
##
## I(male2) 980.787***
## (243.591)
##
## I(prisoners2) 0.127***
## (0.020)
##
## robbery:male -14.708***
## (4.753)
##
## robbery:prisoners -0.274***
## (0.029)
##
## male:prisoners 11.682**
## (5.903)
##
## Constant 291,238.600***
## (75,138.460)
##
## -----------------------------------------------
## Observations 1,173
## R2 0.243
## Adjusted R2 0.237
## Residual Std. Error 23,159.340 (df = 1163)
## F Statistic 41.376*** (df = 9; 1163)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Based on the auxiliary regression, we can see that the adjusted R^2 value is .237 which tells us that 23.7% of the variance in squared residuals is explained by the model. Therefore, in the presence of the current model, we have heteroskedasticity.
Auxiliary_test_summary <- summary(Auxiliary_test)
chisq_p_value <- pchisq(q = Auxiliary_test_summary$r.squared * nobs(Auxiliary_test), df = 9, lower.tail = FALSE)
chisq_p_value
## [1] 5.052167e-56
Based on the computed p_value above, we can see that it is equivalent to the p_value produced by the white test from the package ‘skedastic’.
#Test Critical Value
Auxiliary_test_summary$r.squared * nobs(Auxiliary_test)
## [1] 284.4949
We can also see that we are able to replicate the test stat.
Based on the tests run above, it is safe to say that this model contains heteroskedasticity, as the p_value = 5.052167e-56 is very low, and the R^2 value of .243 for the whites test shows us that 24.3% of the variation in the squared errors is explained by the model. Additionally, we have a fairly large critical value at 284.4949. This shows us that there is likely a case of dependency between the values associated with the model used, since our degrees of freedom are low and the p_value is so low.