Heteroskedasticity means that when you’re trying to predict or explain something using a set of variables, the amount by which you can be wrong in your predictions is not always the same. Sometimes you’ll be more wrong than other times. This can mess up your analysis and make it hard to draw reliable conclusions.
The Breusch-Pagan/White test is a method that helps check whether a certain type of statistical variation (called heteroskedasticity) is present in a set of data or not. It does this by running a test that compares how much the data varies from its average value across all observations. This test is widely used in the field of economics to help improve the accuracy and reliability of statistical models.
library(ggplot2)
data(diamonds)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.2.1 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(dplyr)
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
describe(diamonds)
## vars n mean sd median trimmed mad min max
## carat 1 53940 0.80 0.47 0.70 0.73 0.47 0.2 5.01
## cut* 2 53940 3.90 1.12 4.00 4.04 1.48 1.0 5.00
## color* 3 53940 3.59 1.70 4.00 3.55 1.48 1.0 7.00
## clarity* 4 53940 4.05 1.65 4.00 3.91 1.48 1.0 8.00
## depth 5 53940 61.75 1.43 61.80 61.78 1.04 43.0 79.00
## table 6 53940 57.46 2.23 57.00 57.32 1.48 43.0 95.00
## price 7 53940 3932.80 3989.44 2401.00 3158.99 2475.94 326.0 18823.00
## x 8 53940 5.73 1.12 5.70 5.66 1.38 0.0 10.74
## y 9 53940 5.73 1.14 5.71 5.66 1.36 0.0 58.90
## z 10 53940 3.54 0.71 3.53 3.49 0.85 0.0 31.80
## range skew kurtosis se
## carat 4.81 1.12 1.26 0.00
## cut* 4.00 -0.72 -0.40 0.00
## color* 6.00 0.19 -0.87 0.01
## clarity* 7.00 0.55 -0.39 0.01
## depth 36.00 -0.08 5.74 0.01
## table 52.00 0.80 2.80 0.01
## price 18497.00 1.62 2.18 17.18
## x 10.74 0.38 -0.62 0.00
## y 58.90 2.43 91.20 0.00
## z 31.80 1.52 47.08 0.00
diamonds <- ggplot2:: diamonds
sub <- diamonds[, c(1,5,6,7)]
data(sub)
## Warning in data(sub): data set 'sub' not found
sub <- sub[1:100,]
lm.mod <- lm(formula = carat ~ .,
data = sub )
summary(lm.mod )
##
## Call:
## lm(formula = carat ~ ., data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06737 -0.03401 -0.01153 0.03333 0.16539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.385e-01 2.833e-01 -0.842 0.4020
## depth 6.269e-03 3.096e-03 2.025 0.0457 *
## table 5.989e-04 2.242e-03 0.267 0.7900
## price 2.103e-04 6.784e-06 31.005 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04684 on 96 degrees of freedom
## Multiple R-squared: 0.9118, Adjusted R-squared: 0.909
## F-statistic: 330.7 on 3 and 96 DF, p-value: < 2.2e-16
plot(lm.mod, which = 1)
par(mfrow = c(2, 2))
# Create a residual plot
plot(lm.mod)
par(mfrow = c(1, 1))
Null (H0): Homoscedasticity is present. Alternative (HA): Heteroscedasticity is present.
Test statistic is 6141. P-value is less than 0.05, we reject the null hypothesis. Heteroscedasticity is present.
options(repos = c(CRAN = "http://cran.rstudio.com/"))
# install.packages("parallel")
# library(parallel)
# num_cores <- detectCores()-1
# cl <- makeCluster(num_cores)
# ?parLapply
# result <- parLapply(cl, lm.mod, white(mainlm = lm.mod,
# interactions = TRUE
# )
# )
install.packages("skedastic", dependencies = c("Depends", "Imports"))
##
## The downloaded binary packages are in
## /var/folders/_2/ds2qp0zn11v05n6h2k954t800000gn/T//RtmpQJyqZ7/downloaded_packages
library(skedastic)
skedastic_package_white <- white(mainlm = lm.mod,
interactions = TRUE
)
skedastic_package_white
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 66.5 7.33e-11 9 White's Test greater
install.packages("lmtest")
##
## The downloaded binary packages are in
## /var/folders/_2/ds2qp0zn11v05n6h2k954t800000gn/T//RtmpQJyqZ7/downloaded_packages
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(lm.mod)
##
## studentized Breusch-Pagan test
##
## data: lm.mod
## BP = 26.694, df = 3, p-value = 6.824e-06
?resid
sub$residuals <- resid(object = lm.mod)
summary(sub$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.06737 -0.03401 -0.01153 0.00000 0.03333 0.16539
sub$squared_residuals <- (sub$residuals)^2
summary(sub$squared_residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.230e-07 4.714e-04 1.151e-03 2.106e-03 2.683e-03 2.736e-02
From the coefficients, all the variables seem to demonstrate a weak relationship, <0.2. But while depth and price have a positive relationship, the rest (carat & depth, carat & price) have a negative relationship.
The R-squared value of 1 indicates that the regression predictions perfectly fit the data.
?resid
sub$residuals <- resid(object = lm.mod)
summary(sub$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.06737 -0.03401 -0.01153 0.00000 0.03333 0.16539
sub$squared_residuals <- (sub$residuals)^2
summary(sub$squared_residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.230e-07 4.714e-04 1.151e-03 2.106e-03 2.683e-03 2.736e-02
white_auxillary_reg <- lm(formula = squared_residuals ~ table + depth + price +
I(table^2) + I(depth^2) + I(price^2)
+ table:depth + depth:price + price:table,
data = sub
)
white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary
##
## Call:
## lm(formula = squared_residuals ~ table + depth + price + I(table^2) +
## I(depth^2) + I(price^2) + table:depth + depth:price + price:table,
## data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0044031 -0.0013611 -0.0001797 0.0008708 0.0129491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.977e+00 4.527e-01 4.366 3.37e-05 ***
## table -3.454e-02 9.336e-03 -3.700 0.000371 ***
## depth -3.101e-02 9.282e-03 -3.341 0.001216 **
## price -7.036e-05 1.602e-05 -4.393 3.04e-05 ***
## I(table^2) 1.502e-04 4.940e-05 3.040 0.003099 **
## I(depth^2) 1.200e-04 6.454e-05 1.859 0.066299 .
## I(price^2) -1.559e-09 1.197e-09 -1.302 0.196194
## table:depth 2.731e-04 6.886e-05 3.966 0.000146 ***
## depth:price 5.944e-07 1.535e-07 3.873 0.000203 ***
## table:price 6.807e-07 1.443e-07 4.716 8.75e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002168 on 90 degrees of freedom
## Multiple R-squared: 0.6651, Adjusted R-squared: 0.6316
## F-statistic: 19.86 on 9 and 90 DF, p-value: < 2.2e-16
?pchisq
chisq_p_value <- pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg),
df = 9,
lower.tail = FALSE )
chisq_p_value
## [1] 7.327895e-11
white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)
## [1] 66.51057
qchisq(p = .05, df = 9)
## [1] 3.325113
qchisq(p = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg),
df = 9,
lower.tail = FALSE
)
## Warning in qchisq(p = white_auxillary_reg_summary$r.squared *
## nobs(white_auxillary_reg), : NaNs produced
## [1] NaN