Heteroskedasticity means that when you’re trying to predict or explain something using a set of variables, the amount by which you can be wrong in your predictions is not always the same. Sometimes you’ll be more wrong than other times. This can mess up your analysis and make it hard to draw reliable conclusions.
The Breusch-Pagan/White test is a method that helps check whether a certain type of statistical variation (called heteroskedasticity) is present in a set of data or not. It does this by running a test that compares how much the data varies from its average value across all observations. This test is widely used in the field of economics to help improve the accuracy and reliability of statistical models.
library(ggplot2)
data(diamonds)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.2.1 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(psych)
##
## Attaching package: 'psych'
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(dplyr)
glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
describe(diamonds)
## vars n mean sd median trimmed mad min max
## carat 1 53940 0.80 0.47 0.70 0.73 0.47 0.2 5.01
## cut* 2 53940 3.90 1.12 4.00 4.04 1.48 1.0 5.00
## color* 3 53940 3.59 1.70 4.00 3.55 1.48 1.0 7.00
## clarity* 4 53940 4.05 1.65 4.00 3.91 1.48 1.0 8.00
## depth 5 53940 61.75 1.43 61.80 61.78 1.04 43.0 79.00
## table 6 53940 57.46 2.23 57.00 57.32 1.48 43.0 95.00
## price 7 53940 3932.80 3989.44 2401.00 3158.99 2475.94 326.0 18823.00
## x 8 53940 5.73 1.12 5.70 5.66 1.38 0.0 10.74
## y 9 53940 5.73 1.14 5.71 5.66 1.36 0.0 58.90
## z 10 53940 3.54 0.71 3.53 3.49 0.85 0.0 31.80
## range skew kurtosis se
## carat 4.81 1.12 1.26 0.00
## cut* 4.00 -0.72 -0.40 0.00
## color* 6.00 0.19 -0.87 0.01
## clarity* 7.00 0.55 -0.39 0.01
## depth 36.00 -0.08 5.74 0.01
## table 52.00 0.80 2.80 0.01
## price 18497.00 1.62 2.18 17.18
## x 10.74 0.38 -0.62 0.00
## y 58.90 2.43 91.20 0.00
## z 31.80 1.52 47.08 0.00
diamonds <- ggplot2:: diamonds
sub <- diamonds[, c(1,5,7)]
data(sub)
## Warning in data(sub): data set 'sub' not found
lm.mod <- lm(formula = carat ~ .,
data = sub)
summary(lm.mod )
##
## Call:
## lm(formula = carat ~ ., data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35270 -0.11217 -0.02375 0.10264 2.62184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.101e-01 3.401e-02 -12.06 <2e-16 ***
## depth 1.259e-02 5.504e-04 22.87 <2e-16 ***
## price 1.095e-04 1.976e-07 554.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1831 on 53937 degrees of freedom
## Multiple R-squared: 0.8508, Adjusted R-squared: 0.8508
## F-statistic: 1.538e+05 on 2 and 53937 DF, p-value: < 2.2e-16
plot(lm.mod, which = 1)
par(mfrow = c(2, 2))
# Create a residual plot
plot(lm.mod)
par(mfrow = c(1, 1))
Null (H0): Homoscedasticity is present. Alternative (HA): Heteroscedasticity is present.
Test statistic is 6141. P-value is less than 0.05, we reject the null hypothesis. Heteroscedasticity is present.
options(repos = c(CRAN = "http://cran.rstudio.com/"))
install.packages("lmtest")
##
## The downloaded binary packages are in
## /var/folders/_2/ds2qp0zn11v05n6h2k954t800000gn/T//RtmpMhJwfT/downloaded_packages
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(lm.mod)
##
## studentized Breusch-Pagan test
##
## data: lm.mod
## BP = 6141, df = 2, p-value < 2.2e-16
?resid
sub$residuals <- resid(object = lm.mod)
summary(sub$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.35270 -0.11217 -0.02375 0.00000 0.10264 2.62184
sub$squared_residuals <- (sub$residuals)^2
summary(sub$squared_residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.003159 0.011824 0.033528 0.027131 6.874055
From the coefficients, all the variables seem to demonstrate a weak relationship, <0.2. But while depth and price have a positive relationship, the rest (carat & depth, carat & price) have a negative relationship.
The R-squared value of 1 indicates that the regression predictions perfectly fit the data.
?resid
sub$residuals <- resid(object = lm.mod)
summary(sub$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.35270 -0.11217 -0.02375 0.00000 0.10264 2.62184
sub$squared_residuals <- (sub$residuals)^2
summary(sub$squared_residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000000 0.003159 0.011824 0.033528 0.027131 6.874055
white_auxillary_reg <- lm(formula = squared_residuals ~ carat + depth + price +
I(carat^2) + I(depth^2) + I(price^2) +
carat:depth + depth:price + price:carat ,
data = sub
)
white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary
##
## Call:
## lm(formula = squared_residuals ~ carat + depth + price + I(carat^2) +
## I(depth^2) + I(price^2) + carat:depth + depth:price + price:carat,
## data = sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.410e-11 -1.000e-15 1.000e-15 3.000e-15 2.931e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.682e-01 1.253e-12 1.342e+11 <2e-16 ***
## carat 8.202e-01 3.491e-13 2.349e+12 <2e-16 ***
## depth -1.032e-02 4.070e-14 -2.536e+11 <2e-16 ***
## price -8.985e-05 4.411e-17 -2.037e+12 <2e-16 ***
## I(carat^2) 1.000e+00 2.011e-14 4.973e+13 <2e-16 ***
## I(depth^2) 1.584e-04 3.324e-16 4.766e+11 <2e-16 ***
## I(price^2) 1.200e-08 3.212e-22 3.737e+13 <2e-16 ***
## carat:depth -2.517e-02 5.706e-15 -4.412e+12 <2e-16 ***
## depth:price 2.758e-06 7.168e-19 3.847e+12 <2e-16 ***
## carat:price -2.191e-04 4.622e-18 -4.740e+13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.244e-13 on 53930 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.553e+26 on 9 and 53930 DF, p-value: < 2.2e-16
?pchisq
chisq_p_value <- pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg),
df = 9,
lower.tail = FALSE )
chisq_p_value
## [1] 0
white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)
## [1] 53940
qchisq(p = .05, df = 9)
## [1] 3.325113
qchisq(p = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg),
df = 9,
lower.tail = FALSE
)
## Warning in qchisq(p = white_auxillary_reg_summary$r.squared *
## nobs(white_auxillary_reg), : NaNs produced
## [1] NaN