Detecting Heteroskedasticity with White’s test

Author

Betty Wang

Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Running Code

When you click the Render button a document will be generated that includes both content and the output of embedded code. You can embed code like this:

library(skedastic)  # FOR WHITE TEST
library(lmtest)

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

# Load the dataset
data(mtcars)

# Display the structure of the dataset
str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

# Fit the linear regression model to be the main model
model <- lm(mpg ~ hp + wt + cyl, data = mtcars)

# Display the summary of the regression model
summary(model)


Call:
lm(formula = mpg ~ hp + wt + cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9290 -1.5598 -0.5311  1.1850  5.8986 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
hp          -0.01804    0.01188  -1.519 0.140015    
wt          -3.16697    0.74058  -4.276 0.000199 ***
cyl         -0.94162    0.55092  -1.709 0.098480 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.512 on 28 degrees of freedom
Multiple R-squared:  0.8431,    Adjusted R-squared:  0.8263 
F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11

# Perform White's test for heteroskedasticity
# We'll use the 'bptest' function with the 'studentized' parameter for White's test
white_test <- bptest(model, ~ fitted(model) + I(fitted(model)^2), data = mtcars)

# Display the results of White's test
print(white_test)


    studentized Breusch-Pagan test

data:  model
BP = 5.8332, df = 2, p-value = 0.05412

# Optional: Plot residuals to visually check for heteroskedasticity
par(mfrow = c(2, 2))  # Arrange plots in a 2x2 grid
plot(model)

White’s Test Results

Test Statistic (BP): 5.8332 Degrees of Freedom (df): 2 p-value: 0.05412

Interpretation

Null Hypothesis: The residuals have constant variance (homoskedasticity). Alternative Hypothesis: The residuals do not have constant variance (heteroskedasticity exists). p-value = 0.05412: This value is slightly above the typical significance threshold of 0.05. At the 5% significance level, we fail to reject the null hypothesis of homoskedasticity. At a more lenient 10% significance level, the p-value would indicate some evidence of heteroskedasticity.

Conclusion

At the 5% significance level, the test suggests that there is no statistically significant evidence of heteroskedasticity in the residuals of the model. Therefore, we conclude that the residual variance is likely constant, and the assumption of homoskedasticity holds for this model.

Using the residual plots, we can visually check for heteroskedasticity.

Residuals vs Fitted Values Plot: There is a random scatter of points without any pattern, indicating homoscedasticity (constant variance of errors).

Scale-Location Plot: The points form a random scatter, suggesting homoscedasticity.

Normal Q-Q Plot: The residuals do not have varying spread across different quantiles (e.g., the spread is approximately as wide for larger residuals as for smaller residuals), indicating homoscedasticity.

Residuals vs Leverage Plot: Points that are far from the center (high leverage) do not have significantly large residuals.

The White’s test result confirms with the residual charts plotted.

# Ensure mtcars has only 32 rows
nrow(mtcars)  # Check the number of rows in the dataset

[1] 32

# Create squares of the original regressors
mtcars$hp2 <- mtcars$hp^2  # Square of horsepower
mtcars$wt2 <- mtcars$wt^2  # Square of weight
mtcars$cyl2 <- mtcars$cyl^2  # Square of cylinders

# Create cross terms of the original regressors
mtcars$hp_wt <- mtcars$hp * mtcars$wt  # Interaction between horsepower and weight
mtcars$hp_cyl <- mtcars$hp * mtcars$cyl  # Interaction between horsepower and cylinders
mtcars$wt_cyl <- mtcars$wt * mtcars$cyl  # Interaction between weight and cylinders

# View the first few rows of the dataset to confirm new variables
head(mtcars)

                   mpg cyl disp  hp drat    wt  qsec vs am gear carb   hp2
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 12100
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 12100
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1  8649
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 12100
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 30625
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 11025
                        wt2 cyl2  hp_wt hp_cyl wt_cyl
Mazda RX4          6.864400   36 288.20    660  15.72
Mazda RX4 Wag      8.265625   36 316.25    660  17.25
Datsun 710         5.382400   16 215.76    372   9.28
Hornet 4 Drive    10.336225   36 353.65    660  19.29
Hornet Sportabout 11.833600   64 602.00   1400  27.52
Valiant           11.971600   36 363.30    630  20.76

# Extract residuals from the main regression
residuals_main <- residuals(model)

# Square the residuals to use as the dependent variable for the auxiliary regression
squared_residuals <- residuals_main^2

# Run the auxiliary regression (squared residuals as the dependent variable)
auxiliary_model <- lm(squared_residuals ~ hp + wt + cyl + hp2 + wt2 + cyl2 + hp_wt + hp_cyl + wt_cyl, data = mtcars)

# Display the results of the auxiliary regression
summary(auxiliary_model)


Call:
lm(formula = squared_residuals ~ hp + wt + cyl + hp2 + wt2 + 
    cyl2 + hp_wt + hp_cyl + wt_cyl, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.7670  -3.3051  -0.8702   2.7699  20.9129 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  5.314e+01  3.821e+01   1.391    0.178
hp          -2.971e-01  3.060e-01  -0.971    0.342
wt           1.107e+00  1.342e+01   0.082    0.935
cyl         -8.987e+00  1.549e+01  -0.580    0.568
hp2         -8.104e-05  8.549e-04  -0.095    0.925
wt2         -5.660e+00  5.241e+00  -1.080    0.292
cyl2         3.271e-01  2.371e+00   0.138    0.892
hp_wt        1.503e-01  1.323e-01   1.136    0.268
hp_cyl      -2.542e-02  8.616e-02  -0.295    0.771
wt_cyl       2.292e+00  6.875e+00   0.333    0.742

Residual standard error: 8.698 on 22 degrees of freedom
Multiple R-squared:  0.2795,    Adjusted R-squared:  -0.01525 
F-statistic: 0.9483 on 9 and 22 DF,  p-value: 0.5056

Interpreting the R-squared Value

In the auxiliary regression, the R-squared value is 0.2795 meaning that only 27.95% of the variation in the squared residuals is explained by the independent variables (the original regressors, their squares, and cross terms).

This R-squared is low and suggests that the independent variables used in the auxiliary regression do not explain much of the variation in the squared residuals. This may imply that there isn’t a strong relationship between the residuals and the independent variables, indicating no significant heteroskedasticity.

If the R-squared were high, it would suggest that the independent variables are explaining more of the variation in the squared residuals, indicating the presence of heteroskedasticity (i.e., the variance of the residuals depends on the independent variables).

We can formally test if the R-squared value is statistically significant using a chi-squared test.

Null hypothesis: There is no heteroskedasticity (i.e., the R-squared value of the auxiliary regression is not significant). Alternative hypothesis: There is heteroskedasticity (i.e., the R-squared value is significant, meaning the independent variables explain the variation in squared residuals). The test statistic for the chi-squared test is calculated as:

Test Statistic = R^2 × number of observations = 0.2795 × 32 = 8.944

where: R^2 = 0.2795 (from the auxiliary regression), number of observations = 32

Compute the critical value using the chi-squared distribution and compare it with the test statistic. We use the qchisq() function in R to find the critical value for a significance level of 5% (α = 0.05) and degrees of freedom equal to the number of independent variables minus 1 (9 independent variables for the regression, so degrees of freedom = 9 - 1 = 8).

# Define the R-squared value and number of observations
R_squared <- 0.2795
n_obs <- 32
df <- 8  # degrees of freedom for the auxiliary regression

# Calculate the test statistic
test_statistic <- R_squared * n_obs
test_statistic

[1] 8.944

# Find the critical value from the chi-squared distribution
alpha <- 0.05
critical_value <- qchisq(1 - alpha, df)

# Compute p-value of the test statistic
p_value <- 1 - pchisq(test_statistic, df)

# Display results
test_statistic

[1] 8.944

critical_value

[1] 15.50731

p_value

[1] 0.3470421

Interpretation of the Chi-Squared Test Results

Test Statistic: The test statistic is 8.944. Critical Value: The critical value for the chi-squared distribution with 8 degrees of freedom and a significance level of 5% (α = 0.05) is 15.50731. p-value: The p-value is 0.3470421.

The test statistic (8.944) is less than the critical value (15.50731). The p-value (0.3470421) is greater than the significance level (α = 0.05).

Decision Rule: If the test statistic is greater than the critical value, we reject the null hypothesis. If the p-value is less than the significance level (α = 0.05), we also reject the null hypothesis.

Conclusion

Fail to Reject the Null Hypothesis: Since the test statistic is smaller than the critical value, and the p-value is larger than 0.05, we fail to reject the null hypothesis. Interpretation: There is no significant evidence of heteroskedasticity in the model. This suggests that the variance of the residuals is not significantly related to the independent variables in your model.