Issue Summary

What is Heteroskedasticity

Heteroskedasticity means that the variability of errors in a regression model is different across observations. This can lead to incorrect estimates of the standard errors of the regression coefficients and affect hypothesis testing. In simpler terms, it means that the variability in the data is not constant, which can cause problems in statistical analysis.Two major statistical problems are related to this, biased or inefficient estimates of the standard errors of the regression coefficients and incorrect hypothesis testing results.

##Breusch-Pagan test and White’s Test Thee BPL and White tests are used to check if the errors in a regression model are the same or different across observations. The null hypothesis assumes that the errors are the same, while the alternative hypothesis assumes that they are different. The tests use different methods to check for this difference. Since the BPL test uses a single variable to capture the sources of heteroskedasticity, the White test uses multiple variables, to choose between these two tests would be depending on the data that is being analyzed.

Estimiate a multivariate regression

Choosing data

This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973

remove(list=ls())
library("skedastic")
library("stargazer")
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library("psych")
library("lmtest")
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library("AER")
## Loading required package: car
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:psych':
## 
##     logit
## Loading required package: sandwich
## Loading required package: survival
data("USArrests")


model <- lm(log(Murder) ~ ., data = USArrests)

stargazer(model, type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             log(Murder)        
## -----------------------------------------------
## Assault                      0.006***          
##                               (0.001)          
##                                                
## UrbanPop                      -0.003           
##                               (0.005)          
##                                                
## Rape                           0.015           
##                               (0.009)          
##                                                
## Constant                     0.799***          
##                               (0.281)          
##                                                
## -----------------------------------------------
## Observations                    50             
## R2                             0.650           
## Adjusted R2                    0.627           
## Residual Std. Error       0.416 (df = 46)      
## F Statistic           28.509*** (df = 3; 46)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

Next, we will conduct the residual plot analysis

par(mfrow = c(2, 2))
plot(model)

  1. Since this is a relatively small dataset, the first residual vs fitted graph displays a approximately equal relationship between the posiitve and the negative residuals, which indicates a relatively significant linear relationship
  2. The Q-Q plot almost follows a straight line ascending, which indicates the normality of the values are satisfied.
  3. When assessing equal variables, the third graph of scale location indicates that the variance in residuals appears equal (constant) for all values of X/predicted Y, which suggests that the data is homoscedastic.
  4. The last graph indicates the data has outliers.

Testing with Packges

White’s Test

skedastic_package_white  <- white(mainlm =  model, 
                                  interactions = TRUE
                                  )
skedastic_package_white
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      7.62   0.573         9 White's Test greater

The test statistic has a value of 7.62, follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom, and it has a p-value of approximately 0.573 - thus we fail to reject the null of homoskedasticity.

Manually Testing ” Auxillary Regression”

We will find the residuals from our model and put it in our data frame.

In heteroskedastcity regression tests, the dependent variable is residual squared - so we must construct it manually. Note, the residual is coming from the model for which you want to test for heteroskedasticity, and you can use the resid command to create them easily. The y variable is not income anymore.

USArrests$residuals <- resid(object = model)
summary(USArrests$residuals)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -1.248238 -0.320083 -0.002855  0.000000  0.330991  0.785010

Mean is Zero, confirmed.

USArrests$squared_residuals <- (USArrests$residuals)^2 
squared_residuals <- (USArrests$residuals)^2
  summary(USArrests$squared_residuals)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000322 0.0322093 0.1085053 0.1590068 0.1986654 1.5580987

No negative values detected

Creating X values

white_auxillary_reg <- lm(formula = squared_residuals ~  Assault+ UrbanPop  + Rape + 
                                                     I(Assault^2)  + I(UrbanPop^2) + I(Rape^2) +
                                                   Assault:UrbanPop + Assault:Rape + UrbanPop:Rape,
                          data = USArrests)
white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary
## 
## Call:
## lm(formula = squared_residuals ~ Assault + UrbanPop + Rape + 
##     I(Assault^2) + I(UrbanPop^2) + I(Rape^2) + Assault:UrbanPop + 
##     Assault:Rape + UrbanPop:Rape, data = USArrests)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32363 -0.12121 -0.04277  0.06287  1.14093 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)       9.502e-01  6.436e-01   1.476    0.148
## Assault          -2.062e-03  4.164e-03  -0.495    0.623
## UrbanPop         -9.713e-03  2.252e-02  -0.431    0.669
## Rape             -1.808e-02  3.749e-02  -0.482    0.632
## I(Assault^2)     -2.015e-08  9.384e-06  -0.002    0.998
## I(UrbanPop^2)     2.776e-05  1.898e-04   0.146    0.884
## I(Rape^2)         1.947e-06  6.053e-04   0.003    0.997
## Assault:UrbanPop  7.803e-06  5.618e-05   0.139    0.890
## Assault:Rape      5.349e-05  1.542e-04   0.347    0.730
## UrbanPop:Rape     1.004e-04  3.743e-04   0.268    0.790
## 
## Residual standard error: 0.2423 on 40 degrees of freedom
## Multiple R-squared:  0.1524, Adjusted R-squared:  -0.03831 
## F-statistic: 0.7991 on 9 and 40 DF,  p-value: 0.6192

Low R-squared suggests no heteroskedasticity.

chisq_p_value <-  pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), 
                         df = 9,       # degrees of freedom is the number of parameters estimated in the model minus 1 (for constant term) i.e.  equal to the number of variables in the auxillary regression
                         lower.tail = FALSE  )
chisq_p_value
## [1] 0.5728579

The Chisquare P-Value matches between Hand and Package

white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)
## [1] 7.619895

The Critical Value also matches

Breusch-Pagan Test

The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default the same explanatory variables are taken as in the main regression model) and rejects if too much of the variance is explained by the additional explanatory variables.

lmtest_package_bp <- bptest(formula = model,  # fitted "lm" object, or run the regression
                         studentize = TRUE   # by default, so this can be commented out
)
skedastic_package_bp <- skedastic::breusch_pagan(model)
# View the test results
lmtest_package_bp
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 5.2704, df = 3, p-value = 0.153
skedastic_package_bp
## # A tibble: 1 × 5
##   statistic p.value parameter method                alternative
##       <dbl>   <dbl>     <dbl> <chr>                 <chr>      
## 1      5.27   0.153         3 Koenker (studentised) greater

lmtest::bptest gives the exact same outpout as skedastic::breusch_pagan, as expected.

The test statistic value is 5.27, follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom, and it has a p-value of approximately 0.153 thus we fail to reject the null of homoskedasticity.

Conclusion

The BP test is a widely-used test that is easy to understand and apply. It’s based on a regression of the squared residuals from a regression model on the independent variables in the model. However, under certain conditions, it may not be as powerful as other tests, such as when the true heteroskedasticity is non-linear.

The White test is a more general test that allows for a wide range of forms of heteroskedasticity. It can be applied even when there are cross-correlations between the independent variables in the model. It’s more flexible and powerful than the BP test, but it may also be more complex to understand and apply, especially when the model has many variables.

Ultimately, the choice of test depends on the specific characteristics of the data and the research question at hand, as well as personal preferences and familiarity with the tests. It may be useful to try both tests and compare their results for a more complete picture of the presence of heteroskedasticity in the data.