What is heteroskedasticity?
What is the null and alternative hypothesis in BP or White test?
# Load data
diwali <- read.csv("/Users/pin.lyu/Desktop/Tidy_Tuesday/Data/Diwali_Sales_Data.csv")
Data Description:
# Drop variables we don't need
diwali <- diwali |>
select(-Status, -unnamed1)
# Fix inputs
diwali$Product_ID <- gsub("P", "", diwali$Product_ID) |>
as.numeric()
# Drop missing values
diwali <- na.omit(diwali)
# Summary stats
stargazer(data = diwali, type = "text", title = "Table 1: Data Summary Statistics")
##
## Table 1: Data Summary Statistics
## ====================================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------------------
## User_ID 11,239 1,003,004.000 1,716.039 1,000,001 1,006,040
## Product_ID 11,239 175,116.500 101,222.500 142 370,642
## Age 11,239 35.410 12.754 12 92
## Marital_Status 11,239 0.420 0.494 0 1
## Orders 11,239 2.490 1.115 1 4
## Amount 11,239 9,453.611 5,222.356 188.000 23,952.000
## --------------------------------------------------------------------
Regression:
R <- lm(Amount ~ Marital_Status + Age , data = diwali)
# Summary
stargazer(R, type = 'text', title = 'Summary of Two Linear Regression')
##
## Summary of Two Linear Regression
## ===============================================
## Dependent variable:
## ---------------------------
## Amount
## -----------------------------------------------
## Marital_Status -181.073*
## (99.760)
##
## Age 12.583***
## (3.861)
##
## Constant 9,084.099***
## (151.697)
##
## -----------------------------------------------
## Observations 11,239
## R2 0.001
## Adjusted R2 0.001
## Residual Std. Error 5,219.555 (df = 11236)
## F Statistic 7.032*** (df = 2; 11236)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# Set printing direction
par(mfrow = c(2, 2))
# graph
plot(R)
Comments: Based on the distribution of the residuals on the “Residual vs. Fitted” graph, we can see that homoskedasticity exist in this regression. The variance of the residuals are completely evenly distributed around the fitted line. This means that our regression suffers from severe homoskedasticity problem.
# Run white test for heteroskedasticity
white_test <- white(mainlm = R,
interactions = TRUE)
white_test
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 0.739 0.981 5 White's Test greater
Comment: The test statistic roughly has a value of 0.739. The P-value is about 98% , thus we would fail to reject the null hypothesis. This means that we have homoskedasticity in our regression!
# Extract residuals from the model
diwali$residuals <- resid(object = R)
# Print
summary(diwali$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9523 -3925 -1292 0 3186 14627
# Square the inputs
diwali$residuals <- (diwali$residuals)^2
summary(diwali$residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 3349968 13112280 27236480 39875241 213936590
white_auxillary_reg <- lm(formula = residuals ~
Marital_Status +
Age +
I(Marital_Status^2) +
I(Age^2) +
Marital_Status:Age,
data = diwali
)
stargazer(white_auxillary_reg, type = 'text', title = 'Table 2: White Auxillary Regression Summary Statistics')
##
## Table 2: White Auxillary Regression Summary Statistics
## ===============================================
## Dependent variable:
## ---------------------------
## residuals
## -----------------------------------------------
## Marital_Status 863,532.800
## (1,868,209.000)
##
## Age 25,155.730
## (99,649.230)
##
## I(Marital_Status2)
##
##
## I(Age2) 10.238
## (1,091.990)
##
## Marital_Status:Age -20,896.710
## (49,761.400)
##
## Constant 26,277,673.000***
## (2,156,055.000)
##
## -----------------------------------------------
## Observations 11,239
## R2 0.0001
## Adjusted R2 -0.0003
## Residual Std. Error 32,941,854.000 (df = 11234)
## F Statistic 0.185 (df = 4; 11234)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
# Restore regression into summary
white_auxillary_reg_summary <- white_auxillary_reg
Comments: The R squared is low. This suggests that we have homoskedasticity in our model.
Calculate the Chi-Square test statistic: \[χ2=n∗R^2new\] where
n: The total number of observations
R^2: The R-squared of the new regression model that used the squared residuals as the response values
# Chi-squared test
chisq_p_value <- pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg),
df = 5, # degrees of freedom is the number of parameters estimated in the model minus 1 (for constant term) i.e. equal to the number of variables in the auxillary regression
lower.tail = FALSE )
# Print result
chisq_p_value
## numeric(0)
Comments: Here it is giving me a result of 0. Because R wants to emphasize that it is an numeric answer of 0, not an logic or categorical 0. That is why it is printing out the answer in numeric(0) form.
Based on our stargazer table, we can see that R squared is basically 0 as well. So we have obtained the same results.
BP_test <- bptest(formula = R ,
studentize = F
)
# Print result
BP_test
##
## Breusch-Pagan test
##
## data: R
## BP = 0.41071, df = 2, p-value = 0.8144
Comments: The null hypothesis of Breush-Pagan test is the same as white test which is that there is no difference between the coefficients of the predicted regressors (Homoskedasticity). The P-value is 0.814 which would result us to fail to reject the null. This means that homoskedasticity exist.