1. Issue Summary

What is heteroskedasticity?
- Heteroskedasticity refers to a situation where the variance of the residuals is unequal over a range of measured values. If heteroskedasticity exists, the population used in the regression contains unequal variance, the analysis results may be invalid.
What is the null and alternative hypothesis in BP or White test?
- The null hypothesis of both these test for heteroskedasticity is that the coefficient for the fitted values of regressors are equal to 0. The alternative hypothesis is that they are not equal to 0. If null is rejected, then it means that there is heteroskedasticity.

2. Model

# Load data
diwali <- read.csv("/Users/pin.lyu/Desktop/Tidy_Tuesday/Data/Diwali_Sales_Data.csv")

Data Description:
- Diwali, the festival of lights! An widely celebrated Indian fesitival. The data comes from sales data for a retail store during the Diwali festival period in India. The data is shared on Kaggle by Saad Haroon.

# Drop variables we don't need
diwali <- diwali |>
  select(-Status, -unnamed1)

# Fix inputs
diwali$Product_ID <- gsub("P", "", diwali$Product_ID) |>
  as.numeric()

# Drop missing values
diwali <- na.omit(diwali)

# Summary stats
stargazer(data = diwali, type = "text", title = "Table 1: Data Summary Statistics")

## 
## Table 1: Data Summary Statistics
## ====================================================================
## Statistic        N        Mean       St. Dev.      Min       Max    
## --------------------------------------------------------------------
## User_ID        11,239 1,003,004.000  1,716.039  1,000,001 1,006,040 
## Product_ID     11,239  175,116.500  101,222.500    142     370,642  
## Age            11,239    35.410       12.754       12         92    
## Marital_Status 11,239     0.420        0.494        0         1     
## Orders         11,239     2.490        1.115        1         4     
## Amount         11,239   9,453.611    5,222.356   188.000  23,952.000
## --------------------------------------------------------------------

Regression:
- \[ \widehat{Expenditure} = \beta_0 + \beta_1 Marriage + \beta_2Age + \epsilon \]

R <- lm(Amount ~ Marital_Status + Age , data = diwali)

# Summary
stargazer(R, type = 'text', title = 'Summary of Two Linear Regression')

## 
## Summary of Two Linear Regression
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               Amount           
## -----------------------------------------------
## Marital_Status               -181.073*         
##                              (99.760)          
##                                                
## Age                          12.583***         
##                               (3.861)          
##                                                
## Constant                   9,084.099***        
##                              (151.697)         
##                                                
## -----------------------------------------------
## Observations                  11,239           
## R2                             0.001           
## Adjusted R2                    0.001           
## Residual Std. Error   5,219.555 (df = 11236)   
## F Statistic          7.032*** (df = 2; 11236)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

3. Residual Analysis

# Set printing direction
par(mfrow = c(2, 2))

# graph
plot(R)

Comments: Based on the distribution of the residuals on the “Residual vs. Fitted” graph, we can see that homoskedasticity exist in this regression. The variance of the residuals are completely evenly distributed around the fitted line. This means that our regression suffers from severe homoskedasticity problem.

4. White Test

# Run white test for heteroskedasticity 
white_test  <- white(mainlm =  R, 
                     interactions = TRUE)

white_test

## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1     0.739   0.981         5 White's Test greater

Comment: The test statistic roughly has a value of 0.739. The P-value is about 98% , thus we would fail to reject the null hypothesis. This means that we have homoskedasticity in our regression!

5. Auxiliary Regression

# Extract residuals from the model
diwali$residuals <- resid(object = R)

# Print
summary(diwali$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -9523   -3925   -1292       0    3186   14627

# Square the inputs
diwali$residuals <- (diwali$residuals)^2 

summary(diwali$residuals)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##         1   3349968  13112280  27236480  39875241 213936590

white_auxillary_reg <- lm(formula = residuals ~  
                            Marital_Status + 
                            Age +
                            I(Marital_Status^2) +
                            I(Age^2) +
                            Marital_Status:Age,
                          
                         data = diwali
                   ) 

stargazer(white_auxillary_reg, type = 'text', title = 'Table 2: White Auxillary Regression Summary Statistics')

## 
## Table 2: White Auxillary Regression Summary Statistics
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                              residuals         
## -----------------------------------------------
## Marital_Status              863,532.800        
##                           (1,868,209.000)      
##                                                
## Age                         25,155.730         
##                            (99,649.230)        
##                                                
## I(Marital_Status2)                             
##                                                
##                                                
## I(Age2)                       10.238           
##                             (1,091.990)        
##                                                
## Marital_Status:Age          -20,896.710        
##                            (49,761.400)        
##                                                
## Constant                 26,277,673.000***     
##                           (2,156,055.000)      
##                                                
## -----------------------------------------------
## Observations                  11,239           
## R2                            0.0001           
## Adjusted R2                   -0.0003          
## Residual Std. Error 32,941,854.000 (df = 11234)
## F Statistic            0.185 (df = 4; 11234)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

# Restore regression into summary
white_auxillary_reg_summary <- white_auxillary_reg

Comments: The R squared is low. This suggests that we have homoskedasticity in our model.

6. Chi-Square Test

Calculate the Chi-Square test statistic: \[χ2=n∗R^2new\] where

n: The total number of observations
R^2: The R-squared of the new regression model that used the squared residuals as the response values

# Chi-squared test
chisq_p_value <-  pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), 
                         df = 5,       # degrees of freedom is the number of parameters estimated in the model minus 1 (for constant term) i.e.  equal to the number of variables in the auxillary regression
                         lower.tail = FALSE  )
# Print result
chisq_p_value

## numeric(0)

Comments: Here it is giving me a result of 0. Because R wants to emphasize that it is an numeric answer of 0, not an logic or categorical 0. That is why it is printing out the answer in numeric(0) form.

Based on our stargazer table, we can see that R squared is basically 0 as well. So we have obtained the same results.

7. Breusch-Pagan Test

BP_test <- bptest(formula = R ,  
                  studentize = F  
)
                      
# Print result
BP_test

## 
##  Breusch-Pagan test
## 
## data:  R
## BP = 0.41071, df = 2, p-value = 0.8144

Comments: The null hypothesis of Breush-Pagan test is the same as white test which is that there is no difference between the coefficients of the predicted regressors (Homoskedasticity). The P-value is 0.814 which would result us to fail to reject the null. This means that homoskedasticity exist.

ENMT_DIS#13

Pin Lyu

2023-12-03