Weekly Discussion

Author

Zixuan Yin

1. Issue Summary

1.1 What is heteroskedasticity? and the econometric issue it causes.

Heteroscedasticity refers to the situation where the error terms are not uniformly distributed around the predicted values of the dependent variable. Although the sum of the error terms is zero, the standard deviation of the error terms can vary at different values of the independent variables.

Heteroscedasticity can lead to a situation where the accuracy of predictions is affected near the values of the independent variables where the standard deviation of the error terms is either large or small, making it more likely to produce erroneous predictions. This can impact the results of certain statistical tests, particularly the t-test.

1.2 What are the null and alternative hypotheses in BP and White Test?

Null Hypothesis: δ1 = δ2 = …=δ2k = 0 (Homoskedasticity)

Alternative Hypothesis: Any δ is not equal to 0

2. Code

Regression Equation:

\(Speal.Length_i = \beta_0 + \beta_1 \ Sepal.Width_{i} + \beta_2 \ Petal.Length_{i} + \\ \beta_3 Petal.Width_{i} + \epsilon_{i}\)

remove(list=ls())
# Load necessary libraries

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

library(lmtest)

Loading required package: zoo

Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

library(skedastic)

Warning: package 'skedastic' was built under R version 4.4.2

# Load the Iris dataset (it is part of R's built-in datasets)
data(iris)

# Glimpse of the dataset
glimpse(iris)

Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…

# Describe the dataset
describe(iris)

             vars   n mean   sd median trimmed  mad min max range  skew
Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.3 7.9   3.6  0.31
Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.0 4.4   2.4  0.31
Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.0 6.9   5.9 -0.27
Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.1 2.5   2.4 -0.10
Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.0 3.0   2.0  0.00
             kurtosis   se
Sepal.Length    -0.61 0.07
Sepal.Width      0.14 0.04
Petal.Length    -1.42 0.14
Petal.Width     -1.36 0.06
Species*        -1.52 0.07

# Main regression model specification (Sepal Length as dependent variable)
# Sepal.Length = β0 + β1 Sepal.Width + β2 Petal.Length + β3 Petal.Width + ε
subset_iris <- iris[, c(1, 2, 3, 4)]  # Select relevant columns

# Linear regression model
lm_mod <- lm(formula = Sepal.Length ~ ., data = subset_iris)

# Summary of the regression model
summary(lm_mod)  # confirm the independent variables you want


Call:
lm(formula = Sepal.Length ~ ., data = subset_iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.82816 -0.21989  0.01875  0.19709  0.84570 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.85600    0.25078   7.401 9.85e-12 ***
Sepal.Width   0.65084    0.06665   9.765  < 2e-16 ***
Petal.Length  0.70913    0.05672  12.502  < 2e-16 ***
Petal.Width  -0.55648    0.12755  -4.363 2.41e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3145 on 146 degrees of freedom
Multiple R-squared:  0.8586,    Adjusted R-squared:  0.8557 
F-statistic: 295.5 on 3 and 146 DF,  p-value: < 2.2e-16

# Heteroskedasticity tests

## Residual Analysis
# Plot "residuals versus fitted values" chart
plot(lm_mod, which = 1)

# Set up the plotting region to have four subplots
par(mfrow = c(2, 2))
plot(lm_mod)

par(mfrow = c(1, 1))

# Formal tests with packages for heteroskedasticity
# White test for heteroskedasticity

# Conduct White test for heteroscedasticity using the skedastic package
white_test_result <- white(mainlm = lm_mod, interactions = TRUE)

# View the test results
white_test_result

# A tibble: 1 × 5
  statistic p.value parameter method       alternative
      <dbl>   <dbl>     <dbl> <chr>        <chr>      
1      10.4   0.323         9 White's Test greater

# Auxiliary regression for heteroskedasticity (White Test)
# Create the residuals squared variable
subset_iris$residuals <- resid(lm_mod)
summary(subset_iris$residuals)  # mean should be zero

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.82816 -0.21989  0.01875  0.00000  0.19709  0.84570

subset_iris$squared_residuals <- (subset_iris$residuals)^2
summary(subset_iris$squared_residuals)  # should not have any negative values

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000003 0.0105026 0.0461078 0.0963027 0.1386987 0.7152021

# Create X variables: original predictors, squared terms, and interaction terms
white_auxiliary_reg <- lm(formula = squared_residuals ~ Sepal.Width + Petal.Length + Petal.Width +
                                                    I(Sepal.Width^2) + I(Petal.Length^2) + I(Petal.Width^2) +
                                                    Sepal.Width:Petal.Length + Petal.Length:Petal.Width + Petal.Width:Sepal.Width,
                          data = subset_iris)

# Summary of the auxiliary regression
white_auxiliary_reg_summary <- summary(white_auxiliary_reg)
white_auxiliary_reg_summary


Call:
lm(formula = squared_residuals ~ Sepal.Width + Petal.Length + 
    Petal.Width + I(Sepal.Width^2) + I(Petal.Length^2) + I(Petal.Width^2) + 
    Sepal.Width:Petal.Length + Petal.Length:Petal.Width + Petal.Width:Sepal.Width, 
    data = subset_iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.16002 -0.07431 -0.03493  0.04494  0.58292 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)
(Intercept)              -0.47099    0.56592  -0.832    0.407
Sepal.Width               0.23798    0.30375   0.783    0.435
Petal.Length              0.06787    0.24563   0.276    0.783
Petal.Width               0.12842    0.58120   0.221    0.825
I(Sepal.Width^2)         -0.02434    0.04284  -0.568    0.571
I(Petal.Length^2)         0.01458    0.03580   0.407    0.684
I(Petal.Width^2)          0.10404    0.18452   0.564    0.574
Sepal.Width:Petal.Length -0.02351    0.06392  -0.368    0.714
Petal.Length:Petal.Width -0.08037    0.16138  -0.498    0.619
Sepal.Width:Petal.Width  -0.01972    0.15493  -0.127    0.899

Residual standard error: 0.126 on 140 degrees of freedom
Multiple R-squared:  0.06901,   Adjusted R-squared:  0.009162 
F-statistic: 1.153 on 9 and 140 DF,  p-value: 0.33

# BP test (Breusch-Pagan test) for heteroskedasticity
# Perform Breusch-Pagan test for heteroskedasticity
bp_test_result <- bptest(formula = lm_mod, studentize = TRUE)

# View the test results
bp_test_result


    studentized Breusch-Pagan test

data:  lm_mod
BP = 6.9605, df = 3, p-value = 0.07317

# Breusch-Pagan test using skedastic package
bp_skedastic_result <- skedastic::breusch_pagan(lm_mod)

# View both results (they should match)
bp_test_result


    studentized Breusch-Pagan test

data:  lm_mod
BP = 6.9605, df = 3, p-value = 0.07317

bp_skedastic_result

# A tibble: 1 × 5
  statistic p.value parameter method                alternative
      <dbl>   <dbl>     <dbl> <chr>                 <chr>      
1      6.96  0.0732         3 Koenker (studentised) greater

# Conclusion of Breusch-Pagan Test
bp_test_statistic <- bp_test_result$statistic
bp_p_value <- bp_test_result$p.value
bp_test_statistic

      BP 
6.960452

bp_p_value

        BP 
0.07316904

The R-squared value in the auxiliary regression is 0.06901, which indicates that this model can only explain 6.9% of the variation in the residuals. This is a relatively small value, which also suggests that the model is less likely to have heteroscedasticity issues.

Next, I will conduct a chi-square test at alpha= 5%.

# Given R^2 and sample size n
R_squared <- 0.06901
n <- 150

# Calculate the chi-square test statistic (χ²)
test_statistic <- R_squared * n

# Print the test statistic
cat("Test Statistic (χ²): ", test_statistic, "\n")

Test Statistic (χ²):  10.3515

qchisq(1 - 0.05, df = 9)

[1] 16.91898

Since the test statistic is less than the critical value, we cannot reject the null hypothesis, which also indicates that the model does not have heteroscedasticity issues.