Regresssion Discussion 12

Introduction

Heteroskedasticity is an issue where the is non constant variance among residuals for a regression analysis. This causes issues because it is a violation of OLS assumptions, it ends up causing differences in standard errors, coefficients and it also leads to the coefficients no longer being efficient. The R squared value will also not be as reliable as a metric for goodness of fit of the model.

Both tests evaluate the same null hypothesis: that there is constant variance (homoskedasticity) in the residuals, however, the key difference is in the assumptions about the nature of heteroskedasticity. The White test is more flexible because it does not assume that non-constant variance is only linear and so it also accounts for non-linear relationships between the residuals and the predictors. Whereas, the BP test assumes that heteroskedasticity follows a linear relationship with the predictors.

A main OLS will be run which will then be used to determine if there is a violation of the homoskedasticity assumption. In order to test this residual plots will be used but then White test will be used to confirm - looking at both the test stat witht he crit value and the pvalue.

Data Loading

remove(list=ls())

# install.packages("MASS")
library(MASS)
  help(painters)
  
str(painters) # 64 rows and 5 columns.
'data.frame':   54 obs. of  5 variables:
 $ Composition: int  10 15 8 12 0 15 8 15 4 17 ...
 $ Drawing    : int  8 16 13 16 15 16 17 16 12 18 ...
 $ Colour     : int  16 4 16 9 8 4 4 7 10 12 ...
 $ Expression : int  3 14 7 8 0 14 8 6 4 18 ...
 $ School     : Factor w/ 8 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
painters <- MASS::painters
library(tidyverse)
glimpse(painters)
Rows: 54
Columns: 5
$ Composition <int> 10, 15, 8, 12, 0, 15, 8, 15, 4, 17, 10, 13, 10, 15, 13, 12…
$ Drawing     <int> 8, 16, 13, 16, 15, 16, 17, 16, 12, 18, 13, 15, 15, 14, 14,…
$ Colour      <int> 16, 4, 16, 9, 8, 4, 4, 7, 10, 12, 8, 8, 6, 7, 10, 5, 6, 12…
$ Expression  <int> 3, 14, 7, 8, 0, 14, 8, 6, 4, 18, 8, 8, 6, 10, 9, 8, 10, 6,…
$ School      <fct> A, A, A, A, A, A, A, A, A, A, B, B, B, B, B, B, C, C, C, C…
library(psych)
describe(painters)
            vars  n  mean   sd median trimmed  mad min max range  skew kurtosis
Composition    1 54 11.56 4.09   12.5   11.86 3.71   0  18    18 -0.66    -0.31
Drawing        2 54 12.46 3.46   13.5   12.66 3.71   6  18    12 -0.46    -1.04
Colour         3 54 10.94 4.65   10.0   11.09 5.93   0  18    18 -0.14    -1.14
Expression     4 54  7.67 4.80    6.0    7.52 2.97   0  18    18  0.39    -0.72
School*        5 54  4.07 2.26    4.0    4.00 2.97   1   8     7  0.16    -1.19
              se
Composition 0.56
Drawing     0.47
Colour      0.63
Expression  0.65
School*     0.31
#Removing school to predict Drawing 
painters <- painters[, !(colnames(painters) == "School")]

Estimating Equation

\[ Drawing_i = \beta_0 + \beta_1Compostiton_i + \beta_2Colour_i+ \beta_3Expression_i + \epsilon_i \]

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
lm_model <- lm(Drawing ~., data = painters)

stargazer(lm_model, type = "text")

===============================================
                        Dependent variable:    
                    ---------------------------
                              Drawing          
-----------------------------------------------
Composition                    0.078           
                              (0.112)          
                                               
Colour                       -0.314***         
                              (0.076)          
                                               
Expression                   0.309***          
                              (0.097)          
                                               
Constant                     12.626***         
                              (1.378)          
                                               
-----------------------------------------------
Observations                    54             
R2                             0.503           
Adjusted R2                    0.473           
Residual Std. Error       2.510 (df = 50)      
F Statistic           16.849*** (df = 3; 50)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

Residual Plots

plot(lm_model, which = 1)

Does not look like there is issues of heteroskedasticity but to confirm this we will use other tests.

par(mfrow = c(2, 2))

# Create a residual plot
plot(lm_model)

Formal Tests with packages

#install.packages("skedastic")

#Used for White's Test 
library(skedastic)


#install.packages("lmtest") 

#Used for Breush Pagan Test 
require("lmtest")

White Test

Null Hypothesis: The residuals have constant variance (homoskedasticity)

Alternative Hypothesis: There is non constant variance among residuals (heteroskedasticity)

# Conduct White test for heteroscedasticity
skedastic_package_white  <- white(mainlm =  lm_model, 
                                  interactions = TRUE)         

# View the test results
skedastic_package_white
# A tibble: 1 × 5
  statistic p.value parameter method       alternative
      <dbl>   <dbl>     <dbl> <chr>        <chr>      
1      11.0   0.275         9 White's Test greater    

How do you interpret the output i.e. what is the test statistic, and p-value? What is the interpretation i.e. do you reject/fail to reject the null, and what is the conclusion?

The test stat for the White test is 11.004, and the p value is 0.275 which is > 0.05 so we fail to reject the null, there is not enough evidence that suggests heteroskedasticity in the model. Therefore the White test suggests that there is constant variance among residuals .

?white

Auxiliary Regression

?resid 
painters$residuals <- resid(object = lm_model)

summary(painters$residuals) # mean should be zero
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-6.6045 -1.9018  0.2017  0.0000  1.5221  5.2319 
painters$squared_residuals <- (painters$residuals)^2 

summary(painters$squared_residuals) # should not have any negative values
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
 0.00531  0.48669  2.63056  5.83321  6.34556 43.61878 
white_auxillary_reg <- lm(formula = squared_residuals ~ Composition + Colour + Expression+ I(Composition^2)  + I(Colour^2)   + I(Expression^2) +Composition:Colour + Colour:Expression + Expression:Composition ,data = painters) 

white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary

Call:
lm(formula = squared_residuals ~ Composition + Colour + Expression + 
    I(Composition^2) + I(Colour^2) + I(Expression^2) + Composition:Colour + 
    Colour:Expression + Expression:Composition, data = painters)

Residuals:
    Min      1Q  Median      3Q     Max 
-14.408  -4.629  -1.856   2.102  33.486 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)   
(Intercept)            38.18866   13.15967   2.902  0.00577 **
Composition            -2.23072    1.94467  -1.147  0.25754   
Colour                 -3.06727    1.63489  -1.876  0.06728 . 
Expression             -0.91416    1.84796  -0.495  0.62328   
I(Composition^2)       -0.02698    0.10272  -0.263  0.79403   
I(Colour^2)             0.05732    0.06355   0.902  0.37197   
I(Expression^2)        -0.06420    0.10545  -0.609  0.54577   
Composition:Colour      0.16843    0.11773   1.431  0.15961   
Colour:Expression       0.02581    0.08854   0.292  0.77203   
Composition:Expression  0.11989    0.15025   0.798  0.42918   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.019 on 44 degrees of freedom
Multiple R-squared:  0.2038,    Adjusted R-squared:  0.04091 
F-statistic: 1.251 on 9 and 44 DF,  p-value: 0.2901

Interpret the R-squared value - what does it tell you about heteroscedasticity?

R squared is 0.2 here which is low (compared to 0.5 main regression), suggesting no heteroskedasticity.

Chi Squared Test

You will have to compute the test statistic by taking the R-squared of the auxiliary regression above and multiplying it by the number of observations in your auxiliary regression.

54*0.2038
[1] 11.0052
?pchisq

chisq_p_value <-  pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), df = 9, lower.tail = FALSE  )
chisq_p_value
[1] 0.2754446

Same as White

white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)
[1] 11.00386

Same test stat

Critical Value

?qchisq()

qchisq(0.95, 9, ncp = 0, lower.tail = TRUE, log.p = FALSE)
[1] 16.91898

The critical value is 16.9 and the test statistic is 11.003, therefore our test stat is less than crit which means we fail to reject the null. There is not enough evidence to suggest there is heteroskedasticity in our model. The assumption of homoskedasticity stands.

The pvalue suggests the same results, p value of 0.275>0.05 therefore we fail to reject the null.