Discussion 4

0.0.1 ISSUE SUMMARY

0.0.1.1 What is “heteroskedasticity”, and the econometric issue it causes (affects point estimates or standard errors)? Do not confuse heteroskedasticity with other terms like multicollinearity, serial correlation, et cetra (2-3 sentences in your own words - EG do not copy/paste directly from web.)

Heteroskedasticity means that when you’re trying to predict or explain something using a set of variables, the amount by which you can be wrong in your predictions is not always the same. Sometimes you’ll be more wrong than other times. This can mess up your analysis and make it hard to draw reliable conclusions.

0.0.1.2 What is the null and alternative hypothesis in BP or White test? The hypothesis are the same, but the auxiliary regression specification is slightly different. Do you agree with the test logic? (2-3 sentences)

The Breusch-Pagan/White test is a method that helps check whether a certain type of statistical variation (called heteroskedasticity) is present in a set of data or not. It does this by running a test that compares how much the data varies from its average value across all observations. This test is widely used in the field of economics to help improve the accuracy and reliability of statistical models.

0.0.2 Residual analysis

library(ggplot2)
data(diamonds)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.2.1     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ✔ purrr   1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(psych)

## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(dplyr)
glimpse(diamonds)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

describe(diamonds)

##          vars     n    mean      sd  median trimmed     mad   min      max
## carat       1 53940    0.80    0.47    0.70    0.73    0.47   0.2     5.01
## cut*        2 53940    3.90    1.12    4.00    4.04    1.48   1.0     5.00
## color*      3 53940    3.59    1.70    4.00    3.55    1.48   1.0     7.00
## clarity*    4 53940    4.05    1.65    4.00    3.91    1.48   1.0     8.00
## depth       5 53940   61.75    1.43   61.80   61.78    1.04  43.0    79.00
## table       6 53940   57.46    2.23   57.00   57.32    1.48  43.0    95.00
## price       7 53940 3932.80 3989.44 2401.00 3158.99 2475.94 326.0 18823.00
## x           8 53940    5.73    1.12    5.70    5.66    1.38   0.0    10.74
## y           9 53940    5.73    1.14    5.71    5.66    1.36   0.0    58.90
## z          10 53940    3.54    0.71    3.53    3.49    0.85   0.0    31.80
##             range  skew kurtosis    se
## carat        4.81  1.12     1.26  0.00
## cut*         4.00 -0.72    -0.40  0.00
## color*       6.00  0.19    -0.87  0.01
## clarity*     7.00  0.55    -0.39  0.01
## depth       36.00 -0.08     5.74  0.01
## table       52.00  0.80     2.80  0.01
## price    18497.00  1.62     2.18 17.18
## x           10.74  0.38    -0.62  0.00
## y           58.90  2.43    91.20  0.00
## z           31.80  1.52    47.08  0.00

diamonds <- ggplot2:: diamonds
sub <- diamonds[, c(1,5,7)]

data(sub)

## Warning in data(sub): data set 'sub' not found

lm.mod <- lm(formula = carat ~ .,
         data    = sub)
summary(lm.mod )

## 
## Call:
## lm(formula = carat ~ ., data = sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.35270 -0.11217 -0.02375  0.10264  2.62184 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.101e-01  3.401e-02  -12.06   <2e-16 ***
## depth        1.259e-02  5.504e-04   22.87   <2e-16 ***
## price        1.095e-04  1.976e-07  554.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1831 on 53937 degrees of freedom
## Multiple R-squared:  0.8508, Adjusted R-squared:  0.8508 
## F-statistic: 1.538e+05 on 2 and 53937 DF,  p-value: < 2.2e-16

plot(lm.mod, which = 1)

par(mfrow = c(2, 2))

# Create a residual plot
plot(lm.mod)

par(mfrow = c(1, 1))

0.0.2.1 Install the “skedastic” package (installation help) and compute White’s test for your fitted model/main regression. How do you interpret the output i.e. what is the test statistic, and p-value? What is the interpretation i.e. do you reject/fail to reject the null, and what is the conclusion i.e. is there heteroskedasticity or not?

Null (H0): Homoscedasticity is present. Alternative (HA): Heteroscedasticity is present.

Test statistic is 6141. P-value is less than 0.05, we reject the null hypothesis. Heteroscedasticity is present.

0.0.3 White’s Test

options(repos = c(CRAN = "http://cran.rstudio.com/"))

install.packages("lmtest")

## 
## The downloaded binary packages are in
##  /var/folders/_2/ds2qp0zn11v05n6h2k954t800000gn/T//RtmpMhJwfT/downloaded_packages

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(lm.mod)

## 
##  studentized Breusch-Pagan test
## 
## data:  lm.mod
## BP = 6141, df = 2, p-value < 2.2e-16

?resid
sub$residuals <- resid(object = lm.mod)
summary(sub$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.35270 -0.11217 -0.02375  0.00000  0.10264  2.62184

sub$squared_residuals <- (sub$residuals)^2 
summary(sub$squared_residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.003159 0.011824 0.033528 0.027131 6.874055

0.0.3.1 Now, run the auxiliary regression and interpret the R-squared value - what does it tell you about heteroscedasticity? To compute the auxiliary regression, you will first have to find the residuals from the main regression, square them and this vector will be your dependent variable.

From the coefficients, all the variables seem to demonstrate a weak relationship, <0.2. But while depth and price have a positive relationship, the rest (carat & depth, carat & price) have a negative relationship.

The R-squared value of 1 indicates that the regression predictions perfectly fit the data.

?resid
sub$residuals <- resid(object = lm.mod)

  summary(sub$residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.35270 -0.11217 -0.02375  0.00000  0.10264  2.62184

  sub$squared_residuals <- (sub$residuals)^2 

  summary(sub$squared_residuals)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000000 0.003159 0.011824 0.033528 0.027131 6.874055

white_auxillary_reg <- lm(formula = squared_residuals ~ carat     +   depth     +  price +
                                                     I(carat^2)  + I(depth^2)   + I(price^2) +
                                                   carat:depth + depth:price + price:carat ,
                             data = sub
                   ) 

white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary

## 
## Call:
## lm(formula = squared_residuals ~ carat + depth + price + I(carat^2) + 
##     I(depth^2) + I(price^2) + carat:depth + depth:price + price:carat, 
##     data = sub)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -9.410e-11 -1.000e-15  1.000e-15  3.000e-15  2.931e-11 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  1.682e-01  1.253e-12  1.342e+11   <2e-16 ***
## carat        8.202e-01  3.491e-13  2.349e+12   <2e-16 ***
## depth       -1.032e-02  4.070e-14 -2.536e+11   <2e-16 ***
## price       -8.985e-05  4.411e-17 -2.037e+12   <2e-16 ***
## I(carat^2)   1.000e+00  2.011e-14  4.973e+13   <2e-16 ***
## I(depth^2)   1.584e-04  3.324e-16  4.766e+11   <2e-16 ***
## I(price^2)   1.200e-08  3.212e-22  3.737e+13   <2e-16 ***
## carat:depth -2.517e-02  5.706e-15 -4.412e+12   <2e-16 ***
## depth:price  2.758e-06  7.168e-19  3.847e+12   <2e-16 ***
## carat:price -2.191e-04  4.622e-18 -4.740e+13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.244e-13 on 53930 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.553e+26 on 9 and 53930 DF,  p-value: < 2.2e-16

?pchisq

chisq_p_value <-  pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), 
                         df = 9,      
                         lower.tail = FALSE  )
chisq_p_value

## [1] 0

white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)

## [1] 53940

qchisq(p = .05, df = 9)

## [1] 3.325113

qchisq(p  = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), 
       df = 9, 
       lower.tail = FALSE
       )

## Warning in qchisq(p = white_auxillary_reg_summary$r.squared *
## nobs(white_auxillary_reg), : NaNs produced

## [1] NaN