Discussionwk13

Author

Nuobing Fan

1. ISSUE SUMMARY

1. What is “heteroskedasticity”, and the econometric issue it causes (affects point estimates or standard errors)? Do not confuse heteroskedasticity with other terms like multicollinearity, serial correlation, et cetra (2-3 sentences in your own words - EG do not copy/paste directly from the web.)

Heteroskedasticity refers to the phenomenon in regression models where the variance of the error terms is not constant across different levels of the predictor variables. This variability affects the standard errors of the model, leading to unreliable hypothesis tests and confidence intervals. In the presence of heteroskedasticity, OLS estimators are still unbiased, but they are not efficient (i.e., they do not have the minimum variance), which means that hypothesis testing could lead to misleading conclusions.

2. What is the null and alternative hypothesis in BPLinks to an external site. or WhiteLinks to an external site. test? The hypothesis is the same, but the auxiliary regression specification is slightly different. Do you agree with the test logic? (2-3 sentences)

In both the Breusch-Pagan (BP) and White tests, the null hypothesis (H_0) is that the variance of the residuals is constant (homoskedasticity). The alternative hypothesis ( H_1) is that the variance of the residuals depends on the values of the predictor variables (heteroskedasticity). Yes, the logic makes sense, as these tests use auxiliary regressions to see if the residual variance can be explained by the predictor variables or their combinations. If the variance is explained significantly, it implies heteroskedasticity.

2. CODING

1. Choose a dataset, specify your linear regression, and estimate the regression in R. Please keep at least 3 independent variables in your regression. This is your main regression.

# Clear environment
remove(list=ls())

# Load necessary libraries
library(MASS)
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.3

Warning: package 'dplyr' was built under R version 4.3.3

Warning: package 'forcats' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::select() masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(psych)

Warning: package 'psych' was built under R version 4.3.3


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

library(stargazer)


Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

# Load the USCrime dataset
data("UScrime")

# Overview of the dataset
str(UScrime)

'data.frame':   47 obs. of  16 variables:
 $ M   : int  151 143 142 136 141 121 127 131 157 140 ...
 $ So  : int  1 0 1 0 0 0 1 1 1 0 ...
 $ Ed  : int  91 113 89 121 121 110 111 109 90 118 ...
 $ Po1 : int  58 103 45 149 109 118 82 115 65 71 ...
 $ Po2 : int  56 95 44 141 101 115 79 109 62 68 ...
 $ LF  : int  510 583 533 577 591 547 519 542 553 632 ...
 $ M.F : int  950 1012 969 994 985 964 982 969 955 1029 ...
 $ Pop : int  33 13 18 157 18 25 4 50 39 7 ...
 $ NW  : int  301 102 219 80 30 44 139 179 286 15 ...
 $ U1  : int  108 96 94 102 91 84 97 79 81 100 ...
 $ U2  : int  41 36 33 39 20 29 38 35 28 24 ...
 $ GDP : int  394 557 318 673 578 689 620 472 421 526 ...
 $ Ineq: int  261 194 250 167 174 126 168 206 239 174 ...
 $ Prob: num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
 $ Time: num  26.2 25.3 24.3 29.9 21.3 ...
 $ y   : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

# Subset the dataset with the dependent variable and at least three independent variables
subset_USCrime <- UScrime[, c("y", "M", "Ed", "Po1")]

# Fit a linear regression model
lm.mod <- lm(formula = y ~ ., data = subset_USCrime)

# Display regression summary using stargazer for better formatting
stargazer(lm.mod, type = "text")


===============================================
                        Dependent variable:    
                    ---------------------------
                                 y             
-----------------------------------------------
M                            12.300***         
                              (3.833)          
                                               
Ed                             4.737           
                              (4.242)          
                                               
Po1                          10.718***         
                              (1.569)          
                                               
Constant                   -2,210.879**        
                             (833.612)         
                                               
-----------------------------------------------
Observations                    47             
R2                             0.575           
Adjusted R2                    0.545           
Residual Std. Error      260.858 (df = 43)     
F Statistic           19.373*** (df = 3; 43)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01

# Residual plot
plot(lm.mod, which = 1)

# Set up multiple plots in the same window
par(mfrow = c(2, 2))

# Create a residual plot with additional plots
plot(lm.mod)

# Reset to one plot per window
par(mfrow = c(1, 1))

# Summary of the linear regression model
summary(lm.mod)


Call:
lm(formula = y ~ ., data = subset_USCrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-559.57 -168.63   -4.43  143.05  565.36 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2210.879    833.612  -2.652  0.01116 *  
M              12.300      3.833   3.209  0.00252 ** 
Ed              4.737      4.242   1.117  0.27031    
Po1            10.718      1.569   6.829 2.27e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 260.9 on 43 degrees of freedom
Multiple R-squared:  0.5748,    Adjusted R-squared:  0.5451 
F-statistic: 19.37 on 3 and 43 DF,  p-value: 4.25e-08

here we have y the crime rate, also the dependent variable; an dM percentage of males, Ed education, Po1 the police expenditure as independent variables.

2. Install the “skedastic” package (installation helpLinks to an external site.) and compute White’s test for your fitted model/main regression. How do you interpret the output i.e. what is the test statistic, and p-value? What is the interpretation i.e. do you reject/fail to reject the null, and what is the conclusion i.e. is there heteroskedasticity or not?

Null Hypothesis (H0): The residuals have constant variance (homoskedasticity). This means that the variance of the error terms is constant across all levels of the independent variables.

Alternative Hypothesis (H1): The residuals do not have constant variance (heteroskedasticity). This implies that the variance of the residuals depends on the values of the independent variables, indicating heteroskedasticity is present.

# Install skedastic package (if not already installed)
# install.packages("skedastic")


library(skedastic)

Warning: package 'skedastic' was built under R version 4.3.3

# Conduct White's test for heteroscedasticity
white_test_result <- white(mainlm = lm.mod, interactions = TRUE)

# View the test results
white_test_result

# A tibble: 1 × 5
  statistic p.value parameter method       alternative
      <dbl>   <dbl>     <dbl> <chr>        <chr>      
1      25.7 0.00233         9 White's Test greater

The test statistic and p-value are provided in the output. If the p-value is greater than the significance level (typically 0.05), we fail to reject the null hypothesis of homoskedasticity. Based on the provided example output (p-value = 0.00233), we reject the null hypothesis and conclude that there is evidence of heteroskedasticity.

3. Now, run the auxiliary regression and interpret the R-squared value - what does it tell you about heteroscedasticity? To compute the auxiliary regression, you will first have to find the residuals from the main regression, square them and this vector will be your dependent variable. Your independent variables will be

A. The original regressorsLinks to an external site. in main regression (you should know what regressors mean as we have talked about this in class).

B. Squares of the original regressors. These are easy to create in R with the as-if (I()) operator.Links to an external site.

C. Cross terms of original regressors with each other. These are also easily created in R with colon operator (:)Links to an external site.

So, you will have to create new variables for parts B and C above.
For example, if X_1 is the only regressor in your main regression, then just create (X_1)^2. However, If you had X_1 and X_2 in your main regression, create (X_1)^2, (X_2)^2 (squares terms) and X_1 * X_2 (cross terms).
If you have three Xs, like the question asks, create/compute (X_1)^2, (X_2)^2, (X_3)^2 (squares terms) and X_1 * X_2, X_2 * X_3, X_3 * X_1 (cross terms). Of course, you can try 4 or more terms too for fund.

# Compute residuals from the main regression model
subset_USCrime$residuals <- resid(lm.mod)

# Create the squared residuals variable
subset_USCrime$squared_residuals <- (subset_USCrime$residuals)^2

summary(subset_USCrime$squared_residuals)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
    17.2   8838.8  28295.2  62255.8  93423.8 319636.9

# Auxiliary regression with original predictors, squared terms, and interaction terms
aux_reg <- lm(squared_residuals ~ M + Ed + Po1 + I(M^2) + I(Ed^2) + I(Po1^2) + M:Ed + Ed:Po1 + Po1:M, data = subset_USCrime)

# Summary of the auxiliary regression
summary(aux_reg)


Call:
lm(formula = squared_residuals ~ M + Ed + Po1 + I(M^2) + I(Ed^2) + 
    I(Po1^2) + M:Ed + Ed:Po1 + Po1:M, data = subset_USCrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-120388  -35045   -4579   16662  173160 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -4.591e+06  4.627e+06  -0.992   0.3275  
M            2.598e+04  4.050e+04   0.641   0.5252  
Ed           3.910e+04  3.752e+04   1.042   0.3042  
Po1          1.903e+04  9.969e+03   1.909   0.0640 .
I(M^2)      -7.670e+01  9.254e+01  -0.829   0.4125  
I(Ed^2)     -2.566e+02  1.285e+02  -1.998   0.0532 .
I(Po1^2)    -1.753e+01  1.445e+01  -1.214   0.2326  
M:Ed         7.180e+01  1.482e+02   0.484   0.6310  
Ed:Po1       4.912e+01  5.748e+01   0.855   0.3983  
M:Po1       -1.438e+02  5.983e+01  -2.404   0.0214 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62780 on 37 degrees of freedom
Multiple R-squared:  0.5458,    Adjusted R-squared:  0.4353 
F-statistic:  4.94 on 9 and 37 DF,  p-value: 0.0002225

The R-squared value from this auxiliary regression represents the proportion of the variance in the squared residuals that can be explained by the predictors.

A low R-squared value suggests that there is no strong evidence of heteroskedasticity. However, in our example, the R-squared value is around 0.5458, which indicates that a significant portion of the variance in squared residuals can be explained by the predictors, suggesting potential heteroskedasticity.

4. Again, do not forget to answer what low/high R-squared values suggest in terms of heteroskedasticity intuitively. We can formally test if this R-squared value is statistically large or small, given the sample size, using a chi squared testLinks to an external site..

In the lecture, we said we can use the F test for all coefficients jointly equal to 0 in the auxiliary regression, or alternatively if the R2 of auxiliary regression is large enough or not).

1. You will have to compute the test statistic by taking the R-squared of the auxiliary regression above and multiplying it by the number of observations in your auxiliary regression. You should get the same test statistic that the skedastic command shows you.

2. Your critical value will be based on alpha/significance level (5% is fine as it corresponds to 95% confidence interval) and the degrees of freedom (number of coefficients you estimate in your ancillary regression minus 1 {for the slope constant}). See the ?qchisq() function in R.

3. If the critical value is greater than the test statistic (more extreme, on the right here), you will reject the null and have heteroskedasticity. You can simply calculate the p-value of your test statistic and compare it with alpha to determine if you have heteroskedasticity or not. A small p-value should suggest heteroskedasticity (say less than 5%).

# Calculate the chi-square test statistic
chi_sq_stat <- summary(aux_reg)$r.squared * nobs(aux_reg)

aux_reg


Call:
lm(formula = squared_residuals ~ M + Ed + Po1 + I(M^2) + I(Ed^2) + 
    I(Po1^2) + M:Ed + Ed:Po1 + Po1:M, data = subset_USCrime)

Coefficients:
(Intercept)            M           Ed          Po1       I(M^2)      I(Ed^2)  
 -4.591e+06    2.598e+04    3.910e+04    1.903e+04   -7.670e+01   -2.567e+02  
   I(Po1^2)         M:Ed       Ed:Po1        M:Po1  
 -1.753e+01    7.180e+01    4.912e+01   -1.438e+02

# Degrees of freedom is the number of coefficients estimated minus 1 (for the constant term)
df <- length(coefficients(aux_reg)) - 1

# Calculate the critical value at 5% significance level
critical_value <- qchisq(0.95, df)

# Compare the test statistic to the critical value
chi_sq_stat

[1] 25.653

critical_value

[1] 16.91898

# p-value for the chi-square test
p_value <- pchisq(chi_sq_stat, df, lower.tail = FALSE)
p_value

[1] 0.002327705

the p-value is 0.00233, which is less than 0.05, indicating significant

# Install and load lmtest package
# install.packages("lmtest")
library(lmtest)

Warning: package 'lmtest' was built under R version 4.3.3

Loading required package: zoo


Attaching package: 'zoo'

The following objects are masked from 'package:base':

    as.Date, as.Date.numeric

# Conduct Breusch-Pagan test
bp_test_result <- bptest(lm.mod)

# View the test results
bp_test_result


    studentized Breusch-Pagan test

data:  lm.mod
BP = 19.687, df = 3, p-value = 0.0001971

If the p-value is greater than 0.05, we fail to reject the null hypothesis of homoskedasticity. Otherwise, we conclude that heteroskedasticity is present. ( since p 0.001971, we conclude heteroskedasticity)