1. What is “heteroskedasticity”, and the econometric issue it causes (affects point estimates or standard errors)? Do not confuse heteroskedasticity with other terms like multicollinearity, serial correlation, et cetra (2-3 sentences in your own words - EG do not copy/paste directly from the web.)
Heteroskedasticity refers to the phenomenon in regression models where the variance of the error terms is not constant across different levels of the predictor variables. This variability affects the standard errors of the model, leading to unreliable hypothesis tests and confidence intervals. In the presence of heteroskedasticity, OLS estimators are still unbiased, but they are not efficient (i.e., they do not have the minimum variance), which means that hypothesis testing could lead to misleading conclusions.
2. What is the null and alternative hypothesis in BPLinks to an external site. or WhiteLinks to an external site. test? The hypothesis is the same, but the auxiliary regression specification is slightly different. Do you agree with the test logic? (2-3 sentences)
In both the Breusch-Pagan (BP) and White tests, the null hypothesis (H_0) is that the variance of the residuals is constant (homoskedasticity). The alternative hypothesis ( H_1) is that the variance of the residuals depends on the values of the predictor variables (heteroskedasticity). Yes, the logic makes sense, as these tests use auxiliary regressions to see if the residual variance can be explained by the predictor variables or their combinations. If the variance is explained significantly, it implies heteroskedasticity.
2. CODING
1. Choose a dataset, specify your linear regression, and estimate the regression in R. Please keep at least 3 independent variables in your regression. This is your main regression.
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ dplyr::select() masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
Warning: package 'psych' was built under R version 4.3.3
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(stargazer)
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
# Load the USCrime datasetdata("UScrime")# Overview of the datasetstr(UScrime)
'data.frame': 47 obs. of 16 variables:
$ M : int 151 143 142 136 141 121 127 131 157 140 ...
$ So : int 1 0 1 0 0 0 1 1 1 0 ...
$ Ed : int 91 113 89 121 121 110 111 109 90 118 ...
$ Po1 : int 58 103 45 149 109 118 82 115 65 71 ...
$ Po2 : int 56 95 44 141 101 115 79 109 62 68 ...
$ LF : int 510 583 533 577 591 547 519 542 553 632 ...
$ M.F : int 950 1012 969 994 985 964 982 969 955 1029 ...
$ Pop : int 33 13 18 157 18 25 4 50 39 7 ...
$ NW : int 301 102 219 80 30 44 139 179 286 15 ...
$ U1 : int 108 96 94 102 91 84 97 79 81 100 ...
$ U2 : int 41 36 33 39 20 29 38 35 28 24 ...
$ GDP : int 394 557 318 673 578 689 620 472 421 526 ...
$ Ineq: int 261 194 250 167 174 126 168 206 239 174 ...
$ Prob: num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
$ Time: num 26.2 25.3 24.3 29.9 21.3 ...
$ y : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
# Subset the dataset with the dependent variable and at least three independent variablessubset_USCrime <- UScrime[, c("y", "M", "Ed", "Po1")]# Fit a linear regression modellm.mod <-lm(formula = y ~ ., data = subset_USCrime)# Display regression summary using stargazer for better formattingstargazer(lm.mod, type ="text")
# Set up multiple plots in the same windowpar(mfrow =c(2, 2))# Create a residual plot with additional plotsplot(lm.mod)
# Reset to one plot per windowpar(mfrow =c(1, 1))# Summary of the linear regression modelsummary(lm.mod)
Call:
lm(formula = y ~ ., data = subset_USCrime)
Residuals:
Min 1Q Median 3Q Max
-559.57 -168.63 -4.43 143.05 565.36
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2210.879 833.612 -2.652 0.01116 *
M 12.300 3.833 3.209 0.00252 **
Ed 4.737 4.242 1.117 0.27031
Po1 10.718 1.569 6.829 2.27e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 260.9 on 43 degrees of freedom
Multiple R-squared: 0.5748, Adjusted R-squared: 0.5451
F-statistic: 19.37 on 3 and 43 DF, p-value: 4.25e-08
here we have y the crime rate, also the dependent variable; an dM percentage of males, Ed education, Po1 the police expenditure as independent variables.
2. Install the “skedastic” package (installation helpLinks to an external site.) and compute White’s test for your fitted model/main regression. How do you interpret the output i.e. what is the test statistic, and p-value? What is the interpretation i.e. do you reject/fail to reject the null, and what is the conclusion i.e. is there heteroskedasticity or not?
Null Hypothesis (H0): The residuals have constant variance (homoskedasticity). This means that the variance of the error terms is constant across all levels of the independent variables.
Alternative Hypothesis (H1): The residuals do not have constant variance (heteroskedasticity). This implies that the variance of the residuals depends on the values of the independent variables, indicating heteroskedasticity is present.
# Install skedastic package (if not already installed)# install.packages("skedastic")library(skedastic)
Warning: package 'skedastic' was built under R version 4.3.3
# Conduct White's test for heteroscedasticitywhite_test_result <-white(mainlm = lm.mod, interactions =TRUE)# View the test resultswhite_test_result
# A tibble: 1 × 5
statistic p.value parameter method alternative
<dbl> <dbl> <dbl> <chr> <chr>
1 25.7 0.00233 9 White's Test greater
The test statistic and p-value are provided in the output. If the p-value is greater than the significance level (typically 0.05), we fail to reject the null hypothesis of homoskedasticity. Based on the provided example output (p-value = 0.00233), we reject the null hypothesis and conclude that there is evidence of heteroskedasticity.
3. Now, run the auxiliary regression and interpret the R-squared value - what does it tell you about heteroscedasticity? To compute the auxiliary regression, you will first have to find the residuals from the main regression, square them and this vector will be your dependent variable. Your independent variables will be
A. The original regressorsLinks to an external site. in main regression (you should know what regressors mean as we have talked about this in class).
B. Squares of the original regressors. These are easy to create in R with the as-if (I()) operator.Links to an external site.
C. Cross terms of original regressors with each other. These are also easily created in R with colon operator (:)Links to an external site.
So, you will have to create new variables for parts B and C above.
For example, if X_1 is the only regressor in your main regression, then just create (X_1)^2. However, If you had X_1 and X_2 in your main regression, create (X_1)^2, (X_2)^2 (squares terms) and X_1 * X_2 (cross terms).
If you have three Xs, like the question asks, create/compute (X_1)^2, (X_2)^2, (X_3)^2 (squares terms) and X_1 * X_2, X_2 * X_3, X_3 * X_1 (cross terms). Of course, you can try 4 or more terms too for fund.
# Compute residuals from the main regression modelsubset_USCrime$residuals <-resid(lm.mod)# Create the squared residuals variablesubset_USCrime$squared_residuals <- (subset_USCrime$residuals)^2summary(subset_USCrime$squared_residuals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.2 8838.8 28295.2 62255.8 93423.8 319636.9
# Auxiliary regression with original predictors, squared terms, and interaction termsaux_reg <-lm(squared_residuals ~ M + Ed + Po1 +I(M^2) +I(Ed^2) +I(Po1^2) + M:Ed + Ed:Po1 + Po1:M, data = subset_USCrime)# Summary of the auxiliary regressionsummary(aux_reg)
Call:
lm(formula = squared_residuals ~ M + Ed + Po1 + I(M^2) + I(Ed^2) +
I(Po1^2) + M:Ed + Ed:Po1 + Po1:M, data = subset_USCrime)
Residuals:
Min 1Q Median 3Q Max
-120388 -35045 -4579 16662 173160
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.591e+06 4.627e+06 -0.992 0.3275
M 2.598e+04 4.050e+04 0.641 0.5252
Ed 3.910e+04 3.752e+04 1.042 0.3042
Po1 1.903e+04 9.969e+03 1.909 0.0640 .
I(M^2) -7.670e+01 9.254e+01 -0.829 0.4125
I(Ed^2) -2.566e+02 1.285e+02 -1.998 0.0532 .
I(Po1^2) -1.753e+01 1.445e+01 -1.214 0.2326
M:Ed 7.180e+01 1.482e+02 0.484 0.6310
Ed:Po1 4.912e+01 5.748e+01 0.855 0.3983
M:Po1 -1.438e+02 5.983e+01 -2.404 0.0214 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 62780 on 37 degrees of freedom
Multiple R-squared: 0.5458, Adjusted R-squared: 0.4353
F-statistic: 4.94 on 9 and 37 DF, p-value: 0.0002225
The R-squared value from this auxiliary regression represents the proportion of the variance in the squared residuals that can be explained by the predictors.
A low R-squared value suggests that there is no strong evidence of heteroskedasticity. However, in our example, the R-squared value is around 0.5458, which indicates that a significant portion of the variance in squared residuals can be explained by the predictors, suggesting potential heteroskedasticity.
4. Again, do not forget to answer what low/high R-squared values suggest in terms of heteroskedasticity intuitively. We can formally test if this R-squared value is statistically large or small, given the sample size, using a chi squared testLinks to an external site..
In the lecture, we said we can use the F test for all coefficients jointly equal to 0 in the auxiliary regression, or alternatively if the R2 of auxiliary regression is large enough or not).
1. You will have to compute the test statistic by taking the R-squared of the auxiliary regression above and multiplying it by the number of observations in your auxiliary regression. You should get the same test statistic that the skedastic command shows you.
2. Your critical value will be based on alpha/significance level (5% is fine as it corresponds to 95% confidence interval) and the degrees of freedom (number of coefficients you estimate in your ancillary regression minus 1 {for the slope constant}). See the ?qchisq() function in R.
3. If the critical value is greater than the test statistic (more extreme, on the right here), you will reject the null and have heteroskedasticity. You can simply calculate the p-value of your test statistic and compare it with alpha to determine if you have heteroskedasticity or not. A small p-value should suggest heteroskedasticity (say less than 5%).
# Calculate the chi-square test statisticchi_sq_stat <-summary(aux_reg)$r.squared *nobs(aux_reg)aux_reg
Call:
lm(formula = squared_residuals ~ M + Ed + Po1 + I(M^2) + I(Ed^2) +
I(Po1^2) + M:Ed + Ed:Po1 + Po1:M, data = subset_USCrime)
Coefficients:
(Intercept) M Ed Po1 I(M^2) I(Ed^2)
-4.591e+06 2.598e+04 3.910e+04 1.903e+04 -7.670e+01 -2.567e+02
I(Po1^2) M:Ed Ed:Po1 M:Po1
-1.753e+01 7.180e+01 4.912e+01 -1.438e+02
# Degrees of freedom is the number of coefficients estimated minus 1 (for the constant term)df <-length(coefficients(aux_reg)) -1# Calculate the critical value at 5% significance levelcritical_value <-qchisq(0.95, df)# Compare the test statistic to the critical valuechi_sq_stat
[1] 25.653
critical_value
[1] 16.91898
# p-value for the chi-square testp_value <-pchisq(chi_sq_stat, df, lower.tail =FALSE)p_value
[1] 0.002327705
the p-value is 0.00233, which is less than 0.05, indicating significant
# Install and load lmtest package# install.packages("lmtest")library(lmtest)
Warning: package 'lmtest' was built under R version 4.3.3
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
# Conduct Breusch-Pagan testbp_test_result <-bptest(lm.mod)# View the test resultsbp_test_result
studentized Breusch-Pagan test
data: lm.mod
BP = 19.687, df = 3, p-value = 0.0001971
If the p-value is greater than 0.05, we fail to reject the null hypothesis of homoskedasticity. Otherwise, we conclude that heteroskedasticity is present. ( since p 0.001971, we conclude heteroskedasticity)