In today’s lecture, we are going to deal with three different tests regarding one of the most common problems in linear models, namely heteroscedasticity. Outline of the lecture is presented below.
The first thing we need to do is to get familiar with the lmtest package hence we need to install lmtest package.
install.packages("lmtest", dependencies = T, repos='http://cran.us.r-project.org' )
##
## The downloaded binary packages are in
## /var/folders/_8/f2m9fdt162ng6nj45842hhsc0000gn/T//RtmpL7iqWw/downloaded_packages
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(AER)
## Loading required package: car
## Loading required package: carData
## Loading required package: sandwich
## Loading required package: survival
This package consists of 16 different diagnostic tests and some data-sets. Among these tests some serve to the purpose of detection of heteroscedasticity and auto-correlation, and others relate to normality, functional forms and linearity. The URL below provides a reference manual for lmtest package.
GQ test is one of the most reliable methods in detecting heteroscedasticity and it orders observations according to their level of dependent variable. After deleting some specified number of central observations, GQ tests returns to a p-value regarding heteroscedasticty. If we need to understand the arguments of GQ test, we can check out the \(19^{th}\) page of reference manual.
Let us continue with Journals data to see the interpretations of the GQ test. Journals dataset contains some academic journals and regarding information such as citation number, price, subscription number, journal title etc. The command we use for GQ test is gqtest() since most R packages are created with a pretty straight-forward approach. First thing we need to specify in the gqtest() command is the formula or linear model object we have defined. First thing to do here is to define a regression. Kleiber and Zeileis defines a variable called “citeprice” which is the ratio of price and citation number and he packs this variable together with subscription number and price(2008, p. 101). They also define the age of the journal and include it in the dataframe. Described process can be done with the following chunk:
data("Journals")
journals <- Journals[, c("subs", "price")]
#journals with lowercase j is our new data frame that we have disected from the Journals data.#
journals$age <- 2019 - Journals$foundingyear
journals$citeprice <- Journals$price/Journals$citations
Let’s continue with model construction and GQ test.
jour_lm <- lm(log(subs)~log(citeprice), data = journals)
summary(jour_lm)
##
## Call:
## lm(formula = log(subs) ~ log(citeprice), data = journals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.72478 -0.53609 0.03721 0.46619 1.84808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.76621 0.05591 85.25 <2e-16 ***
## log(citeprice) -0.53305 0.03561 -14.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7497 on 178 degrees of freedom
## Multiple R-squared: 0.5573, Adjusted R-squared: 0.5548
## F-statistic: 224 on 1 and 178 DF, p-value: < 2.2e-16
Since we have constructed our linear model as Kleiber and Zeileis did we can continue with different test regarding heteroscedasticty and GQ test is the first one of these tests.
gqtest(jour_lm, order.by = ~citeprice, data = journals, fraction = 36)
##
## Goldfeld-Quandt test
##
## data: jour_lm
## GQ = 2.0341, df1 = 70, df2 = 70, p-value = 0.001697
## alternative hypothesis: variance increases from segment 1 to 2
First argument of the gqtest() command is the formula or linear model that we want to test for heteroscedasticity. After specifying formula or lm object we can continue with the order.by argument which denotes the independent variable for which we order our observations. Be careful with order.by argument because it may produce errors if the variable is not attached to the R or dataset is not specified. To abolish this problem we specified the dataset with data argument. The final argument I used is the fraction which sets the number of central observations to be omitted. I generally use 20% of the observations to increase the power of the test but different standards exist regarding this number.
As we can easily observe GQ value returns to the test statistics and p-value returns to the corresponding value with f score tables. If p-value is below 0.05 we reject the null hypothesis and null hypothesis is homoscedasticity in our case. P-value is much lower than 0.05 hence we can reject the null and infer that heteroscedasticity exists in our model.
Breusch-Pagan test is another method we can use in order to investigate the existence of heteroscedasticity. As you have learned from your lectures, BP test firstly constructs the model then obtains the residuals. After obtaining the residuals it creates an auxiliary regression as following; \(\epsilon_{i}^2/\tilde{\sigma^2} = \delta_0 + \delta_1X_{1i}+\delta_2X_{2i}+\delta_3X_{3i}+....+\delta_kX_{ki}\) for k number of regressors. After this step it retains the Estimated Sum of Squares value and calculates \(\frac{ESS}{2}\) value as test statistics. Final step of this test is to compare our results with \(\chi^2\) distribution. Most of us can manually do this calculation with a little bit of research and our existing R knowledge though the package lmtest has a function that returns to p-values of BP test automatically. Now we will explore this function.
bptest(jour_lm, varformula = ~citeprice, data = journals, studentize = F)
##
## Breusch-Pagan test
##
## data: jour_lm
## BP = 11.558, df = 1, p-value = 0.0006745
The only argument that needs our explanation here is the varformula argument. We normally include the variables that are possible suspects for the existence of heteroscedasticty. If we do not specify this argument R uses all explanatory variables in the auxiliary regression.
Test statistic follows a \(\chi^2\) distribution in this test and we can reject the null hypothesis according to p-value. If we observe a p-value less than 0.05, BP test indicates the existence of heteroscedasticity problem. In our example, BP test gives the same result with the GQ test.
BP test is useful for detecting heteroscedasticity though it fails to detect heteroscedasticty if normality assumption is loosened. We use the White Test because of this error. White Test starts with setting up the regression and obtaining the residuals \(\hat{u}_i\) then sets up an auxiliary regression as the following; \(\hat{u}_i= \alpha_1+\alpha_2X_{2i}+\alpha_3X_{3i}+\alpha_4X_{2i}^2+\alpha_5X_{3i}^2+\alpha_6X_{2i}X_{3i}\) for a two regressor case. The it calculates the test statistic \(nR^2\) and checks whether it follows a \(\chi^2_{df}\) distribution or not. We use bptest() command to conduct the White Test though subtle distinctions play a crucial role here.
bptest(jour_lm, varformula = ~citeprice+I(citeprice^2), data = journals, studentize = F)
##
## Breusch-Pagan test
##
## data: jour_lm
## BP = 14.084, df = 2, p-value = 0.0008743
The only difference is in our code is I(citeprice^2) term that includes polynomial squared term to regular BP test. As I have already explained, difference between BP test and White Test is the existence of multiplicative terms and squared terms in the auxiliary regression.
We can set up a multiple regression with journals data by including age as another regressor. This new model would make the application of White Test much more clearer.
jour_lm2 <- lm(subs~citeprice+age, data = journals)
summary(jour_lm2)
##
## Call:
## lm(formula = subs ~ citeprice + age, data = journals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -499.64 -98.37 -41.40 61.89 717.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 118.5254 35.3550 3.352 0.00098 ***
## citeprice -19.1390 4.0102 -4.773 3.79e-06 ***
## age 2.4401 0.5407 4.513 1.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 176.8 on 177 degrees of freedom
## Multiple R-squared: 0.261, Adjusted R-squared: 0.2527
## F-statistic: 31.26 on 2 and 177 DF, p-value: 2.358e-12
bptest(jour_lm2, varformula = ~citeprice+age, data = journals, studentize = F)
##
## Breusch-Pagan test
##
## data: jour_lm2
## BP = 75.91, df = 2, p-value < 2.2e-16
Last line of code that I wrote gives us the implications of regular BP test with two regressors. As you can easily observe, the only difference here is the inclusion of the age variable to the test command.
bptest(jour_lm2, varformula = ~citeprice+age+I(citeprice^2)+I(age^2)+I(age*citeprice), data = journals)
##
## studentized Breusch-Pagan test
##
## data: jour_lm2
## BP = 36.024, df = 5, p-value = 9.394e-07
This second version of the code gives us the result for White Test since it includes both squared and multiplication terms. If we want to conduct White Test manually we can use the following code:
aux_jour_lm2 <- lm(I(residuals(jour_lm2)^2)~citeprice+age+I(citeprice^2)+I(age^2)+I(age*citeprice), data = journals)
sum_aux_jour_lm2 <- summary(aux_jour_lm2)
teststat_jour_lm2 <- sum_aux_jour_lm2$r.squared*nrow(journals)
pchisq(teststat_jour_lm2, df=5, lower.tail = F)
## [1] 9.393738e-07
In our following lectures, we are going to see the methods of dealing with heteroscedasticity such as Weighted Least Squares Method and White’s HAC robust standard errors.