I am choosing to explore a linear model from a project have been working on. Data is from census tracts within the City of Pittsburgh. The goal is to explore which variables explain income.
\[ Income_c = \beta_0 - \beta_1 Black_c + \beta_2 Owner Occupied_c - \beta_3 Unemployment_c - \beta_4 GRAPI_c -\beta_5 Disability_c + \epsilon_c \]
Where c = census indexes census tract.
All variables represent population proportions for census tracts.
Black = % of tract Black or African American,
Owner Occupied = % of tract that owns their home,
Unemployment = unemployment rate of census tract,
GRAPI = gross rent as percent of income over 35%,
Disability = Disabled population of tract.
model <- lm(log(income) ~ race_b
+ own
+ unemp
+ grapi
+ disab,
data = data)
stargazer(model, type = 'text')
##
## ===============================================
## Dependent variable:
## ---------------------------
## log(income)
## -----------------------------------------------
## race_b -0.004***
## (0.001)
##
## own 0.008***
## (0.001)
##
## unemp -0.017***
## (0.006)
##
## grapi -0.008***
## (0.002)
##
## disab -0.021***
## (0.005)
##
## Constant 11.221***
## (0.105)
##
## -----------------------------------------------
## Observations 94
## R2 0.683
## Adjusted R2 0.665
## Residual Std. Error 0.257 (df = 88)
## F Statistic 37.887*** (df = 5; 88)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
From a simple eyeball test, residual plots don’t look that bad in terms of heteroskedasticity. However, more formal tests may help give a more definitive answer.
par(mfrow = c(2, 2))
plot(model)
The lmtest package implements Breusch Pagan test for heteroskedasticty
in linear regression models, and skedastic package
implements White test.
White's Test is a special
case of the method of Breusch and Pagan (1979). Under the null
hypothesis of homoskedasticity, the distribution of the test statistic
is asymptotically chi-squared with parameter degrees of
freedom.
skedastic_package_white <- white(mainlm = model,
interactions = TRUE
)
# View the test results
skedastic_package_white
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 19.9 0.467 20 White's Test greater
The test statistic has a value of 19.86, follows a chi-squared
distribution with parameter (the number of regressors
without the constant in the model) degrees of freedom, and it has a
p-value of approximately 47% - thus we fail to reject the null
of homoskedasticity.
Now, we model the squared error terms as a function of predictors and the squared terms of the predictor.
We will find the residuals from our model and put it in our data frame.
In heteroskedastcity regression tests, the dependent variable is
residual squared - so we must construct it manually. Note, the residual
is coming from the model for which you want to test for
heteroskedasticity, and you can use the resid command to
create them easily. The y variable is not income anymore.
data$residuals <- resid(object = model)
summary(data$residuals) # mean should be zero
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.578430 -0.159852 0.002979 0.000000 0.147901 0.610116
data$squared_residuals <- (data$residuals)^2
summary(data$squared_residuals) # should not have any negative values
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000069 0.0055144 0.0231314 0.0619545 0.0689465 0.3722417
The independent variables will the original predictors, their squared terms and cross terms.
The original predictors in our case were race_b, own, unemp, grapi, and disab.
We can create their squared terms easily with the I
function without creating new variable in our data frame. In R,
I() is used to prevent the interpretation
of a variable or expression as a function or operator. It is called the
"as-is" operator and is used to force the interpreter
to treat the following expression literally, without any modifications
or substitutions. For example, suppose you have a variable
x that you want to square in a formula. In
most cases, you would write x^2. However,
if you want to create a formula that explicitly represents the
expression x^2, you would use the
I operator, where
I(x^2) tells R to treat
x^2 as a literal expression, rather than
trying to interpret it as a function call or operator.
Cross terms are just interactions terms, and can be created with
the operator :
white_auxillary_reg <- lm(formula = squared_residuals ~
race_b +
own +
unemp +
grapi +
disab +
I(race_b^2) +
I(own^2) +
I(unemp^2) +
I(grapi^2) +
I(disab^2) +
race_b:own +
race_b:unemp +
race_b:grapi +
race_b:disab +
own:unemp +
own:grapi +
own:disab +
unemp:grapi +
unemp:disab +
grapi:disab,
data = data)
white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary
##
## Call:
## lm(formula = squared_residuals ~ race_b + own + unemp + grapi +
## disab + I(race_b^2) + I(own^2) + I(unemp^2) + I(grapi^2) +
## I(disab^2) + race_b:own + race_b:unemp + race_b:grapi + race_b:disab +
## own:unemp + own:grapi + own:disab + unemp:grapi + unemp:disab +
## grapi:disab, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.13778 -0.04648 -0.02000 0.02435 0.28809
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.899e-01 1.180e-01 1.609 0.112
## race_b -5.186e-04 2.808e-03 -0.185 0.854
## own -1.421e-03 2.781e-03 -0.511 0.611
## unemp 2.182e-02 1.237e-02 1.764 0.082 .
## grapi -3.774e-03 3.856e-03 -0.979 0.331
## disab -9.824e-03 9.181e-03 -1.070 0.288
## I(race_b^2) -2.369e-05 2.905e-05 -0.815 0.417
## I(own^2) 6.720e-06 2.389e-05 0.281 0.779
## I(unemp^2) -5.843e-04 6.007e-04 -0.973 0.334
## I(grapi^2) 3.514e-05 3.810e-05 0.922 0.359
## I(disab^2) 4.035e-04 3.495e-04 1.155 0.252
## race_b:own -1.969e-06 3.233e-05 -0.061 0.952
## race_b:unemp -1.268e-05 1.274e-04 -0.100 0.921
## race_b:grapi 2.212e-05 4.235e-05 0.522 0.603
## race_b:disab 1.043e-04 1.718e-04 0.607 0.546
## own:unemp -1.094e-04 1.507e-04 -0.726 0.470
## own:grapi 4.171e-05 3.352e-05 1.244 0.217
## own:disab -2.095e-05 1.169e-04 -0.179 0.858
## unemp:grapi 4.108e-05 1.629e-04 0.252 0.802
## unemp:disab -3.673e-04 5.641e-04 -0.651 0.517
## grapi:disab -1.370e-04 1.368e-04 -1.001 0.320
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09423 on 73 degrees of freedom
## Multiple R-squared: 0.2113, Adjusted R-squared: -0.004803
## F-statistic: 0.9778 on 20 and 73 DF, p-value: 0.4978
Low R-squared suggests no heteroskedasticity.
We can continue testing with chi-squared distribution.
More formally, calculate the chi-squared test statistic and compare it with chi-squared critical value to determine if R squares is significantly small or big.
Calculate the Chi-Square test statistic: \[\chi^2= n * newR^2\]
n: The total number of observations
\(newR^2\): The R-squared of the new regression model that used the squared residuals as the response values.
chisq_p_value <- pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg),
df = 20, # degrees of freedom is the number of parameters estimated in the model minus 1 (for constant term) i.e. equal to the number of variables in the auxillary regression
lower.tail = FALSE )
chisq_p_value
## [1] 0.4666767
We get the exact same pvalue from the package and by hand above.
We can also replicate the test statistic. Recall:
skedastic_package_white
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 19.9 0.467 20 White's Test greater
Now, testing the critical value:
white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)
## [1] 19.86065
Beautiful!
The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default the same explanatory variables are taken as in the main regression model) and rejects if too much of the variance is explained by the additional explanatory variables.
Under \(H_0\) the test statistic of
the Breusch-Pagan test follows a chi-squared distribution with
parameter (the number of regressors without the constant in
the model) degrees of freedom.
lmtest_package_bp <- bptest(formula = model, # fitted "lm" object, or run the regression
studentize = TRUE # by default, so this can be commented out
)
# View the test results
lmtest_package_bp
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 3.7881, df = 5, p-value = 0.5803
skedastic_package_bp <- skedastic::breusch_pagan(model)
lmtest_package_bp
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 3.7881, df = 5, p-value = 0.5803
skedastic_package_bp
## # A tibble: 1 × 5
## statistic p.value parameter method alternative
## <dbl> <dbl> <dbl> <chr> <chr>
## 1 3.79 0.580 5 Koenker (studentised) greater
lmtest::bptest gives the exact same outpout as
skedastic::breusch_pagan, as expected.
The test statistic value is 3.7781, follows a chi-squared
distribution with parameter (the number of regressors
without the constant in the model) degrees of freedom, and it has a
p-value of approximately 0.5803 thus we fail to reject the null
of homoskedasticity.
Professor Sharma’s rpubs and explanation
Both Breusch-Pagan (BP) test and White test are used to test for heteroskedasticity in a linear regression model. However, there is no clear consensus on which test is better as the choice depends on various factors such as the sample size, the distribution of the errors, and the specific objectives of the analysis.
In general, the White test is considered more robust than the BP test when the sample size is large or when the errors are not normally distributed.
However, the BP test may perform better when the errors are normally distributed and the sample size is small.
Ultimately, the choice of test should be based on the specific characteristics of the data and the research question being investigated.