library(pacman)
p_load(tidyverse, lfe, robustbase, ggthemes)
df <- education %>%
rename(residents = X1,
per_capita_income = X2,
young_residents = X3,
per_capita_exp = Y,
state = State)
We can make a plot to examine the relationship between our x variable and the errors. To do this, we need to run a regression, recover the residuals, then make a scatterplot. The resid()
function will get the residuals from our regression. We plot the x variable on the x-axis and the residuals on the y-axis.
reg <- lm(per_capita_exp ~ per_capita_income, data = df)
errs <- resid(reg)
ggplot(aes(x = per_capita_income, y = errs), data = df) +
geom_point(color = "#2D708EFF") +
theme_pander() + geom_hline(yintercept = 0) +
labs(x = "Per Capita Income", y = "Residuals")
What are we looking for here? If our errors are heteroskedastic, then the variance of our errors is changing with the x variable. There seems to be some visual evidence of this happening here - at low values of per capita income the residuals are tightly packed around zero but as we move to larger values of per capita income the residuals are more spread out. But this graph does not provide evidence of heteroskedasticity so we need to know how to formally test for this. We will cover the three tests you learned in lecture: Goldfeld-Quant, Breusch-Pagan, and White test.
We want to be sure that our variance isn’t changing with our predictors. One way is to compare different values of our predicted variable and see how our errors are changing. The Goldfeld-Quant test does exactly that. We pick a fraction of our sample and compare the first fraction of the sample to the second fraction. We then look at the ratio of Sum Squared Error and see how far it is from 1. If the ratio is 1, then across small and large values of x, the errors are roughly the same. If the ratio is different from 1, the errors are not the same across values of x.
If we wanted to do this by hand, we’d need to follow some steps. 6 steps in total. Note that Goldfeld-Quant only allows you to look at one varibale at a time so let’s look at per_capita_income
and let’s compare the first 1/4 and the last 1/4 of the sample.
df <- arrange(df, per_capita_income)
head(df$per_capita_income)
## [1] 3448 3680 3724 3764 3817 3825
nrow()
function tells us the total number of observations in the data frame.n_GQ <- as.integer(nrow(df) * 1/ 4)
n_GQ #this will give us the closest count that gives us the number that is 1/4 of our full sample!
## [1] 12
#on the first 1/4
lm_g1 <- lm(data = head(df, n_GQ), per_capita_exp ~ per_capita_income + residents + young_residents)
#on the last 1/4
lm_g2 <- lm(data = tail(df, n_GQ), per_capita_exp ~ per_capita_income + residents + young_residents)
sum()
function, as well as resid()
function.e_g1 <- resid(lm_g1)
e_g2 <- resid(lm_g2)
sse_g1 <- sum(e_g1 ^ 2) #now, to get SSE, we need to square the residuals, and then sum them.
sse_g2 <- sum(e_g2 ^ 2)
sse_g1
## [1] 4431.2
sse_g2
## [1] 17413.74
stat_GQ <- (sse_g2 / sse_g1)
stat_GQ
## [1] 3.929802
#n-k degrees of freedom, with k = no. of parameters estimated
p_GQ <- pf(q = stat_GQ, df1 = n_GQ-3, df2 = n_GQ-3, lower.tail = F) #pf gives probability from an f-dist.
p_GQ
## [1] 0.02691828
Conclusion, can we reject H0? Yes. So we have some evidence of heteroskedasticity. Let’s see the conclusions of our other tests.
If you want to see if your variables are correlated with your errors, why not just run a regression? This is what the Breusch-Pagan test does. We run a regression, recover the residuals, then regress the residuals on the x variables.
lm_BP <- lm(data = df, per_capita_exp ~ per_capita_income + residents + young_residents)
e_BP <- resid(lm_BP)
lm_BrPa <- lm(data = df, I(e_BP ^ 2) ~ per_capita_income + residents + young_residents)
r.squared
object from within the lm
summary object. We need to know how much variation in the errors is explained by the x variables.r2_BP <- summary(lm_BrPa)$r.squared
LM_BP <- nrow(df) * r2_BP
pchisq(q = LM_BP, df = 3, lower.tail = F) #this function calculates a chi-squared distribution, which is how we get our bp test probability.
## [1] 0.001376283
The white test is very similar to the Breusch-Pagan test, with a few modifications: when regressing the squared errors in the B-P test, we add interaction terms and squared terms. So the first two steps of the White test are identical to those in the Breusch-Pagan test.
lm_BP <- lm(data = df, per_capita_exp ~ per_capita_income + residents + young_residents)
e_WhiteTest <- resid(lm_BP)
lm_White <- lm(data = df, I(e_WhiteTest ^ 2) ~ per_capita_income + residents + young_residents +
per_capita_income:residents + per_capita_income:young_residents + residents:young_residents +
I(per_capita_income ^ 2) + I(residents ^ 2) + I(young_residents ^ 2))
r2_White <- summary(lm_White)$r.squared
LM_White <- nrow(df) * r2_White
pchisq(q = LM_White, df = 9, lower.tail = F)
## [1] 0.006961455
The null hypothesis is: \(H_0: \beta_1 = ... = \beta_9 = 0\) and the alternative hypothesis is \(H_a: \text{at least one } \beta_i \neq 0\). We reject the null at the 5% level.
Now that we have tested for heteroskedasticity and have some evidence that we have heteroskedastic errors, we need to know how to deal with this. We need to adjust our standard errors so that we have heteroskedastic-robust standard errors. To do this, we use felm()
from the lfe package instead of lm()
. Then, when we summarize, we can add the robust option to get heteroskedastic-robust standard errors.
reg_standard <- lm(data = df, per_capita_exp ~ per_capita_income + residents + young_residents)
reg_consistent <- felm(data = df, per_capita_exp ~ per_capita_income + residents + young_residents)
summary(reg_standard)
##
## Call:
## lm(formula = per_capita_exp ~ per_capita_income + residents +
## young_residents, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.878 -26.878 -3.827 22.246 99.243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.566e+02 1.232e+02 -4.518 4.34e-05 ***
## per_capita_income 7.239e-02 1.160e-02 6.239 1.27e-07 ***
## residents -4.269e-03 5.139e-02 -0.083 0.934
## young_residents 1.552e+00 3.147e-01 4.932 1.10e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.47 on 46 degrees of freedom
## Multiple R-squared: 0.5913, Adjusted R-squared: 0.5647
## F-statistic: 22.19 on 3 and 46 DF, p-value: 4.945e-09
summary(reg_consistent, robust = TRUE)
##
## Call:
## felm(formula = per_capita_exp ~ per_capita_income + residents + young_residents, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.878 -26.878 -3.827 22.246 99.243
##
## Coefficients:
## Estimate Robust s.e t value Pr(>|t|)
## (Intercept) -5.566e+02 1.800e+02 -3.092 0.00338 **
## per_capita_income 7.239e-02 1.638e-02 4.418 6.01e-05 ***
## residents -4.269e-03 5.830e-02 -0.073 0.94194
## young_residents 1.552e+00 4.423e-01 3.509 0.00102 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40.47 on 46 degrees of freedom
## Multiple R-squared(full model): 0.5913 Adjusted R-squared: 0.5647
## Multiple R-squared(proj model): 0.5913 Adjusted R-squared: 0.5647
## F-statistic(full model, *iid*):22.19 on 3 and 46 DF, p-value: 4.945e-09
## F-statistic(proj model): 16.96 on 3 and 46 DF, p-value: 1.472e-07