Today, we will be working through some heteroskedasticity examples. In order to do that though, we need two things: some data and our handy dandy pacman package.
library(pacman)
#install.packages("Ecdat")
p_load(Ecdat, ggplot2, sandwich, lmtest, dplyr)
The Ecdat contains the ‘Caschool’ dataset which features all sorts of fun heterokedasticity. Let’s do some exploration on our own. How do we inspect something? We can ask R!
?Caschool
We’re going to start with some exploration on our own.
Question1: How many observations are there? What is the timespan the data covers? Answer1: 420. 1998-199
Question2: What kind of dataset is it? Answer2: dataframe
Question3: How many columns (variables) are there in the dataset? Answer3: 17
Question4: What variables should we choose as the left-hand side dependent variables here? (Hint: We are trying to understand what impacts educational outcomes here!) Answer4: Test scores seem like a good choice.
Question5: What are the variable names for income, student-to-teacher ratio and enrollment Answer6: avginc, str, enrltot
Let’s take a peak at the first few rows of the dataset. How do we do this?
head(Caschool)
## distcod county district grspan enrltot teachers
## 1 75119 Alameda Sunol Glen Unified KK-08 195 10.90
## 2 61499 Butte Manzanita Elementary KK-08 240 11.15
## 3 61549 Butte Thermalito Union Elementary KK-08 1550 82.90
## 4 61457 Butte Golden Feather Union Elementary KK-08 243 14.00
## 5 61523 Butte Palermo Union Elementary KK-08 1335 71.50
## 6 62042 Fresno Burrel Union Elementary KK-08 137 6.40
## calwpct mealpct computer testscr compstu expnstu str avginc
## 1 0.5102 2.0408 67 690.80 0.3435898 6384.911 17.88991 22.690001
## 2 15.4167 47.9167 101 661.20 0.4208333 5099.381 21.52466 9.824000
## 3 55.0323 76.3226 169 643.60 0.1090323 5501.955 18.69723 8.978000
## 4 36.4754 77.0492 85 647.70 0.3497942 7101.831 17.35714 8.978000
## 5 33.1086 78.4270 171 640.85 0.1280899 5235.988 18.67133 9.080333
## 6 12.3188 86.9565 25 605.55 0.1824818 5580.147 21.40625 10.415000
## elpct readscr mathscr
## 1 0.000000 691.6 690.0
## 2 4.583333 660.5 661.9
## 3 30.000002 636.3 650.9
## 4 0.000000 651.9 643.5
## 5 13.857677 641.8 639.9
## 6 12.408759 605.7 605.4
I am going to save this dataset as a new object with an easier name to type. I also am curious about summary statistics:
school<-Caschool
summary(school)
## distcod county district grspan
## Min. :61382 Sonoma : 29 Lakeside Union Elementary: 3 KK-06: 61
## 1st Qu.:64308 Kern : 27 Mountain View Elementary : 3 KK-08:359
## Median :67760 Los Angeles: 27 Jefferson Elementary : 2
## Mean :67473 Tulare : 24 Liberty Elementary : 2
## 3rd Qu.:70419 San Diego : 21 Ocean View Elementary : 2
## Max. :75440 Santa Clara: 20 Pacific Union Elementary : 2
## (Other) :272 (Other) :406
## enrltot teachers calwpct mealpct
## Min. : 81.0 Min. : 4.85 Min. : 0.000 Min. : 0.00
## 1st Qu.: 379.0 1st Qu.: 19.66 1st Qu.: 4.395 1st Qu.: 23.28
## Median : 950.5 Median : 48.56 Median :10.520 Median : 41.75
## Mean : 2628.8 Mean : 129.07 Mean :13.246 Mean : 44.71
## 3rd Qu.: 3008.0 3rd Qu.: 146.35 3rd Qu.:18.981 3rd Qu.: 66.86
## Max. :27176.0 Max. :1429.00 Max. :78.994 Max. :100.00
##
## computer testscr compstu expnstu
## Min. : 0.0 Min. :605.5 Min. :0.00000 Min. :3926
## 1st Qu.: 46.0 1st Qu.:640.0 1st Qu.:0.09377 1st Qu.:4906
## Median : 117.5 Median :654.5 Median :0.12546 Median :5215
## Mean : 303.4 Mean :654.2 Mean :0.13593 Mean :5312
## 3rd Qu.: 375.2 3rd Qu.:666.7 3rd Qu.:0.16447 3rd Qu.:5601
## Max. :3324.0 Max. :706.8 Max. :0.42083 Max. :7712
##
## str avginc elpct readscr
## Min. :14.00 Min. : 5.335 Min. : 0.000 Min. :604.5
## 1st Qu.:18.58 1st Qu.:10.639 1st Qu.: 1.941 1st Qu.:640.4
## Median :19.72 Median :13.728 Median : 8.778 Median :655.8
## Mean :19.64 Mean :15.317 Mean :15.768 Mean :655.0
## 3rd Qu.:20.87 3rd Qu.:17.629 3rd Qu.:22.970 3rd Qu.:668.7
## Max. :25.80 Max. :55.328 Max. :85.540 Max. :704.0
##
## mathscr
## Min. :605.4
## 1st Qu.:639.4
## Median :652.5
## Mean :653.3
## 3rd Qu.:665.9
## Max. :709.5
##
Question8: What is the range of district average income
range(school$avginc)
## [1] 5.335 55.328
Question9: What is the range of student to teacher ratio
range(school$str)
## [1] 14.0 25.8
Let’s try to figure out what the range of the total score (reading + math) is for each district? How should we do this? We’ll use some tools from our friendly neighborhood tidyverse to solve this.
p_load(tidyverse)
school <- mutate(school, scrtot = readscr + mathscr)
range(school$scrtot)
## [1] 1211.1 1413.5
Let’s run the linear regression of total score on income, student-to-teacher ratio and enrollment and explore a little
ols1 <- lm(data = school, scrtot ~ avginc + str + enrltot)
summary(ols1)
##
## Call:
## lm(formula = scrtot ~ avginc + str + enrltot, data = school)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.566 -18.354 2.205 17.654 61.413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.259e+03 1.501e+01 83.825 < 2e-16 ***
## avginc 3.772e+00 1.817e-01 20.759 < 2e-16 ***
## str -1.830e-01 7.268e-01 -0.252 0.801
## enrltot -1.671e-03 3.419e-04 -4.887 1.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.99 on 416 degrees of freedom
## Multiple R-squared: 0.538, Adjusted R-squared: 0.5347
## F-statistic: 161.5 on 3 and 416 DF, p-value: < 2.2e-16
On your own, interpret coefficients. This is the most important part of the game!
Create the residuals of the regression you just ran. Hint: you can use the mutate command, combined with the resid(your_model) command. This will return a vector of residuals! This is super handy, especially for this section.
school <- mutate(school, resid1 = resid(ols1))
However, we know this is the heteroskedasticity lab, so we need to check for heteroskedasticity. How would we do this? ### Plot residuals against each of the three variables!
qplot(school$avginc, school$resid1) #eesh
qplot(school$str, school$resid1) #pretty good
qplot(school$enrltot, school$resid1) #eesh
qplot(school$scrtot, school$resid1) #eesh
Does it appear as though our disturbances are heteroskedastic? Yes, there seems to be more variance in our errors as these variables increase. Why do we care about heteroskedasticity? Does this cause any issues? It violates our assumption of homosked which in turn gives us incorrect se and we can no longer do inference
Let’s Run a similar regression but this time let’s use some techiniques. The reason why we want to first try multiple specifications is because mispecification itself can cause heteroskedascity.
Let’s start with a log-linear specification. How would we go about this?
lm_i <- lm(data = school, log(scrtot) ~ avginc + str + enrltot)
Now, we can use our quick-plot tool to examine our new residuals
qplot(school$avginc, resid(lm_i))
qplot(school$str, resid(lm_i))
qplot(school$enrltot, resid(lm_i))
qplot(school$scrtot, resid(lm_i))
Next, let’s do a log-log specification
lm_ii <- lm(data = school, log(scrtot) ~ log(avginc) + log(str) + log(enrltot))
and qplot again
qplot(school$avginc, resid(lm_ii))
qplot(school$str, resid(lm_ii))
qplot(school$enrltot, resid(lm_ii))
qplot(school$scrtot, resid(lm_ii))
How about lin-log?
lm_iii <- lm(data = school, scrtot ~ log(avginc) + log(str) + log(enrltot))
qplot(log(school$avginc), resid(lm_iii))
qplot(log(school$str), resid(lm_iii))
qplot(log(school$enrltot), resid(lm_iii))
qplot(school$scrtot, resid(lm_iii))
Maybe interaction terms will help us.
lm_iv <- lm(data = school, scrtot ~ avginc + str + enrltot + avginc:str + avginc:enrltot + str:enrltot)
qplot(school$avginc, resid(lm_iv))
qplot(school$str, resid(lm_iv))
qplot(school$enrltot, resid(lm_iv))
Okay. Maybe we really need quadratic terms. We’ll keep our interactions.
lm_v <- lm(data = school, scrtot ~ avginc + str + enrltot + avginc:str + avginc:enrltot + str:enrltot + avginc ^ 2 + str ^ 2 + enrltot ^ 2)
qplot(school$avginc, resid(lm_v))
qplot(school$str, resid(lm_v))
qplot(school$enrltot, resid(lm_v))
Enough! Okay, it seems like there probably is heteroskedasticity. What can we do from here on out? We need tests. Tests with funny names….
What we need to do: compare the variances of first 3/8 and the last 3/8 ranked by the explaining variable of interest. This looks at the ratio and sees how far it is from 1. Think about it. If the ratio is 1, then the beginning and end of your errors is roughly the same!
Note that Goldfeld-Quant only allows you to look at one varibale at a time. Let’s focus on income at the moment. If we wanted to do this by hand, we’d need to follow some steps. 6 steps in total.
Order your observations by your variable of interest, in this case, let’s do income
school <- arrange(school, avginc)
head(school$avginc) #we can see it's increasing
## [1] 5.335 5.699 6.216 6.577 6.613 6.983
We can see it’s increasing, which is what we want.
What is 3/8 of our observations:
(nrow(school)/8)*3
## [1] 157.5
Split the data into two groups, in appropriate sizes. Our case: chosen to be three-eighth of the size of your dataset in full. There are a couple ways to do this
n_GQ <- as.integer(nrow(school) * 3/ 8)
n_GQ <- (nrow(school)%/%8) * 3
n_GQ
## [1] 156
this will give us the closest count that gives us the number that is 3/8 of our full sample!
Run seperate regressions of y (total score) on x (income) for each of the two groups of 3/8.
lm_g1 <- lm(data = tail(school, n_GQ), scrtot ~ avginc + str + enrltot)
using the tail command, we are running the reg on the last 3/8.
lm_g2 <- lm(data = head(school, n_GQ), scrtot ~ avginc + str + enrltot)
using head, we are doing the above on the first ones
Record our sum of square errors to test
e_g1 <- resid(lm_g1) #the resid command gets us our models' residuals!
e_g2 <- resid(lm_g2)
Now, to get SSE, we need to square the residuals. squaring removes signs and weights bigger errors
sse_g1 <- sum(e_g1 ^ 2)
sse_g2 <- sum(e_g2 ^ 2)
Calculate the G-Q test stastistic, and compute the p-value
stat_GQ <- (sse_g2 / sse_g1)
stat_GQ
## [1] 1.489965
which letter test is this?….. the F one!! We have n-k degrees of freedom, with k = no. of parameters estimated
p_GQ <- pf(q = stat_GQ, df1 = n_GQ-4, df2 = n_GQ-4, lower.tail = F)
p_GQ
## [1] 0.007222659
pf gives probability from an f-dist. A two tailed test tests if the two values are different (which is what we want), vs upper or lower tail test. Hence lower.tail=F
State the null hypothesis and draw conclusion, The Null Hypothesis of Goldfeld-Quant is….. H0:The variances of the residuals from regressions using the first 3/8 and the last 3/8 of the dataset ordered by average income are the same.
The Alternative Hypothesis of Goldfeld-Quant is………. HA: The variances of the residuals from regressions using the first 3/8 and the last 3/8 of the dataset ordered by average income are different More formally: H0: sigma_1^2 = sigma_2^2, and HA: sigma_1^2 != sigma_2^2 Conclusion, can we reject H0? yes!
Whats the problem with this test? What if the heteroskedasticty occurs in the middle, not in the first or last 3/8s? Time for a new test
If you want to see if your variables are correlated with your errors, why not just run a regression? Well our first goal is to get the residuals from our regression. let’s get those. Again, we need to follow some steps to run this test. How do we go about this? Another six steps!!
Well our first goal is to get the residuals from our regression. let’s get those. Regress y (scrtot) on an intercept, x1 (avginc), x2 (str), … xk (enrltot)
lm_frst <- lm(data = school, scrtot ~ avginc + str + enrltot)
Record residuals
e_BP <- resid(lm_frst) #use the resid() command again
Regress e^2 on an intercept, x1, x2, … xk
lm_BrPa <- lm(data = school, e_BP ^ 2 ~ avginc + str + enrltot)
Record R^2
r2_BP <- summary(lm_BrPa)$r.squared
this is letting us get r squared! What is the $ operator doing here?
Compute the Bruesch-Pagan statisic (called Langrange Multiplier) and the P-value
LM_BP <- nrow(school) * r2_BP
BP is a test with statistic of size n*R^2 and k degrees of freedom. This function calculates a chi-squared distribution, which is how we get our bp test probability.
pchisq(q = LM_BP, df = 3, lower.tail = F) #this function calculates a chi-squared distribution
## [1] 0.01342699
State the null hypothesis and draw conclusion. What is H0?
H0: b1 = b2 = b3 = 0 where b1, b2, b3 are the coefficients of regression model or rather: (e.BP ^ 2 ~ b0 + b1 * avginc + b2 * str + b3 * enrltot)
Note, if just one of our coefficeints isn’t equal to 0 we have heteroskedasticity!
What do we conclude? We reject the null hypothesis with 5% significance level. We have one more test to walk through. The issue with this test is that our estimation assumed a linear trend when we first got our residuals.
White test is very similar to the Breusch-Pagan test, with a few modifications: When regressing the squared errors in the B-P test, in addition to the original regressors, we add interaction terms and squared terms. So the first two steps of the White test are identical to those in the Breusch-Pagan test
Regress y (scrtot) on an intercept, x1 (avginc), x2 (str), … xk (enrltot)
lm_first <- lm(data = school, scrtot ~ avginc + str + enrltot)
Record residuals
e_WhiteTest <- resid(lm_first)
Regress e^2 on an intercept, x1, x2, … xk AND THE INTEREACTION AND SQUARED TERMS. This is where we go off the BP script
lm_White <- lm(data = school, e_BP ^ 2 ~ avginc + str + enrltot +
avginc:str + avginc:enrltot + str:enrltot +
avginc ^ 2 + str ^ 2 + enrltot ^ 2)
Record R^2
r2_White <- summary(lm_White)$r.squared
Compute the Bruesch-Pagan statisic (called Langrange Multiplier)
LM_White <- nrow(school) * r2_White
pchisq(q = LM_White, df = 9, lower.tail = F)
## [1] 2.411685e-05
This is a chi-squared function. Takes value, outputs probability. So what can we say? We have heteroskedasticity. One issue with this test is that the number of variables you have to include in your regression increases exponentially. Think about what you would have to include if you have 5 independent variables (x’s)?
How do we fix the problem? We have some serious heterosked problems. We can use Heteroskedasticity-robust standard errors! We could build these on our own, but R has some built in functions!
p_load(lfe)
lm_hetero_robust <- lfe::felm(data = school, scrtot ~ avginc + str + enrltot)
Now we have made our standard erros robust to heteroskedasticity:
summary(lm_hetero_robust, robust = T)
##
## Call:
## lfe::felm(formula = scrtot ~ avginc + str + enrltot, data = school)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.566 -18.354 2.205 17.654 61.413
##
## Coefficients:
## Estimate Robust s.e t value Pr(>|t|)
## (Intercept) 1.259e+03 1.474e+01 85.374 < 2e-16 ***
## avginc 3.772e+00 2.370e-01 15.912 < 2e-16 ***
## str -1.830e-01 7.270e-01 -0.252 0.801
## enrltot -1.671e-03 2.768e-04 -6.037 3.48e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.99 on 416 degrees of freedom
## Multiple R-squared(full model): 0.538 Adjusted R-squared: 0.5347
## Multiple R-squared(proj model): 0.538 Adjusted R-squared: 0.5347
## F-statistic(full model, *iid*):161.5 on 3 and 416 DF, p-value: < 2.2e-16
## F-statistic(proj model): 101.2 on 3 and 416 DF, p-value: < 2.2e-16
summary(lm_hetero_robust, robust = F)
##
## Call:
## lfe::felm(formula = scrtot ~ avginc + str + enrltot, data = school)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.566 -18.354 2.205 17.654 61.413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.259e+03 1.501e+01 83.825 < 2e-16 ***
## avginc 3.772e+00 1.817e-01 20.759 < 2e-16 ***
## str -1.830e-01 7.268e-01 -0.252 0.801
## enrltot -1.671e-03 3.419e-04 -4.887 1.47e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.99 on 416 degrees of freedom
## Multiple R-squared(full model): 0.538 Adjusted R-squared: 0.5347
## Multiple R-squared(proj model): 0.538 Adjusted R-squared: 0.5347
## F-statistic(full model):161.5 on 3 and 416 DF, p-value: < 2.2e-16
## F-statistic(proj model): 161.5 on 3 and 416 DF, p-value: < 2.2e-16
Notice that the se’s are different depending on if you made them robust or not. Note that se doesnt change much for ‘str’ because if we recall str didn’t seem heteroskedastic!
Ok! Lots covered here. R makes our life easier, but we still have to know what and where the problems come from so we can use all of R’s helpful tools.