Today, we will be working through some heteroskedasticity examples.
In order to do that though, we need two things: some data and our handy dandy pacman package. We’ll be using the Caschool data from Ecdat to look at some fun data.
First, let’s load pacman. If you need to, run install.packages("pacman") first. If you run into errors, run library(Ecdat,ggplot2) to get around pacman install concerns. For everybody else,
library(pacman)
#install.packages("Ecdat") if necessary
p_load(Ecdat, tidyverse, ggplot2, sandwich, lmtest)
#Let's explore our dataset, CaschoolThe Ecdat contains the Caschool dataset which features all sorts of fun heterokedasticity we can look at.
Let’s do some exploration on our own. How do we inspect something? We can ask R using the ? key
?CaschoolWe’re going to start with some exploration on our own.
Let’s take a peak at the first few rows of the dataset. How do we do this?
#Let's pass in our caschool data to a head command
head(Caschool)## distcod county district grspan enrltot teachers
## 1 75119 Alameda Sunol Glen Unified KK-08 195 10.90
## 2 61499 Butte Manzanita Elementary KK-08 240 11.15
## 3 61549 Butte Thermalito Union Elementary KK-08 1550 82.90
## 4 61457 Butte Golden Feather Union Elementary KK-08 243 14.00
## 5 61523 Butte Palermo Union Elementary KK-08 1335 71.50
## 6 62042 Fresno Burrel Union Elementary KK-08 137 6.40
## calwpct mealpct computer testscr compstu expnstu str avginc
## 1 0.5102 2.0408 67 690.80 0.3435898 6384.911 17.88991 22.690001
## 2 15.4167 47.9167 101 661.20 0.4208333 5099.381 21.52466 9.824000
## 3 55.0323 76.3226 169 643.60 0.1090323 5501.955 18.69723 8.978000
## 4 36.4754 77.0492 85 647.70 0.3497942 7101.831 17.35714 8.978000
## 5 33.1086 78.4270 171 640.85 0.1280899 5235.988 18.67133 9.080333
## 6 12.3188 86.9565 25 605.55 0.1824818 5580.147 21.40625 10.415000
## elpct readscr mathscr
## 1 0.000000 691.6 690.0
## 2 4.583333 660.5 661.9
## 3 30.000002 636.3 650.9
## 4 0.000000 651.9 643.5
## 5 13.857677 641.8 639.9
## 6 12.408759 605.7 605.4
I think using some function of our test scores would work well.
Let’s try to figure out what the range of the total average score (reading + math) is? How should we do this?
We’ll use some tools from our friendly neighborhood tidyverse to solve this. Remember our mutate command? We’ll be using that.
#We need the tidyverse, so if you haven't already, run the following:
#p_load(tidyverse)
#Generate total scores by adding reading and math scores together.
Caschool <- mutate(Caschool, scrtot = readscr + mathscr)Now, we need to do some analysis. Back to regressions.
Let’s run the linear regression of total score on income, student-to-teacher ratio, enrollment and expenditures per student. First, let’s explore a little.
ols1 <- lm(data = Caschool, scrtot ~ avginc + str + enrltot + expnstu)
summary(ols1)##
## Call:
## lm(formula = scrtot ~ avginc + str + enrltot + expnstu, data = Caschool)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.657 -17.753 2.213 17.866 62.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.313e+03 2.783e+01 47.171 < 2e-16 ***
## avginc 3.864e+00 1.851e-01 20.876 < 2e-16 ***
## str -1.395e+00 8.927e-01 -1.563 0.1188
## enrltot -1.611e-03 3.411e-04 -4.722 3.2e-06 ***
## expnstu -6.052e-03 2.613e-03 -2.316 0.0211 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.86 on 415 degrees of freedom
## Multiple R-squared: 0.5439, Adjusted R-squared: 0.5395
## F-statistic: 123.7 on 4 and 415 DF, p-value: < 2.2e-16
On your own, interpret coefficients. In particular, think about the coefficient on expnstu: it’s significant at the 5% level. Why might we see a negative sign on expenditures per student?
Ok, now we need to find a way to examine heteroskedasticity. The first step in doing that is to recover errors, which we have to figure out. Let’s do that.
Create the residuals of the regression you just ran. Hint: you can use the mutate command, combined with the resid(your_lm_model) command. This will return a vector of residuals! This is super handy, especially for this section. Let’s create a new column in our dataframe with our errors.
Caschool <- mutate(Caschool, resid1 = resid(ols1))However, we know this is the heteroskedasticity lab, so we need to check for heteroskedasticity. How would we do this?
Plot residuals against each of the four explanatory variables!
#Let's use our qplot command to graph our X variables against our residuals to see if they seem reasonable. Think a bit about
#what 'reasonable' means.
#Average income
qplot(Caschool$avginc, Caschool$resid1)#student teacher ratio
qplot(Caschool$str, Caschool$resid1)#enrollment total
qplot(Caschool$enrltot, Caschool$resid1)#expenditures per student
qplot(Caschool$expnstu, Caschool$resid1)Does it appear as though our disturbances are heteroskedastic? explain.
Why do we care about heteroskedasticity? What issues does that cause with our assumptions about OLS?
Let’s Run a similar regression but this time let’s use some common transformation techiniques. Why do we want to do this?
The reason why we want to first try model specifications is because model mispecification itself can cause heteroskedascity. Why do log-type heteroskedasticities help us sometimes?
Let’s start with a log-linear specification. How would we go about this?
#Let's run a regression with the form log(y) = x1 + x2 + ...
lm_i <- lm(data = Caschool, log(scrtot) ~ avginc + str + enrltot + expnstu)Now, we can use our quick-plot tool to examine our new residuals
qplot(Caschool$avginc, resid(lm_i))qplot(Caschool$str, resid(lm_i))qplot(Caschool$enrltot, resid(lm_i))qplot(Caschool$expnstu, resid(lm_i))Next, let’s do a log-log specification
#our form is now log(Y) = log(X1) + log(X2)...
lm_ii <- lm(data = Caschool, log(scrtot) ~ log(avginc) + log(str) + log(enrltot) + log(expnstu))qplot again
qplot(log(Caschool$avginc), resid(lm_ii))qplot(log(Caschool$str), resid(lm_ii))qplot(log(Caschool$enrltot), resid(lm_ii))qplot(log(Caschool$expnstu), resid(lm_ii))lm_checker <- lm(data = Caschool, (resid(lm_ii)^2) ~ log(avginc) + log(str) + log(enrltot))
summary(lm_checker)$r.squared## [1] 0.0350631
That’s a bit hard to interpret. We might be okay here. But let’s keep looking. (Note: how would you interpret these coefficients?)
How about lin-log?
#y = log(x1)+ log(x2)...
lm_iii <- lm(data = Caschool, scrtot ~ log(avginc) + log(str) + log(enrltot))qplot(log(Caschool$avginc), resid(lm_iii))qplot(log(Caschool$str), resid(lm_iii))qplot(log(Caschool$enrltot), resid(lm_iii))qplot(log(Caschool$enrltot), resid(lm_iii))Maybe interaction terms will help us…
lm_iv <- lm(data = Caschool, scrtot ~ avginc + str + enrltot + expnstu + avginc:str + avginc:enrltot + avginc:expnstu + str:enrltot + str:expnstu + enrltot:expnstu)Ok, let’s look at our errors
qplot(Caschool$avginc, resid(lm_iv))qplot(Caschool$str, resid(lm_iv))qplot(Caschool$enrltot, resid(lm_iv))qplot(Caschool$expnstu, resid(lm_iv))lm_v <- lm(data = Caschool, scrtot ~ avginc + str + enrltot + expnstu + avginc:str + avginc:enrltot + avginc:expnstu + str:enrltot + str:expnstu + enrltot:expnstu + avginc ^ 2 + str ^ 2 + enrltot ^ 2 + expnstu^2)qplot(Caschool$avginc, resid(lm_v))qplot(Caschool$str, resid(lm_v))qplot(Caschool$enrltot, resid(lm_v))qplot(Caschool$expnstu, resid(lm_v))Shoot. Okay, what can we do from here on out? We need tests. Tests with funny names….
Ok. Before we start down this path, let’s try to figure out what exactly we need to look for in a heteroskedasticity test. What exactly are we testing for here? We want to be sure that our variance isn’t changing with our predictors. Well, what’s one way of checking that? How can we do this?
One way is to compare different values of our predicted variable and see how our errors are changing that. That’s what the GQ test is trying to do! But we need a process. You can actually run all of this code with tools you’ve already seen so far. Let’s do that.
First, we need to pick a fraction of our sample to compare. Let’s look at the first and last 3/8 of our sample and then test if they have the same variances.
What we need to do: compare the variances of first 3/8 and the last 3/8 ranked by the explaining variable of interest
This looks at the ratio and sees how far it is from 1. Think about it. If the ratio is 1, then across large ans small values of your variable of interest, your errors is roughly the same.
Note that Goldfeld-Quant only allows you to look at one varibale at a time. Let’s focus on income at the moment
If we wanted to do this by hand, we’d need to follow some steps. 6 steps in total.
1.) Order your observations by your variable of interest, in this case, let’s do income. It looked the worst in our graphs, so it makes sense to check here.
#in order to order a variable, all we need to do is sort it, using the arrange command.
Caschool <- arrange(Caschool, avginc)
head(Caschool$avginc)## [1] 5.335 5.699 6.216 6.577 6.613 6.983
we can see avginc is increasing
2.) Split the data into two groups, in appropriate sizes. We’ve chosen to be three-eighths to be our sample size. We need to know what 3/8 of our dataset is.
n_GQ <- as.integer(nrow(Caschool) * 3/ 8)
(nrow(Caschool)%/%8) * 3 #Option 2. this is one we've covered all the commands for. It will give us something close, but slightly smaller than the last option. Can you think of why? What does the %/% do?## [1] 156
n_GQ #this will give us the closest count that gives us the number that is 3/8 of our full sample!## [1] 157
3.) Run seperate regressions of y (total score) on x (income) for each of the two groups of 3/8. We can do this with clever use of our head and tail commands.
#on the last 3/8
lm_g1 <- lm(data = tail(Caschool, n_GQ), scrtot ~ avginc + str + enrltot + expnstu) #using the tail command, we are running the reg
#on the first 3/8
lm_g2 <- lm(data = head(Caschool, n_GQ), scrtot ~ avginc + str + enrltot + expnstu) #using head, we are doing the above on the first ones4.) Record our sum of square errors to test
We need to recover the SSEs for our test, so we need to build those. We’ll use the sum() command, as well as our resid() command we just learned to do this.
e_g1 <- resid(lm_g1) #the resid command gets us our models' residuals!
e_g2 <- resid(lm_g2)
sse_g1 <- sum(e_g1 ^ 2) #now, to get SSE, we need to square the residuals, and then sum them.
sse_g2 <- sum(e_g2 ^ 2)5.) Calculate the G-Q test stastistic, and compute the p-value. How do we do this? Stop for a second and look at Ed’s notes. What kind of test is the GQ test?
stat_GQ <- (sse_g2 / sse_g1)
stat_GQ## [1] 1.39631
Great. So did you look up the test type? Good. We can use an F-test to produce a p-value for our calculated statistic.
NOTE: I made an error on my last version of these notes that incorrectly specified the degrees of freedom for this test. They should be as below: n_GQ minus k (the number of variables) minus 1. Thanks to Jason for catching this.
#Ok. So. In general, any distribution you can think of has a set of functions you can use on R. Generally, they are prefixed with r, p, q, and d. These correspond to: draws from that distribution (r), likelihood of seeing a number that size (p), the quantile (q) and the density (d). Most often you'll see r and p used but the other ones are useful as well.
p_GQ <- pf(q = stat_GQ, df1 = n_GQ-4-1, df2 = n_GQ-4-1, lower.tail = F) #pf gives probability from an f-dist.
p_GQ## [1] 0.02019347
So what can we say. I’d challenge you to figure out EXACTLY what our null hypothesis is here.
6.) State the null hypothesis and draw conclusion,
The Null Hypothesis of Goldfeld-Quant is….. H0:The variances of the residuals from regressions using the first 3/8 and the last 3/8 of the dataset ordered by average income are the same.
The Alternative Hypothesis of Goldfeld-Quant is……….HA: The variances of the residuals from regressions using the first 3/8 and the last 3/8 of the dataset ordered by average income are different
More formally: H0: _1^2 = _2^2, and HA: _1^2 != _2^2
Conclusion, can we reject H0?
Yes! At the 5% significance level. Let’s look at another test.
Again, we need to follow some steps to run this test. How do we go about this? Another six steps!!
1.) Regress y (scrtot) on an intercept, x1 (avginc), x2 (str), … xk (expnstu)
lm_frst <- lm(data = Caschool, scrtot ~ avginc + str + enrltot + expnstu)2.) Recover residuals from our original regression
Let’s use the resid() command again
e_BP <- resid(lm_frst) #use the resid() command again3.) Regress our squared errors (e^2) on an intercept, x1, x2, … xk, or our explanatory variables.
lm_BrPa <- lm(data = Caschool, e_BP ^ 2 ~ avginc + str + enrltot + expnstu)4.) Record R^2
Now we need to do something a little strange. We’re going to call the r.squared object from within the lm summary object. When have we done this before? Yep! Dataframes. So we use the same formatting.
r2_BP <- summary(lm_BrPa)$r.squared #this is letting us get r squared!5.) Compute the Bruesch-Pagan statisic (called Langrange Multiplier) and the P-value
LM_BP <- nrow(Caschool) * r2_BP
pchisq(q = LM_BP, df = 4, lower.tail = F) #this function calculates a chi-squared distribution## [1] 0.02253599
If you’re wondering why we used 4 degrees of freedom (df in above function is equal to 4) I encourage you to look up Breusch Pagan for more details, however the general rule of thumb is that BP is a test with statistic of size n*R^2 and k degrees of freedom.
6.) State the null hypothesis and draw conclusion. What is H0?
H0: b1 = b2 = b3 = b4 = 0 where b1, b2, b3, b4 are the coefficients of regression model: (e.BP ^ 2 ~ b0 + b1 * avginc + b2 * str + b3 * enrltot)
What do we conclude?
We reject the null hypothesis with 5% significance level. We have one more test to walk through.
The white test is very similar to the Breusch-Pagan test, with a few modifications:
When regressing the squared errors in the B-P test, in addition to the original regressors, we add interaction terms and squared terms.
So the first two steps of the White test are identical to those in the Breusch-Pagan test
White test steps! 1.) Regress y (scrtot) on an intercept, x1 (avginc), x2 (str), … xk (expnstu). Same as BP
lm_first <- lm(data = Caschool, scrtot ~ avginc + str + enrltot + expnstu)2.) Record residuals from our reg. Also, same as BP
e_WhiteTest <- resid(lm_first)3.) Regress e^2 on an intercept, x1, x2, … xk AND THE INTEREACTION AND SQUARED TERMS. This is where we go off the BP script
lm_White <- lm(data = Caschool, e_BP ^ 2 ~ avginc + str + enrltot + expnstu + avginc:str + avginc:enrltot + avginc:expnstu + str:enrltot + str:expnstu + enrltot:expnstu + avginc ^ 2 + str ^ 2 + enrltot ^ 2 + expnstu^2)4.) Record R^2. Back to the BP script, but with a new regression model
r2_White <- summary(lm_White)$r.squared5.) Compute the Bruesch-Pagan statisic (called Langrange Multiplier)
LM_White <- nrow(Caschool) * r2_White
pchisq(q = LM_White, df = 9, lower.tail = F) #this is a chi-squared function. Takes value, outputs probability.## [1] 1.717395e-05
So what can we say? What is our H0? What is HA?
How do we fix our problem? We have some serious het problems. We could use a specification, but sometimes those don’t work, so we have to turn to heteroskedasticity robust standard errors.
We could build these on our own, but R has some built in functions! Let’s see how they work. This next section will require new packages called sandwich and lmtest
#first let's remind ourselves what our coefs and SE look like
lm_orig <- lm(data = Caschool, scrtot ~ avginc + str + enrltot + expnstu)
summary(lm_orig)##
## Call:
## lm(formula = scrtot ~ avginc + str + enrltot + expnstu, data = Caschool)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.657 -17.753 2.213 17.866 62.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.313e+03 2.783e+01 47.171 < 2e-16 ***
## avginc 3.864e+00 1.851e-01 20.876 < 2e-16 ***
## str -1.395e+00 8.927e-01 -1.563 0.1188
## enrltot -1.611e-03 3.411e-04 -4.722 3.2e-06 ***
## expnstu -6.052e-03 2.613e-03 -2.316 0.0211 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.86 on 415 degrees of freedom
## Multiple R-squared: 0.5439, Adjusted R-squared: 0.5395
## F-statistic: 123.7 on 4 and 415 DF, p-value: < 2.2e-16
#Now let's look at our HR standard errors. Notice a difference?
lm_lets_correct <- lm(data = Caschool, scrtot ~ avginc + str + enrltot + expnstu)
#This command let's us pass a NEW variance-covariance matrix that is robust to heteroskedasticity. The vcovHC command
#lets us build a new matrix using a heteroskedasticity correcting method. You have options HC0-5,
#but Stata generally uses HC1. In general, they all do the job.
coeftest(lm_lets_correct, vcov = vcovHC(lm_lets_correct, "HC1"))##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3129e+03 2.9364e+01 44.7112 < 2.2e-16 ***
## avginc 3.8640e+00 2.3972e-01 16.1192 < 2.2e-16 ***
## str -1.3954e+00 9.3154e-01 -1.4980 0.13490
## enrltot -1.6109e-03 2.7802e-04 -5.7942 1.359e-08 ***
## expnstu -6.0517e-03 2.7936e-03 -2.1663 0.03086 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In particular, look at how different the std errors are for avg income. Why did they change so much?
Awesome! We just did some heteroskedasticity. Good job!
This lab was a lot but it’ll be useful for you later. Enjoy your weekend, and I will see you guys next week.