Heteroskedasticity

Today, we will be working through some heteroskedasticity examples. In order to do that though, we need two things: some data and our handy dandy pacman package.

library(pacman)
#install.packages("Ecdat")
p_load(Ecdat, ggplot2, sandwich, lmtest, dplyr)

The Ecdat contains the ‘Caschool’ dataset which features all sorts of fun heterokedasticity. Let’s do some exploration on our own. How do we inspect something? We can ask R!

?Caschool

We’re going to start with some exploration on our own.

Question1: How many observations are there? What is the timespan the data covers? Answer1: 420. 1998-199

Question2: What kind of dataset is it? Answer2: dataframe

Question3: How many columns (variables) are there in the dataset? Answer3: 17

Question4: What variables should we choose as the left-hand side dependent variables here? (Hint: We are trying to understand what impacts educational outcomes here!) Answer4: Test scores seem like a good choice.

Question5: What are the variable names for income, student-to-teacher ratio and enrollment Answer6: avginc, str, enrltot

Let’s take a peak at the first few rows of the dataset. How do we do this?

head(Caschool)
##   distcod  county                        district grspan enrltot teachers
## 1   75119 Alameda              Sunol Glen Unified  KK-08     195    10.90
## 2   61499   Butte            Manzanita Elementary  KK-08     240    11.15
## 3   61549   Butte     Thermalito Union Elementary  KK-08    1550    82.90
## 4   61457   Butte Golden Feather Union Elementary  KK-08     243    14.00
## 5   61523   Butte        Palermo Union Elementary  KK-08    1335    71.50
## 6   62042  Fresno         Burrel Union Elementary  KK-08     137     6.40
##   calwpct mealpct computer testscr   compstu  expnstu      str    avginc
## 1  0.5102  2.0408       67  690.80 0.3435898 6384.911 17.88991 22.690001
## 2 15.4167 47.9167      101  661.20 0.4208333 5099.381 21.52466  9.824000
## 3 55.0323 76.3226      169  643.60 0.1090323 5501.955 18.69723  8.978000
## 4 36.4754 77.0492       85  647.70 0.3497942 7101.831 17.35714  8.978000
## 5 33.1086 78.4270      171  640.85 0.1280899 5235.988 18.67133  9.080333
## 6 12.3188 86.9565       25  605.55 0.1824818 5580.147 21.40625 10.415000
##       elpct readscr mathscr
## 1  0.000000   691.6   690.0
## 2  4.583333   660.5   661.9
## 3 30.000002   636.3   650.9
## 4  0.000000   651.9   643.5
## 5 13.857677   641.8   639.9
## 6 12.408759   605.7   605.4

I am going to save this dataset as a new object with an easier name to type. I also am curious about summary statistics:

school<-Caschool
summary(school)
##     distcod              county                         district     grspan   
##  Min.   :61382   Sonoma     : 29   Lakeside Union Elementary:  3   KK-06: 61  
##  1st Qu.:64308   Kern       : 27   Mountain View Elementary :  3   KK-08:359  
##  Median :67760   Los Angeles: 27   Jefferson Elementary     :  2              
##  Mean   :67473   Tulare     : 24   Liberty Elementary       :  2              
##  3rd Qu.:70419   San Diego  : 21   Ocean View Elementary    :  2              
##  Max.   :75440   Santa Clara: 20   Pacific Union Elementary :  2              
##                  (Other)    :272   (Other)                  :406              
##     enrltot           teachers          calwpct          mealpct      
##  Min.   :   81.0   Min.   :   4.85   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:  379.0   1st Qu.:  19.66   1st Qu.: 4.395   1st Qu.: 23.28  
##  Median :  950.5   Median :  48.56   Median :10.520   Median : 41.75  
##  Mean   : 2628.8   Mean   : 129.07   Mean   :13.246   Mean   : 44.71  
##  3rd Qu.: 3008.0   3rd Qu.: 146.35   3rd Qu.:18.981   3rd Qu.: 66.86  
##  Max.   :27176.0   Max.   :1429.00   Max.   :78.994   Max.   :100.00  
##                                                                       
##     computer         testscr         compstu           expnstu    
##  Min.   :   0.0   Min.   :605.5   Min.   :0.00000   Min.   :3926  
##  1st Qu.:  46.0   1st Qu.:640.0   1st Qu.:0.09377   1st Qu.:4906  
##  Median : 117.5   Median :654.5   Median :0.12546   Median :5215  
##  Mean   : 303.4   Mean   :654.2   Mean   :0.13593   Mean   :5312  
##  3rd Qu.: 375.2   3rd Qu.:666.7   3rd Qu.:0.16447   3rd Qu.:5601  
##  Max.   :3324.0   Max.   :706.8   Max.   :0.42083   Max.   :7712  
##                                                                   
##       str            avginc           elpct           readscr     
##  Min.   :14.00   Min.   : 5.335   Min.   : 0.000   Min.   :604.5  
##  1st Qu.:18.58   1st Qu.:10.639   1st Qu.: 1.941   1st Qu.:640.4  
##  Median :19.72   Median :13.728   Median : 8.778   Median :655.8  
##  Mean   :19.64   Mean   :15.317   Mean   :15.768   Mean   :655.0  
##  3rd Qu.:20.87   3rd Qu.:17.629   3rd Qu.:22.970   3rd Qu.:668.7  
##  Max.   :25.80   Max.   :55.328   Max.   :85.540   Max.   :704.0  
##                                                                   
##     mathscr     
##  Min.   :605.4  
##  1st Qu.:639.4  
##  Median :652.5  
##  Mean   :653.3  
##  3rd Qu.:665.9  
##  Max.   :709.5  
## 

Question8: What is the range of district average income

range(school$avginc)
## [1]  5.335 55.328

Question9: What is the range of student to teacher ratio

range(school$str)
## [1] 14.0 25.8

Let’s try to figure out what the range of the total score (reading + math) is for each district? How should we do this? We’ll use some tools from our friendly neighborhood tidyverse to solve this.

p_load(tidyverse)
school <- mutate(school, scrtot = readscr + mathscr)
range(school$scrtot)
## [1] 1211.1 1413.5

Lesson 1: Running Regressions! Again!

Let’s run the linear regression of total score on income, student-to-teacher ratio and enrollment and explore a little

ols1 <- lm(data = school, scrtot ~ avginc + str + enrltot)
summary(ols1)
## 
## Call:
## lm(formula = scrtot ~ avginc + str + enrltot, data = school)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.566 -18.354   2.205  17.654  61.413 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.259e+03  1.501e+01  83.825  < 2e-16 ***
## avginc       3.772e+00  1.817e-01  20.759  < 2e-16 ***
## str         -1.830e-01  7.268e-01  -0.252    0.801    
## enrltot     -1.671e-03  3.419e-04  -4.887 1.47e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.99 on 416 degrees of freedom
## Multiple R-squared:  0.538,  Adjusted R-squared:  0.5347 
## F-statistic: 161.5 on 3 and 416 DF,  p-value: < 2.2e-16

On your own, interpret coefficients. This is the most important part of the game!

Lesson 2: Retrieving Residuals from an LM model

Create the residuals of the regression you just ran. Hint: you can use the mutate command, combined with the resid(your_model) command. This will return a vector of residuals! This is super handy, especially for this section.

school <- mutate(school, resid1 = resid(ols1))

However, we know this is the heteroskedasticity lab, so we need to check for heteroskedasticity. How would we do this? ### Plot residuals against each of the three variables!

qplot(school$avginc, school$resid1) #eesh

qplot(school$str, school$resid1) #pretty good

qplot(school$enrltot, school$resid1) #eesh

qplot(school$scrtot, school$resid1) #eesh

Does it appear as though our disturbances are heteroskedastic? Yes, there seems to be more variance in our errors as these variables increase. Why do we care about heteroskedasticity? Does this cause any issues? It violates our assumption of homosked which in turn gives us incorrect se and we can no longer do inference

Let’s Run a similar regression but this time let’s use some techiniques. The reason why we want to first try multiple specifications is because mispecification itself can cause heteroskedascity.

Let’s start with a log-linear specification. How would we go about this?

lm_i <- lm(data = school, log(scrtot) ~ avginc + str + enrltot)

Now, we can use our quick-plot tool to examine our new residuals

qplot(school$avginc, resid(lm_i))

qplot(school$str, resid(lm_i))

qplot(school$enrltot, resid(lm_i))

qplot(school$scrtot, resid(lm_i))

Next, let’s do a log-log specification

lm_ii <- lm(data = school, log(scrtot) ~ log(avginc) + log(str) + log(enrltot))

and qplot again

qplot(school$avginc, resid(lm_ii))

qplot(school$str, resid(lm_ii))

qplot(school$enrltot, resid(lm_ii))

qplot(school$scrtot, resid(lm_ii))

How about lin-log?

lm_iii <- lm(data = school, scrtot ~ log(avginc) + log(str) + log(enrltot))

qplot(log(school$avginc), resid(lm_iii))

qplot(log(school$str), resid(lm_iii))

qplot(log(school$enrltot), resid(lm_iii))

qplot(school$scrtot, resid(lm_iii))

Maybe interaction terms will help us.

lm_iv <- lm(data = school, scrtot ~ avginc + str + enrltot + avginc:str + avginc:enrltot + str:enrltot)

qplot(school$avginc, resid(lm_iv))

qplot(school$str, resid(lm_iv))

qplot(school$enrltot, resid(lm_iv))

Okay. Maybe we really need quadratic terms. We’ll keep our interactions.

lm_v <- lm(data = school, scrtot ~ avginc + str + enrltot + avginc:str + avginc:enrltot + str:enrltot + avginc ^ 2 + str ^ 2 + enrltot ^ 2)

qplot(school$avginc, resid(lm_v))

qplot(school$str, resid(lm_v))

qplot(school$enrltot, resid(lm_v))

Enough! Okay, it seems like there probably is heteroskedasticity. What can we do from here on out? We need tests. Tests with funny names….

Lesson 3:Goldfeld-Quant test!

What we need to do: compare the variances of first 3/8 and the last 3/8 ranked by the explaining variable of interest. This looks at the ratio and sees how far it is from 1. Think about it. If the ratio is 1, then the beginning and end of your errors is roughly the same!

Note that Goldfeld-Quant only allows you to look at one varibale at a time. Let’s focus on income at the moment. If we wanted to do this by hand, we’d need to follow some steps. 6 steps in total.

1.

Order your observations by your variable of interest, in this case, let’s do income

school <- arrange(school, avginc)
head(school$avginc) #we can see it's increasing
## [1] 5.335 5.699 6.216 6.577 6.613 6.983

We can see it’s increasing, which is what we want.

What is 3/8 of our observations:

(nrow(school)/8)*3
## [1] 157.5

2.

Split the data into two groups, in appropriate sizes. Our case: chosen to be three-eighth of the size of your dataset in full. There are a couple ways to do this

n_GQ <- as.integer(nrow(school) * 3/ 8)
n_GQ <- (nrow(school)%/%8) * 3
n_GQ 
## [1] 156

this will give us the closest count that gives us the number that is 3/8 of our full sample!

3.

Run seperate regressions of y (total score) on x (income) for each of the two groups of 3/8.

lm_g1 <- lm(data = tail(school, n_GQ), scrtot ~ avginc + str + enrltot)

using the tail command, we are running the reg on the last 3/8.

lm_g2 <- lm(data = head(school, n_GQ), scrtot ~ avginc + str + enrltot)

using head, we are doing the above on the first ones

4.

Record our sum of square errors to test

e_g1 <- resid(lm_g1) #the resid command gets us our models' residuals!
e_g2 <- resid(lm_g2)

Now, to get SSE, we need to square the residuals. squaring removes signs and weights bigger errors

sse_g1 <- sum(e_g1 ^ 2) 
sse_g2 <- sum(e_g2 ^ 2)

5.

Calculate the G-Q test stastistic, and compute the p-value

stat_GQ <- (sse_g2 / sse_g1)
stat_GQ
## [1] 1.489965

which letter test is this?….. the F one!! We have n-k degrees of freedom, with k = no. of parameters estimated

p_GQ <- pf(q = stat_GQ, df1 = n_GQ-4, df2 = n_GQ-4, lower.tail = F) 
p_GQ
## [1] 0.007222659

pf gives probability from an f-dist. A two tailed test tests if the two values are different (which is what we want), vs upper or lower tail test. Hence lower.tail=F

6.

State the null hypothesis and draw conclusion, The Null Hypothesis of Goldfeld-Quant is….. H0:The variances of the residuals from regressions using the first 3/8 and the last 3/8 of the dataset ordered by average income are the same.

The Alternative Hypothesis of Goldfeld-Quant is………. HA: The variances of the residuals from regressions using the first 3/8 and the last 3/8 of the dataset ordered by average income are different More formally: H0: sigma_1^2 = sigma_2^2, and HA: sigma_1^2 != sigma_2^2 Conclusion, can we reject H0? yes!

Whats the problem with this test? What if the heteroskedasticty occurs in the middle, not in the first or last 3/8s? Time for a new test

Lesson 4: Breusch-Pagan test:

If you want to see if your variables are correlated with your errors, why not just run a regression? Well our first goal is to get the residuals from our regression. let’s get those. Again, we need to follow some steps to run this test. How do we go about this? Another six steps!!

Step 1.

Well our first goal is to get the residuals from our regression. let’s get those. Regress y (scrtot) on an intercept, x1 (avginc), x2 (str), … xk (enrltot)

lm_frst <- lm(data = school, scrtot ~ avginc + str + enrltot)

Step 2.

Record residuals

e_BP <- resid(lm_frst) #use the resid() command again

Step 3.

Regress e^2 on an intercept, x1, x2, … xk

lm_BrPa <- lm(data = school, e_BP ^ 2 ~ avginc + str + enrltot)

Step 4.

Record R^2

r2_BP <- summary(lm_BrPa)$r.squared

this is letting us get r squared! What is the $ operator doing here?

Step 5.

Compute the Bruesch-Pagan statisic (called Langrange Multiplier) and the P-value

LM_BP <- nrow(school) * r2_BP

BP is a test with statistic of size n*R^2 and k degrees of freedom. This function calculates a chi-squared distribution, which is how we get our bp test probability.

pchisq(q = LM_BP, df = 3, lower.tail = F) #this function calculates a chi-squared distribution
## [1] 0.01342699

Step 6.

State the null hypothesis and draw conclusion. What is H0?

H0: b1 = b2 = b3 = 0 where b1, b2, b3 are the coefficients of regression model or rather: (e.BP ^ 2 ~ b0 + b1 * avginc + b2 * str + b3 * enrltot)

Note, if just one of our coefficeints isn’t equal to 0 we have heteroskedasticity!

What do we conclude? We reject the null hypothesis with 5% significance level. We have one more test to walk through. The issue with this test is that our estimation assumed a linear trend when we first got our residuals.

Lesson 5: White test

White test is very similar to the Breusch-Pagan test, with a few modifications: When regressing the squared errors in the B-P test, in addition to the original regressors, we add interaction terms and squared terms. So the first two steps of the White test are identical to those in the Breusch-Pagan test

Step 1.

Regress y (scrtot) on an intercept, x1 (avginc), x2 (str), … xk (enrltot)

lm_first <- lm(data = school, scrtot ~ avginc + str + enrltot)

Step 2.

Record residuals

e_WhiteTest <- resid(lm_first)

Step 3.

Regress e^2 on an intercept, x1, x2, … xk AND THE INTEREACTION AND SQUARED TERMS. This is where we go off the BP script

lm_White <- lm(data = school, e_BP ^ 2 ~ avginc + str + enrltot + 
                 avginc:str + avginc:enrltot + str:enrltot +
                 avginc ^ 2 + str ^ 2 + enrltot ^ 2)

Step 4.

Record R^2

r2_White <- summary(lm_White)$r.squared

Step 5.

Compute the Bruesch-Pagan statisic (called Langrange Multiplier)

LM_White <- nrow(school) * r2_White 
pchisq(q = LM_White, df = 9, lower.tail = F)
## [1] 2.411685e-05

This is a chi-squared function. Takes value, outputs probability. So what can we say? We have heteroskedasticity. One issue with this test is that the number of variables you have to include in your regression increases exponentially. Think about what you would have to include if you have 5 independent variables (x’s)?

Lesson 6: What next?

How do we fix the problem? We have some serious heterosked problems. We can use Heteroskedasticity-robust standard errors! We could build these on our own, but R has some built in functions!

p_load(lfe)
lm_hetero_robust <- lfe::felm(data = school, scrtot ~ avginc + str + enrltot)

Now we have made our standard erros robust to heteroskedasticity:

summary(lm_hetero_robust, robust = T)
## 
## Call:
##    lfe::felm(formula = scrtot ~ avginc + str + enrltot, data = school) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.566 -18.354   2.205  17.654  61.413 
## 
## Coefficients:
##               Estimate Robust s.e t value Pr(>|t|)    
## (Intercept)  1.259e+03  1.474e+01  85.374  < 2e-16 ***
## avginc       3.772e+00  2.370e-01  15.912  < 2e-16 ***
## str         -1.830e-01  7.270e-01  -0.252    0.801    
## enrltot     -1.671e-03  2.768e-04  -6.037 3.48e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.99 on 416 degrees of freedom
## Multiple R-squared(full model): 0.538   Adjusted R-squared: 0.5347 
## Multiple R-squared(proj model): 0.538   Adjusted R-squared: 0.5347 
## F-statistic(full model, *iid*):161.5 on 3 and 416 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 101.2 on 3 and 416 DF, p-value: < 2.2e-16
summary(lm_hetero_robust, robust = F)
## 
## Call:
##    lfe::felm(formula = scrtot ~ avginc + str + enrltot, data = school) 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -82.566 -18.354   2.205  17.654  61.413 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.259e+03  1.501e+01  83.825  < 2e-16 ***
## avginc       3.772e+00  1.817e-01  20.759  < 2e-16 ***
## str         -1.830e-01  7.268e-01  -0.252    0.801    
## enrltot     -1.671e-03  3.419e-04  -4.887 1.47e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.99 on 416 degrees of freedom
## Multiple R-squared(full model): 0.538   Adjusted R-squared: 0.5347 
## Multiple R-squared(proj model): 0.538   Adjusted R-squared: 0.5347 
## F-statistic(full model):161.5 on 3 and 416 DF, p-value: < 2.2e-16 
## F-statistic(proj model): 161.5 on 3 and 416 DF, p-value: < 2.2e-16

Notice that the se’s are different depending on if you made them robust or not. Note that se doesnt change much for ‘str’ because if we recall str didn’t seem heteroskedastic!

Ok! Lots covered here. R makes our life easier, but we still have to know what and where the problems come from so we can use all of R’s helpful tools.