November 15, 2016

Lightening round!

The Plan

Quickly cover: * Different dummy variable model specifications * Remember what heteroskedasticity is * How to test for heteroskedasticity * How to get R to calculate robust standard errors.

Ways to use dummy variables

Let \(d_i\) be a dummy variable and \(x_i\) be a continuous variable.

Linear model with dummy variable: \[y = \beta_0 + \beta_1 d_1 + \beta_2 x_2 + u\]

Linear Probability model: \[d = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\]

Interaction term: \[y = \beta_0 + \beta_1 d_1 + \beta_2 x_1 + \beta_3 d_1 * x_1 + u\]

Heteroskedasticity

Recall: With homoskedastic errors (constant variance of residuals) OLS is BLUE. But with heteroskedasticity (non-constant variance of residuals) this is no longer true. OLS is still unbiased, but the usual hypothesis tests are unreliable and we can't be assured that OLS gives us the minimum variance estimates.

Heteroskedasticity

How do we know if there's heteroskedasticity?

First step: Look at scatterplots. * plot the residuals against variables you think might affect the variance of the residuals. * plot the residuals against the fitted values

Example: regressing percent of income spent on food against income

Testing for heteroskedasticity

Tests of heteroskedasticity usually modify the residuals from a model then regress \(x\) variables (or the \(y\) variable) on the modified residuals. If that auxillary regression explains enough of the variation in the residuals, it indicates that the variance of the original residuals is correlated with the \(x\) variables and therefore there is heteroskedasticity.

For example: \[log(u^2) = \beta_0 + \beta_1 x + u\] If the \(R^2\) is high, then it would seem that the magnitude of the residuals is a function of \(x\) and therefore we have heteroskedasticity.

For the two tests we'll be looking at, we will test the overall significance of such a regression. Because of the structure of the test, the null hypothesis will be no heteroskedasticity.

Testing for heteroskedasticity

Breush-Pagan test

This one is available in the package lmtest.

library(lmtest)
fit1 <- lm(mpg ~ wt + hp + disp, mtcars)
bptest(fit1)
    studentized Breusch-Pagan test

data:  fit1
BP = 0.9459, df = 3, p-value = 0.8143

Testing for heteroskedasticity

White's test

For no apparent reason, White's test isn't already implemented in R. Fortunately it is easy to implement manually by writing a function.

  1. Estimate OLS model and save squared residuals (\(u^2\)).
  2. Regress \(u^2\) on all the \(x\) variables, their cross products, and their squares.
    • If \(y = f(x_1,x_2,x_3)\) we will regress \(u^2\) on \(x_1,x_2,x_3,x_1^2,x_2^2,x_3^2,x_1*x_2,x_1*x_3\) and \(x_2*x_3\)
  3. Test the null hypothesis that regression is insignificant (no heteroskedasticity) and \(nR^2\) is distributed \(\Chi^2_{df=(k-1)}\).
  4. Reject null if \(nR^2\) is too big. (Probability of finding those results is high indicates that nothing looks especially fishy compared to the null.)

Implementing White's test

We need to take an lm object (e.g. "fit1" if that's what we've named our model), square its residuals, and regress interactions of all the \(x\) variables on those transformed residuals.

Then we need to get the \(R^2\) from that regression, multiply by \(n\) (sample size), and test that against the Chi squared distribution with \(k-1\) degrees of freedom.

Implementing White's test

white.test <- function(object.lm){
  u <- object.lm$residuals # residuals
  n <- length(u) # sample size
  u2 <- u^2 # squared residuals
  X <- object.lm$model[,2:ncol(object.lm$model)] # x variables (the data)
  # variable names, but not the name of the intercept term:
  vars <- names(object.lm$coefficients)[2:length(names(object.lm$coefficients))]
  # now the tricky part: creating a formula without knowing in advance
  # what variables we're using. paste(vars,collapse="*") will turn a list of variable
  # names ("X1, X2, etc.") and mash them together separated by "*" ("X1*X2*etc")
  fw <- as.formula(paste("u2",c(paste(vars,collapse="*"),
            paste(vars,"^2",sep="",collapse="+")),sep = "~"))
  W <- summary(lm(fw,object.lm$model))$r.squared
...

Implementing White's test

...
  W <- summary(lm(fw,object.lm$model))$r.squared
  df <- length(vars) - 1 # degrees of freedom
  p <- pchisq(n*W,df) # n x R^2 from auxillary regression tested with df
  null.hypothesis <- "no heteroskedasticity" # so I don't forget
  stars <- ifelse(p <= 0.001,"***", # make some stars as long as we're here
                  ifelse(p <=0.01,"**",
                         ifelse(p<=0.05,"*",
                                ifelse(p<=0.1,".",""))))
  out <- c(nR2 = n*W,
           p.value = p,
           df = df,
           null.hypothesis = null.hypothesis,
           significance = stars)
  return(out)
}

Implementing White's test

white.test(fit1)
                    nR2                 p.value                      df 
     "6.42722882543409"     "0.959788988892527"                     "2" 
        null.hypothesis            significance 
"no heteroskedasticity"                      "" 

In this case, no stars is a good thing. (Homoskedasticity means less work!)

What to do with heteroskedasticity

If our sample size is large enough, we can use robust standard errors (AKA Heteroskedasticity consistent standard errors).

library(AER) ; library(stargazer)
coeftest(fit1, vcov = vcovHC(fit1, type = "HC1")) #%>% stargazer(type = "text")
t test of coefficients:

               Estimate  Std. Error t value  Pr(>|t|)    
(Intercept) 37.10550527  2.41789093 15.3462 3.684e-15 ***
wt          -3.80089058  0.96880687 -3.9233  0.000516 ***
hp          -0.03115655  0.00925855 -3.3652  0.002234 ** 
disp        -0.00093701  0.00838404 -0.1118  0.911810    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What to do with heteroskedasticity

A more advanced method is to build the nature of the heteroskedasticity into our model. This might be as simple as adding a missing variable (e.g. food.spending.pct ~ income + is.foody).

Or we can use a version of weighted least squares which gives different observations different importance in the estimation of our parameters (slope coefficients and standard errors).

Feasible Generalized Least Squares (FGLS) is a version of GLS for when we don't know what the population variance is. For a large enough sample size, it is consistent and asymptotically efficient.