November 15, 2016
Quickly cover: * Different dummy variable model specifications * Remember what heteroskedasticity is * How to test for heteroskedasticity * How to get R to calculate robust standard errors.
Let \(d_i\) be a dummy variable and \(x_i\) be a continuous variable.
Linear model with dummy variable: \[y = \beta_0 + \beta_1 d_1 + \beta_2 x_2 + u\]
Linear Probability model: \[d = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + u\]
Interaction term: \[y = \beta_0 + \beta_1 d_1 + \beta_2 x_1 + \beta_3 d_1 * x_1 + u\]
Recall: With homoskedastic errors (constant variance of residuals) OLS is BLUE. But with heteroskedasticity (non-constant variance of residuals) this is no longer true. OLS is still unbiased, but the usual hypothesis tests are unreliable and we can't be assured that OLS gives us the minimum variance estimates.
How do we know if there's heteroskedasticity?
First step: Look at scatterplots. * plot the residuals against variables you think might affect the variance of the residuals. * plot the residuals against the fitted values
Example: regressing percent of income spent on food against income
Tests of heteroskedasticity usually modify the residuals from a model then regress \(x\) variables (or the \(y\) variable) on the modified residuals. If that auxillary regression explains enough of the variation in the residuals, it indicates that the variance of the original residuals is correlated with the \(x\) variables and therefore there is heteroskedasticity.
For example: \[log(u^2) = \beta_0 + \beta_1 x + u\] If the \(R^2\) is high, then it would seem that the magnitude of the residuals is a function of \(x\) and therefore we have heteroskedasticity.
For the two tests we'll be looking at, we will test the overall significance of such a regression. Because of the structure of the test, the null hypothesis will be no heteroskedasticity.
This one is available in the package lmtest
.
library(lmtest) fit1 <- lm(mpg ~ wt + hp + disp, mtcars) bptest(fit1)
studentized Breusch-Pagan test data: fit1 BP = 0.9459, df = 3, p-value = 0.8143
For no apparent reason, White's test isn't already implemented in R. Fortunately it is easy to implement manually by writing a function.
We need to take an lm
object (e.g. "fit1" if that's what we've named our model), square its residuals, and regress interactions of all the \(x\) variables on those transformed residuals.
Then we need to get the \(R^2\) from that regression, multiply by \(n\) (sample size), and test that against the Chi squared distribution with \(k-1\) degrees of freedom.
white.test <- function(object.lm){ u <- object.lm$residuals # residuals n <- length(u) # sample size u2 <- u^2 # squared residuals X <- object.lm$model[,2:ncol(object.lm$model)] # x variables (the data) # variable names, but not the name of the intercept term: vars <- names(object.lm$coefficients)[2:length(names(object.lm$coefficients))] # now the tricky part: creating a formula without knowing in advance # what variables we're using. paste(vars,collapse="*") will turn a list of variable # names ("X1, X2, etc.") and mash them together separated by "*" ("X1*X2*etc") fw <- as.formula(paste("u2",c(paste(vars,collapse="*"), paste(vars,"^2",sep="",collapse="+")),sep = "~")) W <- summary(lm(fw,object.lm$model))$r.squared ...
... W <- summary(lm(fw,object.lm$model))$r.squared df <- length(vars) - 1 # degrees of freedom p <- pchisq(n*W,df) # n x R^2 from auxillary regression tested with df null.hypothesis <- "no heteroskedasticity" # so I don't forget stars <- ifelse(p <= 0.001,"***", # make some stars as long as we're here ifelse(p <=0.01,"**", ifelse(p<=0.05,"*", ifelse(p<=0.1,".","")))) out <- c(nR2 = n*W, p.value = p, df = df, null.hypothesis = null.hypothesis, significance = stars) return(out) }
white.test(fit1)
nR2 p.value df "6.42722882543409" "0.959788988892527" "2" null.hypothesis significance "no heteroskedasticity" ""
In this case, no stars is a good thing. (Homoskedasticity means less work!)
If our sample size is large enough, we can use robust standard errors (AKA Heteroskedasticity consistent standard errors).
library(AER) ; library(stargazer) coeftest(fit1, vcov = vcovHC(fit1, type = "HC1")) #%>% stargazer(type = "text")
t test of coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 37.10550527 2.41789093 15.3462 3.684e-15 *** wt -3.80089058 0.96880687 -3.9233 0.000516 *** hp -0.03115655 0.00925855 -3.3652 0.002234 ** disp -0.00093701 0.00838404 -0.1118 0.911810 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A more advanced method is to build the nature of the heteroskedasticity into our model. This might be as simple as adding a missing variable (e.g. food.spending.pct ~ income + is.foody
).
Or we can use a version of weighted least squares which gives different observations different importance in the estimation of our parameters (slope coefficients and standard errors).
Feasible Generalized Least Squares (FGLS) is a version of GLS for when we don't know what the population variance is. For a large enough sample size, it is consistent and asymptotically efficient.