Introduction

This handout introduces you to two ways of handling common problems in OLS model specification: variance inflation factors (VIFs), which help you diagnose multicollinearity, and tests for detecting outliers. It is a simplification and practical amplification of the material covered in Bailey, Chapter 5, especially 5.4; you should re-read this chapter as you move toward (or revise) your data-analysis sections.

Motivation

OLS is a great and remarkably robust workhorse data-analysis tool, but it has several drawbacks:

We want to have a data-informed way of handling these problems, since they are not necessarily statistical errors but problems of data analysis.

Outliers

Let’s look at a quick example of an outlier. (For more, see Bailey 76-79.)

x <- c(1:50)
y <- 1.5*x + rnorm(length(x),mean=0,sd=5)
y[5] <- 200 #create an outlier

data <- data.frame(x,y)
colnames(data) <- c("x","y")

lm.outlier <-lm(y~x,data=data)
pred.x.outlier<-data$x
y.hat.outlier <- predict(lm.outlier,
          data.frame(x=data$x),
          interval="confidence")
pch.type <- rep(16,length(x))
pch.type[5] <- 10 # note this allows us to specify different pch in next line


plot(x,y, pch=pch.type)
lines(pred.x.outlier,y.hat.outlier[,1],col="red",lwd=3)
lines(pred.x.outlier,y.hat.outlier[,2],lty="dotted",col="blue",lwd=2)
lines(pred.x.outlier,y.hat.outlier[,3],lty="dotted",col="blue",lwd=2)

Clearly, one of these points is an outlier. Let’s see how it affects our estimate by re-estimating the line after dropping the outlier:

lm.omit <- lm(y~x,data=data[-5,]) # drops the 5th row
pred.x.omit <- data[-5,]$x
y.hat.omit <- predict(lm.omit,
          data.frame(x= pred.x.omit),
          interval="confidence")
pch.type <- rep(16,length(x))
pch.type[5] <- 10 # note this allows us to specify different pch in next line

plot(x,y, pch=pch.type)
lines(pred.x.outlier,y.hat.outlier[,1],col="red",lwd=3)
lines(pred.x.outlier,y.hat.outlier[,2],lty="dotted",col="blue",lwd=2)
lines(pred.x.outlier,y.hat.outlier[,3],lty="dotted",col="blue",lwd=2)


lines(pred.x.omit,y.hat.omit[,1],col="orange",lwd=3)
lines(pred.x.omit,y.hat.omit[,2],lty="dotted",col="brown",lwd=2)
lines(pred.x.omit,y.hat.omit[,3],lty="dotted",col="brown",lwd=2)


library(stargazer) 
stargazer(lm.outlier,lm.omit,type="html",
          dep.var.labels="Comparing Models",
          covariate.labels=c("X"))
Dependent variable:
Comparing Models
(1) (2)
X 1.173*** 1.552***
(0.267) (0.040)
Constant 13.147* -0.376
(7.822) (1.188)
Observations 50 49
R2 0.287 0.969
Adjusted R2 0.272 0.969
Residual Std. Error 27.241 (df = 48) 4.013 (df = 47)
F Statistic 19.293*** (df = 1; 48) 1,493.236*** (df = 1; 47)
Note: p<0.1; p<0.05; p<0.01

As the table and the plot suggests, this one point was pretty well throwing off the relationship—we know that the true data-generating process is \(Y=1.5X + \epsilon\), but the model with the outlier returned \(\hat{Y}=1.2X\) and the non-outlier model returned \(\hat{Y}=1.55X\).

Detecting outliers

In the real world, of course, we don’t specify the equations and then find the outliers. We have to deduce the presence of outliers from data. We will use a nifty diagnostic tool from the car package (use install.packages(car)).

state.df <- data.frame(state.x77)
state.df$State <- rownames(state.x77)
state.df <- data.frame(state.x77)
state.df$State <- rownames(state.x77)
lm.state <- lm(Income~HS.Grad,
                data=state.df)
plot(state.df$HS.Grad,state.df$Income,pch=16,
      xlab="High School Graduation Rate",
     ylab="Income")

predict.x.state <- seq(min(state.df$HS.Grad),
                           max(state.df$HS.Grad),
                           length=length(state.df$HS.Grad))
y.hat.state <- predict(lm.state,
          data.frame(HS.Grad= predict.x.state),
          interval="confidence")
lines(predict.x.state,y.hat.state[,1],col="red",lwd=3)
lines(predict.x.state,y.hat.state[,2],lty="dotted",col="blue",lwd=2)
lines(predict.x.state,y.hat.state[,3],lty="dotted",col="blue",lwd=2)

Dependent variable:
Income
HS.Grad 47.162***
(8.616)
Constant 1,931.105***
(462.739)
Observations 50
R2 0.384
Adjusted R2 0.371
Residual Std. Error 487.144 (df = 48)
F Statistic 29.962*** (df = 1; 48)
Note: p<0.1; p<0.05; p<0.01

This is a pretty straightforward model: we find that income is higher in states with higher high school graduation rates. The model seems to support that!

However, if you look at the plot again, there appear to be some states that are outliers. This is easier to visualize if you look at the plot with names instead of points:

plot(state.df$HS.Grad,state.df$Income,pch="",
     xlab="High School Graduation Rate",
     ylab="Income")
text(state.df$HS.Grad,state.df$Income,labels=state.df$State)

It seems unsurprising that Alaska (especially in 1977) would have higher incomes than predicted (oil money!) and on reflection it does seem likely that Utah would have lower incomes (fewer working women, larger families, probably a younger population, too). But are these influencing our estimates of \(\beta\)? We use the residualPlots() function in car to see if the residuals (the difference between the observed value of Y and \(\hat{Y}\)) are random (as they should be) or if they are instead following a pattern (see more here).

library(car)

residualPlots(lm.state, id.n=2) # id.n argument # identifies the n

##            Test stat Pr(>|t|)
## HS.Grad       -1.288    0.204
## Tukey test    -1.288    0.198
# most influential obs # w/r/t each variable

This suggests that a) there is some nonrandom variance in the residuals (that’s bad) and b) that the two states causing it are Alaska and Utah (that’s useful).

The next step is to use influencePlot(). This function returns the “Studentized residuals by hat values”–that’s a little technical, and you can learn mnore here, but the interpretation is easy enough: if the Studentized residual is more than the absolute value of about 2, drop the observation.

influencePlot(lm.state,id.n=5)

##                    StudRes        Hat        CookD
## Alaska          2.83395729 0.07779240 0.2954584580
## Kentucky       -0.07408754 0.08675526 0.0002662329
## Maryland        1.92081996 0.02020423 0.0360224254
## Mississippi    -1.65793176 0.06586156 0.0934938917
## New Jersey      1.75802849 0.02011564 0.0303994214
## New Mexico     -1.99633299 0.02136908 0.0409636492
## North Carolina  0.27260181 0.08675526 0.0035990957
## South Carolina -0.16823481 0.09330623 0.0014863931
## Utah           -2.43858841 0.08300735 0.2440056718

Visually, there are two—maybe four—states that are outliers and which are possibly influencing our results: Alaska (definitely), Utah (definitely), and New Mexico and Marlyand (just at the edge of being worrisome).

Does including those states matter?

lm.state.drop <- lm(Income~HS.Grad,
                data=subset(state.df,State!="Alaska" & State !="Utah"))
lm.state.drop.all <- lm(Income~HS.Grad,
                data=subset(state.df,State!="Alaska" & 
                              State !="Utah" &
                              State != "New Mexico" &
                              State != "Maryland"))
Dependent variable:
Income
Full Drop AK+UT Drop AK, UT, NM, MD
(1) (2) (3)
HS.Grad 47.162*** 46.609*** 47.573***
(8.616) (8.219) (7.555)
Constant 1,931.105*** 1,956.959*** 1,906.857***
(462.739) (436.265) (400.808)
Observations 50 48 46
R2 0.384 0.411 0.474
Adjusted R2 0.371 0.399 0.462
Residual Std. Error 487.144 (df = 48) 434.472 (df = 46) 398.864 (df = 44)
F Statistic 29.962*** (df = 1; 48) 32.159*** (df = 1; 46) 39.649*** (df = 1; 44)
Note: p<0.1; p<0.05; p<0.01

There is a difference between the estimates—it’s not huge, but it’s real. The question of whether to drop just AK and UT or all four potentially problematic states will depend a little bit on substantive knowledge (I think, for instance, that AK and UT are more different than other states than NM and MD are—those two states seem to be within normal variation). Bailey (Chapter 3.8) works through more troubling examples. In any event, it’s a good idea to look at these diagnostics.

Variance Inflation Factors and Multicollinearity

By contrast, using VIF to diagnose multicollinearity is pretty straightforward. Consider a typical multivariate regression:

lm.state.murder <- lm(Murder~HS.Grad+Life.Exp+Income+Population+Frost+Area+Illiteracy,
                   data=state.df)
Dependent variable:
Murder
HS.Grad 0.032
(0.057)
Life.Exp -1.655***
(0.256)
Income -0.0002
(0.001)
Population 0.0002***
(0.0001)
Frost -0.013*
(0.007)
Area 0.00001
(0.00000)
Illiteracy 1.373
(0.832)
Constant 122.180***
(17.886)
Observations 50
R2 0.808
Adjusted R2 0.776
Residual Std. Error 1.746 (df = 42)
F Statistic 25.292*** (df = 7; 42)
Note: p<0.1; p<0.05; p<0.01

Assuming that this is a good model of state-level murder rates (hint: it’s probably not!), we want to test for the presence of multicollinearity. We do so using the vif() function in the car package:

vif(lm.state.murder )
##    HS.Grad   Life.Exp     Income Population      Frost       Area 
##   3.437276   1.901430   1.989395   1.342691   2.373463   1.690625 
## Illiteracy 
##   4.135956

As with the outlier diagnostics, there’s a technical and a practical interpretation. The practical interpretation is most important. If the VIF statistic is higher than 5, or 10, then drop one of the variables suffering from the affliction. (Bailey presents a more detailed description of this process on p. 226, but in practice generally you will have a good sense of which variable is more theoretically relevant.) (And, yes, the threshold is really “5, or 10”; it is a squishy statistic because it only affects variance estimates, not coefficient estimates, and so some judicious use of theory can be used to justify a good measure with a VIF of 5 even if a bad measure with such a high VIF would be a candidate for elimination.)

Let’s see an example where this does matter, using the mtcars dataset in R. We’ll estimate cars’ miles per gallon as a function of other characteristics:

mtcars.df <- data.frame(mtcars)
mtcars.df$Car <- rownames(mtcars)
lm.car <- lm(mpg~cyl+disp+hp+drat+wt+qsec,data=mtcars.df)
vif(lm.car)
##       cyl      disp        hp      drat        wt      qsec 
##  9.958978 10.550573  5.357783  2.966519  7.181690  4.039701

Obviously, there’s lots of multicollinearity going on here, suggesting that some variables (for instance, cylinder count and displacement) are linearly related. So we drop some variables to form a reduced model:

lm.car.2 <- lm(mpg~wt+qsec+hp,data=mtcars.df)
vif(lm.car.2)
##       wt     qsec       hp 
## 2.530443 2.873804 4.921956

This is much better in terms of handling multicollinearity.

We want to check what the practical upshot is here:

Dependent variable:
mpg
(1) (2)
cyl -0.819
(0.812)
disp 0.013
(0.012)
hp -0.018 -0.018
(0.016) (0.015)
drat 1.320
(1.479)
wt -4.191*** -4.359***
(1.258) (0.753)
qsec 0.401 0.511
(0.517) (0.439)
Constant 26.307* 27.611***
(14.630) (8.420)
Observations 32 32
R2 0.855 0.835
Adjusted R2 0.820 0.817
Residual Std. Error 2.557 (df = 25) 2.578 (df = 28)
F Statistic 24.534*** (df = 6; 25) 47.153*** (df = 3; 28)
Note: p<0.1; p<0.05; p<0.01

Note, by the way, that just as we discussed above, the biggest change here isn’t to the coefficient estimates (they are close to the same) but to the estimates of the variance (standard errors shown in parentheses). Every additional half-ton (yes, wt is measured in 1,000 lbs—1=1,000 lbs) reduces the car’s mileage but about 4 gallons/mile, but we are more confident in the range of our estimate by cleaning up our equation.