This script is an example of using VIF to eliminate multi-collinearity from a set of covariates. You might have a set of covariates that you are going to use in a regression to explain some response variable Y. One assumption in most regressions is that there is no correlation among predictor variables. The machinery of regression does not work if correlation among the X variables is too high.
This script shows how to use Variance Inflation Factors (VIFs) to identify and remove highly correlated variables from a data set.
Source Zuur's helper functions (from http://highstat.com/book2.htm for the 'Mixed Effects' book).
getwd()
## [1] "E:/Data/Dropbox/FindTheBestZIP"
source("HighstatLibV6.R")
The original corvif function is in HighstatLibV6.R. I modified it as two slightly different versions in JacksHelperFunctionsV1.R below. The thinXwithVIF function is also in JacksHelperFunctionsV1.R.
The functions are as follows:
# JacksHelperFunctionsV1.R
# Slightly different versions of Zuur's corvif To use: corvif(YourDataFile)
corvif <- function(dataz) {
dataz <- as.data.frame(dataz)
# correlation part
cat("Correlations of the variables\n\n")
tmp_cor <- cor(dataz, use = "complete.obs")
print(tmp_cor)
# vif part
form <- formula(paste("fooy ~ ", paste(strsplit(names(dataz), " "), collapse = " + ")))
dataz <- data.frame(fooy = 1, dataz)
lm_mod <- lm(form, dataz)
cat("\n\nVariance inflation factors\n\n")
print(myvif(lm_mod))
}
# VIF FUNCTION. To use: corvif(YourDataFile)
corvif_noCorr <- function(dataz) {
dataz <- as.data.frame(dataz)
# correlation part cat('Correlations of the variables\n\n') tmp_cor <-
# cor(dataz,use='complete.obs') print(tmp_cor)
# vif part
form <- formula(paste("fooy ~ ", paste(strsplit(names(dataz), " "), collapse = " + ")))
dataz <- data.frame(fooy = 1, dataz)
lm_mod <- lm(form, dataz)
cat("\n\nVariance inflation factors\n\n")
print(myvif(lm_mod))
}
# To use: thinXwithVIF(X, Threshold) where X is a dataframe of predictor
# variables (no factors!), and Threshold is the maximum VIF you will
# tolerate Zuur et al. (2009) sugggest 3, others have used 6 or 10.
thinXwithVIF = function(X, Threshold = 3) {
VIFS = corvif(X)
XVars = names(X)
max(VIFS$GVIF)
while (max(VIFS$GVIF) >= Threshold) {
print(paste("Drop ", XVars[which.max(VIFS$GVIF)], ".", sep = ""), quote = FALSE)
XVars = XVars[-which.max(VIFS$GVIF)]
X = X[, -which.max(VIFS$GVIF)]
VIFS = corvif_noCorr(X)
print(max(VIFS$GVIF))
}
return(list(VIFS = VIFS, XVars = XVars, X = X))
}
Use mtcars as a test data set. (Type '?mtcars' to see an explanation of the variables.)
summary(mtcars)
## mpg cyl disp hp
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.2 Median :6.00 Median :196.3 Median :123.0
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.76 Min. :1.51 Min. :14.5 Min. :0.000
## 1st Qu.:3.08 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000
## Median :3.69 Median :3.33 Median :17.7 Median :0.000
## Mean :3.60 Mean :3.22 Mean :17.8 Mean :0.438
## 3rd Qu.:3.92 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000
## Max. :4.93 Max. :5.42 Max. :22.9 Max. :1.000
## am gear carb
## Min. :0.000 Min. :3.00 Min. :1.00
## 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00
## Median :0.000 Median :4.00 Median :2.00
## Mean :0.406 Mean :3.69 Mean :2.81
## 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :1.000 Max. :5.00 Max. :8.00
The variables 'vs' and 'am' are dummy variables and should be factors:
Leave them in as they are true dummy variables. Select the Y variable (mpg). All the rest of the data frame variables are predictor (X) variables.
MyData = mtcars
summary(MyData)
## mpg cyl disp hp
## Min. :10.4 Min. :4.00 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.4 1st Qu.:4.00 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.2 Median :6.00 Median :196.3 Median :123.0
## Mean :20.1 Mean :6.19 Mean :230.7 Mean :146.7
## 3rd Qu.:22.8 3rd Qu.:8.00 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.9 Max. :8.00 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.76 Min. :1.51 Min. :14.5 Min. :0.000
## 1st Qu.:3.08 1st Qu.:2.58 1st Qu.:16.9 1st Qu.:0.000
## Median :3.69 Median :3.33 Median :17.7 Median :0.000
## Mean :3.60 Mean :3.22 Mean :17.8 Mean :0.438
## 3rd Qu.:3.92 3rd Qu.:3.61 3rd Qu.:18.9 3rd Qu.:1.000
## Max. :4.93 Max. :5.42 Max. :22.9 Max. :1.000
## am gear carb
## Min. :0.000 Min. :3.00 Min. :1.00
## 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.00
## Median :0.000 Median :4.00 Median :2.00
## Mean :0.406 Mean :3.69 Mean :2.81
## 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:4.00
## Max. :1.000 Max. :5.00 Max. :8.00
YVar = names(MyData)[1] # Pick the first element of the list of variable names in MyData
YVar
## [1] "mpg"
XVars <- names(MyData[, -1]) # Delete the first column from MyData, and list the names
XVars
## [1] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
Z = MyData[, -1]
VIFS = corvif(Z)
## Correlations of the variables
##
## cyl disp hp drat wt qsec vs am
## cyl 1.0000 0.9020 0.8324 -0.69994 0.7825 -0.5912 -0.8108 -0.52261
## disp 0.9020 1.0000 0.7909 -0.71021 0.8880 -0.4337 -0.7104 -0.59123
## hp 0.8324 0.7909 1.0000 -0.44876 0.6587 -0.7082 -0.7231 -0.24320
## drat -0.6999 -0.7102 -0.4488 1.00000 -0.7124 0.0912 0.4403 0.71271
## wt 0.7825 0.8880 0.6587 -0.71244 1.0000 -0.1747 -0.5549 -0.69250
## qsec -0.5912 -0.4337 -0.7082 0.09120 -0.1747 1.0000 0.7445 -0.22986
## vs -0.8108 -0.7104 -0.7231 0.44028 -0.5549 0.7445 1.0000 0.16835
## am -0.5226 -0.5912 -0.2432 0.71271 -0.6925 -0.2299 0.1683 1.00000
## gear -0.4927 -0.5556 -0.1257 0.69961 -0.5833 -0.2127 0.2060 0.79406
## carb 0.5270 0.3950 0.7498 -0.09079 0.4276 -0.6562 -0.5696 0.05753
## gear carb
## cyl -0.4927 0.52699
## disp -0.5556 0.39498
## hp -0.1257 0.74981
## drat 0.6996 -0.09079
## wt -0.5833 0.42761
## qsec -0.2127 -0.65625
## vs 0.2060 -0.56961
## am 0.7941 0.05753
## gear 1.0000 0.27407
## carb 0.2741 1.00000
##
##
## Variance inflation factors
##
## GVIF
## cyl 15.374
## disp 21.620
## hp 9.832
## drat 3.375
## wt 15.165
## qsec 7.528
## vs 4.966
## am 4.648
## gear 5.357
## carb 7.909
str(VIFS)
## 'data.frame': 10 obs. of 1 variable:
## $ GVIF: num 15.37 21.62 9.83 3.37 15.16 ...
In the above output, look for correlations above 0.7 as the most problematic. Several cutoffs for VIFs are suggested: 10, 6 and 3 are thresholds I have seen. We'll use 6 for now.
Threshold = 6
The highest VIF is now 21.6202 for the variable 'disp'. Drop it, then recalculate and keep dropping until the maximum VIF is less than the threshold.
The ThinXwithVIF function should work!
thinXwithVIF(X, Threshold)
## Correlations of the variables
##
## cyl disp hp drat wt qsec vs am
## cyl 1.0000 0.9020 0.8324 -0.69994 0.7825 -0.5912 -0.8108 -0.52261
## disp 0.9020 1.0000 0.7909 -0.71021 0.8880 -0.4337 -0.7104 -0.59123
## hp 0.8324 0.7909 1.0000 -0.44876 0.6587 -0.7082 -0.7231 -0.24320
## drat -0.6999 -0.7102 -0.4488 1.00000 -0.7124 0.0912 0.4403 0.71271
## wt 0.7825 0.8880 0.6587 -0.71244 1.0000 -0.1747 -0.5549 -0.69250
## qsec -0.5912 -0.4337 -0.7082 0.09120 -0.1747 1.0000 0.7445 -0.22986
## vs -0.8108 -0.7104 -0.7231 0.44028 -0.5549 0.7445 1.0000 0.16835
## am -0.5226 -0.5912 -0.2432 0.71271 -0.6925 -0.2299 0.1683 1.00000
## gear -0.4927 -0.5556 -0.1257 0.69961 -0.5833 -0.2127 0.2060 0.79406
## carb 0.5270 0.3950 0.7498 -0.09079 0.4276 -0.6562 -0.5696 0.05753
## gear carb
## cyl -0.4927 0.52699
## disp -0.5556 0.39498
## hp -0.1257 0.74981
## drat 0.6996 -0.09079
## wt -0.5833 0.42761
## qsec -0.2127 -0.65625
## vs 0.2060 -0.56961
## am 0.7941 0.05753
## gear 1.0000 0.27407
## carb 0.2741 1.00000
##
##
## Variance inflation factors
##
## GVIF
## cyl 15.374
## disp 21.620
## hp 9.832
## drat 3.375
## wt 15.165
## qsec 7.528
## vs 4.966
## am 4.648
## gear 5.357
## carb 7.909
## [1] Drop disp.
##
##
## Variance inflation factors
##
## GVIF
## cyl 14.285
## hp 7.123
## drat 3.329
## wt 6.189
## qsec 6.914
## vs 4.916
## am 4.645
## gear 5.324
## carb 4.311
## [1] 14.28
## [1] Drop cyl.
##
##
## Variance inflation factors
##
## GVIF
## hp 6.016
## drat 3.112
## wt 6.051
## qsec 5.919
## vs 4.271
## am 4.286
## gear 4.690
## carb 4.290
## [1] 6.051
## [1] Drop wt.
##
##
## Variance inflation factors
##
## GVIF
## hp 5.075
## drat 3.021
## qsec 4.714
## vs 3.792
## am 4.052
## gear 4.373
## carb 3.642
## [1] 5.075
## $VIFS
## GVIF
## hp 5.075
## drat 3.021
## qsec 4.714
## vs 3.792
## am 4.052
## gear 4.373
## carb 3.642
##
## $XVars
## [1] "hp" "drat" "qsec" "vs" "am" "gear" "carb"
##
## $X
## hp drat qsec vs am gear carb
## Mazda RX4 110 3.90 16.46 0 1 4 4
## Mazda RX4 Wag 110 3.90 17.02 0 1 4 4
## Datsun 710 93 3.85 18.61 1 1 4 1
## Hornet 4 Drive 110 3.08 19.44 1 0 3 1
## Hornet Sportabout 175 3.15 17.02 0 0 3 2
## Valiant 105 2.76 20.22 1 0 3 1
## Duster 360 245 3.21 15.84 0 0 3 4
## Merc 240D 62 3.69 20.00 1 0 4 2
## Merc 230 95 3.92 22.90 1 0 4 2
## Merc 280 123 3.92 18.30 1 0 4 4
## Merc 280C 123 3.92 18.90 1 0 4 4
## Merc 450SE 180 3.07 17.40 0 0 3 3
## Merc 450SL 180 3.07 17.60 0 0 3 3
## Merc 450SLC 180 3.07 18.00 0 0 3 3
## Cadillac Fleetwood 205 2.93 17.98 0 0 3 4
## Lincoln Continental 215 3.00 17.82 0 0 3 4
## Chrysler Imperial 230 3.23 17.42 0 0 3 4
## Fiat 128 66 4.08 19.47 1 1 4 1
## Honda Civic 52 4.93 18.52 1 1 4 2
## Toyota Corolla 65 4.22 19.90 1 1 4 1
## Toyota Corona 97 3.70 20.01 1 0 3 1
## Dodge Challenger 150 2.76 16.87 0 0 3 2
## AMC Javelin 150 3.15 17.30 0 0 3 2
## Camaro Z28 245 3.73 15.41 0 0 3 4
## Pontiac Firebird 175 3.08 17.05 0 0 3 2
## Fiat X1-9 66 4.08 18.90 1 1 4 1
## Porsche 914-2 91 4.43 16.70 0 1 5 2
## Lotus Europa 113 3.77 16.90 1 1 5 2
## Ford Pantera L 264 4.22 14.50 0 1 5 4
## Ferrari Dino 175 3.62 15.50 0 1 5 6
## Maserati Bora 335 3.54 14.60 0 1 5 8
## Volvo 142E 109 4.11 18.60 1 1 4 2
OK, so now hp, drat, qsec, vs, am, gear, carb are the covariates that passed the VIF test, and they are contained in the dataframe Z. Here's the first six rows:
head(Z)
## hp drat qsec vs am gear carb
## Mazda RX4 110 3.90 16.46 0 1 4 4
## Mazda RX4 Wag 110 3.90 17.02 0 1 4 4
## Datsun 710 93 3.85 18.61 1 1 4 1
## Hornet 4 Drive 110 3.08 19.44 1 0 3 1
## Hornet Sportabout 175 3.15 17.02 0 0 3 2
## Valiant 105 2.76 20.22 1 0 3 1
You can now use XVars as the right side of a regression formula and feel happy about it.