Automatic Thinning of X Variables Using VIF

This script is an example of using VIF to eliminate multi-collinearity from a set of covariates. You might have a set of covariates that you are going to use in a regression to explain some response variable Y. One assumption in most regressions is that there is no correlation among predictor variables. The machinery of regression does not work if correlation among the X variables is too high.

This script shows how to use Variance Inflation Factors (VIFs) to identify and remove highly correlated variables from a data set.

Bring in packages and source helper files

Source Zuur's helper functions (from http://highstat.com/book2.htm for the 'Mixed Effects' book).

getwd()
## [1] "E:/Data/Dropbox/FindTheBestZIP"
source("HighstatLibV6.R")

The original corvif function is in HighstatLibV6.R. I modified it as two slightly different versions in JacksHelperFunctionsV1.R below. The thinXwithVIF function is also in JacksHelperFunctionsV1.R.

The functions are as follows:

# JacksHelperFunctionsV1.R

# Slightly different versions of Zuur's corvif To use: corvif(YourDataFile)
corvif <- function(dataz) {
    dataz <- as.data.frame(dataz)
    # correlation part
    cat("Correlations of the variables\n\n")
    tmp_cor <- cor(dataz, use = "complete.obs")
    print(tmp_cor)

    # vif part
    form <- formula(paste("fooy ~ ", paste(strsplit(names(dataz), " "), collapse = " + ")))
    dataz <- data.frame(fooy = 1, dataz)
    lm_mod <- lm(form, dataz)

    cat("\n\nVariance inflation factors\n\n")
    print(myvif(lm_mod))
}
# VIF FUNCTION. To use: corvif(YourDataFile)
corvif_noCorr <- function(dataz) {
    dataz <- as.data.frame(dataz)
    # correlation part cat('Correlations of the variables\n\n') tmp_cor <-
    # cor(dataz,use='complete.obs') print(tmp_cor)

    # vif part
    form <- formula(paste("fooy ~ ", paste(strsplit(names(dataz), " "), collapse = " + ")))
    dataz <- data.frame(fooy = 1, dataz)
    lm_mod <- lm(form, dataz)

    cat("\n\nVariance inflation factors\n\n")
    print(myvif(lm_mod))
}

# To use: thinXwithVIF(X, Threshold) where X is a dataframe of predictor
# variables (no factors!), and Threshold is the maximum VIF you will
# tolerate Zuur et al. (2009) sugggest 3, others have used 6 or 10.
thinXwithVIF = function(X, Threshold = 3) {
    VIFS = corvif(X)
    XVars = names(X)
    max(VIFS$GVIF)
    while (max(VIFS$GVIF) >= Threshold) {
        print(paste("Drop ", XVars[which.max(VIFS$GVIF)], ".", sep = ""), quote = FALSE)
        XVars = XVars[-which.max(VIFS$GVIF)]
        X = X[, -which.max(VIFS$GVIF)]
        VIFS = corvif_noCorr(X)
        print(max(VIFS$GVIF))
    }
    return(list(VIFS = VIFS, XVars = XVars, X = X))
}

Bring in The Data and Select the X Variables

Use mtcars as a test data set. (Type '?mtcars' to see an explanation of the variables.)

summary(mtcars)
##       mpg            cyl            disp             hp       
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0  
##       drat            wt            qsec            vs       
##  Min.   :2.76   Min.   :1.51   Min.   :14.5   Min.   :0.000  
##  1st Qu.:3.08   1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000  
##  Median :3.69   Median :3.33   Median :17.7   Median :0.000  
##  Mean   :3.60   Mean   :3.22   Mean   :17.8   Mean   :0.438  
##  3rd Qu.:3.92   3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000  
##  Max.   :4.93   Max.   :5.42   Max.   :22.9   Max.   :1.000  
##        am             gear           carb     
##  Min.   :0.000   Min.   :3.00   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00  
##  Median :0.000   Median :4.00   Median :2.00  
##  Mean   :0.406   Mean   :3.69   Mean   :2.81  
##  3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :1.000   Max.   :5.00   Max.   :8.00

The variables 'vs' and 'am' are dummy variables and should be factors:

MyData = mtcars
summary(MyData)
##       mpg            cyl            disp             hp       
##  Min.   :10.4   Min.   :4.00   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.4   1st Qu.:4.00   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.2   Median :6.00   Median :196.3   Median :123.0  
##  Mean   :20.1   Mean   :6.19   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.8   3rd Qu.:8.00   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.9   Max.   :8.00   Max.   :472.0   Max.   :335.0  
##       drat            wt            qsec            vs       
##  Min.   :2.76   Min.   :1.51   Min.   :14.5   Min.   :0.000  
##  1st Qu.:3.08   1st Qu.:2.58   1st Qu.:16.9   1st Qu.:0.000  
##  Median :3.69   Median :3.33   Median :17.7   Median :0.000  
##  Mean   :3.60   Mean   :3.22   Mean   :17.8   Mean   :0.438  
##  3rd Qu.:3.92   3rd Qu.:3.61   3rd Qu.:18.9   3rd Qu.:1.000  
##  Max.   :4.93   Max.   :5.42   Max.   :22.9   Max.   :1.000  
##        am             gear           carb     
##  Min.   :0.000   Min.   :3.00   Min.   :1.00  
##  1st Qu.:0.000   1st Qu.:3.00   1st Qu.:2.00  
##  Median :0.000   Median :4.00   Median :2.00  
##  Mean   :0.406   Mean   :3.69   Mean   :2.81  
##  3rd Qu.:1.000   3rd Qu.:4.00   3rd Qu.:4.00  
##  Max.   :1.000   Max.   :5.00   Max.   :8.00
YVar = names(MyData)[1]  # Pick the first element of the list of variable names in MyData
YVar
## [1] "mpg"
XVars <- names(MyData[, -1])  # Delete the first column from MyData, and list the names
XVars
##  [1] "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear" "carb"

Check the X Variables for High Correlation

Z = MyData[, -1]
VIFS = corvif(Z)
## Correlations of the variables
## 
##          cyl    disp      hp     drat      wt    qsec      vs       am
## cyl   1.0000  0.9020  0.8324 -0.69994  0.7825 -0.5912 -0.8108 -0.52261
## disp  0.9020  1.0000  0.7909 -0.71021  0.8880 -0.4337 -0.7104 -0.59123
## hp    0.8324  0.7909  1.0000 -0.44876  0.6587 -0.7082 -0.7231 -0.24320
## drat -0.6999 -0.7102 -0.4488  1.00000 -0.7124  0.0912  0.4403  0.71271
## wt    0.7825  0.8880  0.6587 -0.71244  1.0000 -0.1747 -0.5549 -0.69250
## qsec -0.5912 -0.4337 -0.7082  0.09120 -0.1747  1.0000  0.7445 -0.22986
## vs   -0.8108 -0.7104 -0.7231  0.44028 -0.5549  0.7445  1.0000  0.16835
## am   -0.5226 -0.5912 -0.2432  0.71271 -0.6925 -0.2299  0.1683  1.00000
## gear -0.4927 -0.5556 -0.1257  0.69961 -0.5833 -0.2127  0.2060  0.79406
## carb  0.5270  0.3950  0.7498 -0.09079  0.4276 -0.6562 -0.5696  0.05753
##         gear     carb
## cyl  -0.4927  0.52699
## disp -0.5556  0.39498
## hp   -0.1257  0.74981
## drat  0.6996 -0.09079
## wt   -0.5833  0.42761
## qsec -0.2127 -0.65625
## vs    0.2060 -0.56961
## am    0.7941  0.05753
## gear  1.0000  0.27407
## carb  0.2741  1.00000
## 
## 
## Variance inflation factors
## 
##        GVIF
## cyl  15.374
## disp 21.620
## hp    9.832
## drat  3.375
## wt   15.165
## qsec  7.528
## vs    4.966
## am    4.648
## gear  5.357
## carb  7.909
str(VIFS)
## 'data.frame':    10 obs. of  1 variable:
##  $ GVIF: num  15.37 21.62 9.83 3.37 15.16 ...

In the above output, look for correlations above 0.7 as the most problematic. Several cutoffs for VIFs are suggested: 10, 6 and 3 are thresholds I have seen. We'll use 6 for now.

Threshold = 6

Iterate the Dropping Process

The highest VIF is now 21.6202 for the variable 'disp'. Drop it, then recalculate and keep dropping until the maximum VIF is less than the threshold.

The ThinXwithVIF function should work!

thinXwithVIF(X, Threshold)
## Correlations of the variables
## 
##          cyl    disp      hp     drat      wt    qsec      vs       am
## cyl   1.0000  0.9020  0.8324 -0.69994  0.7825 -0.5912 -0.8108 -0.52261
## disp  0.9020  1.0000  0.7909 -0.71021  0.8880 -0.4337 -0.7104 -0.59123
## hp    0.8324  0.7909  1.0000 -0.44876  0.6587 -0.7082 -0.7231 -0.24320
## drat -0.6999 -0.7102 -0.4488  1.00000 -0.7124  0.0912  0.4403  0.71271
## wt    0.7825  0.8880  0.6587 -0.71244  1.0000 -0.1747 -0.5549 -0.69250
## qsec -0.5912 -0.4337 -0.7082  0.09120 -0.1747  1.0000  0.7445 -0.22986
## vs   -0.8108 -0.7104 -0.7231  0.44028 -0.5549  0.7445  1.0000  0.16835
## am   -0.5226 -0.5912 -0.2432  0.71271 -0.6925 -0.2299  0.1683  1.00000
## gear -0.4927 -0.5556 -0.1257  0.69961 -0.5833 -0.2127  0.2060  0.79406
## carb  0.5270  0.3950  0.7498 -0.09079  0.4276 -0.6562 -0.5696  0.05753
##         gear     carb
## cyl  -0.4927  0.52699
## disp -0.5556  0.39498
## hp   -0.1257  0.74981
## drat  0.6996 -0.09079
## wt   -0.5833  0.42761
## qsec -0.2127 -0.65625
## vs    0.2060 -0.56961
## am    0.7941  0.05753
## gear  1.0000  0.27407
## carb  0.2741  1.00000
## 
## 
## Variance inflation factors
## 
##        GVIF
## cyl  15.374
## disp 21.620
## hp    9.832
## drat  3.375
## wt   15.165
## qsec  7.528
## vs    4.966
## am    4.648
## gear  5.357
## carb  7.909
## [1] Drop disp.
## 
## 
## Variance inflation factors
## 
##        GVIF
## cyl  14.285
## hp    7.123
## drat  3.329
## wt    6.189
## qsec  6.914
## vs    4.916
## am    4.645
## gear  5.324
## carb  4.311
## [1] 14.28
## [1] Drop cyl.
## 
## 
## Variance inflation factors
## 
##       GVIF
## hp   6.016
## drat 3.112
## wt   6.051
## qsec 5.919
## vs   4.271
## am   4.286
## gear 4.690
## carb 4.290
## [1] 6.051
## [1] Drop wt.
## 
## 
## Variance inflation factors
## 
##       GVIF
## hp   5.075
## drat 3.021
## qsec 4.714
## vs   3.792
## am   4.052
## gear 4.373
## carb 3.642
## [1] 5.075
## $VIFS
##       GVIF
## hp   5.075
## drat 3.021
## qsec 4.714
## vs   3.792
## am   4.052
## gear 4.373
## carb 3.642
## 
## $XVars
## [1] "hp"   "drat" "qsec" "vs"   "am"   "gear" "carb"
## 
## $X
##                      hp drat  qsec vs am gear carb
## Mazda RX4           110 3.90 16.46  0  1    4    4
## Mazda RX4 Wag       110 3.90 17.02  0  1    4    4
## Datsun 710           93 3.85 18.61  1  1    4    1
## Hornet 4 Drive      110 3.08 19.44  1  0    3    1
## Hornet Sportabout   175 3.15 17.02  0  0    3    2
## Valiant             105 2.76 20.22  1  0    3    1
## Duster 360          245 3.21 15.84  0  0    3    4
## Merc 240D            62 3.69 20.00  1  0    4    2
## Merc 230             95 3.92 22.90  1  0    4    2
## Merc 280            123 3.92 18.30  1  0    4    4
## Merc 280C           123 3.92 18.90  1  0    4    4
## Merc 450SE          180 3.07 17.40  0  0    3    3
## Merc 450SL          180 3.07 17.60  0  0    3    3
## Merc 450SLC         180 3.07 18.00  0  0    3    3
## Cadillac Fleetwood  205 2.93 17.98  0  0    3    4
## Lincoln Continental 215 3.00 17.82  0  0    3    4
## Chrysler Imperial   230 3.23 17.42  0  0    3    4
## Fiat 128             66 4.08 19.47  1  1    4    1
## Honda Civic          52 4.93 18.52  1  1    4    2
## Toyota Corolla       65 4.22 19.90  1  1    4    1
## Toyota Corona        97 3.70 20.01  1  0    3    1
## Dodge Challenger    150 2.76 16.87  0  0    3    2
## AMC Javelin         150 3.15 17.30  0  0    3    2
## Camaro Z28          245 3.73 15.41  0  0    3    4
## Pontiac Firebird    175 3.08 17.05  0  0    3    2
## Fiat X1-9            66 4.08 18.90  1  1    4    1
## Porsche 914-2        91 4.43 16.70  0  1    5    2
## Lotus Europa        113 3.77 16.90  1  1    5    2
## Ford Pantera L      264 4.22 14.50  0  1    5    4
## Ferrari Dino        175 3.62 15.50  0  1    5    6
## Maserati Bora       335 3.54 14.60  0  1    5    8
## Volvo 142E          109 4.11 18.60  1  1    4    2

Results

OK, so now hp, drat, qsec, vs, am, gear, carb are the covariates that passed the VIF test, and they are contained in the dataframe Z. Here's the first six rows:

head(Z)
##                    hp drat  qsec vs am gear carb
## Mazda RX4         110 3.90 16.46  0  1    4    4
## Mazda RX4 Wag     110 3.90 17.02  0  1    4    4
## Datsun 710         93 3.85 18.61  1  1    4    1
## Hornet 4 Drive    110 3.08 19.44  1  0    3    1
## Hornet Sportabout 175 3.15 17.02  0  0    3    2
## Valiant           105 2.76 20.22  1  0    3    1

You can now use XVars as the right side of a regression formula and feel happy about it.