Introduction to multi-colinearlity and corrplot

Why worry about multi-collinearity ?

One of the main assumptions of multiple regression is that there is no perfect multi-collinearity between the predictor variables. That is, \(x_1\) and \(x_2\) are perfectly correlated. In reality, this is rarely the case; but it is quite possible for the predictor variables to be highly correlated. This makes it difficult to infer the effect of one predictor variable while holding all constant. How would you know if an effect is not a result of the combination of several variables instead of one variable?

How do I know if I’ve multi-collinearity?

One of the simplest to look at possible multi-collinearity is to use a correlation matrix to see a correlation matrix to see the pair-wise correlation between variables. Base R’s cor() function does just that.

cor_matrix <-cor(mtcars) #this will be useful later
cor_matrix

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

…I said simplest, not best. Sticking to base R though, pairs() is significantly more useful here.

pairs(mtcars)

Corrplot

While pairs() is good for a general look at the relation between two variables, it doesn’t exactly give you a more specific number to look at. Also, black and white is not a particularly colour scheme to look at. This is where corrplot could come in handy.

library(corrplot)
corrplot(cor_matrix)

Along with using circles to visualize the size of the pair-wise correlation, you can also use other methods to visualize the correlations.

corrplot.mixed(cor_matrix, upper = 'pie',lower='square')

You may notice that I changed the function to corrplot.mixed. This is because the correlation value in the matrix between two variables will be the same regardless of the order, so might as well make use of the space, eh?

corrplot.mixed(cor_matrix, upper = 'ellipse',lower='number')

By this point if you see a correlation that is abs(var) =~ 1 and it isn’t between a variable and itself, you should really start questioning what is happening in your data.

VIF

Another way to determine if you have high multicollinearity is to calculate the VIF (Variance influence factor) of each of the predictor variables. For an (actual!) explanation of what it is, see here.

We’ll need to import library car to easily calculate the VIF of each variable after the regression has been initially calculated.

Of course that requires to have a model first so lets make one!

model <- lm(mpg ~ drat + disp +hp + cyl, data=mtcars) #This is probably a terrible model, but ignore that for now
summary(model)

## 
## Call:
## lm(formula = mpg ~ drat + disp + hp + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5660 -1.8161 -0.6469  1.4094  6.5749 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 23.98524    7.98905   3.002  0.00571 **
## drat         2.15405    1.59866   1.347  0.18905   
## disp        -0.01390    0.01089  -1.276  0.21287   
## hp          -0.02317    0.01576  -1.470  0.15299   
## cyl         -0.81402    0.84368  -0.965  0.34318   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.012 on 27 degrees of freedom
## Multiple R-squared:  0.7825, Adjusted R-squared:  0.7503 
## F-statistic: 24.29 on 4 and 27 DF,  p-value: 1.314e-08

Now we have a regression to use, lets import car and see the VIF of each of the predictor variables.

library(car)
vif(model)

##     drat     disp       hp      cyl 
## 2.497073 6.227658 3.989032 7.759102

For those more visually minded,

vif_values <- vif(model)

barplot(vif_values, main = "VIF Values",col = 'green',ylim = c(0.0,8.0))
bad_vif <- 5.0
abline(h = bad_vif, lwd = 3, lty = 2,col = 'red')

As a rule of thumb, a VIF \(> 10\) indicates really high multilinearity and you should probably deal with it, but most people consider the cut-off point to be 5, like I’ve done here.

Ok I have it, what do I do?

This ultimately is dependent on what you ultimate objective is. Multi-colinearly is a problem because it makes it difficult to infer the effect of one variable where everything else. So this is not a problem if:

You just care about the predictions
the factors with high collinearity don’t matter for your for you analysis
Multi-collinearity is ‘moderate’ \(< 5\) (although I’ve seen advice for \(< 2.5\))

In those cases, you can get away with ignoring it. But if you need to infer the effect of the variable that has severe multicollinearity here are some of the simpler solutions:

Get more data. Actually, this solves a lot of problems.like making your prediction more accurate in the first place.
Transform your predictors, whether by standardization, dummy variables or any oter measure.
Drop the one of the offending variables, though this is generally a last resort unless you want to redesign the model from scratch

For simplicity’s sake in this case, I’d actually drop cyl partially because its very heavily correlated with many of the variables here, but mainly because of the structure of cyl at the moment.

model2 <- lm(mpg ~ drat + disp +hp, data=mtcars) 
summary(model2)

## 
## Call:
## lm(formula = mpg ~ drat + disp + hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1225 -1.8454 -0.4456  1.1342  6.4958 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 19.344293   6.370882   3.036  0.00513 **
## drat         2.714975   1.487366   1.825  0.07863 . 
## disp        -0.019232   0.009371  -2.052  0.04960 * 
## hp          -0.031229   0.013345  -2.340  0.02663 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.008 on 28 degrees of freedom
## Multiple R-squared:  0.775,  Adjusted R-squared:  0.7509 
## F-statistic: 32.15 on 3 and 28 DF,  p-value: 3.28e-09

vif_values2 <- vif(model2)

barplot(vif_values2, main = "VIF Values",col = 'green',ylim = c(0.0,8.0))
abline(h = bad_vif, lwd = 3, lty = 2,col = 'red')

Better in terms multi-collinearity! There are ways to improve it, but that’s is starting to fall outside the scope of this vignette.

Summary

Perfectly multicollinearity almost never occurs, but very high levels do. This can be probable when you’re trying to interpret the effects of a single variable

Use ``corrplot`` and other functions to look at the correlation matrix before starting a model

Sometimes you don’t need to worry about multicollinearity, but there are some solutions to fix it

Sources/Further reading

Cran.r-project.org. n.d. An Introduction To Corrplot Package. [online] Available at: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html [Accessed 28 March 2020]

Statology. 2019. How To Easily Calculate Variance Inflation Factor (VIF) In R - Statology. [online] Available at: https://www.statology.org/how-to-easily-calculate-variance-inflation-factor-vif-in-r/ [Accessed 30 March 2020]

Psychologicalstatistics.blogspot.com. 2013. Multicollinearity And Collinearity (In Multiple Regression) - A Tutorial. [online] Available at: http://psychologicalstatistics.blogspot.com/2013/11/multicollinearity-and-collinearity-in.html [Accessed 29 March 2020]

Statistics How To. 2015. Variance Inflation Factor - Statistics How To. [online] Available at: https://www.statisticshowto.datasciencecentral.com/variance-inflation-factor/ [Accessed 30 March 2020].