One of the main assumptions of multiple regression is that there is no perfect multi-collinearity between the predictor variables. That is, \(x_1\) and \(x_2\) are perfectly correlated. In reality, this is rarely the case; but it is quite possible for the predictor variables to be highly correlated. This makes it difficult to infer the effect of one predictor variable while holding all constant. How would you know if an effect is not a result of the combination of several variables instead of one variable?
One of the simplest to look at possible multi-collinearity is to use a correlation matrix to see a correlation matrix to see the pair-wise correlation between variables. Base R’s cor()
function does just that.
cor_matrix <-cor(mtcars) #this will be useful later
cor_matrix
## mpg cyl disp hp drat wt
## mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.68117191 -0.8676594
## cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.69993811 0.7824958
## disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.71021393 0.8879799
## hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.44875912 0.6587479
## drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.00000000 -0.7124406
## wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.71244065 1.0000000
## qsec 0.4186840 -0.5912421 -0.4336979 -0.7082234 0.09120476 -0.1747159
## vs 0.6640389 -0.8108118 -0.7104159 -0.7230967 0.44027846 -0.5549157
## am 0.5998324 -0.5226070 -0.5912270 -0.2432043 0.71271113 -0.6924953
## gear 0.4802848 -0.4926866 -0.5555692 -0.1257043 0.69961013 -0.5832870
## carb -0.5509251 0.5269883 0.3949769 0.7498125 -0.09078980 0.4276059
## qsec vs am gear carb
## mpg 0.41868403 0.6640389 0.59983243 0.4802848 -0.55092507
## cyl -0.59124207 -0.8108118 -0.52260705 -0.4926866 0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692 0.39497686
## hp -0.70822339 -0.7230967 -0.24320426 -0.1257043 0.74981247
## drat 0.09120476 0.4402785 0.71271113 0.6996101 -0.09078980
## wt -0.17471588 -0.5549157 -0.69249526 -0.5832870 0.42760594
## qsec 1.00000000 0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs 0.74453544 1.0000000 0.16834512 0.2060233 -0.56960714
## am -0.22986086 0.1683451 1.00000000 0.7940588 0.05753435
## gear -0.21268223 0.2060233 0.79405876 1.0000000 0.27407284
## carb -0.65624923 -0.5696071 0.05753435 0.2740728 1.00000000
…I said simplest, not best. Sticking to base R though, pairs()
is significantly more useful here.
pairs(mtcars)
While pairs()
is good for a general look at the relation between two variables, it doesn’t exactly give you a more specific number to look at. Also, black and white is not a particularly colour scheme to look at. This is where corrplot
could come in handy.
library(corrplot)
corrplot(cor_matrix)
Along with using circles to visualize the size of the pair-wise correlation, you can also use other methods to visualize the correlations.
corrplot.mixed(cor_matrix, upper = 'pie',lower='square')
You may notice that I changed the function to corrplot.mixed
. This is because the correlation value in the matrix between two variables will be the same regardless of the order, so might as well make use of the space, eh?
corrplot.mixed(cor_matrix, upper = 'ellipse',lower='number')
By this point if you see a correlation that is abs(var) =~ 1
and it isn’t between a variable and itself, you should really start questioning what is happening in your data.
Another way to determine if you have high multicollinearity is to calculate the VIF (Variance influence factor) of each of the predictor variables. For an (actual!) explanation of what it is, see here.
We’ll need to import library car
to easily calculate the VIF of each variable after the regression has been initially calculated.
Of course that requires to have a model first so lets make one!
model <- lm(mpg ~ drat + disp +hp + cyl, data=mtcars) #This is probably a terrible model, but ignore that for now
summary(model)
##
## Call:
## lm(formula = mpg ~ drat + disp + hp + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5660 -1.8161 -0.6469 1.4094 6.5749
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.98524 7.98905 3.002 0.00571 **
## drat 2.15405 1.59866 1.347 0.18905
## disp -0.01390 0.01089 -1.276 0.21287
## hp -0.02317 0.01576 -1.470 0.15299
## cyl -0.81402 0.84368 -0.965 0.34318
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.012 on 27 degrees of freedom
## Multiple R-squared: 0.7825, Adjusted R-squared: 0.7503
## F-statistic: 24.29 on 4 and 27 DF, p-value: 1.314e-08
Now we have a regression to use, lets import car
and see the VIF of each of the predictor variables.
library(car)
vif(model)
## drat disp hp cyl
## 2.497073 6.227658 3.989032 7.759102
For those more visually minded,
vif_values <- vif(model)
barplot(vif_values, main = "VIF Values",col = 'green',ylim = c(0.0,8.0))
bad_vif <- 5.0
abline(h = bad_vif, lwd = 3, lty = 2,col = 'red')
As a rule of thumb, a VIF \(> 10\) indicates really high multilinearity and you should probably deal with it, but most people consider the cut-off point to be 5, like I’ve done here.
This ultimately is dependent on what you ultimate objective is. Multi-colinearly is a problem because it makes it difficult to infer the effect of one variable where everything else. So this is not a problem if:
In those cases, you can get away with ignoring it. But if you need to infer the effect of the variable that has severe multicollinearity here are some of the simpler solutions:
For simplicity’s sake in this case, I’d actually drop cyl
partially because its very heavily correlated with many of the variables here, but mainly because of the structure of cyl
at the moment.
model2 <- lm(mpg ~ drat + disp +hp, data=mtcars)
summary(model2)
##
## Call:
## lm(formula = mpg ~ drat + disp + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1225 -1.8454 -0.4456 1.1342 6.4958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.344293 6.370882 3.036 0.00513 **
## drat 2.714975 1.487366 1.825 0.07863 .
## disp -0.019232 0.009371 -2.052 0.04960 *
## hp -0.031229 0.013345 -2.340 0.02663 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.008 on 28 degrees of freedom
## Multiple R-squared: 0.775, Adjusted R-squared: 0.7509
## F-statistic: 32.15 on 3 and 28 DF, p-value: 3.28e-09
vif_values2 <- vif(model2)
barplot(vif_values2, main = "VIF Values",col = 'green',ylim = c(0.0,8.0))
abline(h = bad_vif, lwd = 3, lty = 2,col = 'red')
Better in terms multi-collinearity! There are ways to improve it, but that’s is starting to fall outside the scope of this vignette.
Use ``corrplot`` and other functions to look at the correlation matrix before starting a model
Cran.r-project.org. n.d. An Introduction To Corrplot Package. [online] Available at: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html [Accessed 28 March 2020]
Statology. 2019. How To Easily Calculate Variance Inflation Factor (VIF) In R - Statology. [online] Available at: https://www.statology.org/how-to-easily-calculate-variance-inflation-factor-vif-in-r/ [Accessed 30 March 2020]
Psychologicalstatistics.blogspot.com. 2013. Multicollinearity And Collinearity (In Multiple Regression) - A Tutorial. [online] Available at: http://psychologicalstatistics.blogspot.com/2013/11/multicollinearity-and-collinearity-in.html [Accessed 29 March 2020]
Statistics How To. 2015. Variance Inflation Factor - Statistics How To. [online] Available at: https://www.statisticshowto.datasciencecentral.com/variance-inflation-factor/ [Accessed 30 March 2020].