Multicollinearity - definition
Multicollinearity means the situation, where the explanatory
variables are highly pairwise correlated.
Consequences of Multicollinearity (according to Rutecki)
Multicollinearity could result in significant problems during model
fitting. It can reduce the overall performance of regression and
classification models:
- does not increase bias, but it can increase variance
(overfitting);
- make the estimates very sensitive to minor changes in the
model;
- doesn’t affect the predictive power, but individual predictor
variable’s impact on the response variable could be calculated
wrongly.
The result is that the coefficient estimates are unstable and
difficult to interpret. Multicollinearity saps the statistical power of
the analysis, can cause the coefficients to switch signs, and makes it
more difficult to specify the correct model.
Remedies to Multicollinearity problem (according to Rutecki)
Increase the sample size. Enlarging the sample will introduce
more variation in the data series, which reduces the effect of sampling
error and helps increase precision when estimating various properties of
the data. Increased sample sizes can reduce either the presence or the
impact of multicollinearity, or both.
Make some transformation of the data (for example, replace level
data by their differences)
Remove some of the highly correlated features.
- Manual Method - Variance Inflation Factor (VIF)
- Automatic Method - Recursive Feature Elimination (RFE) (I’ll skip it
in this study)
- Feature Elmination using PCA Decomposition (I’ll skip it in this
study)
- Replace highly correlated regressors with a linear combination of
them. (I’ll skip it in this study)
- Use regularization methods such as RIDGE and LASSO or Bayesian
regression. (I’ll skip it in this study)
Boston House Prices
Each record in the database describes a Boston suburb or town. The
data was drawn from the Boston Standard Metropolitan Statistical Area
(SMSA) in 1970. The attributes are defined as follows (taken from the UCI
Machine Learning Repository1): CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000
sq.ft.
- INDUS: proportion of non-retail business acres per town
- CHAS: Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town 12. B: 1000(Bk−0.63)2 where Bk
is the proportion of blacks by town 13. LSTAT: % lower status of the
population
- MEDV: Median value of owner-occupied homes in $1000s
We can see that the input attributes have a mixture of units.
Computations
rm(list=ls())
library(lmtest)
library(car)
library(glmnet)
column_names <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
df <- read.table("adot/housing.data", header = FALSE, sep = "", col.names = column_names)
### Plotting the correlation between various columns
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(df, lower.panel = panel.smooth, upper.panel = panel.cor,
gap=0, row1attop=FALSE)

Overparametrized model
model_all <- lm(MEDV ~ ., data=df) # with all the independent variables in the dataframe
summary(model_all)
Call:
lm(formula = MEDV ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
INDUS 2.056e-02 6.150e-02 0.334 0.738288
CHAS 2.687e+00 8.616e-01 3.118 0.001925 **
NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
AGE 6.922e-04 1.321e-02 0.052 0.958229
DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
RAD 3.060e-01 6.635e-02 4.613 5.07e-06 ***
TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
PTRATIO -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
B 9.312e-03 2.686e-03 3.467 0.000573 ***
LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Multicollinearity testing
VIF
\[VIF_i = \frac{1}{1-R^2_i}\]
niekedy sa hovorí, že tento ukazovateľ pre danú premennú nesmie byť
väčší, ako 5, niekedy sa hovorí o hranici 10.
vif(model_all)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 7.484496
TAX PTRATIO B LSTAT
9.008554 1.799084 1.348521 2.941491
Vizualizácia VIF
vif_values <- vif(model_all) #create vector of VIF values
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "steelblue") #create horizontal bar chart to display each VIF value
abline(v = 5, lwd = 3, lty = 2) #add vertical line at 5 as after 5 there is severe correlation

NA
NA
df_9_10 <- df[,-c(9,10)]
model_9_10 <- lm(MEDV ~., data=df_9_10)
summary(model_9_10)
Call:
lm(formula = MEDV ~ ., data = df_9_10)
Residuals:
Min 1Q Median 3Q Max
-15.7734 -2.7928 -0.6504 1.5477 27.9833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.210710 4.893167 5.970 4.54e-09 ***
CRIM -0.061435 0.030434 -2.019 0.044066 *
ZN 0.041516 0.013551 3.064 0.002306 **
INDUS -0.045831 0.055961 -0.819 0.413190
CHAS 3.085016 0.871868 3.538 0.000441 ***
NOX -14.594752 3.670109 -3.977 8.03e-05 ***
RM 4.134272 0.419341 9.859 < 2e-16 ***
AGE -0.004295 0.013415 -0.320 0.748983
DIS -1.488851 0.203337 -7.322 9.98e-13 ***
PTRATIO -0.811272 0.121647 -6.669 6.91e-11 ***
B 0.008205 0.002706 3.032 0.002558 **
LSTAT -0.515812 0.051668 -9.983 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.839 on 494 degrees of freedom
Multiple R-squared: 0.7293, Adjusted R-squared: 0.7232
F-statistic: 121 on 11 and 494 DF, p-value: < 2.2e-16
vif(model_9_10)
CRIM ZN INDUS CHAS NOX RM AGE DIS PTRATIO
1.478206 2.154483 3.179166 1.057805 3.901348 1.872532 3.075755 3.954443 1.496077
B LSTAT
1.316559 2.936487
Is heteroskedasticity present in the new model?
df_9_10 <- df[,-c(9,10)]
model_9_10 <- lm(MEDV ~., data=df_9_10)
bptest(model_9_10)
studentized Breusch-Pagan test
data: model_9_10
BP = 50.806, df = 11, p-value = 4.481e-07
Yes, it is…
LS0tCnRpdGxlOiAiTXVsdGljb2xsaW5lYXJpdHkiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KIyMgSW50cm9kdWN0aW9uCgpUaGUgdGV4dCBpcyBtb3RpdmF0ZWQgYnkgW01hcmNpbiBSdXRlY2tpXShodHRwczovL3d3dy5rYWdnbGUuY29tL2NvZGUvbWFyY2lucnV0ZWNraS9tdWx0aWNvbGxpbmVhcml0eS1kZXRlY3Rpb24tYW5kLXJlbWVkaWVzKSBhbmQgW01hbmltYWxhXShodHRwczovL3d3dy5rYWdnbGUuY29tL2RhdGFzZXRzL3Zpa3Jpc2huYW4vYm9zdG9uLWhvdXNlLXByaWNlcyksIGJvdGggb2YgdGhlIFtLYWdnbGUgcHJvamVjdF0oaHR0cHM6Ly93d3cua2FnZ2xlLmNvbS8pLgoKIyMgTXVsdGljb2xsaW5lYXJpdHkgLSBkZWZpbml0aW9uCk11bHRpY29sbGluZWFyaXR5IG1lYW5zIHRoZSBzaXR1YXRpb24sIHdoZXJlIHRoZSBleHBsYW5hdG9yeSB2YXJpYWJsZXMgYXJlIGhpZ2hseSBwYWlyd2lzZSBjb3JyZWxhdGVkLgoKIyMgIENvbnNlcXVlbmNlcyBvZiBNdWx0aWNvbGxpbmVhcml0eSAoYWNjb3JkaW5nIHRvIFJ1dGVja2kpCgpNdWx0aWNvbGxpbmVhcml0eSBjb3VsZCByZXN1bHQgaW4gc2lnbmlmaWNhbnQgcHJvYmxlbXMgZHVyaW5nIG1vZGVsIGZpdHRpbmcuIEl0IGNhbiByZWR1Y2UgdGhlIG92ZXJhbGwgcGVyZm9ybWFuY2Ugb2YgcmVncmVzc2lvbiBhbmQgY2xhc3NpZmljYXRpb24gbW9kZWxzOgoKLSBkb2VzIG5vdCBpbmNyZWFzZSBiaWFzLCBidXQgaXQgY2FuIGluY3JlYXNlIHZhcmlhbmNlIChvdmVyZml0dGluZyk7Ci0gbWFrZSB0aGUgZXN0aW1hdGVzIHZlcnkgc2Vuc2l0aXZlIHRvIG1pbm9yIGNoYW5nZXMgaW4gdGhlIG1vZGVsOwotIGRvZXNuJ3QgYWZmZWN0IHRoZSBwcmVkaWN0aXZlIHBvd2VyLCBidXQgaW5kaXZpZHVhbCBwcmVkaWN0b3IgdmFyaWFibGUncyBpbXBhY3Qgb24gdGhlIHJlc3BvbnNlIHZhcmlhYmxlIGNvdWxkIGJlIGNhbGN1bGF0ZWQgd3JvbmdseS4KClRoZSByZXN1bHQgaXMgdGhhdCB0aGUgY29lZmZpY2llbnQgZXN0aW1hdGVzIGFyZSB1bnN0YWJsZSBhbmQgZGlmZmljdWx0IHRvIGludGVycHJldC4gTXVsdGljb2xsaW5lYXJpdHkgc2FwcyB0aGUgc3RhdGlzdGljYWwgcG93ZXIgb2YgdGhlIGFuYWx5c2lzLCBjYW4gY2F1c2UgdGhlIGNvZWZmaWNpZW50cyB0byBzd2l0Y2ggc2lnbnMsIGFuZCBtYWtlcyBpdCBtb3JlIGRpZmZpY3VsdCB0byBzcGVjaWZ5IHRoZSBjb3JyZWN0IG1vZGVsLgoKIyMgUmVtZWRpZXMgdG8gTXVsdGljb2xsaW5lYXJpdHkgcHJvYmxlbSAoYWNjb3JkaW5nIHRvIFJ1dGVja2kpCjEuIEluY3JlYXNlIHRoZSBzYW1wbGUgc2l6ZS4gRW5sYXJnaW5nIHRoZSBzYW1wbGUgd2lsbCBpbnRyb2R1Y2UgbW9yZSB2YXJpYXRpb24gaW4gdGhlIGRhdGEgc2VyaWVzLCB3aGljaCByZWR1Y2VzIHRoZSBlZmZlY3Qgb2Ygc2FtcGxpbmcgZXJyb3IgYW5kIGhlbHBzIGluY3JlYXNlIHByZWNpc2lvbiB3aGVuIGVzdGltYXRpbmcgdmFyaW91cyBwcm9wZXJ0aWVzIG9mIHRoZSBkYXRhLiBJbmNyZWFzZWQgc2FtcGxlIHNpemVzIGNhbiByZWR1Y2UgZWl0aGVyIHRoZSBwcmVzZW5jZSBvciB0aGUgaW1wYWN0IG9mIG11bHRpY29sbGluZWFyaXR5LCBvciBib3RoLgoKMi4gTWFrZSBzb21lIHRyYW5zZm9ybWF0aW9uIG9mIHRoZSBkYXRhIChmb3IgZXhhbXBsZSwgcmVwbGFjZSBsZXZlbCBkYXRhIGJ5IHRoZWlyIGRpZmZlcmVuY2VzKQoKMy4gUmVtb3ZlIHNvbWUgb2YgdGhlIGhpZ2hseSBjb3JyZWxhdGVkIGZlYXR1cmVzLgoKLSBNYW51YWwgTWV0aG9kIC0gVmFyaWFuY2UgSW5mbGF0aW9uIEZhY3RvciAoVklGKQotIEF1dG9tYXRpYyBNZXRob2QgLSBSZWN1cnNpdmUgRmVhdHVyZSBFbGltaW5hdGlvbiAoUkZFKSAoSSdsbCBza2lwIGl0IGluIHRoaXMgc3R1ZHkpCi0gRmVhdHVyZSBFbG1pbmF0aW9uIHVzaW5nIFBDQSBEZWNvbXBvc2l0aW9uIChJJ2xsIHNraXAgaXQgaW4gdGhpcyBzdHVkeSkKLSBSZXBsYWNlIGhpZ2hseSBjb3JyZWxhdGVkIHJlZ3Jlc3NvcnMgd2l0aCBhIGxpbmVhciBjb21iaW5hdGlvbiBvZiB0aGVtLiAoSSdsbCBza2lwIGl0IGluIHRoaXMgc3R1ZHkpCgozLiBVc2UgcmVndWxhcml6YXRpb24gbWV0aG9kcyBzdWNoIGFzIFJJREdFIGFuZCBMQVNTTyBvciBCYXllc2lhbiByZWdyZXNzaW9uLiAoSSdsbCBza2lwIGl0IGluIHRoaXMgc3R1ZHkpCgojIyBCb3N0b24gSG91c2UgUHJpY2VzCgpFYWNoIHJlY29yZCBpbiB0aGUgZGF0YWJhc2UgZGVzY3JpYmVzIGEgQm9zdG9uIHN1YnVyYiBvciB0b3duLiBUaGUgZGF0YSB3YXMgZHJhd24gZnJvbSB0aGUgQm9zdG9uIFN0YW5kYXJkIE1ldHJvcG9saXRhbiBTdGF0aXN0aWNhbCBBcmVhIChTTVNBKSBpbiAxOTcwLiBUaGUgYXR0cmlidXRlcyBhcmUgZGXvrIFuZWQgYXMgZm9sbG93cyAodGFrZW4gZnJvbSB0aGUgVUNJIE1hY2hpbmUgTGVhcm5pbmcgUmVwb3NpdG9yeTEpOiBDUklNOiBwZXIgY2FwaXRhIGNyaW1lIHJhdGUgYnkgdG93bgoKLSBaTjogcHJvcG9ydGlvbiBvZiByZXNpZGVudGlhbCBsYW5kIHpvbmVkIGZvciBsb3RzIG92ZXIgMjUsMDAwIHNxLmZ0LgotIElORFVTOiBwcm9wb3J0aW9uIG9mIG5vbi1yZXRhaWwgYnVzaW5lc3MgYWNyZXMgcGVyIHRvd24KLSBDSEFTOiBDaGFybGVzIFJpdmVyIGR1bW15IHZhcmlhYmxlICg9IDEgaWYgdHJhY3QgYm91bmRzIHJpdmVyOyAwIG90aGVyd2lzZSkKLSBOT1g6IG5pdHJpYyBveGlkZXMgY29uY2VudHJhdGlvbiAocGFydHMgcGVyIDEwIG1pbGxpb24pCi0gUk06IGF2ZXJhZ2UgbnVtYmVyIG9mIHJvb21zIHBlciBkd2VsbGluZwotIEFHRTogcHJvcG9ydGlvbiBvZiBvd25lci1vY2N1cGllZCB1bml0cyBidWlsdCBwcmlvciB0byAxOTQwCi0gRElTOiB3ZWlnaHRlZCBkaXN0YW5jZXMgdG8g76yBdmUgQm9zdG9uIGVtcGxveW1lbnQgY2VudGVycwotIFJBRDogaW5kZXggb2YgYWNjZXNzaWJpbGl0eSB0byByYWRpYWwgaGlnaHdheXMKLSBUQVg6IGZ1bGwtdmFsdWUgcHJvcGVydHktdGF4IHJhdGUgcGVyICQxMCwwMDAKLSBQVFJBVElPOiBwdXBpbC10ZWFjaGVyIHJhdGlvIGJ5IHRvd24gMTIuIEI6IDEwMDAoQmviiJIwLjYzKTIgd2hlcmUgQmsgaXMgdGhlIHByb3BvcnRpb24gb2YgYmxhY2tzIGJ5IHRvd24gMTMuIExTVEFUOiAlIGxvd2VyIHN0YXR1cyBvZiB0aGUgcG9wdWxhdGlvbgotIE1FRFY6IE1lZGlhbiB2YWx1ZSBvZiBvd25lci1vY2N1cGllZCBob21lcyBpbiAkMTAwMHMKCldlIGNhbiBzZWUgdGhhdCB0aGUgaW5wdXQgYXR0cmlidXRlcyBoYXZlIGEgbWl4dHVyZSBvZiB1bml0cy4KCiMjIENvbXB1dGF0aW9ucwoKYGBge3IgbGlicmFyaWVzfQpybShsaXN0PWxzKCkpCmxpYnJhcnkobG10ZXN0KQpsaWJyYXJ5KGNhcikKbGlicmFyeShnbG1uZXQpCmBgYAoKCgpgYGB7cn0KY29sdW1uX25hbWVzIDwtIGMoIkNSSU0iLCAiWk4iLCAiSU5EVVMiLCAiQ0hBUyIsICJOT1giLCAiUk0iLCAiQUdFIiwgIkRJUyIsICJSQUQiLCAiVEFYIiwgIlBUUkFUSU8iLCAiQiIsICJMU1RBVCIsICJNRURWIikKZGYgPC0gcmVhZC50YWJsZSgiYWRvdC9ob3VzaW5nLmRhdGEiLCBoZWFkZXIgPSBGQUxTRSwgc2VwID0gIiIsIGNvbC5uYW1lcyA9IGNvbHVtbl9uYW1lcykKCmBgYAoKCgpgYGB7cn0KIyMjIFBsb3R0aW5nIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHZhcmlvdXMgY29sdW1ucwpwYW5lbC5jb3IgPC0gZnVuY3Rpb24oeCwgeSwgZGlnaXRzID0gMiwgcHJlZml4ID0gIiIsIGNleC5jb3IsIC4uLikKewogICAgcGFyKHVzciA9IGMoMCwgMSwgMCwgMSkpCiAgICByIDwtIGFicyhjb3IoeCwgeSkpCiAgICB0eHQgPC0gZm9ybWF0KGMociwgMC4xMjM0NTY3ODkpLCBkaWdpdHMgPSBkaWdpdHMpWzFdCiAgICB0eHQgPC0gcGFzdGUwKHByZWZpeCwgdHh0KQogICAgaWYobWlzc2luZyhjZXguY29yKSkgY2V4LmNvciA8LSAwLjgvc3Ryd2lkdGgodHh0KQogICAgdGV4dCgwLjUsIDAuNSwgdHh0LCBjZXggPSBjZXguY29yICogcikKfQpwYWlycyhkZiwgbG93ZXIucGFuZWwgPSBwYW5lbC5zbW9vdGgsIHVwcGVyLnBhbmVsID0gcGFuZWwuY29yLAogICAgICBnYXA9MCwgcm93MWF0dG9wPUZBTFNFKQpgYGAKCgojIyBPdmVycGFyYW1ldHJpemVkIG1vZGVsCgpgYGB7cn0KbW9kZWxfYWxsIDwtIGxtKE1FRFYgfiAuLCBkYXRhPWRmKSAgIyB3aXRoIGFsbCB0aGUgaW5kZXBlbmRlbnQgdmFyaWFibGVzIGluIHRoZSBkYXRhZnJhbWUKCnN1bW1hcnkobW9kZWxfYWxsKQpgYGAKCiMjIE11bHRpY29sbGluZWFyaXR5IHRlc3RpbmcKCiMjIyBWSUYKCiQkVklGX2kgPSBcZnJhY3sxfXsxLVJeMl9pfSQkCm5pZWtlZHkgc2EgaG92b3LDrSwgxb5lIHRlbnRvIHVrYXpvdmF0ZcS+IHByZSBkYW7DuiBwcmVtZW5uw7ogbmVzbWllIGJ5xaUgdsOkxI3FocOtLCBha28gNSwgbmlla2VkeSBzYSBob3ZvcsOtIG8gaHJhbmljaSAxMC4KCmBgYHtyfQp2aWYobW9kZWxfYWxsKQpgYGAKClZpenVhbGl6w6FjaWEgVklGCgpgYGB7cn0KdmlmX3ZhbHVlcyA8LSB2aWYobW9kZWxfYWxsKSAgICAgICAgICAgI2NyZWF0ZSB2ZWN0b3Igb2YgVklGIHZhbHVlcwoKYmFycGxvdCh2aWZfdmFsdWVzLCBtYWluID0gIlZJRiBWYWx1ZXMiLCBob3JpeiA9IFRSVUUsIGNvbCA9ICJzdGVlbGJsdWUiKSAjY3JlYXRlIGhvcml6b250YWwgYmFyIGNoYXJ0IHRvIGRpc3BsYXkgZWFjaCBWSUYgdmFsdWUKCmFibGluZSh2ID0gNSwgbHdkID0gMywgbHR5ID0gMikgICAgI2FkZCB2ZXJ0aWNhbCBsaW5lIGF0IDUgYXMgYWZ0ZXIgNSB0aGVyZSBpcyBzZXZlcmUgY29ycmVsYXRpb24KCgpgYGAKCmBgYHtyfQpkZl85XzEwIDwtIGRmWywtYyg5LDEwKV0KbW9kZWxfOV8xMCA8LSBsbShNRURWIH4uLCBkYXRhPWRmXzlfMTApCnN1bW1hcnkobW9kZWxfOV8xMCkKYGBgCgpgYGB7cn0KdmlmKG1vZGVsXzlfMTApCmBgYAoKCiMjIElzIGhldGVyb3NrZWRhc3RpY2l0eSBwcmVzZW50IGluIHRoZSBuZXcgbW9kZWw/CgoKYGBge3J9CmRmXzlfMTAgPC0gZGZbLC1jKDksMTApXQptb2RlbF85XzEwIDwtIGxtKE1FRFYgfi4sIGRhdGE9ZGZfOV8xMCkKYnB0ZXN0KG1vZGVsXzlfMTApCmBgYAoKWWVzLCBpdCBpcy4uLgoK