Introduction

The text is motivated by Marcin Rutecki and Manimala, both of the Kaggle project.

Multicollinearity - definition

Multicollinearity means the situation, where the explanatory variables are highly pairwise correlated.

Consequences of Multicollinearity (according to Rutecki)

Multicollinearity could result in significant problems during model fitting. It can reduce the overall performance of regression and classification models:

The result is that the coefficient estimates are unstable and difficult to interpret. Multicollinearity saps the statistical power of the analysis, can cause the coefficients to switch signs, and makes it more difficult to specify the correct model.

Remedies to Multicollinearity problem (according to Rutecki)

  1. Increase the sample size. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both.

  2. Make some transformation of the data (for example, replace level data by their differences)

  3. Remove some of the highly correlated features.

  1. Use regularization methods such as RIDGE and LASSO or Bayesian regression. (I’ll skip it in this study)

Boston House Prices

Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository1): CRIM: per capita crime rate by town

We can see that the input attributes have a mixture of units.

Computations

rm(list=ls())
library(lmtest)
library(car)
library(glmnet)
column_names <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
df <- read.table("adot/housing.data", header = FALSE, sep = "", col.names = column_names)
### Plotting the correlation between various columns
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
    par(usr = c(0, 1, 0, 1))
    r <- abs(cor(x, y))
    txt <- format(c(r, 0.123456789), digits = digits)[1]
    txt <- paste0(prefix, txt)
    if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
    text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(df, lower.panel = panel.smooth, upper.panel = panel.cor,
      gap=0, row1attop=FALSE)

Overparametrized model

model_all <- lm(MEDV ~ ., data=df)  # with all the independent variables in the dataframe

summary(model_all)

Call:
lm(formula = MEDV ~ ., data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
CRIM        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
ZN           4.642e-02  1.373e-02   3.382 0.000778 ***
INDUS        2.056e-02  6.150e-02   0.334 0.738288    
CHAS         2.687e+00  8.616e-01   3.118 0.001925 ** 
NOX         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
RM           3.810e+00  4.179e-01   9.116  < 2e-16 ***
AGE          6.922e-04  1.321e-02   0.052 0.958229    
DIS         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
RAD          3.060e-01  6.635e-02   4.613 5.07e-06 ***
TAX         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
PTRATIO     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
B            9.312e-03  2.686e-03   3.467 0.000573 ***
LSTAT       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Multicollinearity testing

VIF

\[VIF_i = \frac{1}{1-R^2_i}\] niekedy sa hovorí, že tento ukazovateľ pre danú premennú nesmie byť väčší, ako 5, niekedy sa hovorí o hranici 10.

vif(model_all)
    CRIM       ZN    INDUS     CHAS      NOX       RM      AGE      DIS      RAD 
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 7.484496 
     TAX  PTRATIO        B    LSTAT 
9.008554 1.799084 1.348521 2.941491 

Vizualizácia VIF

vif_values <- vif(model_all)           #create vector of VIF values

barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "steelblue") #create horizontal bar chart to display each VIF value

abline(v = 5, lwd = 3, lty = 2)    #add vertical line at 5 as after 5 there is severe correlation

NA
NA
df_9_10 <- df[,-c(9,10)]
model_9_10 <- lm(MEDV ~., data=df_9_10)
summary(model_9_10)

Call:
lm(formula = MEDV ~ ., data = df_9_10)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.7734  -2.7928  -0.6504   1.5477  27.9833 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  29.210710   4.893167   5.970 4.54e-09 ***
CRIM         -0.061435   0.030434  -2.019 0.044066 *  
ZN            0.041516   0.013551   3.064 0.002306 ** 
INDUS        -0.045831   0.055961  -0.819 0.413190    
CHAS          3.085016   0.871868   3.538 0.000441 ***
NOX         -14.594752   3.670109  -3.977 8.03e-05 ***
RM            4.134272   0.419341   9.859  < 2e-16 ***
AGE          -0.004295   0.013415  -0.320 0.748983    
DIS          -1.488851   0.203337  -7.322 9.98e-13 ***
PTRATIO      -0.811272   0.121647  -6.669 6.91e-11 ***
B             0.008205   0.002706   3.032 0.002558 ** 
LSTAT        -0.515812   0.051668  -9.983  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.839 on 494 degrees of freedom
Multiple R-squared:  0.7293,    Adjusted R-squared:  0.7232 
F-statistic:   121 on 11 and 494 DF,  p-value: < 2.2e-16
vif(model_9_10)
    CRIM       ZN    INDUS     CHAS      NOX       RM      AGE      DIS  PTRATIO 
1.478206 2.154483 3.179166 1.057805 3.901348 1.872532 3.075755 3.954443 1.496077 
       B    LSTAT 
1.316559 2.936487 

Is heteroskedasticity present in the new model?

df_9_10 <- df[,-c(9,10)]
model_9_10 <- lm(MEDV ~., data=df_9_10)
bptest(model_9_10)

    studentized Breusch-Pagan test

data:  model_9_10
BP = 50.806, df = 11, p-value = 4.481e-07

Yes, it is…

LS0tCnRpdGxlOiAiTXVsdGljb2xsaW5lYXJpdHkiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KIyMgSW50cm9kdWN0aW9uCgpUaGUgdGV4dCBpcyBtb3RpdmF0ZWQgYnkgW01hcmNpbiBSdXRlY2tpXShodHRwczovL3d3dy5rYWdnbGUuY29tL2NvZGUvbWFyY2lucnV0ZWNraS9tdWx0aWNvbGxpbmVhcml0eS1kZXRlY3Rpb24tYW5kLXJlbWVkaWVzKSBhbmQgW01hbmltYWxhXShodHRwczovL3d3dy5rYWdnbGUuY29tL2RhdGFzZXRzL3Zpa3Jpc2huYW4vYm9zdG9uLWhvdXNlLXByaWNlcyksIGJvdGggb2YgdGhlIFtLYWdnbGUgcHJvamVjdF0oaHR0cHM6Ly93d3cua2FnZ2xlLmNvbS8pLgoKIyMgTXVsdGljb2xsaW5lYXJpdHkgLSBkZWZpbml0aW9uCk11bHRpY29sbGluZWFyaXR5IG1lYW5zIHRoZSBzaXR1YXRpb24sIHdoZXJlIHRoZSBleHBsYW5hdG9yeSB2YXJpYWJsZXMgYXJlIGhpZ2hseSBwYWlyd2lzZSBjb3JyZWxhdGVkLgoKIyMgIENvbnNlcXVlbmNlcyBvZiBNdWx0aWNvbGxpbmVhcml0eSAoYWNjb3JkaW5nIHRvIFJ1dGVja2kpCgpNdWx0aWNvbGxpbmVhcml0eSBjb3VsZCByZXN1bHQgaW4gc2lnbmlmaWNhbnQgcHJvYmxlbXMgZHVyaW5nIG1vZGVsIGZpdHRpbmcuIEl0IGNhbiByZWR1Y2UgdGhlIG92ZXJhbGwgcGVyZm9ybWFuY2Ugb2YgcmVncmVzc2lvbiBhbmQgY2xhc3NpZmljYXRpb24gbW9kZWxzOgoKLSBkb2VzIG5vdCBpbmNyZWFzZSBiaWFzLCBidXQgaXQgY2FuIGluY3JlYXNlIHZhcmlhbmNlIChvdmVyZml0dGluZyk7Ci0gbWFrZSB0aGUgZXN0aW1hdGVzIHZlcnkgc2Vuc2l0aXZlIHRvIG1pbm9yIGNoYW5nZXMgaW4gdGhlIG1vZGVsOwotIGRvZXNuJ3QgYWZmZWN0IHRoZSBwcmVkaWN0aXZlIHBvd2VyLCBidXQgaW5kaXZpZHVhbCBwcmVkaWN0b3IgdmFyaWFibGUncyBpbXBhY3Qgb24gdGhlIHJlc3BvbnNlIHZhcmlhYmxlIGNvdWxkIGJlIGNhbGN1bGF0ZWQgd3JvbmdseS4KClRoZSByZXN1bHQgaXMgdGhhdCB0aGUgY29lZmZpY2llbnQgZXN0aW1hdGVzIGFyZSB1bnN0YWJsZSBhbmQgZGlmZmljdWx0IHRvIGludGVycHJldC4gTXVsdGljb2xsaW5lYXJpdHkgc2FwcyB0aGUgc3RhdGlzdGljYWwgcG93ZXIgb2YgdGhlIGFuYWx5c2lzLCBjYW4gY2F1c2UgdGhlIGNvZWZmaWNpZW50cyB0byBzd2l0Y2ggc2lnbnMsIGFuZCBtYWtlcyBpdCBtb3JlIGRpZmZpY3VsdCB0byBzcGVjaWZ5IHRoZSBjb3JyZWN0IG1vZGVsLgoKIyMgUmVtZWRpZXMgdG8gTXVsdGljb2xsaW5lYXJpdHkgcHJvYmxlbSAoYWNjb3JkaW5nIHRvIFJ1dGVja2kpCjEuIEluY3JlYXNlIHRoZSBzYW1wbGUgc2l6ZS4gRW5sYXJnaW5nIHRoZSBzYW1wbGUgd2lsbCBpbnRyb2R1Y2UgbW9yZSB2YXJpYXRpb24gaW4gdGhlIGRhdGEgc2VyaWVzLCB3aGljaCByZWR1Y2VzIHRoZSBlZmZlY3Qgb2Ygc2FtcGxpbmcgZXJyb3IgYW5kIGhlbHBzIGluY3JlYXNlIHByZWNpc2lvbiB3aGVuIGVzdGltYXRpbmcgdmFyaW91cyBwcm9wZXJ0aWVzIG9mIHRoZSBkYXRhLiBJbmNyZWFzZWQgc2FtcGxlIHNpemVzIGNhbiByZWR1Y2UgZWl0aGVyIHRoZSBwcmVzZW5jZSBvciB0aGUgaW1wYWN0IG9mIG11bHRpY29sbGluZWFyaXR5LCBvciBib3RoLgoKMi4gTWFrZSBzb21lIHRyYW5zZm9ybWF0aW9uIG9mIHRoZSBkYXRhIChmb3IgZXhhbXBsZSwgcmVwbGFjZSBsZXZlbCBkYXRhIGJ5IHRoZWlyIGRpZmZlcmVuY2VzKQoKMy4gUmVtb3ZlIHNvbWUgb2YgdGhlIGhpZ2hseSBjb3JyZWxhdGVkIGZlYXR1cmVzLgoKLSBNYW51YWwgTWV0aG9kIC0gVmFyaWFuY2UgSW5mbGF0aW9uIEZhY3RvciAoVklGKQotIEF1dG9tYXRpYyBNZXRob2QgLSBSZWN1cnNpdmUgRmVhdHVyZSBFbGltaW5hdGlvbiAoUkZFKSAoSSdsbCBza2lwIGl0IGluIHRoaXMgc3R1ZHkpCi0gRmVhdHVyZSBFbG1pbmF0aW9uIHVzaW5nIFBDQSBEZWNvbXBvc2l0aW9uIChJJ2xsIHNraXAgaXQgaW4gdGhpcyBzdHVkeSkKLSBSZXBsYWNlIGhpZ2hseSBjb3JyZWxhdGVkIHJlZ3Jlc3NvcnMgd2l0aCBhIGxpbmVhciBjb21iaW5hdGlvbiBvZiB0aGVtLiAoSSdsbCBza2lwIGl0IGluIHRoaXMgc3R1ZHkpCgozLiBVc2UgcmVndWxhcml6YXRpb24gbWV0aG9kcyBzdWNoIGFzIFJJREdFIGFuZCBMQVNTTyBvciBCYXllc2lhbiByZWdyZXNzaW9uLiAoSSdsbCBza2lwIGl0IGluIHRoaXMgc3R1ZHkpCgojIyBCb3N0b24gSG91c2UgUHJpY2VzCgpFYWNoIHJlY29yZCBpbiB0aGUgZGF0YWJhc2UgZGVzY3JpYmVzIGEgQm9zdG9uIHN1YnVyYiBvciB0b3duLiBUaGUgZGF0YSB3YXMgZHJhd24gZnJvbSB0aGUgQm9zdG9uIFN0YW5kYXJkIE1ldHJvcG9saXRhbiBTdGF0aXN0aWNhbCBBcmVhIChTTVNBKSBpbiAxOTcwLiBUaGUgYXR0cmlidXRlcyBhcmUgZGXvrIFuZWQgYXMgZm9sbG93cyAodGFrZW4gZnJvbSB0aGUgVUNJIE1hY2hpbmUgTGVhcm5pbmcgUmVwb3NpdG9yeTEpOiBDUklNOiBwZXIgY2FwaXRhIGNyaW1lIHJhdGUgYnkgdG93bgoKLSBaTjogcHJvcG9ydGlvbiBvZiByZXNpZGVudGlhbCBsYW5kIHpvbmVkIGZvciBsb3RzIG92ZXIgMjUsMDAwIHNxLmZ0LgotIElORFVTOiBwcm9wb3J0aW9uIG9mIG5vbi1yZXRhaWwgYnVzaW5lc3MgYWNyZXMgcGVyIHRvd24KLSBDSEFTOiBDaGFybGVzIFJpdmVyIGR1bW15IHZhcmlhYmxlICg9IDEgaWYgdHJhY3QgYm91bmRzIHJpdmVyOyAwIG90aGVyd2lzZSkKLSBOT1g6IG5pdHJpYyBveGlkZXMgY29uY2VudHJhdGlvbiAocGFydHMgcGVyIDEwIG1pbGxpb24pCi0gUk06IGF2ZXJhZ2UgbnVtYmVyIG9mIHJvb21zIHBlciBkd2VsbGluZwotIEFHRTogcHJvcG9ydGlvbiBvZiBvd25lci1vY2N1cGllZCB1bml0cyBidWlsdCBwcmlvciB0byAxOTQwCi0gRElTOiB3ZWlnaHRlZCBkaXN0YW5jZXMgdG8g76yBdmUgQm9zdG9uIGVtcGxveW1lbnQgY2VudGVycwotIFJBRDogaW5kZXggb2YgYWNjZXNzaWJpbGl0eSB0byByYWRpYWwgaGlnaHdheXMKLSBUQVg6IGZ1bGwtdmFsdWUgcHJvcGVydHktdGF4IHJhdGUgcGVyICQxMCwwMDAKLSBQVFJBVElPOiBwdXBpbC10ZWFjaGVyIHJhdGlvIGJ5IHRvd24gMTIuIEI6IDEwMDAoQmviiJIwLjYzKTIgd2hlcmUgQmsgaXMgdGhlIHByb3BvcnRpb24gb2YgYmxhY2tzIGJ5IHRvd24gMTMuIExTVEFUOiAlIGxvd2VyIHN0YXR1cyBvZiB0aGUgcG9wdWxhdGlvbgotIE1FRFY6IE1lZGlhbiB2YWx1ZSBvZiBvd25lci1vY2N1cGllZCBob21lcyBpbiAkMTAwMHMKCldlIGNhbiBzZWUgdGhhdCB0aGUgaW5wdXQgYXR0cmlidXRlcyBoYXZlIGEgbWl4dHVyZSBvZiB1bml0cy4KCiMjIENvbXB1dGF0aW9ucwoKYGBge3IgbGlicmFyaWVzfQpybShsaXN0PWxzKCkpCmxpYnJhcnkobG10ZXN0KQpsaWJyYXJ5KGNhcikKbGlicmFyeShnbG1uZXQpCmBgYAoKCgpgYGB7cn0KY29sdW1uX25hbWVzIDwtIGMoIkNSSU0iLCAiWk4iLCAiSU5EVVMiLCAiQ0hBUyIsICJOT1giLCAiUk0iLCAiQUdFIiwgIkRJUyIsICJSQUQiLCAiVEFYIiwgIlBUUkFUSU8iLCAiQiIsICJMU1RBVCIsICJNRURWIikKZGYgPC0gcmVhZC50YWJsZSgiYWRvdC9ob3VzaW5nLmRhdGEiLCBoZWFkZXIgPSBGQUxTRSwgc2VwID0gIiIsIGNvbC5uYW1lcyA9IGNvbHVtbl9uYW1lcykKCmBgYAoKCgpgYGB7cn0KIyMjIFBsb3R0aW5nIHRoZSBjb3JyZWxhdGlvbiBiZXR3ZWVuIHZhcmlvdXMgY29sdW1ucwpwYW5lbC5jb3IgPC0gZnVuY3Rpb24oeCwgeSwgZGlnaXRzID0gMiwgcHJlZml4ID0gIiIsIGNleC5jb3IsIC4uLikKewogICAgcGFyKHVzciA9IGMoMCwgMSwgMCwgMSkpCiAgICByIDwtIGFicyhjb3IoeCwgeSkpCiAgICB0eHQgPC0gZm9ybWF0KGMociwgMC4xMjM0NTY3ODkpLCBkaWdpdHMgPSBkaWdpdHMpWzFdCiAgICB0eHQgPC0gcGFzdGUwKHByZWZpeCwgdHh0KQogICAgaWYobWlzc2luZyhjZXguY29yKSkgY2V4LmNvciA8LSAwLjgvc3Ryd2lkdGgodHh0KQogICAgdGV4dCgwLjUsIDAuNSwgdHh0LCBjZXggPSBjZXguY29yICogcikKfQpwYWlycyhkZiwgbG93ZXIucGFuZWwgPSBwYW5lbC5zbW9vdGgsIHVwcGVyLnBhbmVsID0gcGFuZWwuY29yLAogICAgICBnYXA9MCwgcm93MWF0dG9wPUZBTFNFKQpgYGAKCgojIyBPdmVycGFyYW1ldHJpemVkIG1vZGVsCgpgYGB7cn0KbW9kZWxfYWxsIDwtIGxtKE1FRFYgfiAuLCBkYXRhPWRmKSAgIyB3aXRoIGFsbCB0aGUgaW5kZXBlbmRlbnQgdmFyaWFibGVzIGluIHRoZSBkYXRhZnJhbWUKCnN1bW1hcnkobW9kZWxfYWxsKQpgYGAKCiMjIE11bHRpY29sbGluZWFyaXR5IHRlc3RpbmcKCiMjIyBWSUYKCiQkVklGX2kgPSBcZnJhY3sxfXsxLVJeMl9pfSQkCm5pZWtlZHkgc2EgaG92b3LDrSwgxb5lIHRlbnRvIHVrYXpvdmF0ZcS+IHByZSBkYW7DuiBwcmVtZW5uw7ogbmVzbWllIGJ5xaUgdsOkxI3FocOtLCBha28gNSwgbmlla2VkeSBzYSBob3ZvcsOtIG8gaHJhbmljaSAxMC4KCmBgYHtyfQp2aWYobW9kZWxfYWxsKQpgYGAKClZpenVhbGl6w6FjaWEgVklGCgpgYGB7cn0KdmlmX3ZhbHVlcyA8LSB2aWYobW9kZWxfYWxsKSAgICAgICAgICAgI2NyZWF0ZSB2ZWN0b3Igb2YgVklGIHZhbHVlcwoKYmFycGxvdCh2aWZfdmFsdWVzLCBtYWluID0gIlZJRiBWYWx1ZXMiLCBob3JpeiA9IFRSVUUsIGNvbCA9ICJzdGVlbGJsdWUiKSAjY3JlYXRlIGhvcml6b250YWwgYmFyIGNoYXJ0IHRvIGRpc3BsYXkgZWFjaCBWSUYgdmFsdWUKCmFibGluZSh2ID0gNSwgbHdkID0gMywgbHR5ID0gMikgICAgI2FkZCB2ZXJ0aWNhbCBsaW5lIGF0IDUgYXMgYWZ0ZXIgNSB0aGVyZSBpcyBzZXZlcmUgY29ycmVsYXRpb24KCgpgYGAKCmBgYHtyfQpkZl85XzEwIDwtIGRmWywtYyg5LDEwKV0KbW9kZWxfOV8xMCA8LSBsbShNRURWIH4uLCBkYXRhPWRmXzlfMTApCnN1bW1hcnkobW9kZWxfOV8xMCkKYGBgCgpgYGB7cn0KdmlmKG1vZGVsXzlfMTApCmBgYAoKCiMjIElzIGhldGVyb3NrZWRhc3RpY2l0eSBwcmVzZW50IGluIHRoZSBuZXcgbW9kZWw/CgoKYGBge3J9CmRmXzlfMTAgPC0gZGZbLC1jKDksMTApXQptb2RlbF85XzEwIDwtIGxtKE1FRFYgfi4sIGRhdGE9ZGZfOV8xMCkKYnB0ZXN0KG1vZGVsXzlfMTApCmBgYAoKWWVzLCBpdCBpcy4uLgoK