The text is motivated by Marcin Rutecki and Manimala, both of the Kaggle project.
Multicollinearity means the situation, where the explanatory variables are highly pairwise correlated.
Multicollinearity could result in significant problems during model fitting. It can reduce the overall performance of regression and classification models:
The result is that the coefficient estimates are unstable and difficult to interpret. Multicollinearity saps the statistical power of the analysis, can cause the coefficients to switch signs, and makes it more difficult to specify the correct model.
Increase the sample size. Enlarging the sample will introduce more variation in the data series, which reduces the effect of sampling error and helps increase precision when estimating various properties of the data. Increased sample sizes can reduce either the presence or the impact of multicollinearity, or both.
Make some transformation of the data (for example, replace level data by their differences)
Remove some of the highly correlated features.
Each record in the database describes a Boston suburb or town. The data was drawn from the Boston Standard Metropolitan Statistical Area (SMSA) in 1970. The attributes are defined as follows (taken from the UCI Machine Learning Repository1): CRIM: per capita crime rate by town
We can see that the input attributes have a mixture of units.
rm(list=ls())
library(lmtest)
library(car)
library(glmnet)
column_names <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
df <- read.table("adot/housing.data", header = FALSE, sep = "", col.names = column_names)
### Plotting the correlation between various columns
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
{
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(df, lower.panel = panel.smooth, upper.panel = panel.cor,
gap=0, row1attop=FALSE)
model_all <- lm(MEDV ~ ., data=df) # with all the independent variables in the dataframe
summary(model_all)
Call:
lm(formula = MEDV ~ ., data = df)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
INDUS 2.056e-02 6.150e-02 0.334 0.738288
CHAS 2.687e+00 8.616e-01 3.118 0.001925 **
NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
AGE 6.922e-04 1.321e-02 0.052 0.958229
DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
RAD 3.060e-01 6.635e-02 4.613 5.07e-06 ***
TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
PTRATIO -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
B 9.312e-03 2.686e-03 3.467 0.000573 ***
LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
print(paste("AIC = ", AIC(model_all)))
[1] "AIC = 3027.60859407555"
resettest(model_all)
RESET test
data: model_all
RESET = 97.853, df1 = 2, df2 = 490, p-value < 2.2e-16
model <<- lm(MEDV ~ +1 +CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT )
summary(model)
Call:
lm(formula = MEDV ~ +1 + CRIM + ZN + INDUS + CHAS + NOX + RM +
AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
INDUS 2.056e-02 6.150e-02 0.334 0.738288
CHAS 2.687e+00 8.616e-01 3.118 0.001925 **
NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
AGE 6.922e-04 1.321e-02 0.052 0.958229
DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
RAD 3.060e-01 6.635e-02 4.613 5.07e-06 ***
TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
PTRATIO -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
B 9.312e-03 2.686e-03 3.467 0.000573 ***
LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
print(paste("AIC = ", AIC(model_all)))
[1] "AIC = 3027.60859407555"
print(paste("Ramsey Reset Test = ",resettest(model)))
[1] "Ramsey Reset Test = c(RESET = 97.8531762010485)"
[2] "Ramsey Reset Test = c(df1 = 2, df2 = 490)"
[3] "Ramsey Reset Test = RESET test"
[4] "Ramsey Reset Test = 1.7546369513621e-36"
[5] "Ramsey Reset Test = model"
model <<- lm(MEDV ~ +1 +CRIM + ZN + INDUS + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT )
summary(model)
Call:
lm(formula = MEDV ~ +1 + CRIM + ZN + INDUS + CHAS + NOX + RM +
DIS + RAD + TAX + PTRATIO + B + LSTAT)
Residuals:
Min 1Q Median 3Q Max
-15.6054 -2.7313 -0.5188 1.7601 26.2243
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.436927 5.080119 7.172 2.72e-12 ***
CRIM -0.108006 0.032832 -3.290 0.001075 **
ZN 0.046334 0.013613 3.404 0.000719 ***
INDUS 0.020562 0.061433 0.335 0.737989
CHAS 2.689026 0.859598 3.128 0.001863 **
NOX -17.713540 3.679308 -4.814 1.97e-06 ***
RM 3.814394 0.408480 9.338 < 2e-16 ***
DIS -1.478612 0.190611 -7.757 5.03e-14 ***
RAD 0.305786 0.066089 4.627 4.75e-06 ***
TAX -0.012329 0.003755 -3.283 0.001099 **
PTRATIO -0.952211 0.130294 -7.308 1.10e-12 ***
B 0.009321 0.002678 3.481 0.000544 ***
LSTAT -0.523852 0.047625 -10.999 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.74 on 493 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7343
F-statistic: 117.3 on 12 and 493 DF, p-value: < 2.2e-16
print(paste("AIC = ", AIC(model)))
[1] "AIC = 3025.61141822068"
print(paste("Ramsey Reset Test = ",resettest(model)))
[1] "Ramsey Reset Test = c(RESET = 97.7759346558843)"
[2] "Ramsey Reset Test = c(df1 = 2, df2 = 491)"
[3] "Ramsey Reset Test = RESET test"
[4] "Ramsey Reset Test = 1.80799788716649e-36"
[5] "Ramsey Reset Test = model"
model <<- lm(MEDV ~ +1 +CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT )
summary(model)
Call:
lm(formula = MEDV ~ +1 + CRIM + ZN + CHAS + NOX + RM + DIS +
RAD + TAX + PTRATIO + B + LSTAT)
Residuals:
Min 1Q Median 3Q Max
-15.5984 -2.7386 -0.5046 1.7273 26.2373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
CRIM -0.108413 0.032779 -3.307 0.001010 **
ZN 0.045845 0.013523 3.390 0.000754 ***
CHAS 2.718716 0.854240 3.183 0.001551 **
NOX -17.376023 3.535243 -4.915 1.21e-06 ***
RM 3.801579 0.406316 9.356 < 2e-16 ***
DIS -1.492711 0.185731 -8.037 6.84e-15 ***
RAD 0.299608 0.063402 4.726 3.00e-06 ***
TAX -0.011778 0.003372 -3.493 0.000521 ***
PTRATIO -0.946525 0.129066 -7.334 9.24e-13 ***
B 0.009291 0.002674 3.475 0.000557 ***
LSTAT -0.522553 0.047424 -11.019 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.736 on 494 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
print(paste("AIC = ", AIC(model)))
[1] "AIC = 3023.72638782506"
print(paste("Ramsey Reset Test = ",resettest(model)))
[1] "Ramsey Reset Test = c(RESET = 95.8321519471369)"
[2] "Ramsey Reset Test = c(df1 = 2, df2 = 492)"
[3] "Ramsey Reset Test = RESET test"
[4] "Ramsey Reset Test = 7.11312960358601e-36"
[5] "Ramsey Reset Test = model"
model <<- lm(I(log(MEDV)) ~ +1 +CRIM + ZN + CHAS + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT )
summary(model)
Call:
lm(formula = I(log(MEDV)) ~ +1 + CRIM + ZN + CHAS + NOX + RM +
DIS + RAD + TAX + PTRATIO + B + LSTAT)
Residuals:
Min 1Q Median 3Q Max
-0.73400 -0.09460 -0.01771 0.09782 0.86290
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0836823 0.2030491 20.112 < 2e-16 ***
CRIM -0.0103187 0.0013134 -7.856 2.49e-14 ***
ZN 0.0010874 0.0005418 2.007 0.045308 *
CHAS 0.1051484 0.0342285 3.072 0.002244 **
NOX -0.7217440 0.1416535 -5.095 4.97e-07 ***
RM 0.0906728 0.0162807 5.569 4.20e-08 ***
DIS -0.0517059 0.0074420 -6.948 1.18e-11 ***
RAD 0.0134457 0.0025405 5.293 1.82e-07 ***
TAX -0.0005579 0.0001351 -4.129 4.28e-05 ***
PTRATIO -0.0374259 0.0051715 -7.237 1.77e-12 ***
B 0.0004127 0.0001071 3.852 0.000133 ***
LSTAT -0.0286039 0.0019002 -15.053 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1898 on 494 degrees of freedom
Multiple R-squared: 0.7891, Adjusted R-squared: 0.7844
F-statistic: 168.1 on 11 and 494 DF, p-value: < 2.2e-16
print(paste("AIC = ", AIC(model)))
[1] "AIC = -232.032859586876"
print(paste("Ramsey Reset Test = ",resettest(model)))
[1] "Ramsey Reset Test = c(RESET = 14.4764626320061)"
[2] "Ramsey Reset Test = c(df1 = 2, df2 = 492)"
[3] "Ramsey Reset Test = RESET test"
[4] "Ramsey Reset Test = 7.78016750024865e-07"
[5] "Ramsey Reset Test = model"
model <<- lm(log(MEDV) ~ +1 +CRIM + ZN + CHAS + I(NOX^2) + NOX + RM + DIS + RAD + TAX + PTRATIO + B + LSTAT )
summary(model)
Call:
lm(formula = log(MEDV) ~ +1 + CRIM + ZN + CHAS + I(NOX^2) + NOX +
RM + DIS + RAD + TAX + PTRATIO + B + LSTAT)
Residuals:
Min 1Q Median 3Q Max
-0.73285 -0.09675 -0.01589 0.09603 0.86221
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.9009807 0.3531997 11.045 < 2e-16 ***
CRIM -0.0103083 0.0013143 -7.843 2.75e-14 ***
ZN 0.0011882 0.0005651 2.103 0.036009 *
CHAS 0.1073962 0.0344333 3.119 0.001921 **
I(NOX^2) -0.4926811 0.7790968 -0.632 0.527435
NOX -0.0771932 1.0290625 -0.075 0.940235
RM 0.0893266 0.0164291 5.437 8.53e-08 ***
DIS -0.0492330 0.0084110 -5.853 8.79e-09 ***
RAD 0.0133104 0.0025510 5.218 2.67e-07 ***
TAX -0.0005623 0.0001354 -4.153 3.86e-05 ***
PTRATIO -0.0381504 0.0053000 -7.198 2.29e-12 ***
B 0.0004087 0.0001074 3.805 0.000159 ***
LSTAT -0.0286811 0.0019053 -15.053 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1899 on 493 degrees of freedom
Multiple R-squared: 0.7893, Adjusted R-squared: 0.7842
F-statistic: 153.9 on 12 and 493 DF, p-value: < 2.2e-16
print(paste("AIC = ", AIC(model)))
[1] "AIC = -230.443135964678"
print(paste("Ramsey Reset Test = ",resettest(model)))
[1] "Ramsey Reset Test = c(RESET = 14.6768475258417)"
[2] "Ramsey Reset Test = c(df1 = 2, df2 = 491)"
[3] "Ramsey Reset Test = RESET test"
[4] "Ramsey Reset Test = 6.44450398199586e-07"
[5] "Ramsey Reset Test = model"
\[VIF_i = \frac{1}{1-R^2_i}\] niekedy sa hovorí, že tento ukazovateľ pre danú premennú nesmie byť väčší, ako 5, niekedy sa hovorí o hranici 10.
vif(model_all)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD
1.792192 2.298758 3.991596 1.073995 4.393720 1.933744 3.100826 3.955945 7.484496
TAX PTRATIO B LSTAT
9.008554 1.799084 1.348521 2.941491
Vizualizácia VIF
vif_values <- vif(model_all) #create vector of VIF values
barplot(vif_values, main = "VIF Values", horiz = TRUE, col = "steelblue") #create horizontal bar chart to display each VIF value
abline(v = 5, lwd = 3, lty = 2) #add vertical line at 5 as after 5 there is severe correlation
NA
NA
df_9_10 <- df[,-c(9,10)]
model_9_10 <- lm(MEDV ~., data=df_9_10)
summary(model_9_10)
Call:
lm(formula = MEDV ~ ., data = df_9_10)
Residuals:
Min 1Q Median 3Q Max
-15.7734 -2.7928 -0.6504 1.5477 27.9833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 29.210710 4.893167 5.970 4.54e-09 ***
CRIM -0.061435 0.030434 -2.019 0.044066 *
ZN 0.041516 0.013551 3.064 0.002306 **
INDUS -0.045831 0.055961 -0.819 0.413190
CHAS 3.085016 0.871868 3.538 0.000441 ***
NOX -14.594752 3.670109 -3.977 8.03e-05 ***
RM 4.134272 0.419341 9.859 < 2e-16 ***
AGE -0.004295 0.013415 -0.320 0.748983
DIS -1.488851 0.203337 -7.322 9.98e-13 ***
PTRATIO -0.811272 0.121647 -6.669 6.91e-11 ***
B 0.008205 0.002706 3.032 0.002558 **
LSTAT -0.515812 0.051668 -9.983 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.839 on 494 degrees of freedom
Multiple R-squared: 0.7293, Adjusted R-squared: 0.7232
F-statistic: 121 on 11 and 494 DF, p-value: < 2.2e-16
vif(model_9_10)
CRIM ZN INDUS CHAS NOX RM AGE DIS PTRATIO
1.478206 2.154483 3.179166 1.057805 3.901348 1.872532 3.075755 3.954443 1.496077
B LSTAT
1.316559 2.936487
df_9_10 <- df[,-c(9,10)]
model_9_10 <- lm(MEDV ~., data=df_9_10)
bptest(model_9_10)
studentized Breusch-Pagan test
data: model_9_10
BP = 50.806, df = 11, p-value = 4.481e-07
Yes, it is…