Here, we’re going to check the proportion of target variable and whether multicollinearity exists.
Multicollinearity
We’d like to have each variable independent to each other, meaning that each of which does not have high correlation with others. Therefore, high correlation found, theoretically this indicates multicollinearity. First thing first, we should see the information of each variable by calling summary()
function.
## energy1 energy2 energy3 energy4
## Min. :0.00150 Min. :0.00060 Min. :0.00150 Min. :0.00580
## 1st Qu.:0.01335 1st Qu.:0.01645 1st Qu.:0.01895 1st Qu.:0.02438
## Median :0.02280 Median :0.03080 Median :0.03430 Median :0.04405
## Mean :0.02916 Mean :0.03844 Mean :0.04383 Mean :0.05389
## 3rd Qu.:0.03555 3rd Qu.:0.04795 3rd Qu.:0.05795 3rd Qu.:0.06450
## Max. :0.13710 Max. :0.23390 Max. :0.30590 Max. :0.42640
## energy5 energy6 energy7 energy8
## Min. :0.00670 Min. :0.01020 Min. :0.0033 Min. :0.00550
## 1st Qu.:0.03805 1st Qu.:0.06703 1st Qu.:0.0809 1st Qu.:0.08042
## Median :0.06250 Median :0.09215 Median :0.1070 Median :0.11210
## Mean :0.07520 Mean :0.10457 Mean :0.1217 Mean :0.13480
## 3rd Qu.:0.10028 3rd Qu.:0.13412 3rd Qu.:0.1540 3rd Qu.:0.16960
## Max. :0.40100 Max. :0.38230 Max. :0.3729 Max. :0.45900
## energy9 energy10 energy11 energy12
## Min. :0.00750 Min. :0.0113 Min. :0.0289 Min. :0.0236
## 1st Qu.:0.09703 1st Qu.:0.1113 1st Qu.:0.1293 1st Qu.:0.1335
## Median :0.15225 Median :0.1824 Median :0.2248 Median :0.2490
## Mean :0.17800 Mean :0.2083 Mean :0.2360 Mean :0.2502
## 3rd Qu.:0.23342 3rd Qu.:0.2687 3rd Qu.:0.3016 3rd Qu.:0.3312
## Max. :0.68280 Max. :0.7106 Max. :0.7342 Max. :0.7060
## energy13 energy14 energy15 energy16
## Min. :0.0184 Min. :0.0273 Min. :0.0031 Min. :0.0162
## 1st Qu.:0.1661 1st Qu.:0.1752 1st Qu.:0.1646 1st Qu.:0.1963
## Median :0.2640 Median :0.2811 Median :0.2817 Median :0.3047
## Mean :0.2733 Mean :0.2966 Mean :0.3202 Mean :0.3785
## 3rd Qu.:0.3513 3rd Qu.:0.3862 3rd Qu.:0.4529 3rd Qu.:0.5357
## Max. :0.7131 Max. :0.9970 Max. :1.0000 Max. :0.9988
## energy17 energy18 energy19 energy20
## Min. :0.0349 Min. :0.0375 Min. :0.0494 Min. :0.0656
## 1st Qu.:0.2059 1st Qu.:0.2421 1st Qu.:0.2991 1st Qu.:0.3506
## Median :0.3084 Median :0.3683 Median :0.4350 Median :0.5425
## Mean :0.4160 Mean :0.4523 Mean :0.5048 Mean :0.5630
## 3rd Qu.:0.6594 3rd Qu.:0.6791 3rd Qu.:0.7314 3rd Qu.:0.8093
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## energy21 energy22 energy23 energy24
## Min. :0.0512 Min. :0.0219 Min. :0.0563 Min. :0.0239
## 1st Qu.:0.3997 1st Qu.:0.4069 1st Qu.:0.4502 1st Qu.:0.5407
## Median :0.6177 Median :0.6649 Median :0.6997 Median :0.6985
## Mean :0.6091 Mean :0.6243 Mean :0.6470 Mean :0.6727
## 3rd Qu.:0.8170 3rd Qu.:0.8320 3rd Qu.:0.8486 3rd Qu.:0.8722
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## energy25 energy26 energy27 energy28
## Min. :0.0240 Min. :0.0921 Min. :0.0481 Min. :0.0284
## 1st Qu.:0.5258 1st Qu.:0.5442 1st Qu.:0.5319 1st Qu.:0.5348
## Median :0.7211 Median :0.7545 Median :0.7456 Median :0.7319
## Mean :0.6754 Mean :0.6999 Mean :0.7022 Mean :0.6940
## 3rd Qu.:0.8737 3rd Qu.:0.8938 3rd Qu.:0.9171 3rd Qu.:0.9003
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## energy29 energy30 energy31 energy32
## Min. :0.0144 Min. :0.0613 Min. :0.0482 Min. :0.0404
## 1st Qu.:0.4637 1st Qu.:0.4114 1st Qu.:0.3456 1st Qu.:0.2814
## Median :0.6808 Median :0.6071 Median :0.4904 Median :0.4296
## Mean :0.6421 Mean :0.5809 Mean :0.5045 Mean :0.4390
## 3rd Qu.:0.8521 3rd Qu.:0.7352 3rd Qu.:0.6420 3rd Qu.:0.5803
## Max. :1.0000 Max. :1.0000 Max. :0.9657 Max. :0.9306
## energy33 energy34 energy35 energy36
## Min. :0.0477 Min. :0.0212 Min. :0.0223 Min. :0.0080
## 1st Qu.:0.2579 1st Qu.:0.2176 1st Qu.:0.1794 1st Qu.:0.1543
## Median :0.3912 Median :0.3510 Median :0.3127 Median :0.3211
## Mean :0.4172 Mean :0.4032 Mean :0.3926 Mean :0.3848
## 3rd Qu.:0.5561 3rd Qu.:0.5961 3rd Qu.:0.5934 3rd Qu.:0.5565
## Max. :1.0000 Max. :0.9647 Max. :1.0000 Max. :1.0000
## energy37 energy38 energy39 energy40
## Min. :0.0351 Min. :0.0383 Min. :0.0371 Min. :0.0117
## 1st Qu.:0.1601 1st Qu.:0.1743 1st Qu.:0.1740 1st Qu.:0.1865
## Median :0.3063 Median :0.3127 Median :0.2835 Median :0.2781
## Mean :0.3638 Mean :0.3397 Mean :0.3258 Mean :0.3112
## 3rd Qu.:0.5189 3rd Qu.:0.4405 3rd Qu.:0.4349 3rd Qu.:0.4244
## Max. :0.9497 Max. :1.0000 Max. :0.9857 Max. :0.9297
## energy41 energy42 energy43 energy44
## Min. :0.0360 Min. :0.0056 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1631 1st Qu.:0.1589 1st Qu.:0.1552 1st Qu.:0.1269
## Median :0.2595 Median :0.2451 Median :0.2225 Median :0.1777
## Mean :0.2893 Mean :0.2783 Mean :0.2465 Mean :0.2141
## 3rd Qu.:0.3875 3rd Qu.:0.3842 3rd Qu.:0.3245 3rd Qu.:0.2717
## Max. :0.8995 Max. :0.8246 Max. :0.7733 Max. :0.7762
## energy45 energy46 energy47 energy48
## Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
## 1st Qu.:0.09448 1st Qu.:0.06855 1st Qu.:0.06425 1st Qu.:0.04512
## Median :0.14800 Median :0.12135 Median :0.10165 Median :0.07810
## Mean :0.19723 Mean :0.16063 Mean :0.12245 Mean :0.09142
## 3rd Qu.:0.23155 3rd Qu.:0.20037 3rd Qu.:0.15443 3rd Qu.:0.12010
## Max. :0.70340 Max. :0.72920 Max. :0.55220 Max. :0.33390
## energy49 energy50 energy51 energy52
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.000800
## 1st Qu.:0.02635 1st Qu.:0.01155 1st Qu.:0.008425 1st Qu.:0.007275
## Median :0.04470 Median :0.01790 Median :0.013900 Median :0.011400
## Mean :0.05193 Mean :0.02042 Mean :0.016069 Mean :0.013420
## 3rd Qu.:0.06853 3rd Qu.:0.02527 3rd Qu.:0.020825 3rd Qu.:0.016725
## Max. :0.19810 Max. :0.08250 Max. :0.100400 Max. :0.070900
## energy53 energy54 energy55 energy56
## Min. :0.000500 Min. :0.001000 Min. :0.00060 Min. :0.000400
## 1st Qu.:0.005075 1st Qu.:0.005375 1st Qu.:0.00415 1st Qu.:0.004400
## Median :0.009550 Median :0.009300 Median :0.00750 Median :0.006850
## Mean :0.010709 Mean :0.010941 Mean :0.00929 Mean :0.008222
## 3rd Qu.:0.014900 3rd Qu.:0.014500 3rd Qu.:0.01210 3rd Qu.:0.010575
## Max. :0.039000 Max. :0.035200 Max. :0.04470 Max. :0.039400
## energy57 energy58 energy59 energy60
## Min. :0.00030 Min. :0.000300 Min. :0.000100 Min. :0.000600
## 1st Qu.:0.00370 1st Qu.:0.003600 1st Qu.:0.003675 1st Qu.:0.003100
## Median :0.00595 Median :0.005800 Median :0.006400 Median :0.005300
## Mean :0.00782 Mean :0.007949 Mean :0.007941 Mean :0.006507
## 3rd Qu.:0.01043 3rd Qu.:0.010350 3rd Qu.:0.010325 3rd Qu.:0.008525
## Max. :0.03550 Max. :0.044000 Max. :0.036400 Max. :0.043900
## type
## M:111
## R: 97
##
##
##
##
As explained in Introduction that the values of predictor variables range between 0 and 1, from above summary, we can see most variables share nearly the same range. Nevertheless, to make sure, let’s see the predictors in boxplots.
{#boxplot}Eventhough the metadata said that all predictors range from 0 from 1, we still found there are several of them having relatively minor values compared to the majority. We also found that there is no any high value beyond value of 1. Now, let’s check their correlations.

There is a unique pattern discovered from above figure. Each variable has large positive correlation with their next neighbors. Therefore, we have to remove this pattern since high correlation indicates multicollinearity. We can drop the adjacent neighbors of each variable by only taking the columns with multiplication of 3, 4, 5, or 6. For example, if multiplication of 3 selected, we could take the variables of energy1
, energy3
, energy6
, and all the way to energy60
. As a beginning, I pick the multiplication of 4 to divide the number of predictor variables.

Nice. We have removed all with high correlation. Now, all remaining are only variables with lower correlation. We also should check the p-value of correlation test.
sonarComb <- combn(colnames(DF[,-ncol(DF)]), 2) # create combinations among each variable
Alpha <- 0.05 # set significance value
multicolRes <- data.frame(vs = 1:dim(sonarComb)[2], # create a blank data frame
cor = 1:dim(sonarComb)[2],
res = 1:dim(sonarComb)[2])
for (i in 1:dim(sonarComb)[2]) {
multicolRes$vs[i] <- paste0(sonarComb[1,i], " & ", sonarComb[2,i])
corTest <- cor.test(DF[,sonarComb[1,i]], DF[,sonarComb[2,i]])
multicolRes$cor[i] <- corTest$p.value
multicolRes$res[i] <- ifelse(corTest$p.value < Alpha,
"Yes",
"No")
# multiResults <- c(multiResults, res)
}
head(multicolRes)
Too bad. The first five combinations show multicollinearity exists. We should check the composition.
##
## No Yes
## 44 76
##
## No Yes
## 0.3666667 0.6333333
We only have 44 combinations without multicollinearity detected. However, we just keep forward with the variables used so far.