1 Introduction

This article aims to accomplish Classification in Machine Learning 1 course at Algoritma. The dataset used is obtained from the University of California at Irvine Machine Learning Repository, “Connectionist Bench (Sonar, Mines vs. Rocks)”. You could see the source code fully in my GitHub account here.

1.1 Aim

The goal is to model a prediction to decide between mines and rocks based on sonar signals data.

1.2 Objectives

To compare between Logistic Regression and K-nearest neigbors performance
To assess the models by a binary classification metric, i.e. confusion matrix.
To interpretate the model (only for Logistic Regression)
To test the model using dataset test and discuss the results

1.3 Structure

This article is arranged as follows.

Introduction
Metadata
Preparation
Exploratory Data Analysis
Modelling and Predictions
Evaluation
Model Tuning
Discussions
Conclusions

2 Metadata

2.1 Content

The task is to train a network to discriminate between sonar signals bounced off a metal cylinder and those bounced off a roughly cylindrical rock. The file “sonar.mines” contains 111 patterns obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. The file “sonar.rocks” contains 97 patterns obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock.

Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp.

The label associated with each record contains the letter “R” if the object is a rock and “M” if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.

2.2 Relevant Paper

Gorman, R. P., and Sejnowski, T. J. (1988). “Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets” in Neural Networks, Vol. 1, pp. 75-89. Web Link

3 Preparation

Load all necessary packages.

library(tidyverse) # for data wrangling
library(plotly) # for plotting using interactive plotly style
library(ggcorrplot) # for plotting correlation
library(GGally) # for plotting correlation
library(caret) # for confusion matrix
library(gtools) # for converting log of odds to probs
library(PerformanceAnalytics) # for pair plotting
library(car) # for executing VIF test
library(rsample) # for splitting dataset into train and test with controlled proportion
library(class) # for KNN
library(ROCR) # for calculating ROC
library(MLmetrics) # for calculating accuracy

Import the dataset.

sonar <- read.csv("sonar.all-data", header = F)
head(sonar)

Since the dataset originally have no any column name, we should set it. As described in Introduction, the dataset consists of 60 numbers of energy and 1 target variable. Therefore, for such 60 numbers, we could name them as energy1, energy, so on and so forth, and for the target, we could simply name it as type.

colnames(sonar) <- c(paste0("energy",c(1:60)), "type")
head(sonar)

4 Data Wrangling

Firstly, let’s check if NA exists.

anyNA(sonar)

## [1] FALSE

Unsurprisingly, as stated in the UCI Repository that the dataset does not contain missing values, we clearly did not find any NA value. Next, let’s see the dataset structure.

str(sonar)

## 'data.frame':    208 obs. of  61 variables:
##  $ energy1 : num  0.02 0.0453 0.0262 0.01 0.0762 0.0286 0.0317 0.0519 0.0223 0.0164 ...
##  $ energy2 : num  0.0371 0.0523 0.0582 0.0171 0.0666 0.0453 0.0956 0.0548 0.0375 0.0173 ...
##  $ energy3 : num  0.0428 0.0843 0.1099 0.0623 0.0481 ...
##  $ energy4 : num  0.0207 0.0689 0.1083 0.0205 0.0394 ...
##  $ energy5 : num  0.0954 0.1183 0.0974 0.0205 0.059 ...
##  $ energy6 : num  0.0986 0.2583 0.228 0.0368 0.0649 ...
##  $ energy7 : num  0.154 0.216 0.243 0.11 0.121 ...
##  $ energy8 : num  0.16 0.348 0.377 0.128 0.247 ...
##  $ energy9 : num  0.3109 0.3337 0.5598 0.0598 0.3564 ...
##  $ energy10: num  0.211 0.287 0.619 0.126 0.446 ...
##  $ energy11: num  0.1609 0.4918 0.6333 0.0881 0.4152 ...
##  $ energy12: num  0.158 0.655 0.706 0.199 0.395 ...
##  $ energy13: num  0.2238 0.6919 0.5544 0.0184 0.4256 ...
##  $ energy14: num  0.0645 0.7797 0.532 0.2261 0.4135 ...
##  $ energy15: num  0.066 0.746 0.648 0.173 0.453 ...
##  $ energy16: num  0.227 0.944 0.693 0.213 0.533 ...
##  $ energy17: num  0.31 1 0.6759 0.0693 0.7306 ...
##  $ energy18: num  0.3 0.887 0.755 0.228 0.619 ...
##  $ energy19: num  0.508 0.802 0.893 0.406 0.203 ...
##  $ energy20: num  0.48 0.782 0.862 0.397 0.464 ...
##  $ energy21: num  0.578 0.521 0.797 0.274 0.415 ...
##  $ energy22: num  0.507 0.405 0.674 0.369 0.429 ...
##  $ energy23: num  0.433 0.396 0.429 0.556 0.573 ...
##  $ energy24: num  0.555 0.391 0.365 0.485 0.54 ...
##  $ energy25: num  0.671 0.325 0.533 0.314 0.316 ...
##  $ energy26: num  0.641 0.32 0.241 0.533 0.229 ...
##  $ energy27: num  0.71 0.327 0.507 0.526 0.7 ...
##  $ energy28: num  0.808 0.277 0.853 0.252 1 ...
##  $ energy29: num  0.679 0.442 0.604 0.209 0.726 ...
##  $ energy30: num  0.386 0.203 0.851 0.356 0.472 ...
##  $ energy31: num  0.131 0.379 0.851 0.626 0.51 ...
##  $ energy32: num  0.26 0.295 0.504 0.734 0.546 ...
##  $ energy33: num  0.512 0.198 0.186 0.612 0.288 ...
##  $ energy34: num  0.7547 0.2341 0.2709 0.3497 0.0981 ...
##  $ energy35: num  0.854 0.131 0.423 0.395 0.195 ...
##  $ energy36: num  0.851 0.418 0.304 0.301 0.418 ...
##  $ energy37: num  0.669 0.384 0.612 0.541 0.46 ...
##  $ energy38: num  0.61 0.106 0.676 0.881 0.322 ...
##  $ energy39: num  0.494 0.184 0.537 0.986 0.283 ...
##  $ energy40: num  0.274 0.197 0.472 0.917 0.243 ...
##  $ energy41: num  0.051 0.167 0.465 0.612 0.198 ...
##  $ energy42: num  0.2834 0.0583 0.2587 0.5006 0.2444 ...
##  $ energy43: num  0.282 0.14 0.213 0.321 0.185 ...
##  $ energy44: num  0.4256 0.1628 0.2222 0.3202 0.0841 ...
##  $ energy45: num  0.2641 0.0621 0.2111 0.4295 0.0692 ...
##  $ energy46: num  0.1386 0.0203 0.0176 0.3654 0.0528 ...
##  $ energy47: num  0.1051 0.053 0.1348 0.2655 0.0357 ...
##  $ energy48: num  0.1343 0.0742 0.0744 0.1576 0.0085 ...
##  $ energy49: num  0.0383 0.0409 0.013 0.0681 0.023 0.0264 0.0507 0.0285 0.0777 0.0092 ...
##  $ energy50: num  0.0324 0.0061 0.0106 0.0294 0.0046 0.0081 0.0159 0.0178 0.0439 0.0198 ...
##  $ energy51: num  0.0232 0.0125 0.0033 0.0241 0.0156 0.0104 0.0195 0.0052 0.0061 0.0118 ...
##  $ energy52: num  0.0027 0.0084 0.0232 0.0121 0.0031 0.0045 0.0201 0.0081 0.0145 0.009 ...
##  $ energy53: num  0.0065 0.0089 0.0166 0.0036 0.0054 0.0014 0.0248 0.012 0.0128 0.0223 ...
##  $ energy54: num  0.0159 0.0048 0.0095 0.015 0.0105 0.0038 0.0131 0.0045 0.0145 0.0179 ...
##  $ energy55: num  0.0072 0.0094 0.018 0.0085 0.011 0.0013 0.007 0.0121 0.0058 0.0084 ...
##  $ energy56: num  0.0167 0.0191 0.0244 0.0073 0.0015 0.0089 0.0138 0.0097 0.0049 0.0068 ...
##  $ energy57: num  0.018 0.014 0.0316 0.005 0.0072 0.0057 0.0092 0.0085 0.0065 0.0032 ...
##  $ energy58: num  0.0084 0.0049 0.0164 0.0044 0.0048 0.0027 0.0143 0.0047 0.0093 0.0035 ...
##  $ energy59: num  0.009 0.0052 0.0095 0.004 0.0107 0.0051 0.0036 0.0048 0.0059 0.0056 ...
##  $ energy60: num  0.0032 0.0044 0.0078 0.0117 0.0094 0.0062 0.0103 0.0053 0.0022 0.004 ...
##  $ type    : Factor w/ 2 levels "M","R": 2 2 2 2 2 2 2 2 2 2 ...

All variables have their matched data type. Good. Now, move on to explore the dataset.

5 Exploratory Data Analysis

Here, we’re going to check the proportion of target variable and whether multicollinearity exists.

5.1 Target Variable Proportion

We’d like to have adequately balanced proportion in each class of the target variable. Our target variable is type. Let’s see its proportion.

table(sonar$type)

## 
##   M   R 
## 111  97

prop.table(table(sonar$type))

## 
##         M         R 
## 0.5336538 0.4663462

Nice. Both sides have sufficiently balanced proportion. So, we can move on to check the multicollinearity.

5.2 Multicollinearity

We’d like to have each variable independent to each other, meaning that each of which does not have high correlation with others. Therefore, high correlation found, theoretically this indicates multicollinearity. First thing first, we should see the information of each variable by calling summary() function.

summary(sonar)

##     energy1           energy2           energy3           energy4       
##  Min.   :0.00150   Min.   :0.00060   Min.   :0.00150   Min.   :0.00580  
##  1st Qu.:0.01335   1st Qu.:0.01645   1st Qu.:0.01895   1st Qu.:0.02438  
##  Median :0.02280   Median :0.03080   Median :0.03430   Median :0.04405  
##  Mean   :0.02916   Mean   :0.03844   Mean   :0.04383   Mean   :0.05389  
##  3rd Qu.:0.03555   3rd Qu.:0.04795   3rd Qu.:0.05795   3rd Qu.:0.06450  
##  Max.   :0.13710   Max.   :0.23390   Max.   :0.30590   Max.   :0.42640  
##     energy5           energy6           energy7          energy8       
##  Min.   :0.00670   Min.   :0.01020   Min.   :0.0033   Min.   :0.00550  
##  1st Qu.:0.03805   1st Qu.:0.06703   1st Qu.:0.0809   1st Qu.:0.08042  
##  Median :0.06250   Median :0.09215   Median :0.1070   Median :0.11210  
##  Mean   :0.07520   Mean   :0.10457   Mean   :0.1217   Mean   :0.13480  
##  3rd Qu.:0.10028   3rd Qu.:0.13412   3rd Qu.:0.1540   3rd Qu.:0.16960  
##  Max.   :0.40100   Max.   :0.38230   Max.   :0.3729   Max.   :0.45900  
##     energy9           energy10         energy11         energy12     
##  Min.   :0.00750   Min.   :0.0113   Min.   :0.0289   Min.   :0.0236  
##  1st Qu.:0.09703   1st Qu.:0.1113   1st Qu.:0.1293   1st Qu.:0.1335  
##  Median :0.15225   Median :0.1824   Median :0.2248   Median :0.2490  
##  Mean   :0.17800   Mean   :0.2083   Mean   :0.2360   Mean   :0.2502  
##  3rd Qu.:0.23342   3rd Qu.:0.2687   3rd Qu.:0.3016   3rd Qu.:0.3312  
##  Max.   :0.68280   Max.   :0.7106   Max.   :0.7342   Max.   :0.7060  
##     energy13         energy14         energy15         energy16     
##  Min.   :0.0184   Min.   :0.0273   Min.   :0.0031   Min.   :0.0162  
##  1st Qu.:0.1661   1st Qu.:0.1752   1st Qu.:0.1646   1st Qu.:0.1963  
##  Median :0.2640   Median :0.2811   Median :0.2817   Median :0.3047  
##  Mean   :0.2733   Mean   :0.2966   Mean   :0.3202   Mean   :0.3785  
##  3rd Qu.:0.3513   3rd Qu.:0.3862   3rd Qu.:0.4529   3rd Qu.:0.5357  
##  Max.   :0.7131   Max.   :0.9970   Max.   :1.0000   Max.   :0.9988  
##     energy17         energy18         energy19         energy20     
##  Min.   :0.0349   Min.   :0.0375   Min.   :0.0494   Min.   :0.0656  
##  1st Qu.:0.2059   1st Qu.:0.2421   1st Qu.:0.2991   1st Qu.:0.3506  
##  Median :0.3084   Median :0.3683   Median :0.4350   Median :0.5425  
##  Mean   :0.4160   Mean   :0.4523   Mean   :0.5048   Mean   :0.5630  
##  3rd Qu.:0.6594   3rd Qu.:0.6791   3rd Qu.:0.7314   3rd Qu.:0.8093  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     energy21         energy22         energy23         energy24     
##  Min.   :0.0512   Min.   :0.0219   Min.   :0.0563   Min.   :0.0239  
##  1st Qu.:0.3997   1st Qu.:0.4069   1st Qu.:0.4502   1st Qu.:0.5407  
##  Median :0.6177   Median :0.6649   Median :0.6997   Median :0.6985  
##  Mean   :0.6091   Mean   :0.6243   Mean   :0.6470   Mean   :0.6727  
##  3rd Qu.:0.8170   3rd Qu.:0.8320   3rd Qu.:0.8486   3rd Qu.:0.8722  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     energy25         energy26         energy27         energy28     
##  Min.   :0.0240   Min.   :0.0921   Min.   :0.0481   Min.   :0.0284  
##  1st Qu.:0.5258   1st Qu.:0.5442   1st Qu.:0.5319   1st Qu.:0.5348  
##  Median :0.7211   Median :0.7545   Median :0.7456   Median :0.7319  
##  Mean   :0.6754   Mean   :0.6999   Mean   :0.7022   Mean   :0.6940  
##  3rd Qu.:0.8737   3rd Qu.:0.8938   3rd Qu.:0.9171   3rd Qu.:0.9003  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     energy29         energy30         energy31         energy32     
##  Min.   :0.0144   Min.   :0.0613   Min.   :0.0482   Min.   :0.0404  
##  1st Qu.:0.4637   1st Qu.:0.4114   1st Qu.:0.3456   1st Qu.:0.2814  
##  Median :0.6808   Median :0.6071   Median :0.4904   Median :0.4296  
##  Mean   :0.6421   Mean   :0.5809   Mean   :0.5045   Mean   :0.4390  
##  3rd Qu.:0.8521   3rd Qu.:0.7352   3rd Qu.:0.6420   3rd Qu.:0.5803  
##  Max.   :1.0000   Max.   :1.0000   Max.   :0.9657   Max.   :0.9306  
##     energy33         energy34         energy35         energy36     
##  Min.   :0.0477   Min.   :0.0212   Min.   :0.0223   Min.   :0.0080  
##  1st Qu.:0.2579   1st Qu.:0.2176   1st Qu.:0.1794   1st Qu.:0.1543  
##  Median :0.3912   Median :0.3510   Median :0.3127   Median :0.3211  
##  Mean   :0.4172   Mean   :0.4032   Mean   :0.3926   Mean   :0.3848  
##  3rd Qu.:0.5561   3rd Qu.:0.5961   3rd Qu.:0.5934   3rd Qu.:0.5565  
##  Max.   :1.0000   Max.   :0.9647   Max.   :1.0000   Max.   :1.0000  
##     energy37         energy38         energy39         energy40     
##  Min.   :0.0351   Min.   :0.0383   Min.   :0.0371   Min.   :0.0117  
##  1st Qu.:0.1601   1st Qu.:0.1743   1st Qu.:0.1740   1st Qu.:0.1865  
##  Median :0.3063   Median :0.3127   Median :0.2835   Median :0.2781  
##  Mean   :0.3638   Mean   :0.3397   Mean   :0.3258   Mean   :0.3112  
##  3rd Qu.:0.5189   3rd Qu.:0.4405   3rd Qu.:0.4349   3rd Qu.:0.4244  
##  Max.   :0.9497   Max.   :1.0000   Max.   :0.9857   Max.   :0.9297  
##     energy41         energy42         energy43         energy44     
##  Min.   :0.0360   Min.   :0.0056   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1631   1st Qu.:0.1589   1st Qu.:0.1552   1st Qu.:0.1269  
##  Median :0.2595   Median :0.2451   Median :0.2225   Median :0.1777  
##  Mean   :0.2893   Mean   :0.2783   Mean   :0.2465   Mean   :0.2141  
##  3rd Qu.:0.3875   3rd Qu.:0.3842   3rd Qu.:0.3245   3rd Qu.:0.2717  
##  Max.   :0.8995   Max.   :0.8246   Max.   :0.7733   Max.   :0.7762  
##     energy45          energy46          energy47          energy48      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.09448   1st Qu.:0.06855   1st Qu.:0.06425   1st Qu.:0.04512  
##  Median :0.14800   Median :0.12135   Median :0.10165   Median :0.07810  
##  Mean   :0.19723   Mean   :0.16063   Mean   :0.12245   Mean   :0.09142  
##  3rd Qu.:0.23155   3rd Qu.:0.20037   3rd Qu.:0.15443   3rd Qu.:0.12010  
##  Max.   :0.70340   Max.   :0.72920   Max.   :0.55220   Max.   :0.33390  
##     energy49          energy50          energy51           energy52       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.000800  
##  1st Qu.:0.02635   1st Qu.:0.01155   1st Qu.:0.008425   1st Qu.:0.007275  
##  Median :0.04470   Median :0.01790   Median :0.013900   Median :0.011400  
##  Mean   :0.05193   Mean   :0.02042   Mean   :0.016069   Mean   :0.013420  
##  3rd Qu.:0.06853   3rd Qu.:0.02527   3rd Qu.:0.020825   3rd Qu.:0.016725  
##  Max.   :0.19810   Max.   :0.08250   Max.   :0.100400   Max.   :0.070900  
##     energy53           energy54           energy55          energy56       
##  Min.   :0.000500   Min.   :0.001000   Min.   :0.00060   Min.   :0.000400  
##  1st Qu.:0.005075   1st Qu.:0.005375   1st Qu.:0.00415   1st Qu.:0.004400  
##  Median :0.009550   Median :0.009300   Median :0.00750   Median :0.006850  
##  Mean   :0.010709   Mean   :0.010941   Mean   :0.00929   Mean   :0.008222  
##  3rd Qu.:0.014900   3rd Qu.:0.014500   3rd Qu.:0.01210   3rd Qu.:0.010575  
##  Max.   :0.039000   Max.   :0.035200   Max.   :0.04470   Max.   :0.039400  
##     energy57          energy58           energy59           energy60       
##  Min.   :0.00030   Min.   :0.000300   Min.   :0.000100   Min.   :0.000600  
##  1st Qu.:0.00370   1st Qu.:0.003600   1st Qu.:0.003675   1st Qu.:0.003100  
##  Median :0.00595   Median :0.005800   Median :0.006400   Median :0.005300  
##  Mean   :0.00782   Mean   :0.007949   Mean   :0.007941   Mean   :0.006507  
##  3rd Qu.:0.01043   3rd Qu.:0.010350   3rd Qu.:0.010325   3rd Qu.:0.008525  
##  Max.   :0.03550   Max.   :0.044000   Max.   :0.036400   Max.   :0.043900  
##  type   
##  M:111  
##  R: 97  
##         
##         
##         
##

As explained in Introduction that the values of predictor variables range between 0 and 1, from above summary, we can see most variables share nearly the same range. Nevertheless, to make sure, let’s see the predictors in boxplots.

p <- sonar %>% select(colnames(.)[-61]) %>%
  pivot_longer(cols = colnames(.), names_to = "Energy", values_to = "Value") %>%
  arrange(Energy) %>% 
  mutate(Energy = as.factor(Energy)) %>%
  ggplot(aes(x = Energy,
             y = Value)) +
  coord_flip() +
  geom_boxplot(aes(fill = Energy), show.legend = F) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.position = "none")
ggplotly(p)

{#boxplot}Eventhough the metadata said that all predictors range from 0 from 1, we still found there are several of them having relatively minor values compared to the majority. We also found that there is no any high value beyond value of 1. Now, let’s check their correlations.

temp <- sonar %>% select(-type)
colnames(temp) <- c(1:60)
ggcorrplot(temp %>% cor())

There is a unique pattern discovered from above figure. Each variable has large positive correlation with their next neighbors. Therefore, we have to remove this pattern since high correlation indicates multicollinearity. We can drop the adjacent neighbors of each variable by only taking the columns with multiplication of 3, 4, 5, or 6. For example, if multiplication of 3 selected, we could take the variables of energy1, energy3, energy6, and all the way to energy60. As a beginning, I pick the multiplication of 4 to divide the number of predictor variables.

idx <- c(1, seq(from = 4, to = ncol(temp), by = 4))
ggcorr(temp[,idx], label = T)

DF <- sonar[,c(idx,61)] # subset only necessary columns, i.e. the multipl of 4

Nice. We have removed all with high correlation. Now, all remaining are only variables with lower correlation. We also should check the p-value of correlation test.

sonarComb <- combn(colnames(DF[,-ncol(DF)]), 2) # create combinations among each variable
Alpha <- 0.05 # set significance value
multicolRes <- data.frame(vs = 1:dim(sonarComb)[2], # create a blank data frame
                          cor = 1:dim(sonarComb)[2],
                          res = 1:dim(sonarComb)[2])
for (i in 1:dim(sonarComb)[2]) {
  multicolRes$vs[i] <- paste0(sonarComb[1,i], " & ", sonarComb[2,i])
  corTest <- cor.test(DF[,sonarComb[1,i]], DF[,sonarComb[2,i]])
  multicolRes$cor[i] <- corTest$p.value
  multicolRes$res[i] <- ifelse(corTest$p.value < Alpha, 
                               "Yes", 
                               "No")
  # multiResults <- c(multiResults, res)
}
head(multicolRes)

Too bad. The first five combinations show multicollinearity exists. We should check the composition.

table(multicolRes$res)

## 
##  No Yes 
##  44  76

prop.table(table(multicolRes$res))

## 
##        No       Yes 
## 0.3666667 0.6333333

We only have 44 combinations without multicollinearity detected. However, we just keep forward with the variables used so far.

6 Modelling

In this chapter, we’re going to split the dataset into train dataset and test dataset. Next, we will create two models of logistic regression and KNN (actually we won’t create KNN model as it basically doesn’t generate one). After the models retrieved, we will predict the test dataset using both models, and evalute the results.

6.1 Splitting

In order to secure the proportion of two datasets after splitting, we’re going to use rsample library here. The initial dataset used to split is DF, i.e. the one with removed variables of multiplication of 4.

set.seed(1)
idx <- initial_split(DF, prop = 0.8, strata = type)
sonar_train <- training(idx)
sonar_test <- testing(idx)

prop.table(table(sonar_train$type)) # Check train dataset proportion after split

## 
##         M         R 
## 0.5329341 0.4670659

prop.table(table(sonar_test$type)) # Check test dataset proportion after split

## 
##         M         R 
## 0.5365854 0.4634146

# Split the predictors and the target of train dataset for KNN model usage
X_train <- sonar_train[,-ncol(sonar_train)]
y_train <- sonar_train[,ncol(sonar_train)]

# Split the predictors and the target of test dataset for KNN model usage
X_test <- sonar_test[,-ncol(sonar_test)]
y_test <- sonar_test[,ncol(sonar_test)]

Nice. The proportion of two datasets approximately is kept. Let’s move on to create the models.

6.2 Create the Model

6.2.1 Logistic Regression Model

We’re going to use all predictor variables.

model_log <- glm(formula = type ~ ., data = sonar_train, family = "binomial", maxit = 30)
summary(model_log)

## 
## Call:
## glm(formula = type ~ ., family = "binomial", data = sonar_train, 
##     maxit = 30)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.27127  -0.49501  -0.02504   0.48444   2.32497  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   7.9719     2.4584   3.243 0.001184 ** 
## energy1     -40.5505    15.3281  -2.645 0.008157 ** 
## energy4     -20.9559    10.4479  -2.006 0.044882 *  
## energy8       0.6211     3.7137   0.167 0.867175    
## energy12     -6.6115     2.3115  -2.860 0.004234 ** 
## energy16      3.5734     1.5524   2.302 0.021343 *  
## energy20     -4.3278     1.6016  -2.702 0.006888 ** 
## energy24     -2.1332     1.1945  -1.786 0.074121 .  
## energy28     -1.2483     1.2631  -0.988 0.323008    
## energy32      1.6404     1.5629   1.050 0.293912    
## energy36      5.2483     1.5286   3.433 0.000596 ***
## energy40      3.2440     2.1501   1.509 0.131356    
## energy44    -11.6871     3.3622  -3.476 0.000509 ***
## energy48    -22.6860     6.7376  -3.367 0.000760 ***
## energy52    -64.6841    34.1600  -1.894 0.058283 .  
## energy56     95.8364    55.4891   1.727 0.084146 .  
## energy60    -32.2393    66.8132  -0.483 0.629431    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 230.79  on 166  degrees of freedom
## Residual deviance: 111.58  on 150  degrees of freedom
## AIC: 145.58
## 
## Number of Fisher Scoring iterations: 7

We can see above summary that there are five predictors without any significant code, namely energy8, energy28, energy32, energy40, and energy60. We will handle this issue later in Model Tuning. For now, let’s predict the test dataset using the just-generated model.

6.2.2 KNN Data Pre-Processing - Scalling

Specially for KNN, we need to scale all predictor variables in advance. Though all of them are said to be in range between 0 and 1, we found earlier that some spread not in range exactly 0 and 1. Thus, in order to ensure, it’s preferable to still scale all of them.

X_train.scaled <- scale(x = X_train)
X_test.scaled <- scale(x = X_test, 
                       center = attr(X_train.scaled, "scaled:center"),
                       scale = attr(X_train.scaled, "scaled:scale"))

6.3 Predictions

After the models have been prepared. Let’s predict the test dataset.

6.3.1 Logistic Regression Prediction

Since the direct output of logistic regression model is in log of odds form, we need to convert it to a more interpretable form, i.e. probability. We can use the argument type = "response" to arrange this. Afterwards, all probabilities have to be manually assigned to whether positive (Rocks) or negative (Mines) based on the set threshold. As a start, we could use threshold of 0.5.

predict_log <- predict(object = model_log, newdata = sonar_test, type = "response")
sonar_test$ypred_prob <- predict_log
sonar_test$ypred_label <- ifelse(sonar_test$ypred_prob > 0.5, "R", "M")
# negative class is M
# positive class is R
# since at the beginning, R automatically sets "M" as the first as class, meaning the negative class

6.3.2 KNN Prediction

To predict the test dataset using KNN, firstly, we have to determine initial optimum K value. In this case, we’re going to obtain K by calculating the square root of all observations of the train dataset.

Find optimum K

K <- sqrt(nrow(X_train))
K

## [1] 12.92285

As the target variable has two classes only (R or M) meaning that it is an even number, we need an odd number of K. Therefore, with K = 13, it should be enough for the prediction.

predict_knn <- knn(train = X_train.scaled, test = X_test.scaled, cl = y_train, k = round(K))

6.4 Evaluation

Both models have been used to predict the test dataset. Now, let’s evaluate them. Here, to assess both, we’re going to utilize confusion matrix. For confusion matrix metric, I’m only interested to highlight the Accuracy, Sensitivity (Recall), and Pos Pred Value (Precision). The figure below briefly shows the confusion matrix and its attributes.

Let’s compare both models.

print("Confusion Matrix of Log Regression Model")

## [1] "Confusion Matrix of Log Regression Model"

confusionMatrix(data = as.factor(sonar_test$ypred_label), reference = y_test, positive = "R")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 15  6
##          R  7 13
##                                           
##                Accuracy : 0.6829          
##                  95% CI : (0.5191, 0.8192)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 0.04115         
##                                           
##                   Kappa : 0.3647          
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.6842          
##             Specificity : 0.6818          
##          Pos Pred Value : 0.6500          
##          Neg Pred Value : 0.7143          
##              Prevalence : 0.4634          
##          Detection Rate : 0.3171          
##    Detection Prevalence : 0.4878          
##       Balanced Accuracy : 0.6830          
##                                           
##        'Positive' Class : R               
##

print("Confusion Matrix of KNN")

## [1] "Confusion Matrix of KNN"

confusionMatrix(data = predict_knn, reference = y_test, positive = "R")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 16 10
##          R  6  9
##                                         
##                Accuracy : 0.6098        
##                  95% CI : (0.445, 0.758)
##     No Information Rate : 0.5366        
##     P-Value [Acc > NIR] : 0.2175        
##                                         
##                   Kappa : 0.2039        
##                                         
##  Mcnemar's Test P-Value : 0.4533        
##                                         
##             Sensitivity : 0.4737        
##             Specificity : 0.7273        
##          Pos Pred Value : 0.6000        
##          Neg Pred Value : 0.6154        
##              Prevalence : 0.4634        
##          Detection Rate : 0.2195        
##    Detection Prevalence : 0.3659        
##       Balanced Accuracy : 0.6005        
##                                         
##        'Positive' Class : R             
##

Unfortunately, both models perform poorly. All points we’re interested in (i.e. Accuracy, Sensitivity, and Pos Pred Value) are more or less below 7. Though, the logistic regression model slightly produces superior results. Based on this insufficient performance, the models are required to tune that both can perform more satisfactorily.

7 Model Tuning

7.1 Tuning Logistic Regression

As mentioned previously, there are several predictor variables (i.e. energy8, energy28, energy32, energy40, and energy60) which are insignificant in the logistic regression model. Now, we’re going to remove them, and it is expected that this could enhance the model’s performance.

energyInsignificant <- c("energy8", "energy28", "energy32", "energy40", "energy60")
sonar_train2 <- sonar_train %>% select(-all_of(energyInsignificant))

New dataset has been created already. Now, let’s use it to train the model and see the results.

model_log2 <- glm(formula = type ~ ., data = sonar_train2, family = "binomial", maxit = 30)
summary(model_log2)

## 
## Call:
## glm(formula = type ~ ., family = "binomial", data = sonar_train2, 
##     maxit = 30)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.40305  -0.58818  -0.03312   0.48955   2.29436  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    6.999      1.690   4.143 3.43e-05 ***
## energy1      -37.235     12.848  -2.898  0.00375 ** 
## energy4      -19.630      9.579  -2.049  0.04044 *  
## energy12      -5.660      2.083  -2.717  0.00659 ** 
## energy16       3.178      1.405   2.261  0.02374 *  
## energy20      -3.625      1.330  -2.725  0.00642 ** 
## energy24      -2.308      1.115  -2.070  0.03845 *  
## energy36       6.085      1.456   4.178 2.94e-05 ***
## energy44      -9.091      2.791  -3.257  0.00113 ** 
## energy48     -18.894      5.885  -3.210  0.00133 ** 
## energy52     -56.024     32.051  -1.748  0.08047 .  
## energy56      89.250     51.889   1.720  0.08543 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 230.79  on 166  degrees of freedom
## Residual deviance: 118.93  on 155  degrees of freedom
## AIC: 142.93
## 
## Number of Fisher Scoring iterations: 6

Nice. Each variable at least has . sign indicating that its p-value is less than 0.1. Now, use the new model to predict the test dataset.

predict_log2 <- predict(object = model_log2, newdata = sonar_test, type = "response")
sonar_test$ypred_prob2 <- predict_log2
sonar_test$ypred_label2 <- ifelse(sonar_test$ypred_prob2 > 0.5, "R", "M")

Subsequently, let’s see its performance by using confusion matrix.

confusionMatrix(as.factor(sonar_test$ypred_label2), sonar_test$type, positive = "R")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 15  5
##          R  7 14
##                                           
##                Accuracy : 0.7073          
##                  95% CI : (0.5446, 0.8387)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 0.0196          
##                                           
##                   Kappa : 0.4157          
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.7368          
##             Specificity : 0.6818          
##          Pos Pred Value : 0.6667          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.4634          
##          Detection Rate : 0.3415          
##    Detection Prevalence : 0.5122          
##       Balanced Accuracy : 0.7093          
##                                           
##        'Positive' Class : R               
##

Good. By eliminating all insignificant predictors, we could increase the performance by approxitamely 2%. To see the enhancement, let’s compare all the initial model and the tuned model in the points we’re interested in.

LogResTable1 <- table(sonar_test$ypred_label, sonar_test$type, dnn = c("Prediction","Actual"))
LogResTable2 <- table(sonar_test$ypred_label2, sonar_test$type, dnn = c("Prediction","Actual"))

Accuracy1 <- (LogResTable1[1,1] + LogResTable1[2,2])/nrow(sonar_test)
Accuracy2 <- (LogResTable2[1,1] + LogResTable2[2,2])/nrow(sonar_test)
Recall1 <- LogResTable1[2,2]/(LogResTable1[1,2] + LogResTable1[2,2])
Recall2 <- LogResTable2[2,2]/(LogResTable2[1,2] + LogResTable2[2,2])
Precision1 <- LogResTable1[2,2]/(LogResTable1[2,1] + LogResTable1[2,2])
Precision2 <- LogResTable2[2,2]/(LogResTable2[2,1] + LogResTable2[2,2])

finalResults.LogRes <- data.frame("Model" = c("Initial Log Res","Tuned Log Res"),
                                  "Accuracy" = c(Accuracy1, Accuracy2),
                                  "Recall" = c(Recall1, Recall2),
                                  "Precision" = c(Precision1, Precision2))
cat("\nConfusion matrix of the initial model\n")

## 
## Confusion matrix of the initial model

LogResTable1

##           Actual
## Prediction  M  R
##          M 15  6
##          R  7 13

cat("\nConfusion matrix of the tuned model\n")

## 
## Confusion matrix of the tuned model

LogResTable2

##           Actual
## Prediction  M  R
##          M 15  5
##          R  7 14

finalResults.LogRes

As seen above, the variable elimination is able to move exactly 1 False Negative (FN) from the results of the initial model to True Positive (TP) as exhibited by the tuned model. By this 1 FN-to-TP movement, the performance increases by more or less 2%, 5%, and 1%, as shown by the data frame above, for accuracy, recall, and precision, respectively. From now on, we use the tuned logistic regression model as our final model to compare to that of KNN.

7.2 Tuning KNN

For tuning KNN, we could find the best K value. We could carry out this by using every possible odd K value to predict the test dataset and select the one with best accuracy. This technique is kind of brute force so it is not recommended for the dataset with long rows unless you have a computer with high specifications.

# get all odd K values
oddK_idx <- sapply(1:(nrow(X_train)-1), function(x) {ifelse(x%%2 == 1, T, F)})
optK <- data.frame(1:(nrow(X_train)-1))[oddK_idx,]
optKNN <- data.frame(K = 1:length(optK), acc = 1:length(optK)) # create blank dataframe

# brute force to calculate every accuracy of all odd Ks
for (i in 1:length(optK)) {
  predknn <- knn(train = X_train.scaled, test = X_test.scaled, cl = y_train, k = optK[i])
  optKNN$K[i] <- optK[i]
  optKNN$acc[i] <- Accuracy(predknn, y_test)
}
maxAcc <- optKNN[optKNN$acc == max(optKNN$acc),] # find the one with max accuracy
labelMax <- paste0(round(maxAcc$acc, 3), " when K = ", maxAcc$K) # and label it

K13 <- optKNN[optKNN$K == 13,]
labelK13 <- "K = 13"

# plot the results
optKNN %>% ggplot(aes(x = K,
                      y = acc)) +
  geom_line(color = "blue") +
  geom_point(data = maxAcc, aes(x = K,
                                y = acc)) +
  geom_point(data = K13, aes(x = K,
                             y = acc)) +
  geom_text(data = maxAcc, 
             aes(label = labelMax), 
             hjust = -0.1, 
             size = 3.5) +
  geom_text(data = K13, aes(label = labelK13), size = 2.5, hjust = 1.4) +
  labs(title = "No of Ks vs Accuracy in KNN prediction",
       x = "Number of Ks",
       y = "Accuracy")

Suprisingly, the best K value obtained is just 1. Even, this K = 1 transcends all possible K values. As shown above, our previous K, i.e. K = 13, is far below K = 1. Based on this finding, we’re going to reset K value of our KNN model and see its results.

bestK <- maxAcc$K
predict_knn2 <- knn(train = X_train.scaled, test = X_test.scaled, cl = y_train, k = bestK)
confusionMatrix(predict_knn2, reference = y_test, positive = "R")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 19  5
##          R  3 14
##                                           
##                Accuracy : 0.8049          
##                  95% CI : (0.6513, 0.9118)
##     No Information Rate : 0.5366          
##     P-Value [Acc > NIR] : 0.0003284       
##                                           
##                   Kappa : 0.6048          
##                                           
##  Mcnemar's Test P-Value : 0.7236736       
##                                           
##             Sensitivity : 0.7368          
##             Specificity : 0.8636          
##          Pos Pred Value : 0.8235          
##          Neg Pred Value : 0.7917          
##              Prevalence : 0.4634          
##          Detection Rate : 0.3415          
##    Detection Prevalence : 0.4146          
##       Balanced Accuracy : 0.8002          
##                                           
##        'Positive' Class : R               
##

Great. Now, our KNN model reaches the accuracy of 80%. Let’s compare this tuned model to the initial one.

KNNTable1 <- table(predict_knn, y_test, dnn = c("Prediction","Actual"))
KNNTable2 <- table(predict_knn2, y_test, dnn = c("Prediction","Actual"))

Accuracy1 <- (KNNTable1[1,1] + KNNTable1[2,2])/nrow(sonar_test)
Accuracy2 <- (KNNTable2[1,1] + KNNTable2[2,2])/nrow(sonar_test)
Recall1 <- KNNTable1[2,2]/(KNNTable1[1,2] + KNNTable1[2,2])
Recall2 <- KNNTable2[2,2]/(KNNTable2[1,2] + KNNTable2[2,2])
Precision1 <- KNNTable1[2,2]/(KNNTable1[2,1] + KNNTable1[2,2])
Precision2 <- KNNTable2[2,2]/(KNNTable2[2,1] + KNNTable2[2,2])

finalResults.KNN <- data.frame("Model" = c("Initial KNN","Tuned KNN"),
                                  "Accuracy" = c(Accuracy1, Accuracy2),
                                  "Recall" = c(Recall1, Recall2),
                                  "Precision" = c(Precision1, Precision2))
cat("Confusion matrix of the initial model\n")

## Confusion matrix of the initial model

KNNTable1

##           Actual
## Prediction  M  R
##          M 16 10
##          R  6  9

cat("\nConfusion matrix of the tuned model\n")

## 
## Confusion matrix of the tuned model

KNNTable2

##           Actual
## Prediction  M  R
##          M 19  5
##          R  3 14

finalResults.KNN

Amazing. Such K = 1 significantly elevates the performance of KNN model by approximately 20%, 26%, and 22%, as seen in above data frame, for accuracy, recall, and precision, respectively. This is proven by, if you see the confusion matrices above, three FP-to-TN and five FN-to-TP movements. From now on, the KNN model used is the tuned one with K = 1.

7.3 Comparison of the Tuned Models

Both logistic regression and KNN models have been tuned. Both produce improved results. Now, let’s compare both tuned models and see the results.

finalResults <- rbind(finalResults.LogRes[2,], finalResults.KNN[2,])
rownames(finalResults) <- NULL
finalResults

As seen above, in case of accuracy and precision, the KNN performs much better than the logistic regression. However, uniquely, in case of recall, both models exhibit exactly the same values. This could occur because both have exactly the same proportion of true positive and false negative as shown by the table below.

cat("Confusion Matrix of the tuned log res model\n")

## Confusion Matrix of the tuned log res model

LogResTable2

##           Actual
## Prediction  M  R
##          M 15  5
##          R  7 14

cat("\nConfusion matrix of the tuned KNN model\n")

## 
## Confusion matrix of the tuned KNN model

KNNTable2

##           Actual
## Prediction  M  R
##          M 19  5
##          R  3 14

Based on the results explained, we can conclude that, specially for this dataset, the KNN model performs more efficient than the logistic regression model. Subsequently, since both models are unable to perform the accuracy of 90%, we could determine the most highlighted metric (whether recall or precision) based on the business case of the dataset.

If we prefer high recall, this means that we increase the probability to have more false positives than false negatives. This could cause we might to have more observations predicted as rocks whereas originally they are mines.
If we prefer high precision, this means that we increase the probability to have more false negatives than false positives. This could cause we might to have more observations predicted as mines whereas originally they are rocks.

8 Conclusions

Finally, you have read a lot. Thank you for reaching this much. Now, let’s conclude this article. Firstly, we should conclude from what has been defined in Introduction.

We have satisfied the aim. Two methods, i.e. Logistic Regression and K-Nearest Neighbors, have been generated to decide whether the observation is mines or rocks based on sonar signals data.
We have created comparisons between Logistic Regression and K-nearest neigbors performance.
We have assessed both models based on confusion matrix.
We have examined the model using test dataset and discussed the results.
By applying model tuning technique, we have upgraded the performance of both models, namely:
- by 2%, 5%, and 1% for accuracy, recall, and precision, respectively, for logistic regression
- by 20%, 26%, and 22% for accuracy, recall, and precision, respectively, for KNN.
After both final models compared each other, the final KNN model comes out as a champion. Though, it might be overfitted since the K value used is 1.
Since both final models do not reach the accuracy of almost 100%, it is wiser if we highlight one of other metrics, either recall or precision, based on the business case of the dataset.

Mine or Rock?

Reza Dwi Utomo

24/02/2020