we are going to introduce the  𝐾 -nearest neighbors (KNN) algorithm and show some practical ways of using it in R with the knn function that exists in the class library.we are going to introduce the  𝐾 -nearest neighbors (KNN) algorithm and show some practical ways of using it in R with the knn function that exists in the class library.

Import libraries

library(class)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
require(mlbench)
## Loading required package: mlbench
library(e1071)
library(base)

Step 1: Data collection

For this lesson, we will be using Sonar data set (signals) from mlbench library. Sonar is a system for the detection of objects under water and for measuring the water’s depth by emitting sound pulses and detecting. The complete description can be found in mlbench. For our purposes, this is a two-class (class 𝑅 and class 𝑀 ) classification task with numeric data.

Let’s look at the first five rows of Sonar-

data(Sonar)
head(Sonar)
##       V1     V2     V3     V4     V5     V6     V7     V8     V9    V10    V11
## 1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609
## 2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918
## 3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333
## 4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881
## 5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152
## 6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988
##      V12    V13    V14    V15    V16    V17    V18    V19    V20    V21    V22
## 1 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071
## 2 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052
## 3 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737
## 4 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690
## 5 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292
## 6 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985
##      V23    V24    V25    V26    V27    V28    V29    V30    V31    V32    V33
## 1 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121
## 2 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984
## 3 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862
## 4 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120
## 5 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881
## 6 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299
##      V34    V35    V36    V37    V38    V39    V40    V41    V42    V43    V44
## 1 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256
## 2 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628
## 3 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222
## 4 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202
## 5 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841
## 6 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238
##      V45    V46    V47    V48    V49    V50    V51    V52    V53    V54    V55
## 1 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072
## 2 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094
## 3 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180
## 4 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085
## 5 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110
## 6 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013
##      V56    V57    V58    V59    V60 Class
## 1 0.0167 0.0180 0.0084 0.0090 0.0032     R
## 2 0.0191 0.0140 0.0049 0.0052 0.0044     R
## 3 0.0244 0.0316 0.0164 0.0095 0.0078     R
## 4 0.0073 0.0050 0.0044 0.0040 0.0117     R
## 5 0.0015 0.0072 0.0048 0.0107 0.0094     R
## 6 0.0089 0.0057 0.0027 0.0051 0.0062     R

Step 2: Prepare and explore data

It is A data frame with 208 observations on 61 variables, all numerical and one (the Class) nominal.

cat("number of rows and columns are:", nrow(Sonar), ncol(Sonar))
## number of rows and columns are: 208 61

Lets check how many 𝑀 classes and 𝑅 classes Sonar data contain? and check whether Sonar data contains any NA in its columns.

table(Sonar$Class) 
## 
##   M   R 
## 111  97
apply(Sonar, 2, function(x) sum(is.na(x))) 
##    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11   V12   V13 
##     0     0     0     0     0     0     0     0     0     0     0     0     0 
##   V14   V15   V16   V17   V18   V19   V20   V21   V22   V23   V24   V25   V26 
##     0     0     0     0     0     0     0     0     0     0     0     0     0 
##   V27   V28   V29   V30   V31   V32   V33   V34   V35   V36   V37   V38   V39 
##     0     0     0     0     0     0     0     0     0     0     0     0     0 
##   V40   V41   V42   V43   V44   V45   V46   V47   V48   V49   V50   V51   V52 
##     0     0     0     0     0     0     0     0     0     0     0     0     0 
##   V53   V54   V55   V56   V57   V58   V59   V60 Class 
##     0     0     0     0     0     0     0     0     0

Here, we want to manually take samples from our data to split Sonar into training and test sets.

SEED <- 123
set.seed(SEED)
data <- Sonar[sample(nrow(Sonar)), ] # shuffle data first
bound <- floor(0.7 * nrow(data))
df_train <- data[1:bound, ] 
df_test <- data[(bound + 1):nrow(data), ]
cat("number of training and test samples are ", nrow(df_train), nrow(df_test))
## number of training and test samples are  145 63

Let’s examine if the train and test samples have properly splitted with the almost the same portion of Class labels

cat("number of training classes: \n", table(df_train$Class)/nrow(df_train))
## number of training classes: 
##  0.5172414 0.4827586
cat("\n")
cat("number of test classes: \n", table(df_test$Class)/nrow(df_test))
## number of test classes: 
##  0.5714286 0.4285714

Let’s create dataframes of train and test to simplify our task:

X_train <- subset(df_train, select=-Class)
y_train <- df_train$Class
X_test <- subset(df_test, select=-Class) # exclude Class for prediction
y_test <- df_test$Class

Step 3:Training a model on data

model_knn <- knn(train=X_train,
                 test=X_test,
                 cl=y_train,  # class labels
                 k=3)
model_knn
##  [1] M R M M M M M R M R M R M M M M M R R M M R M R M R M R M M R M R M R M M R
## [39] M M R M M R M M M M M R M M R R M R R R M R R M R
## Levels: M R

Step 4: Evaluate the model performance

As you can see, model_knn with π‘˜=3 provides the above predictions for the test set X_test. Then, we can see how many classes have been correctly or incorrectly classified by comparing to the true labels as follows-

conf_mat <- table(y_test, model_knn)
conf_mat
##       model_knn
## y_test  M  R
##      M 31  5
##      R  7 20

To compute the accuracy, we sum up all the correctly classified observations (located in diagonal) and divide it by the total number of classes

cat("Test accuracy: ", sum(diag(conf_mat))/sum(conf_mat))
## Test accuracy:  0.8095238

To assess whether π‘˜=3 is a good choice and see whether π‘˜=3 leads to overfitting /underfitting the data, we could use knn.cv which does the leave-one-out cross-validations for training set (i.e., it singles out a training sample one at a time and tries to view it as a new example and see what class label it assigns).

Below are the predicted classes for the training set using the leave-one-out cross-validation. Now, let’s examine its accuracy

knn_loocv <- knn.cv(train=X_train, cl=y_train, k=3)
knn_loocv
##   [1] R M M R M M R M R M M R R R M M M M R R R R M M M R M M M M R M R M M M M
##  [38] R R M R M M M M R M R R M R M M R M R M R R M M M R R M M M M R M R M R R
##  [75] R R M M R M M M R M M M R R R M M R R R M R M M M M R M M R R M M R M R M
## [112] M M R R R M M M M M M R R R R M M R R R R M M M M M R M R M R M R R
## Levels: M R

Lets create a confusion matrix to compute the accuracy of the training labels y_train and the cross-validated predictions knn_loocv, same as the above. What can you find from comparing the LOOCV accuracy and the test accuracy above?

conf_mat_cv <- table(y_train, knn_loocv)
conf_mat_cv
##        knn_loocv
## y_train  M  R
##       M 64 11
##       R 18 52
cat("LOOCV accuracy: ", sum(diag(conf_mat_cv)) / sum(conf_mat_cv))
## LOOCV accuracy:  0.8

The difference between the cross-validated accuracy and the test accuracy shows that, π‘˜=3 leads to overfitting. Perhaps we should change π‘˜ to lessen the overfitting.

Step 5: Improve the performance of the model

As noted earlier, we have not standardized (as part of preprocessing) our training and test sets. In the rest of the tutorial, we will see the effect of choosing a suitable π‘˜ through repeated cross-validations using caret library.

In a cross-validation procedure:

The data is divided into the finite number of mutually exclusive subsets Through each iteration, a subset is set aside, and the remaining subsets are used as the training set The subset that was set aside is used as the test set (prediction) This is a method of cross-referencing the model built using its own data.

SEED <- 2016
set.seed(SEED)

# create the training data 70% of the overall Sonar data.

in_train <- createDataPartition(Sonar$Class, p=0.7, list=FALSE) # create training indices

ndf_train <- Sonar[in_train, ]
ndf_test <- Sonar[-in_train, ]

Here, we specify the cross-validation method we want to use to find the best π‘˜ in grid search. Later, we use the built-in plot function to assess the changes in accuracy for different choices of π‘˜.

# lets create a function setup to do 5-fold cross-validation with 2 repeat.
ctrl <- trainControl(method="repeatedcv", number=5, repeats=2)

nn_grid <- expand.grid(k=c(1,3,5,7))
nn_grid
##   k
## 1 1
## 2 3
## 3 5
## 4 7
set.seed(SEED)

best_knn <- train(Class~., data=ndf_train,
                  method="knn",
                  trControl=ctrl, 
                  preProcess = c("center", "scale"),  # standardize
                  tuneGrid=nn_grid)
best_knn
## k-Nearest Neighbors 
## 
## 146 samples
##  60 predictor
##   2 classes: 'M', 'R' 
## 
## Pre-processing: centered (60), scaled (60) 
## Resampling: Cross-Validated (5 fold, repeated 2 times) 
## Summary of sample sizes: 117, 116, 116, 117, 118, 117, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   1  0.8568637  0.7098282
##   3  0.8147537  0.6214765
##   5  0.7978489  0.5866507
##   7  0.7326437  0.4527131
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 1.

So seemingly, π‘˜=1 has the highest accuracy from repeated cross-validation.

Let’s try to do dimensionality reduction as part of preprocess to achieve higher testing accuracy than above. This may not have a definite solution and it depends on how hard you try!

Use the above best_knn to make predictions on the test set (remember to remove the Class for prediction). Then create the much better version of confusion matrix with confusionMatrix function from caret and examine the accuracy and its %95 confidence interval.

In fact, the above result indicates π‘˜=1 (as could be guessed) is also overfitting, though it might be a better option than π‘˜=3. Since the initial dimension of our data is high ( 61 is considered high!), then you might have suspected the better approach, is to preform dimensionality reduction as part of preprocessing.

SEED <- 123 
set.seed(SEED) 

ctrl <- trainControl(method="repeatedcv", number=5, repeats=5) 
nn_grid <- expand.grid(k=c(1, 3, 5, 7)) 
best_knn_reduced <- train( Class~., data=ndf_train, method="knn", 
                            trControl=ctrl, preProcess=c("center", "scale","YeoJohnson"))

X_test <- subset(ndf_test, select=-Class) 

pred_reduced <- predict(best_knn_reduced, newdata=X_test, model="best") 
conf_mat_best_reduced <- confusionMatrix(ndf_test$Class, pred_reduced) 

conf_mat_best_reduced 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  M  R
##          M 29  4
##          R  8 21
##                                           
##                Accuracy : 0.8065          
##                  95% CI : (0.6863, 0.8958)
##     No Information Rate : 0.5968          
##     P-Value [Acc > NIR] : 0.0003688       
##                                           
##                   Kappa : 0.608           
##                                           
##  Mcnemar's Test P-Value : 0.3864762       
##                                           
##             Sensitivity : 0.7838          
##             Specificity : 0.8400          
##          Pos Pred Value : 0.8788          
##          Neg Pred Value : 0.7241          
##              Prevalence : 0.5968          
##          Detection Rate : 0.4677          
##    Detection Prevalence : 0.5323          
##       Balanced Accuracy : 0.8119          
##                                           
##        'Positive' Class : M               
## 

END