Introduction

The KNN - K Neareast Neighbor algorithm is a non-parametric supervised machine learning model.

It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm.

Formula for KNN

We can use many formulas to measure the similarity for KNN-based problems. Mostly we use Euclidean Distance to find out the distance between two points.

\[ Euclidean \ D(x,y) = \sqrt{\sum_{i=1}^n(x_i-y_i)^2} \]

How KNN works

To classify a data point belongs to which category :

  • Select the K value: number of Nearest Neighbors

  • Calculate the Euclidean distance from K value to Data points.

  • Take the K nearest neighbors as per the calculated Euclidean distance.

  • Among these k neighbors, count the number of the data points in each category.

  • Classify the new data points to that category for which the number of the neighbor is maximum.

Implementation with R

Data set

Using the Iris data for this practice.

data("iris")
dataset <- na.omit(iris)
# k-nearest neighbors with an Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.
dataset[,-5] <- scale(dataset[,-5])

Splitting dataset into training and testing

validationIndex <- createDataPartition(dataset$Species, p=0.70, list=FALSE)

train <- dataset[validationIndex,] # 70% of data to training
test <- dataset[-validationIndex,] # remaining 30% for test

Choosing K value

Choosing K-Value is very important for model accuracy.

For this, we can use the Elbow method and searching the K value from range of k value and checking model accuracy in every search.

Using Caret

# Run algorithms using 10-fold cross validation
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"

Inital k

set.seed(7)
fit.knn <- train(Species~., data=train, method="knn",
                 metric=metric ,trControl=trainControl)
knn.k1 <- fit.knn$bestTune # keep this Initial k for testing with knn() function in next section
print(fit.knn)
## k-Nearest Neighbors 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 93, 94, 94, 95, 94, 95, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.9518855  0.9272567
##   7  0.9576936  0.9360633
##   9  0.9607239  0.9405901
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(fit.knn)

This chart shows the Elbow k = 9 with accuracy 96.07% for training dataset

Run prediction with test dataset and print out the confusion matrix:

set.seed(7)
prediction <- predict(fit.knn, newdata = test)
cf <- confusionMatrix(prediction, test$Species)
print(cf)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         1
##   virginica       0          0        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9667          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9333
## Specificity                 1.0000            0.9667           1.0000
## Pos Pred Value              1.0000            0.9375           1.0000
## Neg Pred Value              1.0000            1.0000           0.9677
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3111
## Detection Prevalence        0.3333            0.3556           0.3111
## Balanced Accuracy           1.0000            0.9833           0.9667

With inital k = 9, the model correctly predict 97.78% target variable in test dataset

Grid Search k

Let’s try searching k from 1 to 20:

set.seed(7)
grid <- expand.grid(.k=seq(1,20,by=1))
fit.knn <- train(Species~., data=train, method="knn", 
                 metric=metric, tuneGrid=grid, trControl=trainControl)
knn.k2 <- fit.knn$bestTune # keep this optimal k for testing with stand alone knn() function in next section
print(fit.knn)
## k-Nearest Neighbors 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 93, 94, 94, 95, 94, 95, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9515825  0.9270319
##    2  0.9481818  0.9216997
##    3  0.9509596  0.9257518
##    4  0.9610943  0.9412851
##    5  0.9518855  0.9272567
##    6  0.9626263  0.9436396
##    7  0.9576936  0.9360633
##    8  0.9543603  0.9309562
##    9  0.9607239  0.9405901
##   10  0.9641246  0.9457365
##   11  0.9711616  0.9563425
##   12  0.9543603  0.9309562
##   13  0.9610943  0.9411705
##   14  0.9570875  0.9351304
##   15  0.9580640  0.9366437
##   16  0.9544276  0.9308756
##   17  0.9547306  0.9315179
##   18  0.9361953  0.9032091
##   19  0.9270539  0.8898587
##   20  0.9240236  0.8851200
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
plot(fit.knn)

We found optimal k= 11, the number of closest instances to collect in order to make a prediction.

Using the fit model to predict class for our test set, and print out the confusion matrix:

set.seed(7)
prediction <- predict(fit.knn, newdata = test)
cf <- confusionMatrix(prediction, test$Species)
print(cf)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         1
##   virginica       0          1        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9333
## Specificity                 1.0000            0.9667           0.9667
## Pos Pred Value              1.0000            0.9333           0.9333
## Neg Pred Value              1.0000            0.9667           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.3111
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9500           0.9500

With k = 11, the accuracy of model is 95.56%

Using knn() funtion

Initial K

The initial value for k is generally chosen as the square root of the number of observations.

initial_k <- sqrt(NROW(dataset))
initial_k
## [1] 12.24745

OK, then we run KNN with k=12 and k=13 to check their accuracy :

# run KNN with 
knn.12 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=floor(initial_k))
knn.13 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=ceiling(initial_k))

# use confusion matrix to calculate accuracy
cf.12 <- confusionMatrix(test$Species,knn.12) # k = 12
cf.12
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         1
##   virginica       0          1        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9333
## Specificity                 1.0000            0.9667           0.9667
## Pos Pred Value              1.0000            0.9333           0.9333
## Neg Pred Value              1.0000            0.9667           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.3111
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9500           0.9500
cf.13 <- confusionMatrix(test$Species,knn.13) # k = 13
cf.13
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         1
##   virginica       0          1        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9333
## Specificity                 1.0000            0.9667           0.9667
## Pos Pred Value              1.0000            0.9333           0.9333
## Neg Pred Value              1.0000            0.9667           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.3111
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9500           0.9500

Searching k

# find an optimal k
i=1
k.optm=1
for (i in 1:15){
  knn.mod <- knn(train=train[,-5], test=test[,-5], cl=train[,5], k=i)
  k.optm[i] <- 100 * sum(test[,5] == knn.mod)/NROW(test[,5])
  cat(i,'=',k.optm[i],'\n')
}
## 1 = 91.11111 
## 2 = 91.11111 
## 3 = 88.88889 
## 4 = 93.33333 
## 5 = 91.11111 
## 6 = 97.77778 
## 7 = 97.77778 
## 8 = 97.77778 
## 9 = 97.77778 
## 10 = 95.55556 
## 11 = 95.55556 
## 12 = 93.33333 
## 13 = 95.55556 
## 14 = 95.55556 
## 15 = 95.55556
# Accuracy plot
plot(k.optm, type="b", xlab = "k Value", ylab="Accuracy")

With k = 9

Let’s run stand alone knn() with k = 9

fit.knn.k1 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=knn.k1)

cf <- confusionMatrix(test$Species,fit.knn.k1)
cf
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         15         0
##   virginica       0          1        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.3556          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9667          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9375           1.0000
## Specificity                 1.0000            1.0000           0.9677
## Pos Pred Value              1.0000            1.0000           0.9333
## Neg Pred Value              1.0000            0.9667           1.0000
## Prevalence                  0.3333            0.3556           0.3111
## Detection Rate              0.3333            0.3333           0.3111
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9688           0.9839

With k=9, using knn() we correctly predict 97.78%

With k = 11

And run knn() with k = 11

fit.knn.k2 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=knn.k2)

cf <- confusionMatrix(test$Species,fit.knn.k2)
cf
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         1
##   virginica       0          1        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9333
## Specificity                 1.0000            0.9667           0.9667
## Pos Pred Value              1.0000            0.9333           0.9333
## Neg Pred Value              1.0000            0.9667           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.3111
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9500           0.9500

With k = 11, using knn() we correctly predict 95.56%