Introduction

The KNN - K Neareast Neighbor algorithm is a non-parametric supervised machine learning model.

It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm.

Formula for KNN

We can use many formulas to measure the similarity for KNN-based problems. Mostly we use Euclidean Distance to find out the distance between two points. \[ {Euclidean \space Distance: \space D(x,y)} = \sum(x_i-y_i)^2\]

How KNN works

To classify a data point belongs to which category :

  • Select the K value: number of Nearest Neighbors
  • Calculate the Euclidean distance from K value to Data points.
  • Take the K nearest neighbors as per the calculated Euclidean distance.
  • Among these k neighbors, count the number of the data points in each category.
  • Classify the new data points to that category for which the number of the neighbor is maximum.

Implementation in R

Data set

library(knitr)
library(kableExtra)

kab <- knitr::kable(mower.df, caption = "Mower dataset for KNN",booktabs = F, label = "dataset table")

kable_classic_2(kab, full_width = T)
Mower dataset for KNN
Income LotSize Ownership
60.0 18.4 Owner
85.5 16.8 Owner
64.8 21.6 Owner
61.5 20.8 Owner
87.0 23.6 Owner
110.1 19.2 Owner
108.0 17.6 Owner
82.8 22.4 Owner
69.0 20.0 Owner
93.0 20.8 Owner
51.0 22.0 Owner
81.0 20.0 Owner
75.0 19.6 Nonowner
52.8 20.8 Nonowner
64.8 17.2 Nonowner
43.2 20.4 Nonowner
84.0 17.6 Nonowner
49.2 17.6 Nonowner
59.4 16.0 Nonowner
66.0 18.4 Nonowner
47.4 16.4 Nonowner
33.0 18.8 Nonowner
51.0 14.0 Nonowner
63.0 14.8 Nonowner

Choosing K value

Choosing K-Value is very important for model accuracy.

For this, we can use the Elbow method and searching the K value from range of k value and checking model accuracy in every search. #### Using Caret

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(7)
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"

fit.knn <- train(Ownership ~., data=mower.df, method="knn", metric=metric ,trControl=trainControl)
knn.k1 <- fit.knn$bestTune 
print(fit.knn)
## k-Nearest Neighbors 
## 
## 24 samples
##  2 predictor
##  2 classes: 'Nonowner', 'Owner' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 22, 22, 21, 22, 21, 22, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6805556  0.3600000
##   7  0.6416667  0.2733333
##   9  0.7277778  0.4366667
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Plot shows that the elbow occurs at k = 9

plot(fit.knn)

prediction <- predict(fit.knn, newdata = mower.df,type="raw")
result.df <- cbind(mower.df, prediction)

kab <- knitr::kable(result.df, caption = "KNN classifier result",booktabs = F, label = "Result table")

kable_classic_2(kab, full_width = T)
KNN classifier result
Income LotSize Ownership prediction
60.0 18.4 Owner Nonowner
85.5 16.8 Owner Owner
64.8 21.6 Owner Nonowner
61.5 20.8 Owner Nonowner
87.0 23.6 Owner Owner
110.1 19.2 Owner Owner
108.0 17.6 Owner Owner
82.8 22.4 Owner Owner
69.0 20.0 Owner Nonowner
93.0 20.8 Owner Owner
51.0 22.0 Owner Nonowner
81.0 20.0 Owner Owner
75.0 19.6 Nonowner Owner
52.8 20.8 Nonowner Nonowner
64.8 17.2 Nonowner Nonowner
43.2 20.4 Nonowner Nonowner
84.0 17.6 Nonowner Owner
49.2 17.6 Nonowner Nonowner
59.4 16.0 Nonowner Nonowner
66.0 18.4 Nonowner Nonowner
47.4 16.4 Nonowner Nonowner
33.0 18.8 Nonowner Nonowner
51.0 14.0 Nonowner Nonowner
63.0 14.8 Nonowner Nonowner
cf <- confusionMatrix(prediction, mower.df$Ownership, positive = "Owner")
print(cf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Nonowner Owner
##   Nonowner       10     5
##   Owner           2     7
##                                           
##                Accuracy : 0.7083          
##                  95% CI : (0.4891, 0.8738)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.03196         
##                                           
##                   Kappa : 0.4167          
##                                           
##  Mcnemar's Test P-Value : 0.44969         
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2917          
##    Detection Prevalence : 0.3750          
##       Balanced Accuracy : 0.7083          
##                                           
##        'Positive' Class : Owner           
##