K-Nearest Neighbors (KNN) Using R

Introduction

The KNN - K Neareast Neighbor algorithm is a non-parametric supervised machine learning model.

It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm.

Formula for KNN

We can use many formulas to measure the similarity for KNN-based problems. Mostly we use Euclidean Distance to find out the distance between two points. \[ {Euclidean \space Distance: \space D(x,y)} = \sum(x_i-y_i)^2\]

How KNN works

To classify a data point belongs to which category :

Select the K value: number of Nearest Neighbors
Calculate the Euclidean distance from K value to Data points.
Take the K nearest neighbors as per the calculated Euclidean distance.
Among these k neighbors, count the number of the data points in each category.
Classify the new data points to that category for which the number of the neighbor is maximum.

Implementation in R

Data set

library(knitr)
library(kableExtra)

kab <- knitr::kable(mower.df, caption = "Mower dataset for KNN",booktabs = F, label = "dataset table")

kable_classic_2(kab, full_width = T)

Mower dataset for KNN
Income	LotSize	Ownership
60.0	18.4	Owner
85.5	16.8	Owner
64.8	21.6	Owner
61.5	20.8	Owner
87.0	23.6	Owner
110.1	19.2	Owner
108.0	17.6	Owner
82.8	22.4	Owner
69.0	20.0	Owner
93.0	20.8	Owner
51.0	22.0	Owner
81.0	20.0	Owner
75.0	19.6	Nonowner
52.8	20.8	Nonowner
64.8	17.2	Nonowner
43.2	20.4	Nonowner
84.0	17.6	Nonowner
49.2	17.6	Nonowner
59.4	16.0	Nonowner
66.0	18.4	Nonowner
47.4	16.4	Nonowner
33.0	18.8	Nonowner
51.0	14.0	Nonowner
63.0	14.8	Nonowner

Choosing K value

Choosing K-Value is very important for model accuracy.

For this, we can use the Elbow method and searching the K value from range of k value and checking model accuracy in every search. #### Using Caret

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(7)
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"

fit.knn <- train(Ownership ~., data=mower.df, method="knn", metric=metric ,trControl=trainControl)
knn.k1 <- fit.knn$bestTune 
print(fit.knn)

## k-Nearest Neighbors 
## 
## 24 samples
##  2 predictor
##  2 classes: 'Nonowner', 'Owner' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 22, 22, 21, 22, 21, 22, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.6805556  0.3600000
##   7  0.6416667  0.2733333
##   9  0.7277778  0.4366667
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Plot shows that the elbow occurs at k = 9

plot(fit.knn)

prediction <- predict(fit.knn, newdata = mower.df,type="raw")
result.df <- cbind(mower.df, prediction)

kab <- knitr::kable(result.df, caption = "KNN classifier result",booktabs = F, label = "Result table")

kable_classic_2(kab, full_width = T)

KNN classifier result
Income	LotSize	Ownership	prediction
60.0	18.4	Owner	Nonowner
85.5	16.8	Owner	Owner
64.8	21.6	Owner	Nonowner
61.5	20.8	Owner	Nonowner
87.0	23.6	Owner	Owner
110.1	19.2	Owner	Owner
108.0	17.6	Owner	Owner
82.8	22.4	Owner	Owner
69.0	20.0	Owner	Nonowner
93.0	20.8	Owner	Owner
51.0	22.0	Owner	Nonowner
81.0	20.0	Owner	Owner
75.0	19.6	Nonowner	Owner
52.8	20.8	Nonowner	Nonowner
64.8	17.2	Nonowner	Nonowner
43.2	20.4	Nonowner	Nonowner
84.0	17.6	Nonowner	Owner
49.2	17.6	Nonowner	Nonowner
59.4	16.0	Nonowner	Nonowner
66.0	18.4	Nonowner	Nonowner
47.4	16.4	Nonowner	Nonowner
33.0	18.8	Nonowner	Nonowner
51.0	14.0	Nonowner	Nonowner
63.0	14.8	Nonowner	Nonowner

cf <- confusionMatrix(prediction, mower.df$Ownership, positive = "Owner")
print(cf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Nonowner Owner
##   Nonowner       10     5
##   Owner           2     7
##                                           
##                Accuracy : 0.7083          
##                  95% CI : (0.4891, 0.8738)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.03196         
##                                           
##                   Kappa : 0.4167          
##                                           
##  Mcnemar's Test P-Value : 0.44969         
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.8333          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2917          
##    Detection Prevalence : 0.3750          
##       Balanced Accuracy : 0.7083          
##                                           
##        'Positive' Class : Owner           
##