The KNN - K Neareast Neighbor algorithm is a non-parametric supervised machine learning model.
It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm.
We can use many formulas to measure the similarity for KNN-based problems. Mostly we use Euclidean Distance to find out the distance between two points. \[ {Euclidean \space Distance: \space D(x,y)} = \sum(x_i-y_i)^2\]
To classify a data point belongs to which category :
library(knitr)
library(kableExtra)
kab <- knitr::kable(mower.df, caption = "Mower dataset for KNN",booktabs = F, label = "dataset table")
kable_classic_2(kab, full_width = T)
| Income | LotSize | Ownership |
|---|---|---|
| 60.0 | 18.4 | Owner |
| 85.5 | 16.8 | Owner |
| 64.8 | 21.6 | Owner |
| 61.5 | 20.8 | Owner |
| 87.0 | 23.6 | Owner |
| 110.1 | 19.2 | Owner |
| 108.0 | 17.6 | Owner |
| 82.8 | 22.4 | Owner |
| 69.0 | 20.0 | Owner |
| 93.0 | 20.8 | Owner |
| 51.0 | 22.0 | Owner |
| 81.0 | 20.0 | Owner |
| 75.0 | 19.6 | Nonowner |
| 52.8 | 20.8 | Nonowner |
| 64.8 | 17.2 | Nonowner |
| 43.2 | 20.4 | Nonowner |
| 84.0 | 17.6 | Nonowner |
| 49.2 | 17.6 | Nonowner |
| 59.4 | 16.0 | Nonowner |
| 66.0 | 18.4 | Nonowner |
| 47.4 | 16.4 | Nonowner |
| 33.0 | 18.8 | Nonowner |
| 51.0 | 14.0 | Nonowner |
| 63.0 | 14.8 | Nonowner |
Choosing K-Value is very important for model accuracy.
For this, we can use the Elbow method and searching the K value from range of k value and checking model accuracy in every search. #### Using Caret
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(7)
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
fit.knn <- train(Ownership ~., data=mower.df, method="knn", metric=metric ,trControl=trainControl)
knn.k1 <- fit.knn$bestTune
print(fit.knn)
## k-Nearest Neighbors
##
## 24 samples
## 2 predictor
## 2 classes: 'Nonowner', 'Owner'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 22, 22, 21, 22, 21, 22, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.6805556 0.3600000
## 7 0.6416667 0.2733333
## 9 0.7277778 0.4366667
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(fit.knn)
prediction <- predict(fit.knn, newdata = mower.df,type="raw")
result.df <- cbind(mower.df, prediction)
kab <- knitr::kable(result.df, caption = "KNN classifier result",booktabs = F, label = "Result table")
kable_classic_2(kab, full_width = T)
| Income | LotSize | Ownership | prediction |
|---|---|---|---|
| 60.0 | 18.4 | Owner | Nonowner |
| 85.5 | 16.8 | Owner | Owner |
| 64.8 | 21.6 | Owner | Nonowner |
| 61.5 | 20.8 | Owner | Nonowner |
| 87.0 | 23.6 | Owner | Owner |
| 110.1 | 19.2 | Owner | Owner |
| 108.0 | 17.6 | Owner | Owner |
| 82.8 | 22.4 | Owner | Owner |
| 69.0 | 20.0 | Owner | Nonowner |
| 93.0 | 20.8 | Owner | Owner |
| 51.0 | 22.0 | Owner | Nonowner |
| 81.0 | 20.0 | Owner | Owner |
| 75.0 | 19.6 | Nonowner | Owner |
| 52.8 | 20.8 | Nonowner | Nonowner |
| 64.8 | 17.2 | Nonowner | Nonowner |
| 43.2 | 20.4 | Nonowner | Nonowner |
| 84.0 | 17.6 | Nonowner | Owner |
| 49.2 | 17.6 | Nonowner | Nonowner |
| 59.4 | 16.0 | Nonowner | Nonowner |
| 66.0 | 18.4 | Nonowner | Nonowner |
| 47.4 | 16.4 | Nonowner | Nonowner |
| 33.0 | 18.8 | Nonowner | Nonowner |
| 51.0 | 14.0 | Nonowner | Nonowner |
| 63.0 | 14.8 | Nonowner | Nonowner |
cf <- confusionMatrix(prediction, mower.df$Ownership, positive = "Owner")
print(cf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Nonowner Owner
## Nonowner 10 5
## Owner 2 7
##
## Accuracy : 0.7083
## 95% CI : (0.4891, 0.8738)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.03196
##
## Kappa : 0.4167
##
## Mcnemar's Test P-Value : 0.44969
##
## Sensitivity : 0.5833
## Specificity : 0.8333
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.6667
## Prevalence : 0.5000
## Detection Rate : 0.2917
## Detection Prevalence : 0.3750
## Balanced Accuracy : 0.7083
##
## 'Positive' Class : Owner
##