The KNN - K Neareast Neighbor algorithm is a non-parametric supervised machine learning model.
It is used for both classification and regression. It predicts a target variable using one or multiple independent variables. kNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K-NN algorithm.
We can use many formulas to measure the similarity for KNN-based problems. Mostly we use Euclidean Distance to find out the distance between two points.
\[ Euclidean \ D(x,y) = \sqrt{\sum_{i=1}^n(x_i-y_i)^2} \]
To classify a data point belongs to which category :
Select the K value: number of Nearest Neighbors
Calculate the Euclidean distance from K value to Data points.
Take the K nearest neighbors as per the calculated Euclidean distance.
Among these k neighbors, count the number of the data points in each category.
Classify the new data points to that category for which the number of the neighbor is maximum.
Using the Iris data for this practice.
data("iris")
dataset <- na.omit(iris)
# k-nearest neighbors with an Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.
dataset[,-5] <- scale(dataset[,-5])
Splitting dataset into training and testing
validationIndex <- createDataPartition(dataset$Species, p=0.70, list=FALSE)
train <- dataset[validationIndex,] # 70% of data to training
test <- dataset[-validationIndex,] # remaining 30% for test
Choosing K-Value is very important for model accuracy.
For this, we can use the Elbow method and searching the K value from range of k value and checking model accuracy in every search.
# Run algorithms using 10-fold cross validation
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
fit.knn <- train(Species~., data=train, method="knn",
metric=metric ,trControl=trainControl)
knn.k1 <- fit.knn$bestTune # keep this Initial k for testing with knn() function in next section
print(fit.knn)
## k-Nearest Neighbors
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 93, 94, 94, 95, 94, 95, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9518855 0.9272567
## 7 0.9576936 0.9360633
## 9 0.9607239 0.9405901
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
plot(fit.knn)
This chart shows the Elbow k = 9 with accuracy 96.07% for training dataset
Run prediction with test dataset and print out the confusion matrix:
set.seed(7)
prediction <- predict(fit.knn, newdata = test)
cf <- confusionMatrix(prediction, test$Species)
print(cf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 1
## virginica 0 0 14
##
## Overall Statistics
##
## Accuracy : 0.9778
## 95% CI : (0.8823, 0.9994)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 0.9333
## Specificity 1.0000 0.9667 1.0000
## Pos Pred Value 1.0000 0.9375 1.0000
## Neg Pred Value 1.0000 1.0000 0.9677
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3111
## Detection Prevalence 0.3333 0.3556 0.3111
## Balanced Accuracy 1.0000 0.9833 0.9667
With inital k = 9, the model correctly predict 97.78% target variable in test dataset
Let’s try searching k from 1 to 20:
set.seed(7)
grid <- expand.grid(.k=seq(1,20,by=1))
fit.knn <- train(Species~., data=train, method="knn",
metric=metric, tuneGrid=grid, trControl=trainControl)
knn.k2 <- fit.knn$bestTune # keep this optimal k for testing with stand alone knn() function in next section
print(fit.knn)
## k-Nearest Neighbors
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 93, 94, 94, 95, 94, 95, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 1 0.9515825 0.9270319
## 2 0.9481818 0.9216997
## 3 0.9509596 0.9257518
## 4 0.9610943 0.9412851
## 5 0.9518855 0.9272567
## 6 0.9626263 0.9436396
## 7 0.9576936 0.9360633
## 8 0.9543603 0.9309562
## 9 0.9607239 0.9405901
## 10 0.9641246 0.9457365
## 11 0.9711616 0.9563425
## 12 0.9543603 0.9309562
## 13 0.9610943 0.9411705
## 14 0.9570875 0.9351304
## 15 0.9580640 0.9366437
## 16 0.9544276 0.9308756
## 17 0.9547306 0.9315179
## 18 0.9361953 0.9032091
## 19 0.9270539 0.8898587
## 20 0.9240236 0.8851200
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 11.
plot(fit.knn)
We found optimal k= 11, the number of closest instances to collect in order to make a prediction.
Using the fit model to predict class for our test set, and print out the confusion matrix:
set.seed(7)
prediction <- predict(fit.knn, newdata = test)
cf <- confusionMatrix(prediction, test$Species)
print(cf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 1
## virginica 0 1 14
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9333
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 0.9333
## Specificity 1.0000 0.9667 0.9667
## Pos Pred Value 1.0000 0.9333 0.9333
## Neg Pred Value 1.0000 0.9667 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.3111
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9500 0.9500
With k = 11, the accuracy of model is 95.56%
The initial value for k is generally chosen as the square root of the number of observations.
initial_k <- sqrt(NROW(dataset))
initial_k
## [1] 12.24745
OK, then we run KNN with k=12 and k=13 to check their accuracy :
# run KNN with
knn.12 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=floor(initial_k))
knn.13 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=ceiling(initial_k))
# use confusion matrix to calculate accuracy
cf.12 <- confusionMatrix(test$Species,knn.12) # k = 12
cf.12
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 1
## virginica 0 1 14
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9333
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 0.9333
## Specificity 1.0000 0.9667 0.9667
## Pos Pred Value 1.0000 0.9333 0.9333
## Neg Pred Value 1.0000 0.9667 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.3111
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9500 0.9500
cf.13 <- confusionMatrix(test$Species,knn.13) # k = 13
cf.13
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 1
## virginica 0 1 14
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9333
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 0.9333
## Specificity 1.0000 0.9667 0.9667
## Pos Pred Value 1.0000 0.9333 0.9333
## Neg Pred Value 1.0000 0.9667 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.3111
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9500 0.9500
# find an optimal k
i=1
k.optm=1
for (i in 1:15){
knn.mod <- knn(train=train[,-5], test=test[,-5], cl=train[,5], k=i)
k.optm[i] <- 100 * sum(test[,5] == knn.mod)/NROW(test[,5])
cat(i,'=',k.optm[i],'\n')
}
## 1 = 91.11111
## 2 = 91.11111
## 3 = 88.88889
## 4 = 93.33333
## 5 = 91.11111
## 6 = 97.77778
## 7 = 97.77778
## 8 = 97.77778
## 9 = 97.77778
## 10 = 95.55556
## 11 = 95.55556
## 12 = 93.33333
## 13 = 95.55556
## 14 = 95.55556
## 15 = 95.55556
# Accuracy plot
plot(k.optm, type="b", xlab = "k Value", ylab="Accuracy")
Let’s run stand alone knn() with k = 9
fit.knn.k1 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=knn.k1)
cf <- confusionMatrix(test$Species,fit.knn.k1)
cf
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 15 0
## virginica 0 1 14
##
## Overall Statistics
##
## Accuracy : 0.9778
## 95% CI : (0.8823, 0.9994)
## No Information Rate : 0.3556
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9375 1.0000
## Specificity 1.0000 1.0000 0.9677
## Pos Pred Value 1.0000 1.0000 0.9333
## Neg Pred Value 1.0000 0.9667 1.0000
## Prevalence 0.3333 0.3556 0.3111
## Detection Rate 0.3333 0.3333 0.3111
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9688 0.9839
With k=9, using knn() we correctly predict 97.78%
And run knn() with k = 11
fit.knn.k2 <- knn(train=train[,-5], test=test[,-5], cl=train$Species, k=knn.k2)
cf <- confusionMatrix(test$Species,fit.knn.k2)
cf
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 1
## virginica 0 1 14
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9333
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 0.9333
## Specificity 1.0000 0.9667 0.9667
## Pos Pred Value 1.0000 0.9333 0.9333
## Neg Pred Value 1.0000 0.9667 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.3111
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9500 0.9500
With k = 11, using knn() we correctly predict 95.56%