Introduction

In this small project I will try to use a KNN (K-nearest-neighbors) method for Iris Species identification. The iris dataset is a built-in dataset in R that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species (“setosa”, “versicolor” and “virginica”). There are total 150 data row and they will be split into training and testing dataset. The training dataset will be used to train my model while the testing dataset will be used as a accuracy verification for our model. The goal of this project is to find out which k-value (No. of Nearest Neighbor) is the best in species identification.

Load Library

library(tidyverse)
library(ggplot2)
library(class)
library(caret)

Load data-set

data(iris)
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

As each variable consist of different scales, it is better to normalize thd data column first.

Normalize the data-set

# define a simple normalization function
normalize=function(x){
  return( (x-min(x))/(max(x)-min(x)) )
}

# Loop over the first 4 numeric column
for (i in 1:4){
  iris[i]=normalize(iris[i]) }
summary(iris)
##   Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
##  Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
##  Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
##  3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Now they are normalized into same scale (0-1) for further work.

Split into train/ test dataset

set.seed(111)
iris_train_no=sample(1:nrow(iris),nrow(iris)*0.7,replace=FALSE)
iris_train_data=iris[iris_train_no,]
iris_test_data=iris[-iris_train_no,]

Make a KNN model

# Here I randomly select 3 as nearest neighbor number for a testing.
set.seed(111)
knn3=knn(iris_train_data[1:4],iris_test_data[1:4],cl=as.factor(iris_train_data$Species),k=3)

# Them make a confusion matrix
knn3_cm=confusionMatrix(knn3,iris_test_data$Species)
knn3_cm$overall[1]
##  Accuracy 
## 0.9333333

When we set the nearest neighbor (k) =3, it has an accuracy of 93.3%. What if we set the k for other values (e.g. 1-50)? Will the result be improved or changed if we use a higher k value?

Make KNN model for k=1 to k=50

# Make a loop for k=1 to k=50, then add accuracy rate to a vector
knn_loop_accuracy=c()
for (i in 1:50){
  set.seed(111)
  knn_loop=knn(iris_train_data[1:4],iris_test_data[1:4],cl=as.factor(iris_train_data$Species),k=i)
  knn_loop_cm=confusionMatrix(knn_loop,iris_test_data$Species)
  knn_loop_accuracy[i]=knn_loop_cm$overall[1]
}

# Make a dataframe and present the result by visualization
knn_number_accuracy=data.frame(knn_number=c(1:50),accuracy=knn_loop_accuracy)
ggplot(knn_number_accuracy,aes(x=knn_number,y=accuracy))+geom_line(color="red")+geom_point(color="skyblue")+scale_y_continuous(limits = c(0.8,1))+scale_x_continuous(n.breaks=10)+labs(title="Accuracy Rate of using different K-values")+theme_classic()

When K number is between 0-25, they all have accuracy rate which >90%. However as the k number keep increase, the accuracy rate keep dropping. This is because the chance of noises from other species is also higher according.

Conclusion

Therefore, I select the k=6 with highest accuracy rate 95.6%. This should be the best in making a iris species prediction.