This is an introduction to KNN classifier.

We will be using Iris dataset for this example.

We randomly pick 20 observations from iris data set. We will call this dataset “know_data”" as the reponses are know. The attributes are sepal.length, petal.length and class.

We will select an observation from the original iris dataset (observation number 25), call it “unknown data”" as the response is not known and predicit what will be class of the flower for given lengths of sepal and petal.

#known data points

known_data <- iris[sample(nrow(iris),20), - c(2,4)]

# unkown data points
unknown_data <- iris[25,c("Sepal.Length", "Petal.Length")]
head(known_data)
##    Sepal.Length Petal.Length    Species
## 62          5.9          4.2 versicolor
## 55          6.5          4.6 versicolor
## 8           5.0          1.5     setosa
## 23          4.6          1.0     setosa
## 78          6.7          5.0 versicolor
## 15          5.8          1.2     setosa
unknown_data
##    Sepal.Length Petal.Length
## 25          4.8          1.9

Lets build a KNN classifiers step by step.

step 1: calculate euclidean distance between the known data points and unknown data point.

\[Euclideandistance = \sqrt{(x_{1} - x_{2})^2 - (y_{1} - y_{2})^2}\]

# function to calculate euclidean distance

eculidean_dist <- function(k,unk) {
  
  distance <- rep(0, nrow(k))
  
  for(i in 1:nrow(k))
  
    distance[i] <- sqrt((k[,1][i] - unk[,1])^2 + (k[,2] - unk[,2])^2)
    
  return(distance)
}


# Euclidean distance
edist <- data.frame(dist = eculidean_dist(known_data, unknown_data), species = known_data[,3])

edist
##        dist    species
## 1  2.549510 versicolor
## 2  2.860070 versicolor
## 3  2.308679     setosa
## 4  2.308679     setosa
## 5  2.983287 versicolor
## 6  2.507987     setosa
## 7  2.801785  virginica
## 8  2.469818 versicolor
## 9  2.801785  virginica
## 10 3.324154  virginica
## 11 2.319483 versicolor
## 12 2.404163     setosa
## 13 2.469818     setosa
## 14 2.469818     setosa
## 15 2.435159 versicolor
## 16 2.319483     setosa
## 17 2.302173     setosa
## 18 3.623534  virginica
## 19 2.300000     setosa
## 20 2.594224 versicolor

Step 2: Calculate the number of K

\[K = \sqrt(nrow)\]

k <- floor(sqrt(nrow(known_data)))

print(k)
## [1] 4

Step 3: Rank the distances in ascending order

rank_edist <- edist[order(edist$dist),]

print(rank_edist)
##        dist    species
## 19 2.300000     setosa
## 17 2.302173     setosa
## 3  2.308679     setosa
## 4  2.308679     setosa
## 11 2.319483 versicolor
## 16 2.319483     setosa
## 12 2.404163     setosa
## 15 2.435159 versicolor
## 8  2.469818 versicolor
## 13 2.469818     setosa
## 14 2.469818     setosa
## 6  2.507987     setosa
## 1  2.549510 versicolor
## 20 2.594224 versicolor
## 7  2.801785  virginica
## 9  2.801785  virginica
## 2  2.860070 versicolor
## 5  2.983287 versicolor
## 10 3.324154  virginica
## 18 3.623534  virginica

The class predicted is “setosa”. All of the 4 nearest distances are indicating that the new flower is close to “Setosa”.

Lets check if this is true.

iris[25,]
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 25          4.8         3.4          1.9         0.2  setosa

Yes, it is true.