This is an introduction to KNN classifier.
We will be using Iris dataset for this example.
We randomly pick 20 observations from iris data set. We will call this dataset “know_data”" as the reponses are know. The attributes are sepal.length, petal.length and class.
We will select an observation from the original iris dataset (observation number 25), call it “unknown data”" as the response is not known and predicit what will be class of the flower for given lengths of sepal and petal.
#known data points
known_data <- iris[sample(nrow(iris),20), - c(2,4)]
# unkown data points
unknown_data <- iris[25,c("Sepal.Length", "Petal.Length")]
head(known_data)
## Sepal.Length Petal.Length Species
## 62 5.9 4.2 versicolor
## 55 6.5 4.6 versicolor
## 8 5.0 1.5 setosa
## 23 4.6 1.0 setosa
## 78 6.7 5.0 versicolor
## 15 5.8 1.2 setosa
unknown_data
## Sepal.Length Petal.Length
## 25 4.8 1.9
Lets build a KNN classifiers step by step.
\[Euclideandistance = \sqrt{(x_{1} - x_{2})^2 - (y_{1} - y_{2})^2}\]
# function to calculate euclidean distance
eculidean_dist <- function(k,unk) {
distance <- rep(0, nrow(k))
for(i in 1:nrow(k))
distance[i] <- sqrt((k[,1][i] - unk[,1])^2 + (k[,2] - unk[,2])^2)
return(distance)
}
# Euclidean distance
edist <- data.frame(dist = eculidean_dist(known_data, unknown_data), species = known_data[,3])
edist
## dist species
## 1 2.549510 versicolor
## 2 2.860070 versicolor
## 3 2.308679 setosa
## 4 2.308679 setosa
## 5 2.983287 versicolor
## 6 2.507987 setosa
## 7 2.801785 virginica
## 8 2.469818 versicolor
## 9 2.801785 virginica
## 10 3.324154 virginica
## 11 2.319483 versicolor
## 12 2.404163 setosa
## 13 2.469818 setosa
## 14 2.469818 setosa
## 15 2.435159 versicolor
## 16 2.319483 setosa
## 17 2.302173 setosa
## 18 3.623534 virginica
## 19 2.300000 setosa
## 20 2.594224 versicolor
\[K = \sqrt(nrow)\]
k <- floor(sqrt(nrow(known_data)))
print(k)
## [1] 4
rank_edist <- edist[order(edist$dist),]
print(rank_edist)
## dist species
## 19 2.300000 setosa
## 17 2.302173 setosa
## 3 2.308679 setosa
## 4 2.308679 setosa
## 11 2.319483 versicolor
## 16 2.319483 setosa
## 12 2.404163 setosa
## 15 2.435159 versicolor
## 8 2.469818 versicolor
## 13 2.469818 setosa
## 14 2.469818 setosa
## 6 2.507987 setosa
## 1 2.549510 versicolor
## 20 2.594224 versicolor
## 7 2.801785 virginica
## 9 2.801785 virginica
## 2 2.860070 versicolor
## 5 2.983287 versicolor
## 10 3.324154 virginica
## 18 3.623534 virginica
The class predicted is “setosa”. All of the 4 nearest distances are indicating that the new flower is close to “Setosa”.
Lets check if this is true.
iris[25,]
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 25 4.8 3.4 1.9 0.2 setosa
Yes, it is true.