To begin with, we have to load the libraries that we are going to use:
library(knitr)
library(class)
We will use an example dataset of 100 points to be classified according to the Knn Classification Algorithm:
example <- read.csv('http://jakeporway.com/teaching/data/example_data.csv', h=T, as.is=T)
Now we have a look at what the dataset looks like:
head(example, n=5)
## X Y Label
## 1 2.373546 5.398106 0
## 2 3.183643 4.387974 0
## 3 2.164371 5.341120 0
## 4 4.595281 3.870637 0
## 5 3.329508 6.433024 0
plot(example$X, example$Y, col=example$Label+1, pch=19, xlab="X", ylab="Y")
As we can see, there are two kinds of data: black points (0) and red points (1). We wanna know, through a Knn algorithm, how to classify those points, and be able to classify a possible new point to the dataset.
First, let’s compute the distances between all of the points:
d <- dist(example, method="euclidean")
d <- as.matrix(d)
Now, we can find the k nearest neighbours to a choosen point:
k.nearest.neighbors <- function(i, distance.matrix, k = 5)
{
ordered.neighbors <- order(distance.matrix[i, ])
# This just gives us the list of points that are
# closest to point i, in descending order.
# The first entry is always 0 (the closest point is the point itself) so
# let's ignore that entry and return points 2:(k+1) instead of 1:k
return(ordered.neighbors[2:(k + 1)])
}
Now, we can see which are the k closest neighbours to a point:
k.nearest.neighbors(25, d, k=7)
## [1] 8 47 22 30 48 15 34
We can see here the top 7 neighbours for the point 25 in our dataset.
Now, let’s build the prediction model. We need the algorithm. We can either build the funcion, or use the function knn, already existing in the ‘class’ package. Let’s use the last option. First of all, we build a vector with the sum of points in the dataset, we’ll call it n. We also set up the random number generator for the next step (creating a random set of rows to be training and testing):
n <- nrow(example)
set.seed(1) # This sets the random number generator
Now we have to create training and testing sets (something R’s kNN needs). To do this, let’s sample a random set of rows from our data to be training and testing:
indices <- sort(sample(1:n, n * (1 / 2)))
Now we create two differents dataframes, one for the training and one for the test:
training.set <- example[indices, 1:2]
test.set <- example[-indices, 1:2]
We get their original labels to compare afterwards with our prediction:
training.original.labels <- example[indices, 3]
test.original.labels <- example[-indices, 3]
Finally, we predict the points:
predicted.set <- knn(training.set, test.set, training.original.labels, k = 5)
Now let’s see how the algorithm predicted our points. Notice that the Knn algorithm takes the training.set and the training.original.labels to predict the test.set labels. We already know the test.set labels, so we compare the predicted.y vector of labels with the test.original.labels vector of labels.
sum(predicted.set != test.original.labels)
## [1] 7
So we see that we have classified 43 out of 50 points in the proper way (an error rate of 14%).