The k-nearest neighbors algorithm, also known as KNN, is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. While it can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.
For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label that is most frequently represented around a given data point is used.
In a nutshell:
- K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
- KNN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories.
- KNN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems.
- KNN is a non-parametric technique, which means it does not make any assumption on underlying data (distributions, equations, etc…)
- KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.
Example: We have some data of a target variable that takes values \(\{0,1\}\) and two continuous variables, \(x_1,x_2\).
It arises the following question:
- It arrives a new observation: hod do I predict?
Suppose there are two categories, i.e., and Class 0 and class 1, and we have a new data point A, so this data point will lie in which of these categories. To solve this type of problem, we need a KNN algorithm. With the help of KNN, we can easily identify the category or class of a particular dataset. Consider the below diagram:
The KNN working can be explained on the basis of the below algorithm:
- Step-1: Select the number K of the neighbors
- Step-2: Calculate the Euclidean distance of K number of neighbors
- Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
- Step-4: Among these k neighbors, count the number of the data points in each category.
- Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
- Step-6: Our model is ready (as you can see in the following plot)
Exercise How do we compute the distance between two points? Hint One way is called Euclidean Distance.
For instance, if we have two points \(A(5,3)\) and \(B(1,4)\), the distance between both points (Euclidean Distance) can be computed as:
\(d(A,B)=\sqrt((5-1)^2+(3-4)^2)=\sqrt(17)\)
Exercise How to choose \(K\)?
Let’s see a “manual” example.
Example using Smarket data
Our target is to predict the direction of the stock market (up , down) based on past movements.
We will now perform KNN using the knn() function, which is part of the knn() class library. This function works rather differently from the other model fitting functions that we have encountered thus far. Rather than a two-step approach in which we first fit the model and then we use the model to make predictions, knn() forms predictions using a single command. The function requires four inputs.
- A matrix containing the predictors associated with the training data, labeled train.X below.
- A matrix containing the predictors associated with the data for which we wish to make predictions, labeled test.X below.
- A vector containing the class labels for the training observations, labeled train.Direction below.
- A value for K, the number of nearest neighbors to be used by the classifier.
Example with [R]
install.packages("ISLR")
library(ISLR)
attach(Smarket)
install.packages("class")
library(class)
train<-(Year <2005)
train.X<-cbind(Lag1,Lag2)[train ,]
test.X<-cbind(Lag1,Lag2)[!train ,]
train.Direction<-Direction[train]
Direction.2005<-Direction[!train]
knn.pred<-knn(train.X,test.X,train.Direction ,k=5)
table(knn.pred ,Direction.2005)
The accuracy, in this case is
\(\frac{40+82}{40+59+82+71}=0.48\)
You can try with different values for \(K\). What happens if \(K=10\)?
library(ISLR)
attach(Smarket)
install.packages("class")
library(class)
train<-(Year <2005)
train.X<-cbind(Lag1,Lag2)[train ,]
test.X<-cbind(Lag1,Lag2)[!train ,]
train.Direction<-Direction[train]
Direction.2005<-Direction[!train]
knn.pred<-knn(train.X,test.X,train.Direction ,k=10)
table(knn.pred ,Direction.2005)
\(\frac{42+77}{42+77+64+69}=0.47\)
In this case, it remains almost the same. But is should be nice to choose a criterion (accuracy, false positives, false negatives, etc… ) and to find the best \(k\) value by cross-validation.