This post provides an overview of kNN algorithm
This post captures the gist of the introduction to kNN algorithm covered in the following sources. For more information in kNN, please refer to the following:
The training phase of the kNN algorithm consists only of storing the data.
Most of the computation is done during the prediction phase (This is unusual among machine learning algorithms).
During prediction phase, the kNN algorithm calculates distance nearness) between each new, unlabeled case and all the labeled cases. Because all of its computation is done during the prediction phase, kNN is said to be a lazy learner.
This distance metric is often called Euclidean distance, which in two or even three dimensions is easy to visualize as the straight line distance between two points on a plot. This is calculated in as many dimensions as are present in the data.
Next, for each unlabeled case, the algorithm ranks the neighbors from the nearest (most similar) to the furthest (the least similar).
The kNN algorithm can actually be used for both classification and regression problems, but the only difference is that instead of taking the majority class vote, the algorithm finds the mean or median of the nearest neighbors’ values.
Figure 1
Body length and aggression of reptiles. Labeled cases for adders, grass snakes, and slow worms are indicated by their shape. New, unlabeled data are shown by black crosses. The first step of the kNN algorithm: calculating distance. The lines represent the distance between one of the unlabeled cases (the cross) and each of the labeled cases.
Figure 2
The second step of the kNN algorithm: ranking the neighbors. The lines represent the distance between one of the unlabeled cases (the cross) and each of the labeled cases. The numbers represent the ranked distance between the unlabeled case (the cross) and each labeled case (1 = closest).
The algorithm identifies the k-labeled cases (neighbors) nearest to each unlabeled case. k is an integer specified by us. In other words, find the k-labeled cases that are most similar in terms of their variables to the unlabeled case. Finally, each of the k-nearest neighbor cases “votes” on which class the unlabeled data belongs in, based on the nearest neighbor’s own class. In other words, whatever class most of the k-nearest neighbors belong to is what the unlabeled case is classified as.
Figure 3
Let’s work through figure 3.4 and see this in practice. When we set k to 1, the algorithm finds the single labeled case that is most similar to each of the unlabeled data items. Each of the unlabeled reptiles is closest to a member of the grass snake class, so they are all assigned to this class.
When we set k to 3, the algorithm finds the three labeled cases that are most similar to each of the unlabeled data items. As you can see in the figure, two of the unlabeled cases have nearest neighbors belonging to more than one class. In this situation, each nearest neighbor “votes” for its own class, and the majority vote wins. This is very intuitive because if a single unusually aggressive grass snake happens to be the near- est neighbor to an as-yet-unlabeled adder, it will be outvoted by the neighboring adders in the data.
Hopefully now you can see how this extends to other values of k. When we set k to 5, for example, the algorithm simply finds the five nearest cases to the unlabeled data and takes the majority vote as the class of the unlabeled case. Notice that in all three scenarios, the value of k directly impacts how each unlabeled case is classified.
The final step of the kNN algorithm: identifying the k-nearest neighbors and taking the majority vote. Lines connect the unlabeled data with their one, three, and five nearest neighbors. The majority vote in each scenario is indicated by the shape drawn under each cross.
It may happen that all of the k-nearest neighbors belong to different classes and that the vote results in a tie. What happens in this situation? Well, one way we can avoid this in a two-class classification problem (when the data can only belong to one of two, mutually exclusive groups) is to ensure that we pick odd numbers of k. This way, there will always be a deciding vote. But what about in situations like our reptile classifica- tion problem, where we have more than two groups?
One way of dealing with this situation is to decrease k until a majority vote can be won. But this doesn’t help if an unlabeled case is equidistant between its two nearest neighbors.
Instead, a more common (and pragmatic) approach is to randomly assign cases with no majority vote to one of the classes. In practice, the proportion of cases that have ties among their nearest neighbors is very small, so this has a limited impact on the classification accuracy of the model. However, if you have many ties in your data, your options are as follows:
While it often isn’t easy to tell which algorithms will perform well for a given task, here are some strengths and weaknesses that will help you decide whether kNN will per- form well for your task.
The strengths of the kNN algorithm are as follows:
The weaknesses of the kNN algorithm are these:
It cannot natively handle categorical variables (they must be recoded first, or a different distance metric must be used).
When the training set is large, it can be computationally expensive to compute the distance between new data and all the cases in the training set.
The model can’t be interpreted in terms of real-world relationships in the data.
Prediction accuracy can be strongly impacted by noisy data and outliers.
In high-dimensional datasets, kNN tends to perform poorly. This is due to a phenomenon called the curse of dimensionality. In brief, in high dimensions the distances between the cases start to look the same, so finding the nearest neighbors becomes difficult.