k-Nearest Neighbors (kNN)

Sameer Mathur

Machine Learning in Marketing using kNN

---

kNN

Classification?

Let's assume we have several groups of labeled samples. The items present in the groups are homogeneous in nature.

Now, suppose we have an unlabeled example which needs to be classified into one of the several labeled groups.

kNN Algorithm

k nearest neighbors is a simple algorithm that stores all available / known cases (training data) and classifies new cases by a majority vote of its k neighbors.

Selecting the value of k

A large k value reduces the variance due to the noisy data;

But it can create a bias due to which the learner tends to ignore the smaller patterns which may have useful insights.

Example of kNN Algorithm

Let's consider 10 'drinking items' which are rated on two parameters on a scale of 1 to 10.

The two parameters are “sweetness” and “fizziness”.

This is a perception based rating and so may vary between individuals.

Sample Ratings:

Group Chart

We have bucketed the 10 items into 4 groups namely, 'COLD DRINKS', 'ENERGY DRINKS', 'HEALTH DRINKS' and 'HARD DRINKS'.

How should Mazaa be classified?

Euclidean Distance

Calculate the distance between 'Maaza' and its nearest neighbors ('ACTIV', 'Vodka', 'Pepsi' and 'Monster') using the popular metric: Euclidean distance.

Calculating Distance

Using the co-ordinates of Maaza (8,2) and Vodka (2,1), the distance between 'Maaza' and 'Vodka' can be calculated as: dist(Maaza,Vodka) = 6.08

Classification using kNN

Using Euclidean distance, we can calculate the distance of Maaza from each of its nearest neighbors. Distance between Maaza and ACTIV being the least, it may be inferred that Maaza is same as ACTIV in nature which in turn belongs to the group of drinks (Health Drinks).

If k=1, the algorithm considers the nearest neighbor to Maaza i.e, ACTIV;

if k=3, the algorithm considers '3' nearest neighbors to Maaza by comparing the Euclidean distances (ACTIV, Real, Monster)

For getting the predicted class, iterate from 1 to total number of training data points.

1- Calculate the distance between test data and each row of training data. Here we will use Euclidean distance as our distance metric since it's the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.

2- Sort the calculated distances in ascending order based on distance values

3- Get top k rows from the sorted array.

4- Get the most frequent class of these rows.

5- Return the predicted class.