K-Nearest Neighbors (K-NN)

Sameer Mathur

kNN Algorithm using R

---

kNN Algorithm

Let's assume we have several groups of labeled samples. The items present in the groups are homogeneous in nature. Now, suppose we have an unlabeled example which needs to be classified into one of the several labeled groups. How do you do that? Unhesitatingly, using kNN Algorithm.

k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups.

How does the KNN algorithm work

How to select appropriate k value?

Choosing the number of nearest neighbors i.e. determining the value of k plays a significant role in determining the efficacy of the model. Thus, selection of k will determine how well the data can be utilized to generalize the results of the kNN algorithm. A large k value has benefits which include reducing the variance due to the noisy data; the side effect being developing a bias due to which the learner tends to ignore the smaller patterns which may have useful insights.

Example of kNN Algorithm

Let's consider 10 'drinking items' which are rated on two parameters on a scale of 1 to 10. The two parameters are “sweetness” and “fizziness”. This is more of a perception based rating and so may vary between individuals. I would be considering my ratings (which might differ) to take this illustration ahead. The ratings of few items look somewhat as:

Group Chart

From the previous figure, it is clear we have bucketed the 10 items into 4 groups namely, 'COLD DRINKS', 'ENERGY DRINKS', 'HEALTH DRINKS' and 'HARD DRINKS'.The question here is, to which group would 'Maaza' fall into?This will be determined by calculating distance.

Calculating Distance

Now, calculating distance between 'Maaza' and its nearest neighbors ('ACTIV', 'Vodka', 'Pepsi' and 'Monster') requires the usage of a distance formula, the most popular being Euclidean distance formula i.e. the shortest distance between the 2 points which may be obtained using a ruler.

Calculating Distance

Using the co-ordinates of Maaza (8,2) and Vodka (2,1), the distance between 'Maaza' and 'Vodka' can be calculated as: dist(Maaza,Vodka) = 6.08

Conclusion

Using Euclidean distance, we can calculate the distance of Maaza from each of its nearest neighbors. Distance between Maaza and ACTIV being the least, it may be inferred that Maaza is same as ACTIV in nature which in turn belongs to the group of drinks (Health Drinks).

If k=1, the algorithm considers the nearest neighbor to Maaza i.e, ACTIV; if k=3, the algorithm considers '3' nearest neighbors to Maaza to compare the distances (ACTIV, Vodka, Monster) - ACTIV stands the nearest to Maaza.

Case Study

SocialNetworkAds

Importing the dataset

dataset <- read.csv('Social_Network_Ads.csv')
dataset <- dataset[3:5]
head(dataset)

  Age EstimatedSalary Purchased
1  19           19000         0
2  35           20000         0
3  26           43000         0
4  27           57000         0
5  19           76000         0
6  27           58000         0

# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
str(dataset$Purchased)

 Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...

Splitting the dataset into the Training set and Test set

library(caTools)
set.seed(123)
split <- sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

# some rows of training set
head(training_set)

   Age EstimatedSalary Purchased
1   19           19000         0
3   26           43000         0
6   27           58000         0
7   27           84000         0
8   32          150000         1
10  35           65000         0

# some rows of test set
head(test_set)

   Age EstimatedSalary Purchased
2   35           20000         0
4   27           57000         0
5   19           76000         0
9   25           33000         0
12  26           52000         0
18  45           26000         1

Feature Scaling

# feature scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# varifying feature scaling in training set
head(training_set)

          Age EstimatedSalary Purchased
1  -1.7655475      -1.4733414         0
3  -1.0962966      -0.7883761         0
6  -1.0006894      -0.3602727         0
7  -1.0006894       0.3817730         0
8  -0.5226531       2.2654277         1
10 -0.2358313      -0.1604912         0

# varifying feature scaling in test set
head(test_set)

          Age EstimatedSalary Purchased
2  -0.3041906      -1.5135434         0
4  -1.0599437      -0.3245603         0
5  -1.8156969       0.2859986         0
9  -1.2488820      -1.0957926         0
12 -1.1544129      -0.4852337         0
18  0.6405008      -1.3207353         1

Fitting K-NN to the Training set and Predicting the Test set results

library(class)
y_pred <- knn(train = training_set[, -3],
             test = test_set[, -3],
             cl = training_set[, 3],
             k = 5,
             prob = TRUE)

# creating new dataframe with predicted values
pred_DF <- data.frame(dataset$Age,dataset$EstimatedSalary,y_pred)

   Age EstimatedSalary y_pred
1   19           19000      0
2   35           20000      0
3   26           43000      0
4   27           57000      0
5   19           76000      0
6   27           58000      1
7   27           84000      1
8   32          150000      1
9   25           33000      0
10  35           65000      0
11  26           80000      1
12  26           52000      0

Making the Confusion Matrix

# Confusion Matrix
cm <- table(test_set[, 3], y_pred)
cm

Visualising the Training set results

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = knn(train = training_set[, -3], test = grid_set, cl = training_set[, 3], k = 5)
plot(set[, -3],
     main = 'K-NN (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

plot of chunk unnamed-chunk-10

Visualising the Test set results

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = knn(train = training_set[, -3], test = grid_set, cl = training_set[, 3], k = 5)
plot(set[, -3],
     main = 'K-NN (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

plot of chunk unnamed-chunk-12