Sameer Mathur
kNN Algorithm using R
---
Let's assume we have several groups of labeled samples. The items present in the groups are homogeneous in nature. Now, suppose we have an unlabeled example which needs to be classified into one of the several labeled groups. How do you do that? Unhesitatingly, using kNN Algorithm.
k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups.
Choosing the number of nearest neighbors i.e. determining the value of k plays a significant role in determining the efficacy of the model. Thus, selection of k will determine how well the data can be utilized to generalize the results of the kNN algorithm. A large k value has benefits which include reducing the variance due to the noisy data; the side effect being developing a bias due to which the learner tends to ignore the smaller patterns which may have useful insights.
Let's consider 10 'drinking items' which are rated on two parameters on a scale of 1 to 10. The two parameters are “sweetness” and “fizziness”. This is more of a perception based rating and so may vary between individuals. I would be considering my ratings (which might differ) to take this illustration ahead. The ratings of few items look somewhat as:
From the previous figure, it is clear we have bucketed the 10 items into 4 groups namely, 'COLD DRINKS', 'ENERGY DRINKS', 'HEALTH DRINKS' and 'HARD DRINKS'.The question here is, to which group would 'Maaza' fall into?This will be determined by calculating distance.
Now, calculating distance between 'Maaza' and its nearest neighbors ('ACTIV', 'Vodka', 'Pepsi' and 'Monster') requires the usage of a distance formula, the most popular being Euclidean distance formula i.e. the shortest distance between the 2 points which may be obtained using a ruler.
Using the co-ordinates of Maaza (8,2) and Vodka (2,1), the distance between 'Maaza' and 'Vodka' can be calculated as: dist(Maaza,Vodka) = 6.08
Using Euclidean distance, we can calculate the distance of Maaza from each of its nearest neighbors. Distance between Maaza and ACTIV being the least, it may be inferred that Maaza is same as ACTIV in nature which in turn belongs to the group of drinks (Health Drinks).
If k=1, the algorithm considers the nearest neighbor to Maaza i.e, ACTIV; if k=3, the algorithm considers '3' nearest neighbors to Maaza to compare the distances (ACTIV, Vodka, Monster) - ACTIV stands the nearest to Maaza.
SocialNetworkAds
dataset <- read.csv('Social_Network_Ads.csv')
dataset <- dataset[3:5]
head(dataset)
Age EstimatedSalary Purchased
1 19 19000 0
2 35 20000 0
3 26 43000 0
4 27 57000 0
5 19 76000 0
6 27 58000 0
# Encoding the target feature as factor
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
str(dataset$Purchased)
Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
library(caTools)
set.seed(123)
split <- sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
# some rows of training set
head(training_set)
Age EstimatedSalary Purchased
1 19 19000 0
3 26 43000 0
6 27 58000 0
7 27 84000 0
8 32 150000 1
10 35 65000 0
# some rows of test set
head(test_set)
Age EstimatedSalary Purchased
2 35 20000 0
4 27 57000 0
5 19 76000 0
9 25 33000 0
12 26 52000 0
18 45 26000 1
# feature scaling
training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])
# varifying feature scaling in training set
head(training_set)
Age EstimatedSalary Purchased
1 -1.7655475 -1.4733414 0
3 -1.0962966 -0.7883761 0
6 -1.0006894 -0.3602727 0
7 -1.0006894 0.3817730 0
8 -0.5226531 2.2654277 1
10 -0.2358313 -0.1604912 0
# varifying feature scaling in test set
head(test_set)
Age EstimatedSalary Purchased
2 -0.3041906 -1.5135434 0
4 -1.0599437 -0.3245603 0
5 -1.8156969 0.2859986 0
9 -1.2488820 -1.0957926 0
12 -1.1544129 -0.4852337 0
18 0.6405008 -1.3207353 1
library(class)
y_pred <- knn(train = training_set[, -3],
test = test_set[, -3],
cl = training_set[, 3],
k = 5,
prob = TRUE)
# creating new dataframe with predicted values
pred_DF <- data.frame(dataset$Age,dataset$EstimatedSalary,y_pred)
Age EstimatedSalary y_pred
1 19 19000 0
2 35 20000 0
3 26 43000 0
4 27 57000 0
5 19 76000 0
6 27 58000 1
7 27 84000 1
8 32 150000 1
9 25 33000 0
10 35 65000 0
11 26 80000 1
12 26 52000 0
# Confusion Matrix
cm <- table(test_set[, 3], y_pred)
cm
y_pred
0 1
0 59 5
1 6 30
library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = knn(train = training_set[, -3], test = grid_set, cl = training_set[, 3], k = 5)
plot(set[, -3],
main = 'K-NN (Training set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
Visualising the Test set results
library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = knn(train = training_set[, -3], test = grid_set, cl = training_set[, 3], k = 5)
plot(set[, -3],
main = 'K-NN (Test set)',
xlab = 'Age', ylab = 'Estimated Salary',
xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))