KNN is k-nearest neighbor classification algorithm. KNN is a supervised algorithm.This deal with classification problems.
Example for KNN algorithm– suppose we have 4 cities in the two dim space. A(2,3), B(3,4), C(1,3) and D(4,6). This are in our training model.let a test city is T(2,2). Given A and B in SF and C and D in San Jose.
Distance of A from T=sqrt((2-2)^2 + (3-2)^2)=1
Distance of B from T=sqrt((3-2)^2 + (4-2)^2)=2.2360
Distance of C from T=sqrt((1-2)^2 + (3-2)^2)=1.414
Distance of D from T=sqrt((4-2)^2 + (6-2)^2)=4.4721
If we will take k=1, A will be nearest to T.T will be in SF If we will take k=2, It will take A and C, so even number is a bad option in KNN. If we will take k=3, It will take A, C, B majority is in SF so T will be in SF.
A larger selection of k is a bad option for noisy data. Mostly k is used as sqrt(n), where n is the number of observations in training dataset.
As we are dealing with the distance. we should use the same scale for all variables otherwise it will mislead our algorithms.
KNN use Euclidean distance for distance calculation.
#Distance calculation
sqrt((4-2)^2+(6-2)^2)
[1] 4.472136
#Reading data
data1=iris
str(data1)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Iris data has 4 numerical and 1 categorical variables. Something interesting about iris data. link
#Number of species
table(data1$Species)
setosa versicolor virginica
50 50 50
#Summary of numerical variables
summary(data1)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
As we can see that scale is not same for all three variables, so our first task is to make scale same.
#Lets normalize the data
n=function(y) {return((y-min(y))/(max(y)-min(y))) }
data1_n=as.data.frame(lapply(data1[-5],n))
#Summary of normalized data
summary(data1_n)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
#Lets try Z score
data1_z=as.data.frame(scale(data1[-5]))
value are in the same scale.
#Lets divide data into training and test
set.seed(12)
sam=sample(150,110,replace = FALSE)
sam
[1] 11 122 140 40 25 5 26 92 4 2 55 114 52 53 37 60 62 72 88 15 29 102
[23] 13 91 28 34 63 24 54 82 130 107 105 96 74 109 80 117 44 112 65 58 106 22
[45] 90 10 147 145 16 79 104 36 42 32 149 93 137 113 135 75 68 39 87 49 144 14
[67] 12 124 30 134 86 57 48 121 123 133 59 138 84 115 38 61 9 17 78 18 142 56
[89] 76 7 41 118 116 21 33 126 98 20 67 110 146 95 119 143 97 83 111 108 69 101
train1=data1_n[sam,]
test1=data1_n[-sam,]
train2=data1_z[sam,]
test2=data1_z[-sam,]
#Creating labels
train_label=data1[sam,5]
test_label=data1[-sam,5]
#KNN model with k=1
library(class)
pred1=knn(train = train1,test=test1, cl=train_label , k=1)
#Model performance
library(gmodels)
CrossTable(x=test_label,y=pred1,prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 40
| pred1
test_label | setosa | versicolor | virginica | Row Total |
-------------|------------|------------|------------|------------|
setosa | 14 | 0 | 0 | 14 |
| 1.000 | 0.000 | 0.000 | 0.350 |
| 1.000 | 0.000 | 0.000 | |
| 0.350 | 0.000 | 0.000 | |
-------------|------------|------------|------------|------------|
versicolor | 0 | 12 | 1 | 13 |
| 0.000 | 0.923 | 0.077 | 0.325 |
| 0.000 | 0.857 | 0.083 | |
| 0.000 | 0.300 | 0.025 | |
-------------|------------|------------|------------|------------|
virginica | 0 | 2 | 11 | 13 |
| 0.000 | 0.154 | 0.846 | 0.325 |
| 0.000 | 0.143 | 0.917 | |
| 0.000 | 0.050 | 0.275 | |
-------------|------------|------------|------------|------------|
Column Total | 14 | 14 | 12 | 40 |
| 0.350 | 0.350 | 0.300 | |
-------------|------------|------------|------------|------------|
#Function for accuracy
accuracy=function(actual,prediction){mean(actual==prediction)}
a1=accuracy(test_label,pred1)
a1
[1] 0.925
#KNN model with z score transformation
pred2=knn(train = train2,test=test2 , cl=train_label ,k=1)
CrossTable(x=pred2,y=test_label,prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 40
| test_label
pred2 | setosa | versicolor | virginica | Row Total |
-------------|------------|------------|------------|------------|
setosa | 14 | 0 | 0 | 14 |
| 1.000 | 0.000 | 0.000 | 0.350 |
| 1.000 | 0.000 | 0.000 | |
| 0.350 | 0.000 | 0.000 | |
-------------|------------|------------|------------|------------|
versicolor | 0 | 12 | 3 | 15 |
| 0.000 | 0.800 | 0.200 | 0.375 |
| 0.000 | 0.923 | 0.231 | |
| 0.000 | 0.300 | 0.075 | |
-------------|------------|------------|------------|------------|
virginica | 0 | 1 | 10 | 11 |
| 0.000 | 0.091 | 0.909 | 0.275 |
| 0.000 | 0.077 | 0.769 | |
| 0.000 | 0.025 | 0.250 | |
-------------|------------|------------|------------|------------|
Column Total | 14 | 13 | 13 | 40 |
| 0.350 | 0.325 | 0.325 | |
-------------|------------|------------|------------|------------|
#Checking for accuracy
a2=accuracy(test_label,pred2)
a2
[1] 0.9
#KNN model with K=3
pred3=knn(train = train1,test=test1, cl=train_label , k=3)
CrossTable(x=test_label,y=pred3,prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 40
| pred3
test_label | setosa | versicolor | virginica | Row Total |
-------------|------------|------------|------------|------------|
setosa | 14 | 0 | 0 | 14 |
| 1.000 | 0.000 | 0.000 | 0.350 |
| 1.000 | 0.000 | 0.000 | |
| 0.350 | 0.000 | 0.000 | |
-------------|------------|------------|------------|------------|
versicolor | 0 | 12 | 1 | 13 |
| 0.000 | 0.923 | 0.077 | 0.325 |
| 0.000 | 0.923 | 0.077 | |
| 0.000 | 0.300 | 0.025 | |
-------------|------------|------------|------------|------------|
virginica | 0 | 1 | 12 | 13 |
| 0.000 | 0.077 | 0.923 | 0.325 |
| 0.000 | 0.077 | 0.923 | |
| 0.000 | 0.025 | 0.300 | |
-------------|------------|------------|------------|------------|
Column Total | 14 | 13 | 13 | 40 |
| 0.350 | 0.325 | 0.325 | |
-------------|------------|------------|------------|------------|
#Checking for accuracy
a3=accuracy(test_label,pred3)
a3
[1] 0.95
#KNN model with k=5
pred4=knn(train = train1,test=test1, cl=train_label , k=5)
CrossTable(x=test_label,y=pred4,prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 40
| pred4
test_label | setosa | versicolor | virginica | Row Total |
-------------|------------|------------|------------|------------|
setosa | 14 | 0 | 0 | 14 |
| 1.000 | 0.000 | 0.000 | 0.350 |
| 1.000 | 0.000 | 0.000 | |
| 0.350 | 0.000 | 0.000 | |
-------------|------------|------------|------------|------------|
versicolor | 0 | 13 | 0 | 13 |
| 0.000 | 1.000 | 0.000 | 0.325 |
| 0.000 | 0.929 | 0.000 | |
| 0.000 | 0.325 | 0.000 | |
-------------|------------|------------|------------|------------|
virginica | 0 | 1 | 12 | 13 |
| 0.000 | 0.077 | 0.923 | 0.325 |
| 0.000 | 0.071 | 1.000 | |
| 0.000 | 0.025 | 0.300 | |
-------------|------------|------------|------------|------------|
Column Total | 14 | 14 | 12 | 40 |
| 0.350 | 0.350 | 0.300 | |
-------------|------------|------------|------------|------------|
#Checking for accuracy
a4=accuracy(test_label,pred4)
a4
[1] 0.975
#KNN model with k=7
pred5=knn(train = train1,test=test1 , cl=train_label ,k=7)
CrossTable(x=test_label , y=pred5, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 40
| pred5
test_label | setosa | versicolor | virginica | Row Total |
-------------|------------|------------|------------|------------|
setosa | 14 | 0 | 0 | 14 |
| 1.000 | 0.000 | 0.000 | 0.350 |
| 1.000 | 0.000 | 0.000 | |
| 0.350 | 0.000 | 0.000 | |
-------------|------------|------------|------------|------------|
versicolor | 0 | 13 | 0 | 13 |
| 0.000 | 1.000 | 0.000 | 0.325 |
| 0.000 | 0.929 | 0.000 | |
| 0.000 | 0.325 | 0.000 | |
-------------|------------|------------|------------|------------|
virginica | 0 | 1 | 12 | 13 |
| 0.000 | 0.077 | 0.923 | 0.325 |
| 0.000 | 0.071 | 1.000 | |
| 0.000 | 0.025 | 0.300 | |
-------------|------------|------------|------------|------------|
Column Total | 14 | 14 | 12 | 40 |
| 0.350 | 0.350 | 0.300 | |
-------------|------------|------------|------------|------------|
#Checking for accuracy
a5=accuracy(test_label,pred5)
a5
[1] 0.975
#KNN model with k=9
pred6=knn(train = train1,test=test1 , cl=train_label ,k=9)
CrossTable(x=test_label , y=pred6, prop.chisq = FALSE)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 40
| pred6
test_label | setosa | versicolor | virginica | Row Total |
-------------|------------|------------|------------|------------|
setosa | 14 | 0 | 0 | 14 |
| 1.000 | 0.000 | 0.000 | 0.350 |
| 1.000 | 0.000 | 0.000 | |
| 0.350 | 0.000 | 0.000 | |
-------------|------------|------------|------------|------------|
versicolor | 0 | 13 | 0 | 13 |
| 0.000 | 1.000 | 0.000 | 0.325 |
| 0.000 | 1.000 | 0.000 | |
| 0.000 | 0.325 | 0.000 | |
-------------|------------|------------|------------|------------|
virginica | 0 | 0 | 13 | 13 |
| 0.000 | 0.000 | 1.000 | 0.325 |
| 0.000 | 0.000 | 1.000 | |
| 0.000 | 0.000 | 0.325 | |
-------------|------------|------------|------------|------------|
Column Total | 14 | 13 | 13 | 40 |
| 0.350 | 0.325 | 0.325 | |
-------------|------------|------------|------------|------------|
#Checking for accuracy
a6=accuracy(test_label,pred6)
a6
[1] 1
accuracy=c(a1,a3,a4,a5,a6)
k_values=c(1,3,5,7,9)
library(ggplot2)
data2=cbind.data.frame(accuracy,k_values)
ggplot(data2,aes(y=accuracy,x=k_values))+geom_smooth(col="red")
KKN is non-parametric algorithm. It does not keep any assumption for data distribution.and it does not effected by outliers.
library(alr4)
data3=alr4::UN11
str(data3)
'data.frame': 199 obs. of 6 variables:
$ region : Factor w/ 8 levels "Africa","Asia",..: 2 4 1 1 3 5 2 3 8 4 ...
$ group : Factor w/ 3 levels "oecd","other",..: 2 2 3 3 2 2 2 2 1 1 ...
$ fertility: num 5.97 1.52 2.14 5.13 2 ...
$ ppgdp : num 499 3677 4473 4322 13750 ...
$ lifeExpF : num 49.5 80.4 75 53.2 81.1 ...
$ pctUrban : num 23 53 67 59 100 93 64 47 89 68 ...
- attr(*, "na.action")=Class 'omit' Named int [1:34] 4 5 8 28 41 67 68 72 79 83 ...
.. ..- attr(*, "names")= chr [1:34] "Am Samoa" "Andorra" "Antigua and Barbuda" "Br Virigin Is" ...
We have 2 factor variable and 4 numerical variables. our preditor variable is life expectancy.
#Converting factors to numeric variables
data3$region=as.integer(as.factor(data3$region))
data3$group=as.integer(as.factor(data3$group))
str(data3)
'data.frame': 199 obs. of 6 variables:
$ region : int 2 4 1 1 3 5 2 3 8 4 ...
$ group : int 2 2 3 3 2 2 2 2 1 1 ...
$ fertility: num 5.97 1.52 2.14 5.13 2 ...
$ ppgdp : num 499 3677 4473 4322 13750 ...
$ lifeExpF : num 49.5 80.4 75 53.2 81.1 ...
$ pctUrban : num 23 53 67 59 100 93 64 47 89 68 ...
- attr(*, "na.action")=Class 'omit' Named int [1:34] 4 5 8 28 41 67 68 72 79 83 ...
.. ..- attr(*, "names")= chr [1:34] "Am Samoa" "Andorra" "Antigua and Barbuda" "Br Virigin Is" ...
dim(data3_test_labels)
[1] 79 1
#Creating Dataframe with test and prediction
data_pred=cbind.data.frame(pred_55,data3_test_labels)
library(dplyr)
rename(data=data_pred,t.pred_44.=predicted , `data3_n[-sample1, 5]`=actual)
Error in captureDots(strict = `__quosured`) :
the argument has already been evaluated
#Actual and predicted
ggplot(data=data_pred,aes(x=t.pred_44.,y=`data3_n[-sample1, 5]`))+geom_smooth(col="red")+xlab("predicted")+ylab("Actual")