kNN(k-Nearest Neighbour) Algorithm in R

References:Various study material available online.

What is kNN ?

Let’s assume we have several groups of labeled samples. The items present in the groups are homogeneous in nature. Now, suppose we have an unlabeled example which needs to be classified into one of the several labeled groups. How do you do that? Unhesitatingly, using kNN Algorithm.

k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups.

How does the KNN algorithm work?

Let’s take a simple case to understand this algorithm. Following is a spread of blue circles (BC) and red rectangle (RR) :

You intend to find out the class of the green star (GS) . GS can either be BC or RR and nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 4. Hence, we will now make a circle with GS as center just as big as to enclose only four datapoints on the plane. Refer to following diagram for more details:

The three closest points to GS is all BC. Hence, with good confidence level we can say that the GS should belong to the class BC. Here, the choice became very obvious as all four votes from the closest neighbor went to BC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered to conclude the best K.

kNN Algorithm features:

A very simple classification and regression algorithm.
- In case of classification, new data points get classified in a particular class on the basis of voting from nearest neighbors.
- In case of regression, new data get labeled based on the averages of nearest value.
It is a lazy learner because it doesn’t learn much from the training data.
It is supervised learning algorithm.
Default method is Euclidean distance (shortest distance between 2 points, using formula = \(\sqrt((X_1-X_2)^2+(Y_1-Y_2)^2)\) ) used for continuous variables, whereas for discrete variables, such as for text classification the overlap metric(Hamming distance) would be employed.

Requirements for kNN

Generally k gets decided on the square root of number of data points.But a large k value has benefits which include reducing the variance due to the noisy data; the side effect being developing a bias due to which the learner tends to ignore the smaller patterns which may have useful insights
Data Normalization - It is to transform all the feature data in the same scale (for eg: 0 to 1) else it will give more weightage to the data which is higher in value irrespective of scale/unit.
Installation of “Class” library to implement in R.

kNN Algorithm - Pros and Cons

Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.

Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this doesn’t create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model. This model is sensitive to outliers too.

Steps Involved in performing kNN algorithm:

Data Collection.
Preparing and exploring the data.
- Understnding data structure.
- Feature selection (if required)
- Data normalization.
- Creating Training and Test data set.
Training a model on data.
Evaluate the model performance.
Improve the performance of model.

Case Study on kNN algorithm

Case Study -1

Objective of Analysis: Minimization of risk and maximization of profit on behalf of the bank.

Brief Description: To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link (https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/german_credit.csv) to the German Credit data. A predictive model developed on this data is expected to provide bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.

Methodology: kNN classification

Step-1 Data Collection.

gc <- read.csv("german_credit.csv") # reading csv data files from defined directory as file has already downloaded and stored in the directory

## Taking back-up of the input file, in case the original data is required later
gc.bkup <- gc
head (gc) # To check top 6 values of all the variables in data set.

##   Creditability Account.Balance Duration.of.Credit..month.
## 1             1               1                         18
## 2             1               1                          9
## 3             1               2                         12
## 4             1               1                         12
## 5             1               1                         12
## 6             1               1                         10
##   Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 1                                 4       2          1049
## 2                                 4       0          2799
## 3                                 2       9           841
## 4                                 4       0          2122
## 5                                 4       0          2171
## 6                                 4       0          2241
##   Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 1                    1                            2                   4
## 2                    1                            3                   2
## 3                    2                            4                   2
## 4                    1                            3                   3
## 5                    1                            3                   4
## 6                    1                            2                   1
##   Sex...Marital.Status Guarantors Duration.in.Current.address
## 1                    2          1                           4
## 2                    3          1                           2
## 3                    2          1                           4
## 4                    3          1                           2
## 5                    3          1                           4
## 6                    3          1                           3
##   Most.valuable.available.asset Age..years. Concurrent.Credits
## 1                             2          21                  3
## 2                             1          36                  3
## 3                             1          23                  3
## 4                             1          39                  3
## 5                             2          38                  1
## 6                             1          48                  3
##   Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## 1                 1                          1          3                1
## 2                 1                          2          3                2
## 3                 1                          1          2                1
## 4                 1                          2          2                2
## 5                 2                          2          2                1
## 6                 1                          2          2                2
##   Telephone Foreign.Worker
## 1         1              1
## 2         1              1
## 3         1              1
## 4         1              2
## 5         1              2
## 6         1              2

Step-2 Preparing and exploring the data. Dataset variable details are available on this link (https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 )

There are 20 attributes/features, so for the simplicity we will select relevant attributes which are as follows :

Age (numeric)
Sex (text: male, female)
Job (numeric: 0 - unskilled and non-resident, 1 - unskilled and resident, 2 - skilled, 3 - highly skilled)
Saving accounts (text - little, moderate, quite rich, rich)
Credit amount (numeric, in DM)
Duration (numeric, in month)
Purpose (text: car, furniture/equipment, radio/TV, domestic appliances, repairs, education, business, vacation/others)

Note: All the attributes’ value are already converted to numeric and same data is available on the link as mentioned above.

str(gc) # understanding data structure, can see all the varaibles are integers including 'Creditability' which is our response variable.

## 'data.frame':    1000 obs. of  21 variables:
##  $ Creditability                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Account.Balance                  : int  1 1 2 1 1 1 1 1 4 2 ...
##  $ Duration.of.Credit..month.       : int  18 9 12 12 12 10 8 6 18 24 ...
##  $ Payment.Status.of.Previous.Credit: int  4 4 2 4 4 4 4 4 4 2 ...
##  $ Purpose                          : int  2 0 9 0 0 0 0 0 3 3 ...
##  $ Credit.Amount                    : int  1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
##  $ Value.Savings.Stocks             : int  1 1 2 1 1 1 1 1 1 3 ...
##  $ Length.of.current.employment     : int  2 3 4 3 3 2 4 2 1 1 ...
##  $ Instalment.per.cent              : int  4 2 2 3 4 1 1 2 4 1 ...
##  $ Sex...Marital.Status             : int  2 3 2 3 3 3 3 3 2 2 ...
##  $ Guarantors                       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Duration.in.Current.address      : int  4 2 4 2 4 3 4 4 4 4 ...
##  $ Most.valuable.available.asset    : int  2 1 1 1 2 1 1 1 3 4 ...
##  $ Age..years.                      : int  21 36 23 39 38 48 39 40 65 23 ...
##  $ Concurrent.Credits               : int  3 3 3 3 1 3 3 3 3 3 ...
##  $ Type.of.apartment                : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ No.of.Credits.at.this.Bank       : int  1 2 1 2 2 2 2 1 2 1 ...
##  $ Occupation                       : int  3 3 2 2 2 2 2 2 1 1 ...
##  $ No.of.dependents                 : int  1 2 1 2 1 2 1 2 1 1 ...
##  $ Telephone                        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Foreign.Worker                   : int  1 1 1 2 2 2 2 2 1 1 ...

#Feature/Attribute selection

#The variable 'Creditability' is our target variable i.e. this variable will determine whether bank manager will approve a loan based on the 7 Attributes.

gc.subset <- gc[c('Creditability','Age..years.','Sex...Marital.Status','Occupation','Account.Balance','Credit.Amount','Length.of.current.employment','Purpose')]
head(gc.subset)

##   Creditability Age..years. Sex...Marital.Status Occupation
## 1             1          21                    2          3
## 2             1          36                    3          3
## 3             1          23                    2          2
## 4             1          39                    3          2
## 5             1          38                    3          2
## 6             1          48                    3          2
##   Account.Balance Credit.Amount Length.of.current.employment Purpose
## 1               1          1049                            2       2
## 2               1          2799                            3       0
## 3               2           841                            4       9
## 4               1          2122                            3       0
## 5               1          2171                            3       0
## 6               1          2241                            2       0

#Data normalistion to avoid biasness as the value sclae of 'Credit.Amount'is in thousand whereas other attribute's value are in 2 digits or 1 digit.

normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) } # creating a normalize function for easy convertion.

gc.subset.n<- as.data.frame(lapply(gc.subset[,2:8], normalize)) # lapply creates list that is why it is converted to dataframe and it applies defined fundtion (which is 'normalize') to all the list values which is here column 2 to 8 as first column is target/response.
head(gc.subset.n)

##   Age..years. Sex...Marital.Status Occupation Account.Balance
## 1  0.03571429            0.3333333  0.6666667       0.0000000
## 2  0.30357143            0.6666667  0.6666667       0.0000000
## 3  0.07142857            0.3333333  0.3333333       0.3333333
## 4  0.35714286            0.6666667  0.3333333       0.0000000
## 5  0.33928571            0.6666667  0.3333333       0.0000000
## 6  0.51785714            0.6666667  0.3333333       0.0000000
##   Credit.Amount Length.of.current.employment Purpose
## 1    0.04396390                         0.25     0.2
## 2    0.14025531                         0.50     0.0
## 3    0.03251898                         0.75     0.9
## 4    0.10300429                         0.50     0.0
## 5    0.10570045                         0.50     0.0
## 6    0.10955211                         0.25     0.0

#Now all attributes having value in the range 0 to 1 which is normalised data and 'Creditability' column has been removed as sample value starts form column 2.

#Creating Training and Test data set. Training data will be used to build model whereas test data will be used for validation and optimisation of model by tuning k value.

set.seed(123)  # To get the same random sample
dat.d <- sample(1:nrow(gc.subset.n),size=nrow(gc.subset.n)*0.7,replace = FALSE) #random selection of 70% data.

train.gc <- gc.subset[dat.d,] # 70% training data
test.gc <- gc.subset[-dat.d,] # remaining 30% test data

#Now creating seperate dataframe for 'Creditability' feature which is our target.
train.gc_labels <- gc.subset[dat.d,1]
test.gc_labels  <- gc.subset[-dat.d,1]

Step-3 Training a model on data.

#install.packages(class) # to install class packages as it carries kNN function
library(class)          # to call class package

NROW(train.gc_labels)   # to find the number of observation

## [1] 700

#To identify optimum value of k, generally square root of total no of observations (700) which is 26.45 is taken, so will try with 26, 27 then will check for optimal value of k.

knn.26 <-  knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=26)
knn.27 <-  knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=27)

Step-4 Evaluate the model performance.

## Let's calculate the proportion of correct classification for k = 26, 27 

ACC.26 <- 100 * sum(test.gc_labels == knn.26)/NROW(test.gc_labels)  # For knn = 26
ACC.27 <- 100 * sum(test.gc_labels == knn.27)/NROW(test.gc_labels)  # For knn = 27
ACC.26    #Accuracy is 67.67%

## [1] 67.66667

ACC.27    #Accuracy is 67.33%, which has reduced compare to k=26

## [1] 67.33333

table(knn.26 ,test.gc_labels)  # to check prediction against actual value in tabular form

##       test.gc_labels
## knn.26   0   1
##      0  11   7
##      1  90 192

# 11 & 192 are the correct prediction against actual wheras 90 & 7 are wrong prediction against actual.
table(knn.27 ,test.gc_labels)  # to check prediction against actual value in tabular form

##       test.gc_labels
## knn.27   0   1
##      0  11   8
##      1  90 191

# 11 & 191 are the correct prediction against actual wheras 90 & 8 are wrong prediction against actual.

OR Accuracy can also be calculated using ‘caret’ package and ‘confusion matrix’ function.

Install.packages(caret), To install ‘caret’ packages as it carries ‘confusion matrix’ function which helps in the calculation of accuracy of model.

library(caret)

confusionMatrix(knn.26 ,test.gc_labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  11   7
##          1  90 192
##                                           
##                Accuracy : 0.6767          
##                  95% CI : (0.6205, 0.7293)
##     No Information Rate : 0.6633          
##     P-Value [Acc > NIR] : 0.3365          
##                                           
##                   Kappa : 0.0924          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.10891         
##             Specificity : 0.96482         
##          Pos Pred Value : 0.61111         
##          Neg Pred Value : 0.68085         
##              Prevalence : 0.33667         
##          Detection Rate : 0.03667         
##    Detection Prevalence : 0.06000         
##       Balanced Accuracy : 0.53687         
##                                           
##        'Positive' Class : 0               
##

confusionMatrix(knn.27 ,test.gc_labels)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  11   8
##          1  90 191
##                                           
##                Accuracy : 0.6733          
##                  95% CI : (0.6171, 0.7261)
##     No Information Rate : 0.6633          
##     P-Value [Acc > NIR] : 0.3824          
##                                           
##                   Kappa : 0.0859          
##  Mcnemar's Test P-Value : 2.786e-16       
##                                           
##             Sensitivity : 0.10891         
##             Specificity : 0.95980         
##          Pos Pred Value : 0.57895         
##          Neg Pred Value : 0.67972         
##              Prevalence : 0.33667         
##          Detection Rate : 0.03667         
##    Detection Prevalence : 0.06333         
##       Balanced Accuracy : 0.53435         
##                                           
##        'Positive' Class : 0               
##

Step-5 Improve the performance of model.

For kNN algorithm, the tuning parameters are ‘k’ value and number of ’features/attributes selection.
Optimum ‘k’ value can be found using ‘elbow’ or ‘maximum % accuracy’ graph but ‘feature selection’ can be done only through understanding of features in kNN algorithm.

i=1                          # declaration to initiate for loop
k.optm=1                     # declaration to initiate for loop
for (i in 1:28){ 
    knn.mod <-  knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=i)
    k.optm[i] <- 100 * sum(test.gc_labels == knn.mod)/NROW(test.gc_labels)
    k=i  
    cat(k,'=',k.optm[i],'\n')       # to print % accuracy 
}

## 1 = 60.33333 
## 2 = 59.66667 
## 3 = 60.33333 
## 4 = 64.33333 
## 5 = 62.33333 
## 6 = 64 
## 7 = 63.33333 
## 8 = 64.33333 
## 9 = 63.33333 
## 10 = 64.66667 
## 11 = 64.66667 
## 12 = 65 
## 13 = 66 
## 14 = 65.33333 
## 15 = 66.66667 
## 16 = 67 
## 17 = 67.66667 
## 18 = 67.33333 
## 19 = 67.66667 
## 20 = 67.33333 
## 21 = 66.33333 
## 22 = 66.66667 
## 23 = 67.66667 
## 24 = 67.33333 
## 25 = 68 
## 26 = 67.66667 
## 27 = 67.33333 
## 28 = 67.33333

# Maximum accuracy at k=25   

plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level")  # to plot % accuracy wrt to k-value

At k=25, maximum accuracy achieved which is 68%, after that, it seems increasing K increases the classification but reduces success rate. It is worse to class a customer as good when it is bad, than it is to class a customer as bad when it is good.