Let’s assume we have several groups of labeled samples. The items present in the groups are homogeneous in nature. Now, suppose we have an unlabeled example which needs to be classified into one of the several labeled groups. How do you do that? Unhesitatingly, using kNN Algorithm.
k nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. This algorithms segregates unlabeled data points into well defined groups.
Let’s take a simple case to understand this algorithm. Following is a spread of blue circles (BC) and red rectangle (RR) :
You intend to find out the class of the green star (GS) . GS can either be BC or RR and nothing else. The “K” is KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 4. Hence, we will now make a circle with GS as center just as big as to enclose only four datapoints on the plane. Refer to following diagram for more details:
The three closest points to GS is all BC. Hence, with good confidence level we can say that the GS should belong to the class BC. Here, the choice became very obvious as all four votes from the closest neighbor went to BC. The choice of the parameter K is very crucial in this algorithm. Next we will understand what are the factors to be considered to conclude the best K.
Pros: The algorithm is highly unbiased in nature and makes no prior assumption of the underlying data. Being simple and effective in nature, it is easy to implement and has gained good popularity.
Cons: Indeed it is simple but kNN algorithm has drawn a lot of flake for being extremely simple! If we take a deeper look, this doesn’t create a model since there’s no abstraction process involved. Yes, the training process is really fast as the data is stored verbatim (hence lazy learner) but the prediction time is pretty high with useful insights missing at times. Therefore, building this algorithm requires time to be invested in data preparation (especially treating the missing data and categorical features) to obtain a robust model. This model is sensitive to outliers too.
Brief Description: To minimize loss from the bank’s perspective, the bank needs a decision rule regarding who to give approval of the loan and who not to. An applicant’s demographic and socio-economic profiles are considered by loan managers before a decision is taken regarding his/her loan application. The German Credit Data contains data on 20 variables and the classification whether an applicant is considered a Good or a Bad credit risk for 1000 loan applicants. Here is a link (https://onlinecourses.science.psu.edu/stat857/sites/onlinecourses.science.psu.edu.stat857/files/german_credit.csv) to the German Credit data. A predictive model developed on this data is expected to provide bank manager guidance for making a decision whether to approve a loan to a prospective applicant based on his/her profiles.
Methodology: kNN classification
Step-1 Data Collection.
gc <- read.csv("german_credit.csv") # reading csv data files from defined directory as file has already downloaded and stored in the directory
## Taking back-up of the input file, in case the original data is required later
gc.bkup <- gc
head (gc) # To check top 6 values of all the variables in data set.
## Creditability Account.Balance Duration.of.Credit..month.
## 1 1 1 18
## 2 1 1 9
## 3 1 2 12
## 4 1 1 12
## 5 1 1 12
## 6 1 1 10
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 1 4 2 1049
## 2 4 0 2799
## 3 2 9 841
## 4 4 0 2122
## 5 4 0 2171
## 6 4 0 2241
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 1 1 2 4
## 2 1 3 2
## 3 2 4 2
## 4 1 3 3
## 5 1 3 4
## 6 1 2 1
## Sex...Marital.Status Guarantors Duration.in.Current.address
## 1 2 1 4
## 2 3 1 2
## 3 2 1 4
## 4 3 1 2
## 5 3 1 4
## 6 3 1 3
## Most.valuable.available.asset Age..years. Concurrent.Credits
## 1 2 21 3
## 2 1 36 3
## 3 1 23 3
## 4 1 39 3
## 5 2 38 1
## 6 1 48 3
## Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## 1 1 1 3 1
## 2 1 2 3 2
## 3 1 1 2 1
## 4 1 2 2 2
## 5 2 2 2 1
## 6 1 2 2 2
## Telephone Foreign.Worker
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 2
## 5 1 2
## 6 1 2
Step-2 Preparing and exploring the data. Dataset variable details are available on this link (https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29 )
There are 20 attributes/features, so for the simplicity we will select relevant attributes which are as follows :
Note: All the attributes’ value are already converted to numeric and same data is available on the link as mentioned above.
str(gc) # understanding data structure, can see all the varaibles are integers including 'Creditability' which is our response variable.
## 'data.frame': 1000 obs. of 21 variables:
## $ Creditability : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Account.Balance : int 1 1 2 1 1 1 1 1 4 2 ...
## $ Duration.of.Credit..month. : int 18 9 12 12 12 10 8 6 18 24 ...
## $ Payment.Status.of.Previous.Credit: int 4 4 2 4 4 4 4 4 4 2 ...
## $ Purpose : int 2 0 9 0 0 0 0 0 3 3 ...
## $ Credit.Amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
## $ Value.Savings.Stocks : int 1 1 2 1 1 1 1 1 1 3 ...
## $ Length.of.current.employment : int 2 3 4 3 3 2 4 2 1 1 ...
## $ Instalment.per.cent : int 4 2 2 3 4 1 1 2 4 1 ...
## $ Sex...Marital.Status : int 2 3 2 3 3 3 3 3 2 2 ...
## $ Guarantors : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Duration.in.Current.address : int 4 2 4 2 4 3 4 4 4 4 ...
## $ Most.valuable.available.asset : int 2 1 1 1 2 1 1 1 3 4 ...
## $ Age..years. : int 21 36 23 39 38 48 39 40 65 23 ...
## $ Concurrent.Credits : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Type.of.apartment : int 1 1 1 1 2 1 2 2 2 1 ...
## $ No.of.Credits.at.this.Bank : int 1 2 1 2 2 2 2 1 2 1 ...
## $ Occupation : int 3 3 2 2 2 2 2 2 1 1 ...
## $ No.of.dependents : int 1 2 1 2 1 2 1 2 1 1 ...
## $ Telephone : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Foreign.Worker : int 1 1 1 2 2 2 2 2 1 1 ...
#Feature/Attribute selection
#The variable 'Creditability' is our target variable i.e. this variable will determine whether bank manager will approve a loan based on the 7 Attributes.
gc.subset <- gc[c('Creditability','Age..years.','Sex...Marital.Status','Occupation','Account.Balance','Credit.Amount','Length.of.current.employment','Purpose')]
head(gc.subset)
## Creditability Age..years. Sex...Marital.Status Occupation
## 1 1 21 2 3
## 2 1 36 3 3
## 3 1 23 2 2
## 4 1 39 3 2
## 5 1 38 3 2
## 6 1 48 3 2
## Account.Balance Credit.Amount Length.of.current.employment Purpose
## 1 1 1049 2 2
## 2 1 2799 3 0
## 3 2 841 4 9
## 4 1 2122 3 0
## 5 1 2171 3 0
## 6 1 2241 2 0
#Data normalistion to avoid biasness as the value sclae of 'Credit.Amount'is in thousand whereas other attribute's value are in 2 digits or 1 digit.
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) } # creating a normalize function for easy convertion.
gc.subset.n<- as.data.frame(lapply(gc.subset[,2:8], normalize)) # lapply creates list that is why it is converted to dataframe and it applies defined fundtion (which is 'normalize') to all the list values which is here column 2 to 8 as first column is target/response.
head(gc.subset.n)
## Age..years. Sex...Marital.Status Occupation Account.Balance
## 1 0.03571429 0.3333333 0.6666667 0.0000000
## 2 0.30357143 0.6666667 0.6666667 0.0000000
## 3 0.07142857 0.3333333 0.3333333 0.3333333
## 4 0.35714286 0.6666667 0.3333333 0.0000000
## 5 0.33928571 0.6666667 0.3333333 0.0000000
## 6 0.51785714 0.6666667 0.3333333 0.0000000
## Credit.Amount Length.of.current.employment Purpose
## 1 0.04396390 0.25 0.2
## 2 0.14025531 0.50 0.0
## 3 0.03251898 0.75 0.9
## 4 0.10300429 0.50 0.0
## 5 0.10570045 0.50 0.0
## 6 0.10955211 0.25 0.0
#Now all attributes having value in the range 0 to 1 which is normalised data and 'Creditability' column has been removed as sample value starts form column 2.
#Creating Training and Test data set. Training data will be used to build model whereas test data will be used for validation and optimisation of model by tuning k value.
set.seed(123) # To get the same random sample
dat.d <- sample(1:nrow(gc.subset.n),size=nrow(gc.subset.n)*0.7,replace = FALSE) #random selection of 70% data.
train.gc <- gc.subset[dat.d,] # 70% training data
test.gc <- gc.subset[-dat.d,] # remaining 30% test data
#Now creating seperate dataframe for 'Creditability' feature which is our target.
train.gc_labels <- gc.subset[dat.d,1]
test.gc_labels <- gc.subset[-dat.d,1]
Step-3 Training a model on data.
#install.packages(class) # to install class packages as it carries kNN function
library(class) # to call class package
NROW(train.gc_labels) # to find the number of observation
## [1] 700
#To identify optimum value of k, generally square root of total no of observations (700) which is 26.45 is taken, so will try with 26, 27 then will check for optimal value of k.
knn.26 <- knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=26)
knn.27 <- knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=27)
Step-4 Evaluate the model performance.
## Let's calculate the proportion of correct classification for k = 26, 27
ACC.26 <- 100 * sum(test.gc_labels == knn.26)/NROW(test.gc_labels) # For knn = 26
ACC.27 <- 100 * sum(test.gc_labels == knn.27)/NROW(test.gc_labels) # For knn = 27
ACC.26 #Accuracy is 67.67%
## [1] 67.66667
ACC.27 #Accuracy is 67.33%, which has reduced compare to k=26
## [1] 67.33333
table(knn.26 ,test.gc_labels) # to check prediction against actual value in tabular form
## test.gc_labels
## knn.26 0 1
## 0 11 7
## 1 90 192
# 11 & 192 are the correct prediction against actual wheras 90 & 7 are wrong prediction against actual.
table(knn.27 ,test.gc_labels) # to check prediction against actual value in tabular form
## test.gc_labels
## knn.27 0 1
## 0 11 8
## 1 90 191
# 11 & 191 are the correct prediction against actual wheras 90 & 8 are wrong prediction against actual.
OR Accuracy can also be calculated using ‘caret’ package and ‘confusion matrix’ function.
Install.packages(caret), To install ‘caret’ packages as it carries ‘confusion matrix’ function which helps in the calculation of accuracy of model.
library(caret)
confusionMatrix(knn.26 ,test.gc_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11 7
## 1 90 192
##
## Accuracy : 0.6767
## 95% CI : (0.6205, 0.7293)
## No Information Rate : 0.6633
## P-Value [Acc > NIR] : 0.3365
##
## Kappa : 0.0924
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.10891
## Specificity : 0.96482
## Pos Pred Value : 0.61111
## Neg Pred Value : 0.68085
## Prevalence : 0.33667
## Detection Rate : 0.03667
## Detection Prevalence : 0.06000
## Balanced Accuracy : 0.53687
##
## 'Positive' Class : 0
##
confusionMatrix(knn.27 ,test.gc_labels)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 11 8
## 1 90 191
##
## Accuracy : 0.6733
## 95% CI : (0.6171, 0.7261)
## No Information Rate : 0.6633
## P-Value [Acc > NIR] : 0.3824
##
## Kappa : 0.0859
## Mcnemar's Test P-Value : 2.786e-16
##
## Sensitivity : 0.10891
## Specificity : 0.95980
## Pos Pred Value : 0.57895
## Neg Pred Value : 0.67972
## Prevalence : 0.33667
## Detection Rate : 0.03667
## Detection Prevalence : 0.06333
## Balanced Accuracy : 0.53435
##
## 'Positive' Class : 0
##
Step-5 Improve the performance of model.
i=1 # declaration to initiate for loop
k.optm=1 # declaration to initiate for loop
for (i in 1:28){
knn.mod <- knn(train=train.gc, test=test.gc, cl=train.gc_labels, k=i)
k.optm[i] <- 100 * sum(test.gc_labels == knn.mod)/NROW(test.gc_labels)
k=i
cat(k,'=',k.optm[i],'\n') # to print % accuracy
}
## 1 = 60.33333
## 2 = 59.66667
## 3 = 60.33333
## 4 = 64.33333
## 5 = 62.33333
## 6 = 64
## 7 = 63.33333
## 8 = 64.33333
## 9 = 63.33333
## 10 = 64.66667
## 11 = 64.66667
## 12 = 65
## 13 = 66
## 14 = 65.33333
## 15 = 66.66667
## 16 = 67
## 17 = 67.66667
## 18 = 67.33333
## 19 = 67.66667
## 20 = 67.33333
## 21 = 66.33333
## 22 = 66.66667
## 23 = 67.66667
## 24 = 67.33333
## 25 = 68
## 26 = 67.66667
## 27 = 67.33333
## 28 = 67.33333
# Maximum accuracy at k=25
plot(k.optm, type="b", xlab="K- Value",ylab="Accuracy level") # to plot % accuracy wrt to k-value