k-NN Classification

Nearest neighbours approach to classification

We will use x to denote a feature (aka. predictor, attribute) and y to denote the target (aka. label, class) we are trying to predict.

KNN falls in the supervised learning family of algorithms. Informally, this means that given a labelled dataset consisting of training observations (x,y) we would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x__, h(x) can confidently predict the corresponding output \(y\).

The KNN classifier is also a non parametric and instance-based learning algorithm.

Non-parametric means it makes no explicit assumptions about the functional form of \(h\), avoiding the dangers of mismodeling the underlying distribution of the data. For example, suppose our data is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In that case, our algorithm would make extremely poor predictions.

Instance-based learning means that our algorithm doesn’t explicitly learn a model. Instead, it chooses to memorize the training instances which are subsequently used as ‘knowledge’ for the prediction phase. Concretely, this means that only when a query to our database is made (i.e. when we ask it to predict a label given an input), will the algorithm use the training instances to spit out an answer.

Model Evaluation - Classification

Confusion Matrix

A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is NxN, where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. The following table displays a 2x2 confusion matrix for two classes (Positive and Negative).

  • Positive Cases: samples with the trait that is being examined / identified
  • Accuracy: the proportion of the total number of predictions that were correct
  • Sensitivity (or Recall or Precision): the proportion of actual positive cases which are correctly identified
  • Specificity: the proportion of actual negative cases which are correctly identified

The k-NN algorithm

  • k-NN algorithm gets its name from the fact that it uses information about the k-nearest neighbours to classify unlabelled examples

  • k is a variable term implying that any number of nearest neighbours could be used

  • For each unlabelled record in the test dataset, k-NN identifies k records in the training dataset that are ``nearest’’ in similarity

  • The unlabelled test instance is assigned the class of the majority of the k nearest neighbours


An example of k-NN Classification

Read in the file and display the summary

library(class)
file_url <- "https://raw.githubusercontent.com/amarnathbose/Datafiles/master/german_credit.csv"
h <- read.csv(file_url)
head(h,4)
##   Creditability Account.Balance Duration.of.Credit..month.
## 1             1               1                         18
## 2             1               1                          9
## 3             1               2                         12
## 4             1               1                         12
##   Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 1                                 4       2          1049
## 2                                 4       0          2799
## 3                                 2       9           841
## 4                                 4       0          2122
##   Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 1                    1                            2                   4
## 2                    1                            3                   2
## 3                    2                            4                   2
## 4                    1                            3                   3
##   Sex...Marital.Status Guarantors Duration.in.Current.address
## 1                    2          1                           4
## 2                    3          1                           2
## 3                    2          1                           4
## 4                    3          1                           2
##   Most.valuable.available.asset Age..years. Concurrent.Credits
## 1                             2          21                  3
## 2                             1          36                  3
## 3                             1          23                  3
## 4                             1          39                  3
##   Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## 1                 1                          1          3                1
## 2                 1                          2          3                2
## 3                 1                          1          2                1
## 4                 1                          2          2                2
##   Telephone Foreign.Worker
## 1         1              1
## 2         1              1
## 3         1              1
## 4         1              2
summary(h)
##  Creditability Account.Balance Duration.of.Credit..month.
##  Min.   :0.0   Min.   :1.000   Min.   : 4.0              
##  1st Qu.:0.0   1st Qu.:1.000   1st Qu.:12.0              
##  Median :1.0   Median :2.000   Median :18.0              
##  Mean   :0.7   Mean   :2.577   Mean   :20.9              
##  3rd Qu.:1.0   3rd Qu.:4.000   3rd Qu.:24.0              
##  Max.   :1.0   Max.   :4.000   Max.   :72.0              
##  Payment.Status.of.Previous.Credit    Purpose       Credit.Amount  
##  Min.   :0.000                     Min.   : 0.000   Min.   :  250  
##  1st Qu.:2.000                     1st Qu.: 1.000   1st Qu.: 1366  
##  Median :2.000                     Median : 2.000   Median : 2320  
##  Mean   :2.545                     Mean   : 2.828   Mean   : 3271  
##  3rd Qu.:4.000                     3rd Qu.: 3.000   3rd Qu.: 3972  
##  Max.   :4.000                     Max.   :10.000   Max.   :18424  
##  Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
##  Min.   :1.000        Min.   :1.000                Min.   :1.000      
##  1st Qu.:1.000        1st Qu.:3.000                1st Qu.:2.000      
##  Median :1.000        Median :3.000                Median :3.000      
##  Mean   :2.105        Mean   :3.384                Mean   :2.973      
##  3rd Qu.:3.000        3rd Qu.:5.000                3rd Qu.:4.000      
##  Max.   :5.000        Max.   :5.000                Max.   :4.000      
##  Sex...Marital.Status   Guarantors    Duration.in.Current.address
##  Min.   :1.000        Min.   :1.000   Min.   :1.000              
##  1st Qu.:2.000        1st Qu.:1.000   1st Qu.:2.000              
##  Median :3.000        Median :1.000   Median :3.000              
##  Mean   :2.682        Mean   :1.145   Mean   :2.845              
##  3rd Qu.:3.000        3rd Qu.:1.000   3rd Qu.:4.000              
##  Max.   :4.000        Max.   :3.000   Max.   :4.000              
##  Most.valuable.available.asset  Age..years.    Concurrent.Credits
##  Min.   :1.000                 Min.   :19.00   Min.   :1.000     
##  1st Qu.:1.000                 1st Qu.:27.00   1st Qu.:3.000     
##  Median :2.000                 Median :33.00   Median :3.000     
##  Mean   :2.358                 Mean   :35.54   Mean   :2.675     
##  3rd Qu.:3.000                 3rd Qu.:42.00   3rd Qu.:3.000     
##  Max.   :4.000                 Max.   :75.00   Max.   :3.000     
##  Type.of.apartment No.of.Credits.at.this.Bank   Occupation   
##  Min.   :1.000     Min.   :1.000              Min.   :1.000  
##  1st Qu.:2.000     1st Qu.:1.000              1st Qu.:3.000  
##  Median :2.000     Median :1.000              Median :3.000  
##  Mean   :1.928     Mean   :1.407              Mean   :2.904  
##  3rd Qu.:2.000     3rd Qu.:2.000              3rd Qu.:3.000  
##  Max.   :3.000     Max.   :4.000              Max.   :4.000  
##  No.of.dependents   Telephone     Foreign.Worker 
##  Min.   :1.000    Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000    1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000    Median :1.000   Median :1.000  
##  Mean   :1.155    Mean   :1.404   Mean   :1.037  
##  3rd Qu.:1.000    3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   :2.000    Max.   :2.000   Max.   :2.000

The first column, Credibility contains known values of credit-worthiness of the 1000 people in this data set.

The split can be found as follows

table(h$Creditability)
## 
##   0   1 
## 300 700

In this table, 0 represents the bad creditability instances (instances that have defaulted) and 1 represents the good creditability instances (those that have not defaulted).

The purpose of this classification exercise is to create a supervised training model that is able to predict low creditability instances based on a creditor’s profile. The company that is providing loans needs reasonable assurance that the loan will be paid back, and that the chances of defaulting are low.


Splitting data into Training and Test sets in 2:1 ratio

sg0 <- which(h$Creditability==0)
sg1 <- which(h$Creditability==1) # OR sg1 <- !sg0
sg0tr <- sample(sg0,length(sg0)*2/3)
sg1tr <- sample(sg1,length(sg1)*2/3)
sg0ts <- sg0[!sg0 %in% sg0tr]
sg1ts <- sg1[!sg1 %in% sg1tr]
cat("Training+",length(sg0tr),'\n',"Training-",length(sg1tr),"\n","Testing+",length(sg0ts),"\n","Testing-",length(sg1ts))
## Training+ 200 
##  Training- 466 
##  Testing+ 100 
##  Testing- 234

Combine the selected records for the Training and the Test datasets

htr <- rbind(h[sg0tr,],h[sg1tr,])
hts <- rbind(h[sg0ts,],h[sg1ts,])
table(htr$Creditability)
## 
##   0   1 
## 200 466
table(hts$Creditability)
## 
##   0   1 
## 100 234

Remove labels

The htr and hts datasets have been created. We now store the Training and Test data labels in two vectors, trLabels and tsLabels and thereafter remove the Creditability column form the htr and hts datasets.

trLabels <- htr$Creditability
tsLabels <- hts$Creditability
htr <- htr[,-1]
hts <- hts[,-1]

Normalizing the predictor variables

  • Create function to normalize We now create a function to normalize any column
normalize <- function(x) return( (x-min(x))/(max(x)-min(x)))
  • An aside - lapply

Let us create a data frame

d <- data.frame("col1"=c(1,3,7), "col2"=c(2,6,4))
print(d,row.names=F)
##  col1 col2
##     1    2
##     3    6
##     7    4

In order to normalize the two columns we need to use lapply (list apply), as follows.

dn <- lapply(d,normalize)
dn <- as.data.frame(dn)
print(dn,row.names=F)
##       col1 col2
##  0.0000000  0.0
##  0.3333333  1.0
##  1.0000000  0.5

We can also use lapply to round the column values to 2 decimals

dn <- as.data.frame(lapply(dn,round,2))
print(dn,row.names=F)
##  col1 col2
##  0.00  0.0
##  0.33  1.0
##  1.00  0.5
  • Now normalize the columns of htr and hts
htr <- as.data.frame(lapply(htr,normalize))
hts <- as.data.frame(lapply(hts,normalize))

Now display the summary statistics for each of htr and hts, the columns that have big values, viz.  - column 2: Duration.of.Credit..month - column 5: Credit. Amount - column 13: Age..years These would all be between 0 and 1, since they have been normalized.

summary(htr[,c(2,5,13)])
##  Duration.of.Credit..month. Credit.Amount      Age..years.    
##  Min.   :0.0000             Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.1176             1st Qu.:0.07132   1st Qu.:0.1429  
##  Median :0.2059             Median :0.13314   Median :0.2500  
##  Mean   :0.2521             Mean   :0.19091   Mean   :0.2896  
##  3rd Qu.:0.2941             3rd Qu.:0.24007   3rd Qu.:0.3929  
##  Max.   :1.0000             Max.   :1.00000   Max.   :1.0000
summary(hts[,c(2,5,13)])
##  Duration.of.Credit..month. Credit.Amount      Age..years.    
##  Min.   :0.0000             Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.1111             1st Qu.:0.05966   1st Qu.:0.1250  
##  Median :0.2222             Median :0.10897   Median :0.2679  
##  Mean   :0.2671             Mean   :0.16852   Mean   :0.3069  
##  3rd Qu.:0.3333             3rd Qu.:0.19784   3rd Qu.:0.4286  
##  Max.   :1.0000             Max.   :1.00000   Max.   :1.0000

Functions to evaluate k-NN efficiency

  • splitFile: to split data into training and test sets in a given proportion

Splitting data file into Training & Test data sets

splitFile <- function(dataSet, trProp,classCol) {
  v <- dataSet[,classCol]
  sg0 <- which(v==0)
  sg1 <- which(v==1)
  sg0tr <- sample(sg0,length(sg0)*trProp)
  sg1tr <- sample(sg1,length(sg1)*trProp)
  sg0ts <- sg0[!sg0 %in% sg0tr]
  sg1ts <- sg1[!sg1 %in% sg1tr]
  htr <- rbind(dataSet[sg0tr,],dataSet[sg1tr,])
  hts <- rbind(dataSet[sg0ts,],dataSet[sg1ts,])
  trLabels <- htr[,classCol]
  tsLabels <- hts[,classCol]
  htr <- htr[,-which(names(htr) == classCol)]
  hts <- hts[,-which(names(hts) == classCol)]
  return(list(tr=htr,ts=hts,trL=trLabels,tsL=tsLabels))
}

Split the data set using splitFile

a <- splitFile(h,.6,'Creditability')
trData <- a[[1]]
tsData <- a[[2]]
trL <- a[[3]]
tsL <- a[[4]]
table(trL)
## trL
##   0   1 
## 180 420
table(tsL)
## tsL
##   0   1 
## 120 280

Do a k-NN Classification

a <- splitFile(h,.6,'Creditability')
trData <- a[[1]]
tsData <- a[[2]]
trLabels <- a[[3]]
tsLabels <- a[[4]]
tsPred <- knn(trData, tsData, trLabels, k=3)
#CrossTable(tsLabels, tsPred)
table(tsLabels,tsPred)
##         tsPred
## tsLabels   0   1
##        0  37  83
##        1  63 217
accu0 <- length(which(tsLabels==tsPred)==TRUE)/length(tsLabels)
sens0 <- length(which((tsLabels==tsPred) & (tsLabels==0))) / length(which(tsLabels==0))
spec0 <- length(which((tsLabels==tsPred) & (tsLabels==1))) / length(which(tsLabels==1))
cat("Accuracy=",round(accu0,2),'\n',"Sensitivity=",round(sens0,2),'\n',"Specificity=",round(spec0,2))
## Accuracy= 0.64 
##  Sensitivity= 0.31 
##  Specificity= 0.78

The accuracy of prediction is \(\frac{\text{True Negatives + True Positives}}{\text{Test Sample Size}}\) = 0.635

Redo k-NN Classification: Improve the efficiency

Choosing the best value of k

Function to generate Training & Test Error rates for various k

bestK <- function(trData, trLabels, tsData, tsLabels) {
  ctr <- c(); cts <- c()
  for (k in 1:20) {
    knnTr <- knn(trData, trData, trLabels, k)
    knnTs <- knn(trData, tsData, trLabels, k)
    trTable <- prop.table(table(knnTr, trLabels))
    tsTable <- prop.table(table(knnTs, tsLabels))
    erTr <- trTable[1,2] + trTable[2,1]
    erTs <- tsTable[1,2] + tsTable[2,1]
    ctr <- c(ctr,erTr)
    cts <- c(cts,erTs)
  }
  #acc <- data.frame(k=1/c(1:100), trER=ctr, tsER=cts)
  err <- data.frame(k=1:20, trER=ctr, tsER=cts)
  return(err)
}

Invoke the function bestK to create dataset and Plot Training and Test Error rates for various values of k

err <- bestK(trData, trLabels, tsData, tsLabels)
plot(err$k,err$trER,type='o',ylim=c(0,.5),xlab="k",ylab="Error rate",col="blue")
lines(err$k,err$tsER,type='o',col="red")

What value of k minimizes the difference between training and test data sets. 3 seems to be too small a value because the test error rate falls drastically till \(k \approx 7\).

Rerun k-NN with k=7

tsPred <- knn(trData, tsData, trLabels, k=7)
table(tsLabels,tsPred)
##         tsPred
## tsLabels   0   1
##        0  26  94
##        1  42 238
#paste("The accuracy of prediction is", length(which(tsLabels==tsPred)==TRUE)/length(tsLabels))
accu1 <- length(which(tsLabels==tsPred)==TRUE)/length(tsLabels)
sens1 <- length(which((tsLabels==tsPred) & (tsLabels==0))) / length(which(tsLabels==0))
spec1 <- length(which((tsLabels==tsPred) & (tsLabels==1))) / length(which(tsLabels==1))
cat("Accuracy=",round(accu1,2),'\n',"Sensitivity=",round(sens1,2),'\n',"Specificity=",round(spec1,2))
## Accuracy= 0.66 
##  Sensitivity= 0.22 
##  Specificity= 0.85

Cost of Misclassification

The cost of misclassification varies from problem to problem. While correct classification does not involve any cost, there are two ways of making a wrong classification, viz.

  • False Positives: Identify an instance as positive when it is not so
  • False Negative: Fail to identify an instance as positive when it is indeed so

In the case of this example, it would definitely be more costly to have false negatives than it would be to have false positives. Let the cost matrix be as follows.

Cost Matrix
Predicted
Bad+ Good-
Actual
Bad+ 0 4
Good- 1 0

With the sensitivity value of 0.22 and specificity of 0.85 the expected of misclassification is 3.28.

This is an increase in expected cost of misclassification over the earlier figures of sensitivity = 0.31 and specificity of 0.78, where the expected cost of misclassification was 2.99.

acc <- c()
sen <- c()
spc <- c()
for (i in 1:50) {
  tsPred <- knn(trData, tsData, trLabels, k=i)
  acc <- c(acc,length(which(tsLabels==tsPred)==TRUE)/length(tsLabels))
  sen <- c(sen,length(which((tsLabels==tsPred) & (tsLabels==0))) / length(which(tsLabels==0)))
  spc <- c(spc,length(which((tsLabels==tsPred) & (tsLabels==1))) / length(which(tsLabels==1)))
}
costdf <- data.frame(k=1:50,Accuracy=acc,Sensitivity=sen,Specificity=spc)
cost=3*(1-costdf$Sensitivity)+(1-costdf$Specificity)
costdf <- cbind(costdf,"Cost"=cost)
kable(costdf[c(1:15,seq(20,50,by=5)),],row.names=FALSE)
k Accuracy Sensitivity Specificity Cost
1 0.5950 0.3083333 0.7178571 2.357143
2 0.6100 0.3666667 0.7142857 2.185714
3 0.6350 0.3083333 0.7750000 2.300000
4 0.6250 0.2916667 0.7678571 2.357143
5 0.6450 0.2583333 0.8107143 2.414286
6 0.6400 0.2250000 0.8178571 2.507143
7 0.6600 0.2166667 0.8500000 2.500000
8 0.6300 0.1333333 0.8428571 2.757143
9 0.6650 0.1916667 0.8678571 2.557143
10 0.6550 0.1583333 0.8678571 2.657143
11 0.6700 0.1666667 0.8857143 2.614286
12 0.6625 0.1250000 0.8928571 2.732143
13 0.6675 0.1416667 0.8928571 2.682143
14 0.6675 0.1333333 0.8964286 2.703571
15 0.6800 0.1333333 0.9142857 2.685714
20 0.6875 0.0750000 0.9500000 2.825000
25 0.6925 0.0666667 0.9607143 2.839286
30 0.6975 0.0666667 0.9678571 2.832143
35 0.7000 0.0666667 0.9714286 2.828571
40 0.7025 0.0583333 0.9785714 2.846429
45 0.7025 0.0583333 0.9785714 2.846429
50 0.7025 0.0583333 0.9785714 2.846429

As k increases, - the accuracy increases till k=18, thereafter it plateaus - the sensitivity reduces with increasing values of k - the specificity increases with increasing values of k - but because of the large cost of misclassifying a positive; ie. cost of False Negative the cost increases from k=3.

The Cost Matrix demands high sensitivity, i.e. it is imperative to correctly classify positive or Bad Creditability cases. Though the accuracy seems to be plateauing around k=20, the huge cost associated with a false negative requires that we keep k low - 3 would be a good choice. One should not go strictly with the lowest cost value in the illustration.

Also, the data set has 300 positive, or bad credit cases against 700 negative or good credit cases. This makes positives rarer. A smarter strategy would be to select equal number of classes (bad and good) for the training set, or a sampling scheme that chooses more bad cases than good.