We will use x to denote a feature (aka. predictor, attribute) and y to denote the target (aka. label, class) we are trying to predict.
KNN falls in the supervised learning family of algorithms. Informally, this means that given a labelled dataset consisting of training observations (x,y) we would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x__, h(x) can confidently predict the corresponding output \(y\).
The KNN classifier is also a non parametric and instance-based learning algorithm.
Non-parametric means it makes no explicit assumptions about the functional form of \(h\), avoiding the dangers of mismodeling the underlying distribution of the data. For example, suppose our data is highly non-Gaussian but the learning model we choose assumes a Gaussian form. In that case, our algorithm would make extremely poor predictions.
Instance-based learning means that our algorithm doesn’t explicitly learn a model. Instead, it chooses to memorize the training instances which are subsequently used as ‘knowledge’ for the prediction phase. Concretely, this means that only when a query to our database is made (i.e. when we ask it to predict a label given an input), will the algorithm use the training instances to spit out an answer.
Confusion Matrix
A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. The matrix is NxN, where N is the number of target values (classes). Performance of such models is commonly evaluated using the data in the matrix. The following table displays a 2x2 confusion matrix for two classes (Positive and Negative).
k-NN algorithm gets its name from the fact that it uses information about the k-nearest neighbours to classify unlabelled examples
k is a variable term implying that any number of nearest neighbours could be used
For each unlabelled record in the test dataset, k-NN identifies k records in the training dataset that are ``nearest’’ in similarity
The unlabelled test instance is assigned the class of the majority of the k nearest neighbours
library(class)
file_url <- "https://raw.githubusercontent.com/amarnathbose/Datafiles/master/german_credit.csv"
h <- read.csv(file_url)
head(h,4)
## Creditability Account.Balance Duration.of.Credit..month.
## 1 1 1 18
## 2 1 1 9
## 3 1 2 12
## 4 1 1 12
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## 1 4 2 1049
## 2 4 0 2799
## 3 2 9 841
## 4 4 0 2122
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## 1 1 2 4
## 2 1 3 2
## 3 2 4 2
## 4 1 3 3
## Sex...Marital.Status Guarantors Duration.in.Current.address
## 1 2 1 4
## 2 3 1 2
## 3 2 1 4
## 4 3 1 2
## Most.valuable.available.asset Age..years. Concurrent.Credits
## 1 2 21 3
## 2 1 36 3
## 3 1 23 3
## 4 1 39 3
## Type.of.apartment No.of.Credits.at.this.Bank Occupation No.of.dependents
## 1 1 1 3 1
## 2 1 2 3 2
## 3 1 1 2 1
## 4 1 2 2 2
## Telephone Foreign.Worker
## 1 1 1
## 2 1 1
## 3 1 1
## 4 1 2
summary(h)
## Creditability Account.Balance Duration.of.Credit..month.
## Min. :0.0 Min. :1.000 Min. : 4.0
## 1st Qu.:0.0 1st Qu.:1.000 1st Qu.:12.0
## Median :1.0 Median :2.000 Median :18.0
## Mean :0.7 Mean :2.577 Mean :20.9
## 3rd Qu.:1.0 3rd Qu.:4.000 3rd Qu.:24.0
## Max. :1.0 Max. :4.000 Max. :72.0
## Payment.Status.of.Previous.Credit Purpose Credit.Amount
## Min. :0.000 Min. : 0.000 Min. : 250
## 1st Qu.:2.000 1st Qu.: 1.000 1st Qu.: 1366
## Median :2.000 Median : 2.000 Median : 2320
## Mean :2.545 Mean : 2.828 Mean : 3271
## 3rd Qu.:4.000 3rd Qu.: 3.000 3rd Qu.: 3972
## Max. :4.000 Max. :10.000 Max. :18424
## Value.Savings.Stocks Length.of.current.employment Instalment.per.cent
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.:2.000
## Median :1.000 Median :3.000 Median :3.000
## Mean :2.105 Mean :3.384 Mean :2.973
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.:4.000
## Max. :5.000 Max. :5.000 Max. :4.000
## Sex...Marital.Status Guarantors Duration.in.Current.address
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :1.000 Median :3.000
## Mean :2.682 Mean :1.145 Mean :2.845
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:4.000
## Max. :4.000 Max. :3.000 Max. :4.000
## Most.valuable.available.asset Age..years. Concurrent.Credits
## Min. :1.000 Min. :19.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:27.00 1st Qu.:3.000
## Median :2.000 Median :33.00 Median :3.000
## Mean :2.358 Mean :35.54 Mean :2.675
## 3rd Qu.:3.000 3rd Qu.:42.00 3rd Qu.:3.000
## Max. :4.000 Max. :75.00 Max. :3.000
## Type.of.apartment No.of.Credits.at.this.Bank Occupation
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:3.000
## Median :2.000 Median :1.000 Median :3.000
## Mean :1.928 Mean :1.407 Mean :2.904
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000 Max. :4.000
## No.of.dependents Telephone Foreign.Worker
## Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000
## Mean :1.155 Mean :1.404 Mean :1.037
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :2.000 Max. :2.000 Max. :2.000
The first column, Credibility contains known values of credit-worthiness of the 1000 people in this data set.
The split can be found as follows
table(h$Creditability)
##
## 0 1
## 300 700
In this table, 0 represents the bad creditability instances (instances that have defaulted) and 1 represents the good creditability instances (those that have not defaulted).
The purpose of this classification exercise is to create a supervised training model that is able to predict low creditability instances based on a creditor’s profile. The company that is providing loans needs reasonable assurance that the loan will be paid back, and that the chances of defaulting are low.
sg0 <- which(h$Creditability==0)
sg1 <- which(h$Creditability==1) # OR sg1 <- !sg0
sg0tr <- sample(sg0,length(sg0)*2/3)
sg1tr <- sample(sg1,length(sg1)*2/3)
sg0ts <- sg0[!sg0 %in% sg0tr]
sg1ts <- sg1[!sg1 %in% sg1tr]
cat("Training+",length(sg0tr),'\n',"Training-",length(sg1tr),"\n","Testing+",length(sg0ts),"\n","Testing-",length(sg1ts))
## Training+ 200
## Training- 466
## Testing+ 100
## Testing- 234
htr <- rbind(h[sg0tr,],h[sg1tr,])
hts <- rbind(h[sg0ts,],h[sg1ts,])
table(htr$Creditability)
##
## 0 1
## 200 466
table(hts$Creditability)
##
## 0 1
## 100 234
The htr and hts datasets have been created. We now store the Training and Test data labels in two vectors, trLabels and tsLabels and thereafter remove the Creditability column form the htr and hts datasets.
trLabels <- htr$Creditability
tsLabels <- hts$Creditability
htr <- htr[,-1]
hts <- hts[,-1]
normalize <- function(x) return( (x-min(x))/(max(x)-min(x)))
Let us create a data frame
d <- data.frame("col1"=c(1,3,7), "col2"=c(2,6,4))
print(d,row.names=F)
## col1 col2
## 1 2
## 3 6
## 7 4
In order to normalize the two columns we need to use lapply (list apply), as follows.
dn <- lapply(d,normalize)
dn <- as.data.frame(dn)
print(dn,row.names=F)
## col1 col2
## 0.0000000 0.0
## 0.3333333 1.0
## 1.0000000 0.5
We can also use lapply to round the column values to 2 decimals
dn <- as.data.frame(lapply(dn,round,2))
print(dn,row.names=F)
## col1 col2
## 0.00 0.0
## 0.33 1.0
## 1.00 0.5
htr <- as.data.frame(lapply(htr,normalize))
hts <- as.data.frame(lapply(hts,normalize))
Now display the summary statistics for each of htr and hts, the columns that have big values, viz. - column 2: Duration.of.Credit..month - column 5: Credit. Amount - column 13: Age..years These would all be between 0 and 1, since they have been normalized.
summary(htr[,c(2,5,13)])
## Duration.of.Credit..month. Credit.Amount Age..years.
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1176 1st Qu.:0.07132 1st Qu.:0.1429
## Median :0.2059 Median :0.13314 Median :0.2500
## Mean :0.2521 Mean :0.19091 Mean :0.2896
## 3rd Qu.:0.2941 3rd Qu.:0.24007 3rd Qu.:0.3929
## Max. :1.0000 Max. :1.00000 Max. :1.0000
summary(hts[,c(2,5,13)])
## Duration.of.Credit..month. Credit.Amount Age..years.
## Min. :0.0000 Min. :0.00000 Min. :0.0000
## 1st Qu.:0.1111 1st Qu.:0.05966 1st Qu.:0.1250
## Median :0.2222 Median :0.10897 Median :0.2679
## Mean :0.2671 Mean :0.16852 Mean :0.3069
## 3rd Qu.:0.3333 3rd Qu.:0.19784 3rd Qu.:0.4286
## Max. :1.0000 Max. :1.00000 Max. :1.0000
splitFile <- function(dataSet, trProp,classCol) {
v <- dataSet[,classCol]
sg0 <- which(v==0)
sg1 <- which(v==1)
sg0tr <- sample(sg0,length(sg0)*trProp)
sg1tr <- sample(sg1,length(sg1)*trProp)
sg0ts <- sg0[!sg0 %in% sg0tr]
sg1ts <- sg1[!sg1 %in% sg1tr]
htr <- rbind(dataSet[sg0tr,],dataSet[sg1tr,])
hts <- rbind(dataSet[sg0ts,],dataSet[sg1ts,])
trLabels <- htr[,classCol]
tsLabels <- hts[,classCol]
htr <- htr[,-which(names(htr) == classCol)]
hts <- hts[,-which(names(hts) == classCol)]
return(list(tr=htr,ts=hts,trL=trLabels,tsL=tsLabels))
}
Split the data set using splitFile
a <- splitFile(h,.6,'Creditability')
trData <- a[[1]]
tsData <- a[[2]]
trL <- a[[3]]
tsL <- a[[4]]
table(trL)
## trL
## 0 1
## 180 420
table(tsL)
## tsL
## 0 1
## 120 280
a <- splitFile(h,.6,'Creditability')
trData <- a[[1]]
tsData <- a[[2]]
trLabels <- a[[3]]
tsLabels <- a[[4]]
tsPred <- knn(trData, tsData, trLabels, k=3)
#CrossTable(tsLabels, tsPred)
table(tsLabels,tsPred)
## tsPred
## tsLabels 0 1
## 0 37 83
## 1 63 217
accu0 <- length(which(tsLabels==tsPred)==TRUE)/length(tsLabels)
sens0 <- length(which((tsLabels==tsPred) & (tsLabels==0))) / length(which(tsLabels==0))
spec0 <- length(which((tsLabels==tsPred) & (tsLabels==1))) / length(which(tsLabels==1))
cat("Accuracy=",round(accu0,2),'\n',"Sensitivity=",round(sens0,2),'\n',"Specificity=",round(spec0,2))
## Accuracy= 0.64
## Sensitivity= 0.31
## Specificity= 0.78
The accuracy of prediction is \(\frac{\text{True Negatives + True Positives}}{\text{Test Sample Size}}\) = 0.635
Function to generate Training & Test Error rates for various k
bestK <- function(trData, trLabels, tsData, tsLabels) {
ctr <- c(); cts <- c()
for (k in 1:20) {
knnTr <- knn(trData, trData, trLabels, k)
knnTs <- knn(trData, tsData, trLabels, k)
trTable <- prop.table(table(knnTr, trLabels))
tsTable <- prop.table(table(knnTs, tsLabels))
erTr <- trTable[1,2] + trTable[2,1]
erTs <- tsTable[1,2] + tsTable[2,1]
ctr <- c(ctr,erTr)
cts <- c(cts,erTs)
}
#acc <- data.frame(k=1/c(1:100), trER=ctr, tsER=cts)
err <- data.frame(k=1:20, trER=ctr, tsER=cts)
return(err)
}
Invoke the function bestK to create dataset and Plot Training and Test Error rates for various values of k
err <- bestK(trData, trLabels, tsData, tsLabels)
plot(err$k,err$trER,type='o',ylim=c(0,.5),xlab="k",ylab="Error rate",col="blue")
lines(err$k,err$tsER,type='o',col="red")
What value of k minimizes the difference between training and test data sets. 3 seems to be too small a value because the test error rate falls drastically till \(k \approx 7\).
tsPred <- knn(trData, tsData, trLabels, k=7)
table(tsLabels,tsPred)
## tsPred
## tsLabels 0 1
## 0 26 94
## 1 42 238
#paste("The accuracy of prediction is", length(which(tsLabels==tsPred)==TRUE)/length(tsLabels))
accu1 <- length(which(tsLabels==tsPred)==TRUE)/length(tsLabels)
sens1 <- length(which((tsLabels==tsPred) & (tsLabels==0))) / length(which(tsLabels==0))
spec1 <- length(which((tsLabels==tsPred) & (tsLabels==1))) / length(which(tsLabels==1))
cat("Accuracy=",round(accu1,2),'\n',"Sensitivity=",round(sens1,2),'\n',"Specificity=",round(spec1,2))
## Accuracy= 0.66
## Sensitivity= 0.22
## Specificity= 0.85
The cost of misclassification varies from problem to problem. While correct classification does not involve any cost, there are two ways of making a wrong classification, viz.
In the case of this example, it would definitely be more costly to have false negatives than it would be to have false positives. Let the cost matrix be as follows.
| Bad+ | Good- | |
|---|---|---|
| Actual | ||
| Bad+ | 0 | 4 |
| Good- | 1 | 0 |
With the sensitivity value of 0.22 and specificity of 0.85 the expected of misclassification is 3.28.
This is an increase in expected cost of misclassification over the earlier figures of sensitivity = 0.31 and specificity of 0.78, where the expected cost of misclassification was 2.99.
acc <- c()
sen <- c()
spc <- c()
for (i in 1:50) {
tsPred <- knn(trData, tsData, trLabels, k=i)
acc <- c(acc,length(which(tsLabels==tsPred)==TRUE)/length(tsLabels))
sen <- c(sen,length(which((tsLabels==tsPred) & (tsLabels==0))) / length(which(tsLabels==0)))
spc <- c(spc,length(which((tsLabels==tsPred) & (tsLabels==1))) / length(which(tsLabels==1)))
}
costdf <- data.frame(k=1:50,Accuracy=acc,Sensitivity=sen,Specificity=spc)
cost=3*(1-costdf$Sensitivity)+(1-costdf$Specificity)
costdf <- cbind(costdf,"Cost"=cost)
kable(costdf[c(1:15,seq(20,50,by=5)),],row.names=FALSE)
| k | Accuracy | Sensitivity | Specificity | Cost |
|---|---|---|---|---|
| 1 | 0.5950 | 0.3083333 | 0.7178571 | 2.357143 |
| 2 | 0.6100 | 0.3666667 | 0.7142857 | 2.185714 |
| 3 | 0.6350 | 0.3083333 | 0.7750000 | 2.300000 |
| 4 | 0.6250 | 0.2916667 | 0.7678571 | 2.357143 |
| 5 | 0.6450 | 0.2583333 | 0.8107143 | 2.414286 |
| 6 | 0.6400 | 0.2250000 | 0.8178571 | 2.507143 |
| 7 | 0.6600 | 0.2166667 | 0.8500000 | 2.500000 |
| 8 | 0.6300 | 0.1333333 | 0.8428571 | 2.757143 |
| 9 | 0.6650 | 0.1916667 | 0.8678571 | 2.557143 |
| 10 | 0.6550 | 0.1583333 | 0.8678571 | 2.657143 |
| 11 | 0.6700 | 0.1666667 | 0.8857143 | 2.614286 |
| 12 | 0.6625 | 0.1250000 | 0.8928571 | 2.732143 |
| 13 | 0.6675 | 0.1416667 | 0.8928571 | 2.682143 |
| 14 | 0.6675 | 0.1333333 | 0.8964286 | 2.703571 |
| 15 | 0.6800 | 0.1333333 | 0.9142857 | 2.685714 |
| 20 | 0.6875 | 0.0750000 | 0.9500000 | 2.825000 |
| 25 | 0.6925 | 0.0666667 | 0.9607143 | 2.839286 |
| 30 | 0.6975 | 0.0666667 | 0.9678571 | 2.832143 |
| 35 | 0.7000 | 0.0666667 | 0.9714286 | 2.828571 |
| 40 | 0.7025 | 0.0583333 | 0.9785714 | 2.846429 |
| 45 | 0.7025 | 0.0583333 | 0.9785714 | 2.846429 |
| 50 | 0.7025 | 0.0583333 | 0.9785714 | 2.846429 |
As k increases, - the accuracy increases till k=18, thereafter it plateaus - the sensitivity reduces with increasing values of k - the specificity increases with increasing values of k - but because of the large cost of misclassifying a positive; ie. cost of False Negative the cost increases from k=3.
The Cost Matrix demands high sensitivity, i.e. it is imperative to correctly classify positive or Bad Creditability cases. Though the accuracy seems to be plateauing around k=20, the huge cost associated with a false negative requires that we keep k low - 3 would be a good choice. One should not go strictly with the lowest cost value in the illustration.
Also, the data set has 300 positive, or bad credit cases against 700 negative or good credit cases. This makes positives rarer. A smarter strategy would be to select equal number of classes (bad and good) for the training set, or a sampling scheme that chooses more bad cases than good.