Simple Bank-Loan Model; Using K-Nearest Neighbors

                              "Before You Approve That Loan"

Application of historical customers’ information, accumulated by banks overtime, to predict whether a customer applying for a loan item will default,or otherwise, is the magic to maintain a “clean” book. The tools can also be used to maintain favorable PAR (Portfolio at risk) levels.The use of managers’ instincts, experience and guesswork is considered vintage, and statistical and computational algorithm have now taken charge. However, mining the information in the unstructured data needs a collection of statistical knowledge; both theoretical and application. In this paper K-Nearest Neighbors will be used to calculate the likelihood if a customer will be a defaulter or not. We will use a historical bank data to construct a predictive loan model.

K-Nearest Neighbors algorithm is a non-parametric approach that classifies new cases based on the similarity measures (with regard to distance functions). A case is classified by the majority vote of its neighbors, with the case classified with the most common neighbors, measured by the distance. Explicitly, for k=1 the case is assigned to the class of its nearest neighbor. You can download the historical bank data Here

Enough said. Let’s apply K-Nearest Neighbors on our bank data and see what the computer tells us.

First step is to read in the datasets

train <- read.csv("~/Credit_train.csv")
test <- read.csv("~/Credit_test.csv")

First step is to read in the data using the read.csv function

Columns name

colnames(train)

## [1] "BUSAGE"      "BUSTYPE"     "MAXLINEUTIL" "DAYSDELQ"    "TOTACBAL"   
## [6] "DEFAULT"

colnames(test)

## [1] "BUSAGE"      "BUSTYPE"     "MAXLINEUTIL" "DAYSDELQ"    "TOTACBAL"   
## [6] "DEFAULT"

These are the names of the variables/features/columns in the training and test data respectively.

Here is the title of each variable:

BUSAGE - Business age (age of the business measured in months)

BUSTYPE - Business type (Type of business classified from A-F)

MAXLINEUTIL - Maximum Line of Credit utilized used by the customer

DAYSDELQ - Days delinquency is the number of days a customer has been delinquent or performs an unattractive behavior

TOTACBAL - Total Account Balance (Total account balance for the customer)

DEFAULT - Default (Yes OR No if the client defualted on a loan)

Show us the data!

head(test)

##   BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ TOTACBAL DEFAULT
## 1    354       A      3.0425        0 152125.6       N
## 2     99       A      0.0000        0 151060.9       N
## 3    100       A      2.4507        0 122538.6       N
## 4     85       C      1.1397        0 113975.4       N
## 5     82       A      1.1241        0 112415.7       N
## 6     62       A      0.0000        0 106760.2       Y

This is what our dataset looks like in a database

Load in the required libraries

library(caret)
library(kknn)
library(randomForest)
library(AUC)

It’s time to train our model using KNN

model.KNN <- kknn(DEFAULT~., na.roughfix(train), na.roughfix(test), k=5, distance = 2, scale=FALSE, getOption(max.print=100))
summary(model.KNN)

## 
## Call:
## kknn(formula = DEFAULT ~ ., train = na.roughfix(train), test = na.roughfix(test),     na.action = getOption(max.print = 100), k = 5, distance = 2,     scale = FALSE)
## 
## Response: "nominal"
##      fit     prob.N     prob.Y
## 1      N 1.00000000 0.00000000
## 2      N 0.97077979 0.02922021
## 3      N 0.73607536 0.26392464
## 4      Y 0.46908332 0.53091668
## 5      N 0.73049206 0.26950794
## 6      N 1.00000000 0.00000000
## 7      N 0.70127186 0.29872814
## 8      Y 0.26392464 0.73607536
## 9      N 0.87721034 0.12278966
## 10     N 0.56265278 0.43734722
## 11     N 0.90643054 0.09356946
## 12     N 0.90643054 0.09356946
## 13     N 0.82964482 0.17035518
## 14     Y 0.17035518 0.82964482
## 15     N 0.63692261 0.36307739
## 16     N 0.63692261 0.36307739
## 17     N 0.56265278 0.43734722
## 18     N 1.00000000 0.00000000
## 19     N 1.00000000 0.00000000
## 20     N 1.00000000 0.00000000
## 21     N 1.00000000 0.00000000
## 22     N 1.00000000 0.00000000
## 23     N 1.00000000 0.00000000
## 24     N 0.97077979 0.02922021
## 25     Y 0.46656743 0.53343257
## 26     N 0.56265278 0.43734722
## 27     N 0.97077979 0.02922021
## 28     N 0.97077979 0.02922021
## 29     N 0.73049206 0.26950794
## 30     N 1.00000000 0.00000000
## 31     N 0.87721034 0.12278966
## 32     Y 0.26392464 0.73607536
## 33     Y 0.43986312 0.56013688
##  [ reached getOption("max.print") -- omitted 7039 rows ]

K-Nearest Neighbors

K-Nearest Neighbors computes the likelihood of each data point defaulting or not defaulting on a loan. If the probablility of defaulting is 0.51 or more it learns its going to default and vice-versa. It should be noted that the data used here is annonymized meaning the names of each customer has been deleted for privacy reasons. For costumer 1 the probability of not defaulting is 100%. For costumer 2 the likelihood of not defaulting is 97%, for costumer 3 it is 73%. For costumer 4 the likelihood of defaulting(Y) is 53% which means that customer or people like that are likey going to default.

Confusion matrix

pc <- NULL
pc <- predict(model.KNN, na.roughfix(test), type="raw")
xtab <- table(pc, test$DEFAULT)
caret::confusionMatrix(xtab, positive="Y")

## Confusion Matrix and Statistics
## 
##    
## pc     N    Y
##   N 6454  469
##   Y  129   20
##                                           
##                Accuracy : 0.9154          
##                  95% CI : (0.9087, 0.9218)
##     No Information Rate : 0.9309          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0314          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.040900        
##             Specificity : 0.980404        
##          Pos Pred Value : 0.134228        
##          Neg Pred Value : 0.932255        
##              Prevalence : 0.069146        
##          Detection Rate : 0.002828        
##    Detection Prevalence : 0.021069        
##       Balanced Accuracy : 0.510652        
##                                           
##        'Positive' Class : Y               
##

Confusion Matrix

An additional measure of predictive power is the so-called Confusion Matrix. It has the form of the table above. Here our model correctly predicted 6,454 cases as non-defualt and 20 cases as default. Our models accuracy is 92% and the Specificity is 98%, and this was done with a 95% confidence interval level. This shows our model is very good and can be easily adopted by banks and financial institutions.

Lift chart

pb <- NULL
pb <- predict(model.KNN, na.roughfix(test), type="prob")
pb <- as.data.frame(pb)
pred.KNN <- data.frame(test$DEFAULT, pb$Y)
colnames(pred.KNN) <- c("target","score")
lift.KNN <- lift(target ~ score, data = pred.KNN, cuts=10, class="Y")
xyplot(lift.KNN, main="KNN - Lift Chart", type=c("l","g"), lwd=2
       , scales=list(x=list(alternating=FALSE,tick.number = 10)
                     ,y=list(alternating=FALSE,tick.number = 10)))

Lift chart

The lift chart is essentially just a view of how quickly the model makes accurate predictions. Our lift chart shows that this model is good because it only needs to evaluate 25% of the data in order to correctly predict all the target outcomes.

ROC Chart

labels <- as.factor(ifelse(pred.KNN$target=="Y", 1, 0))
predictions <- pred.KNN$score
auc(roc(predictions, labels), min = 0, max = 1)

## [1] 0.5561509

plot(roc(predictions, labels), min=0, max=1, type="l", main="KNN - ROC Chart")

ROC chart

The ROC chart is similar to the lift charts in that they provide a means of comparison between classification models. Area under ROC curve (AUC)is often used as a measure of quality of the classification models. A random classifier has an area under the curve of 0.5, while AUC for a perfect classifier is equal to 1. Here the AUC is 0.56

Summary

It is simply inefficient and costly for banks to continue to pay executives high salary just to assess if a costumer is loan worthy or not, especially if the bank has a large pool of people that it wants to do busines with in the future. By applying the model, banks can save cost and increase customer satisfaction. Of course more work has to be done to increase to models accuracy to 99% from the current 92%.

For more use cases on Machine learning, visit us at Cartwheel Technologies