Banks will always have to deal with business clients that will default on a loan. But how do you mitigate risk by predicting before-hand which costumers will default? I would suggest using Machine learning for prediction. In fact I would say use LDA.
Linear Discriminant Analysis (LDA) is a classification method originally developed in 1936 by R. A. Fisher. It’s simple, statistically robust and often produces models whose accuracy is as good as more complex methods.
About Algorithm LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets).
In this project we will use a dataset from a bank regarding its small business clients who defaulted and those that did not default separated by delinquent days (DAYSDELQ) and number of months in business (BUSAGE). We will use LDA to find an optimal linear model that best separates the two classes (default and non-default).
I will walk you through the 11 steps/code in which we can accurately predict the number of people that will default on a loan based on historical bank data.
train <- read.csv("~/Credit_train.csv")
test <- read.csv("~/Credit_test.csv")
The dataset comes in two files- Credit_train.csv and Credit_test.csv
colnames(train)
## [1] "BUSAGE" "BUSTYPE" "MAXLINEUTIL" "DAYSDELQ" "TOTACBAL"
## [6] "DEFAULT"
colnames(test)
## [1] "BUSAGE" "BUSTYPE" "MAXLINEUTIL" "DAYSDELQ" "TOTACBAL"
## [6] "DEFAULT"
These are the names of the columns/variables
BUSAGE - Business age (age of the business measured in months)
BUSTYPE - Business type (Type of business classified from A-F)
MAXLINEUTIL - Maximum Line of Credit utilized used by the customer
DAYSDELQ - Days delinquency is the number of days a customer has been delinquent or performs illegal or immoral acts
TOTACBAL - Total Account Balance (Total account balance for the customer)
DEFAULT - Default (Yes OR No if the client defualted on a loan)
head(train)
## BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ TOTACBAL DEFAULT
## 1 183 B 0 0 0.24 N
## 2 271 E 0 0 1.37 N
## 3 51 A 0 0 1.52 N
## 4 208 A 0 0 1.64 N
## 5 148 A 0 0 1.78 N
## 6 82 D 0 0 1.88 N
head(test)
## BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ TOTACBAL DEFAULT
## 1 354 A 3.0425 0 152125.6 N
## 2 99 A 0.0000 0 151060.9 N
## 3 100 A 2.4507 0 122538.6 N
## 4 85 C 1.1397 0 113975.4 N
## 5 82 A 1.1241 0 112415.7 N
## 6 62 A 0.0000 0 106760.2 Y
We now have a sense of what our data looks like by veiwing the top 6 part of the row
summary(train)
## BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ
## Min. : 1.0 A:17696 Min. : 0.000 Min. : 0.0000
## 1st Qu.: 41.0 B: 6152 1st Qu.: 0.000 1st Qu.: 0.0000
## Median : 75.0 C: 2020 Median : 0.269 Median : 0.0000
## Mean : 111.3 D: 2276 Mean : 0.387 Mean : 0.4472
## 3rd Qu.: 149.0 E: 69 3rd Qu.: 0.774 3rd Qu.: 0.0000
## Max. :1393.0 F: 214 Max. :14.095 Max. :548.0000
## NA's :192 NA's :4143 NA's :466
## TOTACBAL DEFAULT
## Min. : 0.2 N:26423
## 1st Qu.: 5444.7 Y: 2004
## Median : 16354.6
## Mean : 23211.7
## 3rd Qu.: 34609.7
## Max. :429915.7
## NA's :9203
The training data tells us that the Median age for the bank’s business clients is 75 months(6 years and 4 months), and the maximum years in business is (116 years, 8 months) -they were 192 missing cases. Majority of the bank’s business clients -17,696 falls under type A. Maximum Line of Credit used by a customer is 27. The maximum number of days a customer went delinquent is 584. The Mean or average total account balance of each client is 23,211.7 dollars, the maximum is 429,915.7 dollars. And finally, they were 26,423 non-defaulters and 2004 defaulters.
summary(test)
## BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ
## Min. : 3.0 A:4413 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 42.0 B:1569 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 74.0 C: 496 Median : 0.2540 Median : 0.0000
## Mean : 109.1 D: 532 Mean : 0.3888 Mean : 0.3589
## 3rd Qu.: 145.0 E: 14 3rd Qu.: 0.7611 3rd Qu.: 0.0000
## Max. :1381.0 F: 48 Max. :26.5000 Max. :364.0000
## NA's :47 NA's :1036 NA's :125
## TOTACBAL DEFAULT
## Min. : 0.02 N:6583
## 1st Qu.: 5119.59 Y: 489
## Median : 16270.55
## Mean : 22673.69
## 3rd Qu.: 33523.48
## Max. :152125.57
## NA's :2302
In summary, the test data shows that the median age of the business is 74 months(6 years) while the maximum is 1381 months(115). Majority of the bank’s business clients -4,413 falls under type A. Maximum Line of Credit utilized by a customer is 27. The maximum number of days a customer went delinquent is 125. The Mean or average total account balance of each client is 22,673.69 dollars, the maximum is 152,125.57 dollars. And finally, they were 6,583 non-defaulters and 489 defaulters.
library(caret)
library(randomForest)
library(AUC)
library(MASS)
Here we are loading the Packages we need for the analysis in our environment- caret, randomForest, AUC, and MASS.
Next we are going to apply the lda function on the DEFAULT variable against all other variables, we are doing so on the train data, and we will omit missing records
model.LDA <- lda(DEFAULT~., data=train, na.action="na.omit")
model.LDA
## Call:
## lda(DEFAULT ~ ., data = train, na.action = "na.omit")
##
## Prior probabilities of groups:
## N Y
## 0.9139355 0.0860645
##
## Group means:
## BUSAGE BUSTYPEB BUSTYPEC BUSTYPED BUSTYPEE BUSTYPEF
## N 116.88465 0.2442296 0.06228055 0.08282968 0.0009785299 0.002820469
## Y 85.82885 0.2652812 0.04706601 0.07579462 0.0006112469 0.003667482
## MAXLINEUTIL DAYSDELQ TOTACBAL
## N 0.4498486 0.08311748 22791.31
## Y 0.7775695 3.76711491 27707.46
##
## Coefficients of linear discriminants:
## LD1
## BUSAGE -2.522905e-03
## BUSTYPEB -2.382606e-02
## BUSTYPEC -1.910992e-01
## BUSTYPED -1.071924e-01
## BUSTYPEE -2.338590e-01
## BUSTYPEF 3.398732e-01
## MAXLINEUTIL 2.059445e+00
## DAYSDELQ 6.988151e-02
## TOTACBAL -8.153696e-06
Next step is to perform predictive analysis on the test data using the LDA Model
pc <- predict(model.LDA, na.roughfix(test))
summary(pc$class)
## N Y
## 7036 36
The model predicts there will be 7,036 non-defaulters and 36 defaulters
xtab <- table(pc$class, test$DEFAULT)
caret::confusionMatrix(xtab, positive = "Y")
## Confusion Matrix and Statistics
##
##
## N Y
## N 6573 463
## Y 10 26
##
## Accuracy : 0.9331
## 95% CI : (0.927, 0.9388)
## No Information Rate : 0.9309
## P-Value [Acc > NIR] : 0.2347
##
## Kappa : 0.0904
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.053170
## Specificity : 0.998481
## Pos Pred Value : 0.722222
## Neg Pred Value : 0.934196
## Prevalence : 0.069146
## Detection Rate : 0.003676
## Detection Prevalence : 0.005090
## Balanced Accuracy : 0.525825
##
## 'Positive' Class : Y
##
A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. Here our model correctly predicts that there will 6,573 non-defaulters and 26 defaulters on a bank loan. The model is 93% accurate and this was achieved with a 95% Confidence Interval level.
pb <- NULL
pb <- pc$posterior
pb <- as.data.frame(pb)
pred.LDA <- data.frame(test$DEFAULT, pb$Y)
colnames(pred.LDA) <- c("target","score")
lift.LDA <- lift(target ~ score, data = pred.LDA, cuts=10, class="Y")
xyplot(lift.LDA, main="LDA - Lift Chart", type=c("l","g"), lwd=2
, scales=list(x=list(alternating=FALSE,tick.number = 10)
,y=list(alternating=FALSE,tick.number = 10)))
lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. However, in contrast to the confusion matrix that evaluates models on the whole population lift chart evaluates model performance in a portion of the population.
labels <- as.factor(ifelse(pred.LDA$target=="Y", 1, 0))
predictions <- pred.LDA$score
auc(roc(predictions, labels), min = 0, max = 1)
## [1] 0.755156
plot(roc(predictions, labels), min=0, max=1, type="l", main="LDA - ROC Chart")
The ROC chart is similar to the lift charts in that they provide a means of comparison between classification models. Area under ROC curve (AUC)is often used as a measure of quality of the classification models. A random classifier has an area under the curve of 0.5, while AUC for a perfect classifier is equal to 1. Here the AUC is 0.76
This wraps up our 11 lines of code in use predictive analytics on financial data. For more use cases on Machine learning, visit us at Cartwheel Technologies