Banks will always have to deal with business clients that will default on a loan. But how do you mitigate risk by predicting before-hand which costumers will default? I would suggest using Machine learning for prediction. In fact I would say use LDA.

Linear Discriminant Analysis (LDA) is a classification method originally developed in 1936 by R. A. Fisher. It’s simple, statistically robust and often produces models whose accuracy is as good as more complex methods.

About Algorithm LDA is based upon the concept of searching for a linear combination of variables (predictors) that best separates two classes (targets).

In this project we will use a dataset from a bank regarding its small business clients who defaulted and those that did not default separated by delinquent days (DAYSDELQ) and number of months in business (BUSAGE). We will use LDA to find an optimal linear model that best separates the two classes (default and non-default).

I will walk you through the 11 steps/code in which we can accurately predict the number of people that will default on a loan based on historical bank data.

1. Data Preparation - You can easily download the data from Dr Saed’s website: http://www.saedsayad.com/datasets/CreditData.zip

2. Read in the datasets using the read.csv function

train <- read.csv("~/Credit_train.csv")
test <- read.csv("~/Credit_test.csv")

The dataset comes in two files- Credit_train.csv and Credit_test.csv

3. View the column names using the colnames function

colnames(train)
## [1] "BUSAGE"      "BUSTYPE"     "MAXLINEUTIL" "DAYSDELQ"    "TOTACBAL"   
## [6] "DEFAULT"
colnames(test)
## [1] "BUSAGE"      "BUSTYPE"     "MAXLINEUTIL" "DAYSDELQ"    "TOTACBAL"   
## [6] "DEFAULT"

These are the names of the columns/variables

BUSAGE - Business age (age of the business measured in months)

BUSTYPE - Business type (Type of business classified from A-F)

MAXLINEUTIL - Maximum Line of Credit utilized used by the customer

DAYSDELQ - Days delinquency is the number of days a customer has been delinquent or performs illegal or immoral acts

TOTACBAL - Total Account Balance (Total account balance for the customer)

DEFAULT - Default (Yes OR No if the client defualted on a loan)

4. View the top 6 rows of our data using the head function

head(train)
##   BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ TOTACBAL DEFAULT
## 1    183       B           0        0     0.24       N
## 2    271       E           0        0     1.37       N
## 3     51       A           0        0     1.52       N
## 4    208       A           0        0     1.64       N
## 5    148       A           0        0     1.78       N
## 6     82       D           0        0     1.88       N
head(test)
##   BUSAGE BUSTYPE MAXLINEUTIL DAYSDELQ TOTACBAL DEFAULT
## 1    354       A      3.0425        0 152125.6       N
## 2     99       A      0.0000        0 151060.9       N
## 3    100       A      2.4507        0 122538.6       N
## 4     85       C      1.1397        0 113975.4       N
## 5     82       A      1.1241        0 112415.7       N
## 6     62       A      0.0000        0 106760.2       Y

We now have a sense of what our data looks like by veiwing the top 6 part of the row

5. Summarize the training data using the summary function

summary(train)
##      BUSAGE       BUSTYPE    MAXLINEUTIL        DAYSDELQ       
##  Min.   :   1.0   A:17696   Min.   : 0.000   Min.   :  0.0000  
##  1st Qu.:  41.0   B: 6152   1st Qu.: 0.000   1st Qu.:  0.0000  
##  Median :  75.0   C: 2020   Median : 0.269   Median :  0.0000  
##  Mean   : 111.3   D: 2276   Mean   : 0.387   Mean   :  0.4472  
##  3rd Qu.: 149.0   E:   69   3rd Qu.: 0.774   3rd Qu.:  0.0000  
##  Max.   :1393.0   F:  214   Max.   :14.095   Max.   :548.0000  
##  NA's   :192                NA's   :4143     NA's   :466       
##     TOTACBAL        DEFAULT  
##  Min.   :     0.2   N:26423  
##  1st Qu.:  5444.7   Y: 2004  
##  Median : 16354.6            
##  Mean   : 23211.7            
##  3rd Qu.: 34609.7            
##  Max.   :429915.7            
##  NA's   :9203

The training data tells us that the Median age for the bank’s business clients is 75 months(6 years and 4 months), and the maximum years in business is (116 years, 8 months) -they were 192 missing cases. Majority of the bank’s business clients -17,696 falls under type A. Maximum Line of Credit used by a customer is 27. The maximum number of days a customer went delinquent is 584. The Mean or average total account balance of each client is 23,211.7 dollars, the maximum is 429,915.7 dollars. And finally, they were 26,423 non-defaulters and 2004 defaulters.

6. Summarize the test data using the summary function

summary(test)
##      BUSAGE       BUSTYPE   MAXLINEUTIL         DAYSDELQ       
##  Min.   :   3.0   A:4413   Min.   : 0.0000   Min.   :  0.0000  
##  1st Qu.:  42.0   B:1569   1st Qu.: 0.0000   1st Qu.:  0.0000  
##  Median :  74.0   C: 496   Median : 0.2540   Median :  0.0000  
##  Mean   : 109.1   D: 532   Mean   : 0.3888   Mean   :  0.3589  
##  3rd Qu.: 145.0   E:  14   3rd Qu.: 0.7611   3rd Qu.:  0.0000  
##  Max.   :1381.0   F:  48   Max.   :26.5000   Max.   :364.0000  
##  NA's   :47                NA's   :1036      NA's   :125       
##     TOTACBAL         DEFAULT 
##  Min.   :     0.02   N:6583  
##  1st Qu.:  5119.59   Y: 489  
##  Median : 16270.55           
##  Mean   : 22673.69           
##  3rd Qu.: 33523.48           
##  Max.   :152125.57           
##  NA's   :2302

In summary, the test data shows that the median age of the business is 74 months(6 years) while the maximum is 1381 months(115). Majority of the bank’s business clients -4,413 falls under type A. Maximum Line of Credit utilized by a customer is 27. The maximum number of days a customer went delinquent is 125. The Mean or average total account balance of each client is 22,673.69 dollars, the maximum is 152,125.57 dollars. And finally, they were 6,583 non-defaulters and 489 defaulters.

7. Load in the packages for our analysis using the library function

library(caret)
library(randomForest)
library(AUC)
library(MASS)

Here we are loading the Packages we need for the analysis in our environment- caret, randomForest, AUC, and MASS.

8. Apply LDA algorithm on the training data

Next we are going to apply the lda function on the DEFAULT variable against all other variables, we are doing so on the train data, and we will omit missing records

model.LDA <- lda(DEFAULT~., data=train, na.action="na.omit")
model.LDA
## Call:
## lda(DEFAULT ~ ., data = train, na.action = "na.omit")
## 
## Prior probabilities of groups:
##         N         Y 
## 0.9139355 0.0860645 
## 
## Group means:
##      BUSAGE  BUSTYPEB   BUSTYPEC   BUSTYPED     BUSTYPEE    BUSTYPEF
## N 116.88465 0.2442296 0.06228055 0.08282968 0.0009785299 0.002820469
## Y  85.82885 0.2652812 0.04706601 0.07579462 0.0006112469 0.003667482
##   MAXLINEUTIL   DAYSDELQ TOTACBAL
## N   0.4498486 0.08311748 22791.31
## Y   0.7775695 3.76711491 27707.46
## 
## Coefficients of linear discriminants:
##                       LD1
## BUSAGE      -2.522905e-03
## BUSTYPEB    -2.382606e-02
## BUSTYPEC    -1.910992e-01
## BUSTYPED    -1.071924e-01
## BUSTYPEE    -2.338590e-01
## BUSTYPEF     3.398732e-01
## MAXLINEUTIL  2.059445e+00
## DAYSDELQ     6.988151e-02
## TOTACBAL    -8.153696e-06

9. Do the prediction on the test data

Next step is to perform predictive analysis on the test data using the LDA Model

pc <- predict(model.LDA, na.roughfix(test))
summary(pc$class)
##    N    Y 
## 7036   36

The model predicts there will be 7,036 non-defaulters and 36 defaulters

xtab <- table(pc$class, test$DEFAULT)
caret::confusionMatrix(xtab, positive = "Y")
## Confusion Matrix and Statistics
## 
##    
##        N    Y
##   N 6573  463
##   Y   10   26
##                                          
##                Accuracy : 0.9331         
##                  95% CI : (0.927, 0.9388)
##     No Information Rate : 0.9309         
##     P-Value [Acc > NIR] : 0.2347         
##                                          
##                   Kappa : 0.0904         
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.053170       
##             Specificity : 0.998481       
##          Pos Pred Value : 0.722222       
##          Neg Pred Value : 0.934196       
##              Prevalence : 0.069146       
##          Detection Rate : 0.003676       
##    Detection Prevalence : 0.005090       
##       Balanced Accuracy : 0.525825       
##                                          
##        'Positive' Class : Y              
## 

A confusion matrix shows the number of correct and incorrect predictions made by the classification model compared to the actual outcomes (target value) in the data. Here our model correctly predicts that there will 6,573 non-defaulters and 26 defaulters on a bank loan. The model is 93% accurate and this was achieved with a 95% Confidence Interval level.

10. Plot the Lift chart

pb <- NULL
pb <- pc$posterior
pb <- as.data.frame(pb)
pred.LDA <- data.frame(test$DEFAULT, pb$Y)
colnames(pred.LDA) <- c("target","score")
lift.LDA <- lift(target ~ score, data = pred.LDA, cuts=10, class="Y")
xyplot(lift.LDA, main="LDA - Lift Chart", type=c("l","g"), lwd=2
       , scales=list(x=list(alternating=FALSE,tick.number = 10)
                     ,y=list(alternating=FALSE,tick.number = 10)))

lift is a measure of the effectiveness of a classification model calculated as the ratio between the results obtained with and without the model. However, in contrast to the confusion matrix that evaluates models on the whole population lift chart evaluates model performance in a portion of the population.

11. Plot the Receiver Operating Characteristic ROC chart

labels <- as.factor(ifelse(pred.LDA$target=="Y", 1, 0))
predictions <- pred.LDA$score
auc(roc(predictions, labels), min = 0, max = 1)
## [1] 0.755156
plot(roc(predictions, labels), min=0, max=1, type="l", main="LDA - ROC Chart")

The ROC chart is similar to the lift charts in that they provide a means of comparison between classification models. Area under ROC curve (AUC)is often used as a measure of quality of the classification models. A random classifier has an area under the curve of 0.5, while AUC for a perfect classifier is equal to 1. Here the AUC is 0.76

This wraps up our 11 lines of code in use predictive analytics on financial data. For more use cases on Machine learning, visit us at Cartwheel Technologies