For complete html functions, please visit www.rpubs.com/jasonchanhku/loans

Libraries Used

library(fBasics) #summary statistics
library(plotly) #data visulization
library(dplyr) #count function
library(C50) #C50 Decision Tree algorithm
library(gmodels) #CrossTable()
library(caret) #Confusion Matrix
library(rmarkdown)


Objective

The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans.

Therefore, objective of this project is to develop a simple credit approval model using the C5.0 decision tree algorithm.

The target variable of interest is default status.


Step 1: Data Exploration

The data is obtained from donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg. The dataset contains information on loans obtained from a credit agency in Germany.

The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default.

credit <- read.csv(file = "Machine-Learning-with-R-datasets-master/credit.csv")
credit$default[credit$default == 1] <- "no"
credit$default[credit$default == 2] <- "yes"
credit$default <- as.factor(credit$default)

Data Preview

Below is a partial table preview of the dataset:

knitr::kable(head(credit), caption = "Credit Information Dataset")
Credit Information Dataset
checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors residence_history property age installment_plan housing existing_credits job dependents telephone foreign_worker default
< 0 DM 6 critical radio/tv 1169 unknown > 7 yrs 4 single male none 4 real estate 67 none own 2 skilled employee 1 yes yes no
1 - 200 DM 48 repaid radio/tv 5951 < 100 DM 1 - 4 yrs 2 female none 2 real estate 22 none own 1 skilled employee 1 none yes yes
unknown 12 critical education 2096 < 100 DM 4 - 7 yrs 2 single male none 3 real estate 49 none own 1 unskilled resident 2 none yes no
< 0 DM 42 repaid furniture 7882 < 100 DM 4 - 7 yrs 2 single male guarantor 4 building society savings 45 none for free 1 skilled employee 2 none yes no
< 0 DM 24 delayed car (new) 4870 < 100 DM 1 - 4 yrs 3 single male none 4 unknown/none 53 none for free 2 skilled employee 2 none yes yes
unknown 36 repaid education 9055 unknown 1 - 4 yrs 2 single male none 4 unknown/none 35 none for free 1 unskilled resident 2 yes yes no

The general structure of the dataset:

str(credit)
## 'data.frame':    1000 obs. of  21 variables:
##  $ checking_balance    : Factor w/ 4 levels "1 - 200 DM","< 0 DM",..: 2 1 4 2 2 4 4 1 4 1 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
##  $ purpose             : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "101 - 500 DM",..: 5 3 3 3 3 5 2 3 4 3 ...
##  $ employment_length   : Factor w/ 5 levels "0 - 1 yrs","1 - 4 yrs",..: 4 2 3 3 2 2 4 2 3 5 ...
##  $ installment_rate    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status     : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
##  $ other_debtors       : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
##  $ residence_history   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property            : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ installment_plan    : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_credits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone           : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

Useful Tables

Below are some useful tables that offers summary statistics from the loan features such as checking, savings, duration, amount, and default.

Checking Balance

knitr::kable(data.frame(table(credit$checking_balance)), Caption = "Checking Balance", col.names = c("Checking Balance", "Frequency") )
Checking Balance Frequency
1 - 200 DM 269
< 0 DM 274
> 200 DM 63
unknown 394

Savings Balance

knitr::kable(data.frame(table(credit$savings_balance)), Caption = "Savings Balance", col.names = c("Savings Balance", "Frequency"))
Savings Balance Frequency
101 - 500 DM 103
501 - 1000 DM 63
< 100 DM 603
> 1000 DM 48
unknown 183

Loan Duration

knitr::kable(basicStats(credit$months_loan_duration), Caption = "Loan Duration", digits = 2, col.names = c("Value"))
Value
nobs 1000.00
NAs 0.00
Minimum 4.00
Maximum 72.00
1. Quartile 12.00
3. Quartile 24.00
Mean 20.90
Median 18.00
Sum 20903.00
SE Mean 0.38
LCL Mean 20.15
UCL Mean 21.65
Variance 145.42
Stdev 12.06
Skewness 1.09
Kurtosis 0.90

Loan Amount

knitr::kable(basicStats(credit$amount), Caption = "Loan Amount", digits = 2, col.names = c("Amount"))
Amount
nobs 1000.00
NAs 0.00
Minimum 250.00
Maximum 18424.00
1. Quartile 1365.50
3. Quartile 3972.25
Mean 3271.26
Median 2319.50
Sum 3271258.00
SE Mean 89.26
LCL Mean 3096.09
UCL Mean 3446.42
Variance 7967843.47
Stdev 2822.74
Skewness 1.94
Kurtosis 4.25

Default Applicants

knitr::kable(data.frame(table(credit$default)), col.names = c("Default", "Amount"))
Default Amount
no 700
yes 300

Data Visualization

The data visualization will focus on identifying patterns of key features which clearly distinguishes an applicant’s default status.

Age

plot_ly(credit, y = ~age, color = ~default, type = "box")

As visualized, default applicants are of younger age. Perhaps the younger age have less awareness about financial planning.

Loan Amount

plot_ly(credit, y = ~amount, color = ~default, type = "box")

As visualized, default applicants tend to have higher loan amounts. Risky applicants tend to loan huge amounts perhaps due to underestimation.

Loan Duration

plot_ly(credit, y=~months_loan_duration, color = ~default, type = "box")

As visualized, default applicants tend to have longer loan durations perhaps due to their cash flows.

Job

credit %>% count(default, job) %>% plot_ly(x = ~default, y = ~n, color = ~job, type = "bar")

As visualized, defaulted applicants are less skilled. Perhaps this reflects their financial planning awareness.

Checking Balance

credit %>% count(default, checking_balance) %>% plot_ly(x = ~default, y = ~n, color = ~checking_balance, type = "bar")

As visualized, defaulted applicants have lower checking balance.

Savings Balance

credit %>% count(default, savings_balance) %>% plot_ly(x = ~default, y = ~n, color = ~savings_balance, type = "bar")

As visualized, defaulted applicants have lower savings balance.

Purpose

credit %>% count(default, purpose) %>% plot_ly(x = ~default, y = ~n, color = ~purpose, type = "bar")

As visualized, of the defaulted applicants, most of them take loans to purchase new cars.

Preliminary Insights

From the data preview and data visualization, we identify the following characteristics that distinguishes an applicant who is likely to default:

  • Low checking & savings balance
  • Less skilled
  • Younger age
  • Longer loan duration
  • Higher loan amounts
  • Apply loans to purchase new cars


Step 2: Data Preparation

The training and test datasets will be split into a portion of 90% to 10%, leaving 900 values for training and 100 for test. However, as the dataset is not sorted in random order, this could cause bias if for example the data is sorted by loan amounts ascending. The model will train on small loans and test on big loans. Hence, random sampling is required.

Random Sample

Random sampling is applied as following (please click show code):

# To ensure results can be reproduced
set.seed(123)

# Sample 900 integers randomly from 1000
train_sample <- sample(1000,900)

#Subset randomly into training and test
credit_train <- credit[train_sample ,]
credit_test <- credit[-train_sample , ]

Let’s check if both the train and test set have rather even split of the class levels. This is to prevent training bias.

knitr::kable(data.frame(prop.table(table(credit_train$default))), caption = "Training Set Proportion", col.names = c("Default Status", "Proportion"), digits = 3)
Training Set Proportion
Default Status Proportion
no 0.703
yes 0.297
knitr::kable(data.frame(prop.table(table(credit_test$default))), caption = "Test Set Proportion", col.names = c("Default Status", "Proportion"), digits = 4)
Test Set Proportion
Default Status Proportion
no 0.67
yes 0.33

Step 3: Model Training

Since we have all the data we need, the C5.0 model can now be trained on the training set. To use the training set, the class labels must be removed from the training set and plugged into separately as labels.

The summary of the model is as follows:

#building the classifier
credit_model <- C5.0(credit_train[-21], credit_train$default)

credit_model
## 
## Call:
## C5.0.default(x = credit_train[-21], y = credit_train$default)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 20 
## 
## Tree size: 54 
## 
## Non-standard options: attempt to group attributes
summary(credit_model)
## 
## Call:
## C5.0.default(x = credit_train[-21], y = credit_train$default)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Nov 16 23:21:34 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (21 attributes) from undefined.data
## 
## Decision tree:
## 
## checking_balance in {> 200 DM,unknown}: no (412/50)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...other_debtors = guarantor:
##     :...months_loan_duration > 36: yes (4/1)
##     :   months_loan_duration <= 36:
##     :   :...installment_plan in {none,stores}: no (24)
##     :       installment_plan = bank:
##     :       :...purpose = car (new): yes (3)
##     :           purpose in {business,car (used),domestic appliances,education,
##     :                       furniture,others,radio/tv,repairs,
##     :                       retraining}: no (7/1)
##     other_debtors in {co-applicant,none}:
##     :...credit_history = critical: no (102/30)
##         credit_history = fully repaid: yes (27/6)
##         credit_history = fully repaid this bank:
##         :...other_debtors = co-applicant: no (2)
##         :   other_debtors = none: yes (26/8)
##         credit_history in {delayed,repaid}:
##         :...savings_balance in {501 - 1000 DM,> 1000 DM}: no (19/3)
##             savings_balance = 101 - 500 DM:
##             :...other_debtors = co-applicant: yes (3)
##             :   other_debtors = none:
##             :   :...personal_status in {divorced male,
##             :       :                   married male}: yes (6/1)
##             :       personal_status = female:
##             :       :...installment_rate <= 3: no (4/1)
##             :       :   installment_rate > 3: yes (4)
##             :       personal_status = single male:
##             :       :...age <= 41: no (15/2)
##             :           age > 41: yes (2)
##             savings_balance = unknown:
##             :...credit_history = delayed: no (8)
##             :   credit_history = repaid:
##             :   :...foreign_worker = no: no (2)
##             :       foreign_worker = yes:
##             :       :...checking_balance = < 0 DM:
##             :           :...telephone = none: yes (11/2)
##             :           :   telephone = yes:
##             :           :   :...amount <= 5045: no (5/1)
##             :           :       amount > 5045: yes (2)
##             :           checking_balance = 1 - 200 DM:
##             :           :...residence_history > 3: no (9)
##             :               residence_history <= 3: [S1]
##             savings_balance = < 100 DM:
##             :...months_loan_duration > 39:
##                 :...residence_history <= 1: no (2)
##                 :   residence_history > 1: yes (19/1)
##                 months_loan_duration <= 39:
##                 :...purpose in {car (new),retraining}: yes (47/16)
##                     purpose in {domestic appliances,others}: no (3)
##                     purpose = car (used):
##                     :...amount <= 8086: no (9/1)
##                     :   amount > 8086: yes (5)
##                     purpose = education:
##                     :...checking_balance = 1 - 200 DM: no (2)
##                     :   checking_balance = < 0 DM: yes (5)
##                     purpose = repairs:
##                     :...residence_history <= 3: yes (4/1)
##                     :   residence_history > 3: no (3)
##                     purpose = business:
##                     :...credit_history = delayed: yes (2)
##                     :   credit_history = repaid:
##                     :   :...age <= 34: no (5)
##                     :       age > 34: yes (2)
##                     purpose = radio/tv:
##                     :...employment_length in {0 - 1 yrs,
##                     :   :                     unemployed}: yes (14/5)
##                     :   employment_length = 4 - 7 yrs: no (3)
##                     :   employment_length = > 7 yrs:
##                     :   :...amount <= 932: yes (2)
##                     :   :   amount > 932: no (7)
##                     :   employment_length = 1 - 4 yrs:
##                     :   :...months_loan_duration <= 15: no (6)
##                     :       months_loan_duration > 15:
##                     :       :...amount <= 3275: yes (7)
##                     :           amount > 3275: no (2)
##                     purpose = furniture:
##                     :...residence_history <= 1: no (8/1)
##                         residence_history > 1:
##                         :...installment_plan in {bank,stores}: no (3/1)
##                             installment_plan = none:
##                             :...telephone = yes: yes (7/1)
##                                 telephone = none:
##                                 :...months_loan_duration > 27: yes (3)
##                                     months_loan_duration <= 27: [S2]
## 
## SubTree [S1]
## 
## property in {building society savings,unknown/none}: yes (4)
## property = other: no (6)
## property = real estate:
## :...job = skilled employee: yes (2)
##     job in {mangement self-employed,unemployed non-resident,
##             unskilled resident}: no (2)
## 
## SubTree [S2]
## 
## checking_balance = 1 - 200 DM: yes (5/2)
## checking_balance = < 0 DM:
## :...property in {building society savings,real estate,unknown/none}: no (8)
##     property = other:
##     :...installment_rate <= 1: no (2)
##         installment_rate > 1: yes (4)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      54  135(15.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     589    44    (a): class no
##      91   176    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##   54.22% other_debtors
##   50.00% credit_history
##   32.56% savings_balance
##   25.22% months_loan_duration
##   19.78% purpose
##   10.11% residence_history
##    7.33% installment_plan
##    5.22% telephone
##    4.78% foreign_worker
##    4.56% employment_length
##    4.33% amount
##    3.44% personal_status
##    3.11% property
##    2.67% age
##    1.56% installment_rate
##    0.44% job
## 
## 
## Time: 0.0 secs

Explanation

First few lines of the summary reads the following:

  • if checking balance is more than 200 or unknown, less likely to default
  • 412/50 indicates 412 classified correctly and 50 classified incorrectly
  • 15% error rate is the sum of false positive and negative


Step 4: Evaluating Model Performance

Now it is time to put our test data to the test after training our model. After that, an evaluation is done to check its accuracy rate.

#making prediction
credit_pred <- predict(credit_model, credit_test)

#Evaluation
confusionMatrix(credit_pred, credit_test$default, dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
## 
##          Actual
## Predicted no yes
##       no  60  19
##       yes  7  14
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6427, 0.8226)
##     No Information Rate : 0.67            
##     P-Value [Acc > NIR] : 0.08146         
##                                           
##                   Kappa : 0.3523          
##  Mcnemar's Test P-Value : 0.03098         
##                                           
##             Sensitivity : 0.8955          
##             Specificity : 0.4242          
##          Pos Pred Value : 0.7595          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6700          
##          Detection Rate : 0.6000          
##    Detection Prevalence : 0.7900          
##       Balanced Accuracy : 0.6599          
##                                           
##        'Positive' Class : no              
## 

Turns out that the accuracy is only 74% and there is definitely room for improvement.


Step 5: Improving the Model

There are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly type of mistakes.

Adaptive Boosting

Boosting is rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone.

To add bossting to the model, an addtional parameter trials is added. It acts as an upper limit and stop adding trees if accuracy does not improve. The de facto standard is 10.

The evaluation is as follows:

#making the classifier
credit_boost10 <- C5.0(credit_train[-21], credit_train$default, trials = 10)

#making the prediction
credit_boost_pred10 <- predict(credit_boost10, credit_test)

#Evaluation
confusionMatrix(credit_boost_pred10, credit_test$default)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  60  17
##        yes  7  16
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.6643, 0.8398)
##     No Information Rate : 0.67            
##     P-Value [Acc > NIR] : 0.03281         
##                                           
##                   Kappa : 0.4121          
##  Mcnemar's Test P-Value : 0.06619         
##                                           
##             Sensitivity : 0.8955          
##             Specificity : 0.4848          
##          Pos Pred Value : 0.7792          
##          Neg Pred Value : 0.6957          
##              Prevalence : 0.6700          
##          Detection Rate : 0.6000          
##    Detection Prevalence : 0.7700          
##       Balanced Accuracy : 0.6902          
##                                           
##        'Positive' Class : no              
## 

After adaptive boosting, the accuracy of the model improved to 76%.