For complete html functions, please visit www.rpubs.com/jasonchanhku/loans

Libraries Used

library(fBasics) #summary statistics
library(plotly) #data visulization
library(dplyr) #count function
library(C50) #C50 Decision Tree algorithm
library(gmodels) #CrossTable()
library(caret) #Confusion Matrix
library(rmarkdown)

Objective

The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans.

Therefore, objective of this project is to develop a simple credit approval model using the C5.0 decision tree algorithm.

The target variable of interest is default status.

Step 1: Data Exploration

The data is obtained from donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg. The dataset contains information on loans obtained from a credit agency in Germany.

The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default.

credit <- read.csv(file = "Machine-Learning-with-R-datasets-master/credit.csv")
credit$default[credit$default == 1] <- "no"
credit$default[credit$default == 2] <- "yes"
credit$default <- as.factor(credit$default)

Data Preview

Below is a partial table preview of the dataset:

knitr::kable(head(credit), caption = "Credit Information Dataset")

Credit Information Dataset
checking_balance	months_loan_duration	credit_history	purpose	amount	savings_balance	employment_length	installment_rate	personal_status	other_debtors	residence_history	property	age	installment_plan	housing	existing_credits	job	dependents	telephone	foreign_worker	default
< 0 DM	6	critical	radio/tv	1169	unknown	> 7 yrs	4	single male	none	4	real estate	67	none	own	2	skilled employee	1	yes	yes	no
1 - 200 DM	48	repaid	radio/tv	5951	< 100 DM	1 - 4 yrs	2	female	none	2	real estate	22	none	own	1	skilled employee	1	none	yes	yes
unknown	12	critical	education	2096	< 100 DM	4 - 7 yrs	2	single male	none	3	real estate	49	none	own	1	unskilled resident	2	none	yes	no
< 0 DM	42	repaid	furniture	7882	< 100 DM	4 - 7 yrs	2	single male	guarantor	4	building society savings	45	none	for free	1	skilled employee	2	none	yes	no
< 0 DM	24	delayed	car (new)	4870	< 100 DM	1 - 4 yrs	3	single male	none	4	unknown/none	53	none	for free	2	skilled employee	2	none	yes	yes
unknown	36	repaid	education	9055	unknown	1 - 4 yrs	2	single male	none	4	unknown/none	35	none	for free	1	unskilled resident	2	yes	yes	no

The general structure of the dataset:

str(credit)

## 'data.frame':    1000 obs. of  21 variables:
##  $ checking_balance    : Factor w/ 4 levels "1 - 200 DM","< 0 DM",..: 2 1 4 2 2 4 4 1 4 1 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
##  $ purpose             : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "101 - 500 DM",..: 5 3 3 3 3 5 2 3 4 3 ...
##  $ employment_length   : Factor w/ 5 levels "0 - 1 yrs","1 - 4 yrs",..: 4 2 3 3 2 2 4 2 3 5 ...
##  $ installment_rate    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status     : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
##  $ other_debtors       : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
##  $ residence_history   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property            : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ installment_plan    : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_credits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone           : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...

Useful Tables

Below are some useful tables that offers summary statistics from the loan features such as checking, savings, duration, amount, and default.

Checking Balance

knitr::kable(data.frame(table(credit$checking_balance)), Caption = "Checking Balance", col.names = c("Checking Balance", "Frequency") )

Checking Balance	Frequency
1 - 200 DM	269
< 0 DM	274
> 200 DM	63
unknown	394

Savings Balance

knitr::kable(data.frame(table(credit$savings_balance)), Caption = "Savings Balance", col.names = c("Savings Balance", "Frequency"))

Savings Balance	Frequency
101 - 500 DM	103
501 - 1000 DM	63
< 100 DM	603
> 1000 DM	48
unknown	183

Loan Duration

knitr::kable(basicStats(credit$months_loan_duration), Caption = "Loan Duration", digits = 2, col.names = c("Value"))

	Value
nobs	1000.00
NAs	0.00
Minimum	4.00
Maximum	72.00
1. Quartile	12.00
3. Quartile	24.00
Mean	20.90
Median	18.00
Sum	20903.00
SE Mean	0.38
LCL Mean	20.15
UCL Mean	21.65
Variance	145.42
Stdev	12.06
Skewness	1.09
Kurtosis	0.90

Loan Amount

knitr::kable(basicStats(credit$amount), Caption = "Loan Amount", digits = 2, col.names = c("Amount"))

	Amount
nobs	1000.00
NAs	0.00
Minimum	250.00
Maximum	18424.00
1. Quartile	1365.50
3. Quartile	3972.25
Mean	3271.26
Median	2319.50
Sum	3271258.00
SE Mean	89.26
LCL Mean	3096.09
UCL Mean	3446.42
Variance	7967843.47
Stdev	2822.74
Skewness	1.94
Kurtosis	4.25

Default Applicants

knitr::kable(data.frame(table(credit$default)), col.names = c("Default", "Amount"))

Default	Amount
no	700
yes	300

Data Visualization

The data visualization will focus on identifying patterns of key features which clearly distinguishes an applicant’s default status.

Age

plot_ly(credit, y = ~age, color = ~default, type = "box")

As visualized, default applicants are of younger age. Perhaps the younger age have less awareness about financial planning.

Loan Amount

plot_ly(credit, y = ~amount, color = ~default, type = "box")

As visualized, default applicants tend to have higher loan amounts. Risky applicants tend to loan huge amounts perhaps due to underestimation.

Loan Duration

plot_ly(credit, y=~months_loan_duration, color = ~default, type = "box")

As visualized, default applicants tend to have longer loan durations perhaps due to their cash flows.

Job

credit %>% count(default, job) %>% plot_ly(x = ~default, y = ~n, color = ~job, type = "bar")

As visualized, defaulted applicants are less skilled. Perhaps this reflects their financial planning awareness.

Checking Balance

credit %>% count(default, checking_balance) %>% plot_ly(x = ~default, y = ~n, color = ~checking_balance, type = "bar")

As visualized, defaulted applicants have lower checking balance.

Savings Balance

credit %>% count(default, savings_balance) %>% plot_ly(x = ~default, y = ~n, color = ~savings_balance, type = "bar")

As visualized, defaulted applicants have lower savings balance.

Purpose

credit %>% count(default, purpose) %>% plot_ly(x = ~default, y = ~n, color = ~purpose, type = "bar")

As visualized, of the defaulted applicants, most of them take loans to purchase new cars.

Preliminary Insights

From the data preview and data visualization, we identify the following characteristics that distinguishes an applicant who is likely to default:

Low checking & savings balance
Less skilled
Younger age
Longer loan duration
Higher loan amounts
Apply loans to purchase new cars

Step 2: Data Preparation

The training and test datasets will be split into a portion of 90% to 10%, leaving 900 values for training and 100 for test. However, as the dataset is not sorted in random order, this could cause bias if for example the data is sorted by loan amounts ascending. The model will train on small loans and test on big loans. Hence, random sampling is required.

Random Sample

Random sampling is applied as following (please click show code):

# To ensure results can be reproduced
set.seed(123)

# Sample 900 integers randomly from 1000
train_sample <- sample(1000,900)

#Subset randomly into training and test
credit_train <- credit[train_sample ,]
credit_test <- credit[-train_sample , ]

Let’s check if both the train and test set have rather even split of the class levels. This is to prevent training bias.

knitr::kable(data.frame(prop.table(table(credit_train$default))), caption = "Training Set Proportion", col.names = c("Default Status", "Proportion"), digits = 3)

Training Set Proportion
Default Status	Proportion
no	0.703
yes	0.297

knitr::kable(data.frame(prop.table(table(credit_test$default))), caption = "Test Set Proportion", col.names = c("Default Status", "Proportion"), digits = 4)

Test Set Proportion
Default Status	Proportion
no	0.67
yes	0.33

Step 3: Model Training

Since we have all the data we need, the C5.0 model can now be trained on the training set. To use the training set, the class labels must be removed from the training set and plugged into separately as labels.

The summary of the model is as follows:

#building the classifier
credit_model <- C5.0(credit_train[-21], credit_train$default)

credit_model

## 
## Call:
## C5.0.default(x = credit_train[-21], y = credit_train$default)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 20 
## 
## Tree size: 54 
## 
## Non-standard options: attempt to group attributes

summary(credit_model)

## 
## Call:
## C5.0.default(x = credit_train[-21], y = credit_train$default)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Wed Nov 16 23:21:34 2016
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (21 attributes) from undefined.data
## 
## Decision tree:
## 
## checking_balance in {> 200 DM,unknown}: no (412/50)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...other_debtors = guarantor:
##     :...months_loan_duration > 36: yes (4/1)
##     :   months_loan_duration <= 36:
##     :   :...installment_plan in {none,stores}: no (24)
##     :       installment_plan = bank:
##     :       :...purpose = car (new): yes (3)
##     :           purpose in {business,car (used),domestic appliances,education,
##     :                       furniture,others,radio/tv,repairs,
##     :                       retraining}: no (7/1)
##     other_debtors in {co-applicant,none}:
##     :...credit_history = critical: no (102/30)
##         credit_history = fully repaid: yes (27/6)
##         credit_history = fully repaid this bank:
##         :...other_debtors = co-applicant: no (2)
##         :   other_debtors = none: yes (26/8)
##         credit_history in {delayed,repaid}:
##         :...savings_balance in {501 - 1000 DM,> 1000 DM}: no (19/3)
##             savings_balance = 101 - 500 DM:
##             :...other_debtors = co-applicant: yes (3)
##             :   other_debtors = none:
##             :   :...personal_status in {divorced male,
##             :       :                   married male}: yes (6/1)
##             :       personal_status = female:
##             :       :...installment_rate <= 3: no (4/1)
##             :       :   installment_rate > 3: yes (4)
##             :       personal_status = single male:
##             :       :...age <= 41: no (15/2)
##             :           age > 41: yes (2)
##             savings_balance = unknown:
##             :...credit_history = delayed: no (8)
##             :   credit_history = repaid:
##             :   :...foreign_worker = no: no (2)
##             :       foreign_worker = yes:
##             :       :...checking_balance = < 0 DM:
##             :           :...telephone = none: yes (11/2)
##             :           :   telephone = yes:
##             :           :   :...amount <= 5045: no (5/1)
##             :           :       amount > 5045: yes (2)
##             :           checking_balance = 1 - 200 DM:
##             :           :...residence_history > 3: no (9)
##             :               residence_history <= 3: [S1]
##             savings_balance = < 100 DM:
##             :...months_loan_duration > 39:
##                 :...residence_history <= 1: no (2)
##                 :   residence_history > 1: yes (19/1)
##                 months_loan_duration <= 39:
##                 :...purpose in {car (new),retraining}: yes (47/16)
##                     purpose in {domestic appliances,others}: no (3)
##                     purpose = car (used):
##                     :...amount <= 8086: no (9/1)
##                     :   amount > 8086: yes (5)
##                     purpose = education:
##                     :...checking_balance = 1 - 200 DM: no (2)
##                     :   checking_balance = < 0 DM: yes (5)
##                     purpose = repairs:
##                     :...residence_history <= 3: yes (4/1)
##                     :   residence_history > 3: no (3)
##                     purpose = business:
##                     :...credit_history = delayed: yes (2)
##                     :   credit_history = repaid:
##                     :   :...age <= 34: no (5)
##                     :       age > 34: yes (2)
##                     purpose = radio/tv:
##                     :...employment_length in {0 - 1 yrs,
##                     :   :                     unemployed}: yes (14/5)
##                     :   employment_length = 4 - 7 yrs: no (3)
##                     :   employment_length = > 7 yrs:
##                     :   :...amount <= 932: yes (2)
##                     :   :   amount > 932: no (7)
##                     :   employment_length = 1 - 4 yrs:
##                     :   :...months_loan_duration <= 15: no (6)
##                     :       months_loan_duration > 15:
##                     :       :...amount <= 3275: yes (7)
##                     :           amount > 3275: no (2)
##                     purpose = furniture:
##                     :...residence_history <= 1: no (8/1)
##                         residence_history > 1:
##                         :...installment_plan in {bank,stores}: no (3/1)
##                             installment_plan = none:
##                             :...telephone = yes: yes (7/1)
##                                 telephone = none:
##                                 :...months_loan_duration > 27: yes (3)
##                                     months_loan_duration <= 27: [S2]
## 
## SubTree [S1]
## 
## property in {building society savings,unknown/none}: yes (4)
## property = other: no (6)
## property = real estate:
## :...job = skilled employee: yes (2)
##     job in {mangement self-employed,unemployed non-resident,
##             unskilled resident}: no (2)
## 
## SubTree [S2]
## 
## checking_balance = 1 - 200 DM: yes (5/2)
## checking_balance = < 0 DM:
## :...property in {building society savings,real estate,unknown/none}: no (8)
##     property = other:
##     :...installment_rate <= 1: no (2)
##         installment_rate > 1: yes (4)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      54  135(15.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     589    44    (a): class no
##      91   176    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##   54.22% other_debtors
##   50.00% credit_history
##   32.56% savings_balance
##   25.22% months_loan_duration
##   19.78% purpose
##   10.11% residence_history
##    7.33% installment_plan
##    5.22% telephone
##    4.78% foreign_worker
##    4.56% employment_length
##    4.33% amount
##    3.44% personal_status
##    3.11% property
##    2.67% age
##    1.56% installment_rate
##    0.44% job
## 
## 
## Time: 0.0 secs

Explanation

First few lines of the summary reads the following:

if checking balance is more than 200 or unknown, less likely to default
412/50 indicates 412 classified correctly and 50 classified incorrectly
15% error rate is the sum of false positive and negative

Step 4: Evaluating Model Performance

Now it is time to put our test data to the test after training our model. After that, an evaluation is done to check its accuracy rate.

#making prediction
credit_pred <- predict(credit_model, credit_test)

#Evaluation
confusionMatrix(credit_pred, credit_test$default, dnn = c("Predicted", "Actual"))

## Confusion Matrix and Statistics
## 
##          Actual
## Predicted no yes
##       no  60  19
##       yes  7  14
##                                           
##                Accuracy : 0.74            
##                  95% CI : (0.6427, 0.8226)
##     No Information Rate : 0.67            
##     P-Value [Acc > NIR] : 0.08146         
##                                           
##                   Kappa : 0.3523          
##  Mcnemar's Test P-Value : 0.03098         
##                                           
##             Sensitivity : 0.8955          
##             Specificity : 0.4242          
##          Pos Pred Value : 0.7595          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6700          
##          Detection Rate : 0.6000          
##    Detection Prevalence : 0.7900          
##       Balanced Accuracy : 0.6599          
##                                           
##        'Positive' Class : no              
##

Turns out that the accuracy is only 74% and there is definitely room for improvement.

Step 5: Improving the Model

There are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly type of mistakes.

Adaptive Boosting

Boosting is rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone.

To add bossting to the model, an addtional parameter trials is added. It acts as an upper limit and stop adding trees if accuracy does not improve. The de facto standard is 10.

The evaluation is as follows:

#making the classifier
credit_boost10 <- C5.0(credit_train[-21], credit_train$default, trials = 10)

#making the prediction
credit_boost_pred10 <- predict(credit_boost10, credit_test)

#Evaluation
confusionMatrix(credit_boost_pred10, credit_test$default)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction no yes
##        no  60  17
##        yes  7  16
##                                           
##                Accuracy : 0.76            
##                  95% CI : (0.6643, 0.8398)
##     No Information Rate : 0.67            
##     P-Value [Acc > NIR] : 0.03281         
##                                           
##                   Kappa : 0.4121          
##  Mcnemar's Test P-Value : 0.06619         
##                                           
##             Sensitivity : 0.8955          
##             Specificity : 0.4848          
##          Pos Pred Value : 0.7792          
##          Neg Pred Value : 0.6957          
##              Prevalence : 0.6700          
##          Detection Rate : 0.6000          
##    Detection Prevalence : 0.7700          
##       Balanced Accuracy : 0.6902          
##                                           
##        'Positive' Class : no              
##

After adaptive boosting, the accuracy of the model improved to 76%.

Identifying Risky Bank Loans with C5.0

Jason Chan

November 14, 2016

Libraries Used

Objective

Step 1: Data Exploration

Data Preview

Useful Tables

Checking Balance

Savings Balance

Loan Duration

Loan Amount

Default Applicants

Data Visualization

Age

Loan Amount

Loan Duration

Job

Checking Balance

Savings Balance

Purpose

Preliminary Insights

Step 2: Data Preparation

Random Sample

Step 3: Model Training

Explanation

Step 4: Evaluating Model Performance

Step 5: Improving the Model

Adaptive Boosting