Introduction

This project is from Book: Machine learning with R by Brett Lantz, chapter 4.

A link to the book https://bit.ly/3gsf2e0

This project is for educational purpose only.

The aim is to develop credit approval model using C5.0 decision trees

Required packages

C50 package is required for training decision tree model. also gmodel package for evaluation

library(C50)
library(gmodels)

Step 1 - collecting data

the data is donated to the UCI Machine learning Repository. the dataset is modified slightly.

A class variable indicates weather the loan went into default.

Step 2 - exploring and preparing the data

#Read the csv file, I set stringAsFactor = TRUE as the data read as character
credit <- read.csv("credit.csv", stringsAsFactors = TRUE)

#Explore structure of the dataset
str(credit)
## 'data.frame':    1000 obs. of  17 variables:
##  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
##  $ purpose             : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
##  $ percent_of_income   : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ years_at_residence  : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_credit        : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_loans_count: int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job                 : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ phone               : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ default             : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
#Inspect checking_balance and savings_balance columns
table(credit$checking_balance)
## 
##     < 0 DM   > 200 DM 1 - 200 DM    unknown 
##        274         63        269        394
table(credit$savings_balance)
## 
##      < 100 DM     > 1000 DM  100 - 500 DM 500 - 1000 DM       unknown 
##           603            48           103            63           183

Numerical features

#Inspect Numerical features
summary(credit$months_loan_duration)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    18.0    20.9    24.0    72.0
summary(credit$amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

Default feature(response)

table(credit$default)
## 
##  no yes 
## 700 300

Data preparation - creating random training and test datasets

We will generate random sample using sample() function. We will use set.seed(123) function to make sure redundant results will be available in the future.

As I am using R version 4, I need to use R random number generator from version 3.5.2 to make sure my results matching the script.

RNGversion("3.5.2"); set.seed(123)
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
train_sample <- sample(1000, 900)

#Check structure of the the random sample vector
str(train_sample)
##  int [1:900] 288 788 409 881 937 46 525 887 548 453 ...
#We will use the vector to randomly select train and test datasets 

credit_train <- credit[train_sample, ]
credit_test <- credit[-train_sample, ]
#Inspect proportion of default values in both training and test

prop.table(table(credit_train$default))
## 
##        no       yes 
## 0.7033333 0.2966667
prop.table(table(credit_test$default))
## 
##   no  yes 
## 0.67 0.33

Step 3 - training a model on the data

#Create decision tree model using C5.0() , we will remove column default from the training data set, default is column no. 17

credit_model <- C5.0(credit_train[-17], credit_train$default)

#Inspect the decision tree

credit_model
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 16 
## 
## Tree size: 57 
## 
## Non-standard options: attempt to group attributes

We can see there are 57 decision trees.

summary(credit_model)
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Tue Aug 04 21:40:41 2020
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (17 attributes) from undefined.data
## 
## Decision tree:
## 
## checking_balance in {> 200 DM,unknown}: no (412/50)
## checking_balance in {< 0 DM,1 - 200 DM}:
## :...credit_history in {perfect,very good}: yes (59/18)
##     credit_history in {critical,good,poor}:
##     :...months_loan_duration <= 22:
##         :...credit_history = critical: no (72/14)
##         :   credit_history = poor:
##         :   :...dependents > 1: no (5)
##         :   :   dependents <= 1:
##         :   :   :...years_at_residence <= 3: yes (4/1)
##         :   :       years_at_residence > 3: no (5/1)
##         :   credit_history = good:
##         :   :...savings_balance in {> 1000 DM,500 - 1000 DM}: no (15/1)
##         :       savings_balance = 100 - 500 DM:
##         :       :...other_credit = bank: yes (3)
##         :       :   other_credit in {none,store}: no (9/2)
##         :       savings_balance = unknown:
##         :       :...other_credit = bank: yes (1)
##         :       :   other_credit in {none,store}: no (21/8)
##         :       savings_balance = < 100 DM:
##         :       :...purpose in {business,car0,renovations}: no (8/2)
##         :           purpose = education:
##         :           :...checking_balance = < 0 DM: yes (4)
##         :           :   checking_balance = 1 - 200 DM: no (1)
##         :           purpose = car:
##         :           :...employment_duration = > 7 years: yes (5)
##         :           :   employment_duration = unemployed: no (4/1)
##         :           :   employment_duration = < 1 year:
##         :           :   :...years_at_residence <= 2: yes (5)
##         :           :   :   years_at_residence > 2: no (3/1)
##         :           :   employment_duration = 1 - 4 years:
##         :           :   :...years_at_residence <= 2: yes (2)
##         :           :   :   years_at_residence > 2: no (6/1)
##         :           :   employment_duration = 4 - 7 years:
##         :           :   :...amount <= 1680: yes (2)
##         :           :       amount > 1680: no (3)
##         :           purpose = furniture/appliances:
##         :           :...job in {management,unskilled}: no (23/3)
##         :               job = unemployed: yes (1)
##         :               job = skilled:
##         :               :...months_loan_duration > 13: [S1]
##         :                   months_loan_duration <= 13:
##         :                   :...housing in {other,own}: no (23/4)
##         :                       housing = rent:
##         :                       :...percent_of_income <= 3: yes (3)
##         :                           percent_of_income > 3: no (2)
##         months_loan_duration > 22:
##         :...savings_balance = > 1000 DM: no (2)
##             savings_balance = 500 - 1000 DM: yes (4/1)
##             savings_balance = 100 - 500 DM:
##             :...credit_history in {critical,poor}: no (14/3)
##             :   credit_history = good:
##             :   :...other_credit = bank: no (1)
##             :       other_credit in {none,store}: yes (12/2)
##             savings_balance = unknown:
##             :...checking_balance = 1 - 200 DM: no (17)
##             :   checking_balance = < 0 DM:
##             :   :...credit_history = critical: no (1)
##             :       credit_history in {good,poor}: yes (12/3)
##             savings_balance = < 100 DM:
##             :...months_loan_duration > 47: yes (21/2)
##                 months_loan_duration <= 47:
##                 :...housing = other:
##                     :...percent_of_income <= 2: no (6)
##                     :   percent_of_income > 2: yes (9/3)
##                     housing = rent:
##                     :...other_credit = bank: no (1)
##                     :   other_credit in {none,store}: yes (16/3)
##                     housing = own:
##                     :...employment_duration = > 7 years: no (13/4)
##                         employment_duration = 4 - 7 years:
##                         :...job in {management,skilled,
##                         :   :       unemployed}: yes (9/1)
##                         :   job = unskilled: no (1)
##                         employment_duration = unemployed:
##                         :...years_at_residence <= 2: yes (4)
##                         :   years_at_residence > 2: no (3)
##                         employment_duration = 1 - 4 years:
##                         :...purpose in {business,car0,education}: yes (7/1)
##                         :   purpose in {furniture/appliances,
##                         :   :           renovations}: no (7)
##                         :   purpose = car:
##                         :   :...years_at_residence <= 3: yes (3)
##                         :       years_at_residence > 3: no (3)
##                         employment_duration = < 1 year:
##                         :...years_at_residence > 3: yes (5)
##                             years_at_residence <= 3:
##                             :...other_credit = bank: no (0)
##                                 other_credit = store: yes (1)
##                                 other_credit = none:
##                                 :...checking_balance = 1 - 200 DM: no (8/2)
##                                     checking_balance = < 0 DM:
##                                     :...job in {management,skilled,
##                                         :       unemployed}: yes (2)
##                                         job = unskilled: no (3/1)
## 
## SubTree [S1]
## 
## employment_duration in {< 1 year,4 - 7 years}: no (4)
## employment_duration in {> 7 years,1 - 4 years,unemployed}: yes (10)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      56  133(14.8%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     598    35    (a): class no
##      98   169    (b): class yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##   54.22% credit_history
##   47.67% months_loan_duration
##   38.11% savings_balance
##   14.33% purpose
##   14.33% housing
##   12.56% employment_duration
##    9.00% job
##    8.67% other_credit
##    6.33% years_at_residence
##    2.22% percent_of_income
##    1.56% dependents
##    0.56% amount
## 
## 
## Time: 0.0 secs

We can see the error is 14% in the above model.

Step 4 - evaluating model performance

Apply the model on the decision tree using predict function

credit_pred <- predict(credit_model, credit_test)

#Evaluate the
CrossTable(credit_test$default, credit_pred, 
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, 
           dnn = c("actual default", "predicted deafult"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted deafult 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        59 |         8 |        67 | 
##                |     0.590 |     0.080 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |        19 |        14 |        33 | 
##                |     0.190 |     0.140 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        78 |        22 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

We can calculate total accuracy to be (59 + 14) / 100 = 73%

Step 5 - improving model performance

Boosting the accuracy of decision trees by adding additional trials parameter indicating the number of separate decision trees to use in the boosted team. de facto standard trials = 10

credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10)

credit_boost10
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 16 
## 
## Number of boosting iterations: 10 
## Average tree size: 47.5 
## 
## Non-standard options: attempt to group attributes
#As the output will show the data for the 10 trials, we will not run this code in this chunk
#summary(credit_boost10)

The results show there are 34 mistakes on 900 training examples, which represents error = 13.5 %

credit_boost_pred10 <- predict(credit_boost10, credit_test)

#Evaluate the
CrossTable(credit_test$default, credit_boost_pred10, 
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, 
           dnn = c("actual default", "predicted deafult"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted deafult 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        62 |         5 |        67 | 
##                |     0.620 |     0.050 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |        13 |        20 |        33 | 
##                |     0.130 |     0.200 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        75 |        25 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

we can see the error in the test came down to 18 out of 100, better than first model 27 out of 100.

as 13 false negative is very costly to the bank in case they default their loan, we can add penalties or cost matrix to show how many times more costly each error is relative to any other.

matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
names(matrix_dimensions) <- c("predicted", "actual")

matrix_dimensions
## $predicted
## [1] "no"  "yes"
## 
## $actual
## [1] "no"  "yes"

We need to set weights for the matrix as following, assuming that loan default costs the bank four times as much as a missed opportunity

  1. predicted no, actual no : 0
  2. predicted yes, actual no : 1
  3. predicted no, actual yes: 4
  4. predicted yes, actual yes : 0
error_cost <- matrix(c(0,1,4,0), nrow = 2,
                     dimnames = matrix_dimensions)
error_cost
##          actual
## predicted no yes
##       no   0   4
##       yes  1   0
credit_cost <- C5.0(credit_train[-17], credit_train$default, costs = error_cost)

credit_cost
## 
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default, costs
##  = error_cost)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 16 
## 
## Tree size: 28 
## 
## Non-standard options: attempt to group attributes
## 
## Cost Matrix:
##          actual
## predicted no yes
##       no   0   4
##       yes  1   0
credit_cost_pred <- predict(credit_cost, credit_test)

#Evaluate the
CrossTable(credit_test$default, credit_cost_pred, 
           prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE, 
           dnn = c("actual default", "predicted deafult"))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted deafult 
## actual default |        no |       yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             no |        37 |        30 |        67 | 
##                |     0.370 |     0.300 |           | 
## ---------------|-----------|-----------|-----------|
##            yes |         7 |        26 |        33 | 
##                |     0.070 |     0.260 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        44 |        56 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
## 

compared to previous two models, this model makes more mistakes, error is ( 7 + 30) 37 out of 100.but taking a deeper look at the loan default now the percentage is 26/33 = 79% identified loan default. but earlier one identified only 20 out of 33 which is accurate only 60%.