This project is from Book: Machine learning with R by Brett Lantz, chapter 4.
A link to the book https://bit.ly/3gsf2e0
This project is for educational purpose only.
The aim is to develop credit approval model using C5.0 decision trees
C50 package is required for training decision tree model. also gmodel package for evaluation
library(C50)
library(gmodels)
the data is donated to the UCI Machine learning Repository. the dataset is modified slightly.
A class variable indicates weather the loan went into default.
#Read the csv file, I set stringAsFactor = TRUE as the data read as character
credit <- read.csv("credit.csv", stringsAsFactors = TRUE)
#Explore structure of the dataset
str(credit)
## 'data.frame': 1000 obs. of 17 variables:
## $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "critical","good",..: 1 2 1 2 4 2 2 2 2 1 ...
## $ purpose : Factor w/ 6 levels "business","car",..: 5 5 4 5 2 4 5 2 5 2 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
## $ employment_duration : Factor w/ 5 levels "< 1 year","> 7 years",..: 2 3 4 4 3 3 2 3 4 5 ...
## $ percent_of_income : int 4 2 2 2 3 2 3 2 2 4 ...
## $ years_at_residence : int 4 2 3 4 4 4 4 2 4 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_credit : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : Factor w/ 3 levels "other","own",..: 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_loans_count: int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "management","skilled",..: 2 2 4 2 2 4 2 1 4 1 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ phone : Factor w/ 2 levels "no","yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
#Inspect checking_balance and savings_balance columns
table(credit$checking_balance)
##
## < 0 DM > 200 DM 1 - 200 DM unknown
## 274 63 269 394
table(credit$savings_balance)
##
## < 100 DM > 1000 DM 100 - 500 DM 500 - 1000 DM unknown
## 603 48 103 63 183
Numerical features
#Inspect Numerical features
summary(credit$months_loan_duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 18.0 20.9 24.0 72.0
summary(credit$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18424
Default feature(response)
table(credit$default)
##
## no yes
## 700 300
We will generate random sample using sample() function. We will use set.seed(123) function to make sure redundant results will be available in the future.
As I am using R version 4, I need to use R random number generator from version 3.5.2 to make sure my results matching the script.
RNGversion("3.5.2"); set.seed(123)
## Warning in RNGkind("Mersenne-Twister", "Inversion", "Rounding"): non-uniform
## 'Rounding' sampler used
train_sample <- sample(1000, 900)
#Check structure of the the random sample vector
str(train_sample)
## int [1:900] 288 788 409 881 937 46 525 887 548 453 ...
#We will use the vector to randomly select train and test datasets
credit_train <- credit[train_sample, ]
credit_test <- credit[-train_sample, ]
#Inspect proportion of default values in both training and test
prop.table(table(credit_train$default))
##
## no yes
## 0.7033333 0.2966667
prop.table(table(credit_test$default))
##
## no yes
## 0.67 0.33
#Create decision tree model using C5.0() , we will remove column default from the training data set, default is column no. 17
credit_model <- C5.0(credit_train[-17], credit_train$default)
#Inspect the decision tree
credit_model
##
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default)
##
## Classification Tree
## Number of samples: 900
## Number of predictors: 16
##
## Tree size: 57
##
## Non-standard options: attempt to group attributes
We can see there are 57 decision trees.
summary(credit_model)
##
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default)
##
##
## C5.0 [Release 2.07 GPL Edition] Tue Aug 04 21:40:41 2020
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (17 attributes) from undefined.data
##
## Decision tree:
##
## checking_balance in {> 200 DM,unknown}: no (412/50)
## checking_balance in {< 0 DM,1 - 200 DM}:
## :...credit_history in {perfect,very good}: yes (59/18)
## credit_history in {critical,good,poor}:
## :...months_loan_duration <= 22:
## :...credit_history = critical: no (72/14)
## : credit_history = poor:
## : :...dependents > 1: no (5)
## : : dependents <= 1:
## : : :...years_at_residence <= 3: yes (4/1)
## : : years_at_residence > 3: no (5/1)
## : credit_history = good:
## : :...savings_balance in {> 1000 DM,500 - 1000 DM}: no (15/1)
## : savings_balance = 100 - 500 DM:
## : :...other_credit = bank: yes (3)
## : : other_credit in {none,store}: no (9/2)
## : savings_balance = unknown:
## : :...other_credit = bank: yes (1)
## : : other_credit in {none,store}: no (21/8)
## : savings_balance = < 100 DM:
## : :...purpose in {business,car0,renovations}: no (8/2)
## : purpose = education:
## : :...checking_balance = < 0 DM: yes (4)
## : : checking_balance = 1 - 200 DM: no (1)
## : purpose = car:
## : :...employment_duration = > 7 years: yes (5)
## : : employment_duration = unemployed: no (4/1)
## : : employment_duration = < 1 year:
## : : :...years_at_residence <= 2: yes (5)
## : : : years_at_residence > 2: no (3/1)
## : : employment_duration = 1 - 4 years:
## : : :...years_at_residence <= 2: yes (2)
## : : : years_at_residence > 2: no (6/1)
## : : employment_duration = 4 - 7 years:
## : : :...amount <= 1680: yes (2)
## : : amount > 1680: no (3)
## : purpose = furniture/appliances:
## : :...job in {management,unskilled}: no (23/3)
## : job = unemployed: yes (1)
## : job = skilled:
## : :...months_loan_duration > 13: [S1]
## : months_loan_duration <= 13:
## : :...housing in {other,own}: no (23/4)
## : housing = rent:
## : :...percent_of_income <= 3: yes (3)
## : percent_of_income > 3: no (2)
## months_loan_duration > 22:
## :...savings_balance = > 1000 DM: no (2)
## savings_balance = 500 - 1000 DM: yes (4/1)
## savings_balance = 100 - 500 DM:
## :...credit_history in {critical,poor}: no (14/3)
## : credit_history = good:
## : :...other_credit = bank: no (1)
## : other_credit in {none,store}: yes (12/2)
## savings_balance = unknown:
## :...checking_balance = 1 - 200 DM: no (17)
## : checking_balance = < 0 DM:
## : :...credit_history = critical: no (1)
## : credit_history in {good,poor}: yes (12/3)
## savings_balance = < 100 DM:
## :...months_loan_duration > 47: yes (21/2)
## months_loan_duration <= 47:
## :...housing = other:
## :...percent_of_income <= 2: no (6)
## : percent_of_income > 2: yes (9/3)
## housing = rent:
## :...other_credit = bank: no (1)
## : other_credit in {none,store}: yes (16/3)
## housing = own:
## :...employment_duration = > 7 years: no (13/4)
## employment_duration = 4 - 7 years:
## :...job in {management,skilled,
## : : unemployed}: yes (9/1)
## : job = unskilled: no (1)
## employment_duration = unemployed:
## :...years_at_residence <= 2: yes (4)
## : years_at_residence > 2: no (3)
## employment_duration = 1 - 4 years:
## :...purpose in {business,car0,education}: yes (7/1)
## : purpose in {furniture/appliances,
## : : renovations}: no (7)
## : purpose = car:
## : :...years_at_residence <= 3: yes (3)
## : years_at_residence > 3: no (3)
## employment_duration = < 1 year:
## :...years_at_residence > 3: yes (5)
## years_at_residence <= 3:
## :...other_credit = bank: no (0)
## other_credit = store: yes (1)
## other_credit = none:
## :...checking_balance = 1 - 200 DM: no (8/2)
## checking_balance = < 0 DM:
## :...job in {management,skilled,
## : unemployed}: yes (2)
## job = unskilled: no (3/1)
##
## SubTree [S1]
##
## employment_duration in {< 1 year,4 - 7 years}: no (4)
## employment_duration in {> 7 years,1 - 4 years,unemployed}: yes (10)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 56 133(14.8%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 598 35 (a): class no
## 98 169 (b): class yes
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 54.22% credit_history
## 47.67% months_loan_duration
## 38.11% savings_balance
## 14.33% purpose
## 14.33% housing
## 12.56% employment_duration
## 9.00% job
## 8.67% other_credit
## 6.33% years_at_residence
## 2.22% percent_of_income
## 1.56% dependents
## 0.56% amount
##
##
## Time: 0.0 secs
We can see the error is 14% in the above model.
Apply the model on the decision tree using predict function
credit_pred <- predict(credit_model, credit_test)
#Evaluate the
CrossTable(credit_test$default, credit_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c("actual default", "predicted deafult"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted deafult
## actual default | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 59 | 8 | 67 |
## | 0.590 | 0.080 | |
## ---------------|-----------|-----------|-----------|
## yes | 19 | 14 | 33 |
## | 0.190 | 0.140 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 78 | 22 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
We can calculate total accuracy to be (59 + 14) / 100 = 73%
Boosting the accuracy of decision trees by adding additional trials parameter indicating the number of separate decision trees to use in the boosted team. de facto standard trials = 10
credit_boost10 <- C5.0(credit_train[-17], credit_train$default, trials = 10)
credit_boost10
##
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default, trials = 10)
##
## Classification Tree
## Number of samples: 900
## Number of predictors: 16
##
## Number of boosting iterations: 10
## Average tree size: 47.5
##
## Non-standard options: attempt to group attributes
#As the output will show the data for the 10 trials, we will not run this code in this chunk
#summary(credit_boost10)
The results show there are 34 mistakes on 900 training examples, which represents error = 13.5 %
credit_boost_pred10 <- predict(credit_boost10, credit_test)
#Evaluate the
CrossTable(credit_test$default, credit_boost_pred10,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c("actual default", "predicted deafult"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted deafult
## actual default | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 62 | 5 | 67 |
## | 0.620 | 0.050 | |
## ---------------|-----------|-----------|-----------|
## yes | 13 | 20 | 33 |
## | 0.130 | 0.200 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 75 | 25 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
we can see the error in the test came down to 18 out of 100, better than first model 27 out of 100.
as 13 false negative is very costly to the bank in case they default their loan, we can add penalties or cost matrix to show how many times more costly each error is relative to any other.
matrix_dimensions <- list(c("no", "yes"), c("no", "yes"))
names(matrix_dimensions) <- c("predicted", "actual")
matrix_dimensions
## $predicted
## [1] "no" "yes"
##
## $actual
## [1] "no" "yes"
We need to set weights for the matrix as following, assuming that loan default costs the bank four times as much as a missed opportunity
error_cost <- matrix(c(0,1,4,0), nrow = 2,
dimnames = matrix_dimensions)
error_cost
## actual
## predicted no yes
## no 0 4
## yes 1 0
credit_cost <- C5.0(credit_train[-17], credit_train$default, costs = error_cost)
credit_cost
##
## Call:
## C5.0.default(x = credit_train[-17], y = credit_train$default, costs
## = error_cost)
##
## Classification Tree
## Number of samples: 900
## Number of predictors: 16
##
## Tree size: 28
##
## Non-standard options: attempt to group attributes
##
## Cost Matrix:
## actual
## predicted no yes
## no 0 4
## yes 1 0
credit_cost_pred <- predict(credit_cost, credit_test)
#Evaluate the
CrossTable(credit_test$default, credit_cost_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c("actual default", "predicted deafult"))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | predicted deafult
## actual default | no | yes | Row Total |
## ---------------|-----------|-----------|-----------|
## no | 37 | 30 | 67 |
## | 0.370 | 0.300 | |
## ---------------|-----------|-----------|-----------|
## yes | 7 | 26 | 33 |
## | 0.070 | 0.260 | |
## ---------------|-----------|-----------|-----------|
## Column Total | 44 | 56 | 100 |
## ---------------|-----------|-----------|-----------|
##
##
compared to previous two models, this model makes more mistakes, error is ( 7 + 30) 37 out of 100.but taking a deeper look at the loan default now the percentage is 26/33 = 79% identified loan default. but earlier one identified only 20 out of 33 which is accurate only 60%.