For complete html functions, please visit www.rpubs.com/jasonchanhku/loans
library(fBasics) #summary statistics
library(plotly) #data visulization
library(dplyr) #count function
library(C50) #C50 Decision Tree algorithm
library(gmodels) #CrossTable()
library(caret) #Confusion Matrix
library(rmarkdown)
The global financial crisis of 2007-2008 highlighted the importance of transparency and rigor in banking practices. As the availability of credit was limited, banks tightened their lending systems and turned to machine learning to more accurately identify risky loans.
Therefore, objective of this project is to develop a simple credit approval model using the C5.0 decision tree algorithm.
The target variable of interest is default status.
The data is obtained from donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg. The dataset contains information on loans obtained from a credit agency in Germany.
The credit dataset includes 1,000 examples on loans, plus a set of numeric and nominal features indicating the characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default.
credit <- read.csv(file = "Machine-Learning-with-R-datasets-master/credit.csv")
credit$default[credit$default == 1] <- "no"
credit$default[credit$default == 2] <- "yes"
credit$default <- as.factor(credit$default)
Below is a partial table preview of the dataset:
knitr::kable(head(credit), caption = "Credit Information Dataset")
checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | residence_history | property | age | installment_plan | housing | existing_credits | job | dependents | telephone | foreign_worker | default |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
< 0 DM | 6 | critical | radio/tv | 1169 | unknown | > 7 yrs | 4 | single male | none | 4 | real estate | 67 | none | own | 2 | skilled employee | 1 | yes | yes | no |
1 - 200 DM | 48 | repaid | radio/tv | 5951 | < 100 DM | 1 - 4 yrs | 2 | female | none | 2 | real estate | 22 | none | own | 1 | skilled employee | 1 | none | yes | yes |
unknown | 12 | critical | education | 2096 | < 100 DM | 4 - 7 yrs | 2 | single male | none | 3 | real estate | 49 | none | own | 1 | unskilled resident | 2 | none | yes | no |
< 0 DM | 42 | repaid | furniture | 7882 | < 100 DM | 4 - 7 yrs | 2 | single male | guarantor | 4 | building society savings | 45 | none | for free | 1 | skilled employee | 2 | none | yes | no |
< 0 DM | 24 | delayed | car (new) | 4870 | < 100 DM | 1 - 4 yrs | 3 | single male | none | 4 | unknown/none | 53 | none | for free | 2 | skilled employee | 2 | none | yes | yes |
unknown | 36 | repaid | education | 9055 | unknown | 1 - 4 yrs | 2 | single male | none | 4 | unknown/none | 35 | none | for free | 1 | unskilled resident | 2 | yes | yes | no |
The general structure of the dataset:
str(credit)
## 'data.frame': 1000 obs. of 21 variables:
## $ checking_balance : Factor w/ 4 levels "1 - 200 DM","< 0 DM",..: 2 1 4 2 2 4 4 1 4 1 ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
## $ purpose : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : Factor w/ 5 levels "101 - 500 DM",..: 5 3 3 3 3 5 2 3 4 3 ...
## $ employment_length : Factor w/ 5 levels "0 - 1 yrs","1 - 4 yrs",..: 4 2 3 3 2 2 4 2 3 5 ...
## $ installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
## $ other_debtors : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
## $ residence_history : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ installment_plan : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ default : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 1 1 1 2 ...
Below are some useful tables that offers summary statistics from the loan features such as checking, savings, duration, amount, and default.
knitr::kable(data.frame(table(credit$checking_balance)), Caption = "Checking Balance", col.names = c("Checking Balance", "Frequency") )
Checking Balance | Frequency |
---|---|
1 - 200 DM | 269 |
< 0 DM | 274 |
> 200 DM | 63 |
unknown | 394 |
knitr::kable(data.frame(table(credit$savings_balance)), Caption = "Savings Balance", col.names = c("Savings Balance", "Frequency"))
Savings Balance | Frequency |
---|---|
101 - 500 DM | 103 |
501 - 1000 DM | 63 |
< 100 DM | 603 |
> 1000 DM | 48 |
unknown | 183 |
knitr::kable(basicStats(credit$months_loan_duration), Caption = "Loan Duration", digits = 2, col.names = c("Value"))
Value | |
---|---|
nobs | 1000.00 |
NAs | 0.00 |
Minimum | 4.00 |
Maximum | 72.00 |
1. Quartile | 12.00 |
3. Quartile | 24.00 |
Mean | 20.90 |
Median | 18.00 |
Sum | 20903.00 |
SE Mean | 0.38 |
LCL Mean | 20.15 |
UCL Mean | 21.65 |
Variance | 145.42 |
Stdev | 12.06 |
Skewness | 1.09 |
Kurtosis | 0.90 |
knitr::kable(basicStats(credit$amount), Caption = "Loan Amount", digits = 2, col.names = c("Amount"))
Amount | |
---|---|
nobs | 1000.00 |
NAs | 0.00 |
Minimum | 250.00 |
Maximum | 18424.00 |
1. Quartile | 1365.50 |
3. Quartile | 3972.25 |
Mean | 3271.26 |
Median | 2319.50 |
Sum | 3271258.00 |
SE Mean | 89.26 |
LCL Mean | 3096.09 |
UCL Mean | 3446.42 |
Variance | 7967843.47 |
Stdev | 2822.74 |
Skewness | 1.94 |
Kurtosis | 4.25 |
knitr::kable(data.frame(table(credit$default)), col.names = c("Default", "Amount"))
Default | Amount |
---|---|
no | 700 |
yes | 300 |
The data visualization will focus on identifying patterns of key features which clearly distinguishes an applicant’s default status.
plot_ly(credit, y = ~age, color = ~default, type = "box")
As visualized, default applicants are of younger age. Perhaps the younger age have less awareness about financial planning.
plot_ly(credit, y = ~amount, color = ~default, type = "box")
As visualized, default applicants tend to have higher loan amounts. Risky applicants tend to loan huge amounts perhaps due to underestimation.
plot_ly(credit, y=~months_loan_duration, color = ~default, type = "box")
As visualized, default applicants tend to have longer loan durations perhaps due to their cash flows.
credit %>% count(default, job) %>% plot_ly(x = ~default, y = ~n, color = ~job, type = "bar")
As visualized, defaulted applicants are less skilled. Perhaps this reflects their financial planning awareness.
credit %>% count(default, checking_balance) %>% plot_ly(x = ~default, y = ~n, color = ~checking_balance, type = "bar")
As visualized, defaulted applicants have lower checking balance.
credit %>% count(default, savings_balance) %>% plot_ly(x = ~default, y = ~n, color = ~savings_balance, type = "bar")
As visualized, defaulted applicants have lower savings balance.
credit %>% count(default, purpose) %>% plot_ly(x = ~default, y = ~n, color = ~purpose, type = "bar")
As visualized, of the defaulted applicants, most of them take loans to purchase new cars.
From the data preview and data visualization, we identify the following characteristics that distinguishes an applicant who is likely to default:
The training and test datasets will be split into a portion of 90% to 10%, leaving 900 values for training and 100 for test. However, as the dataset is not sorted in random order, this could cause bias if for example the data is sorted by loan amounts ascending. The model will train on small loans and test on big loans. Hence, random sampling is required.
Random sampling is applied as following (please click show code):
# To ensure results can be reproduced
set.seed(123)
# Sample 900 integers randomly from 1000
train_sample <- sample(1000,900)
#Subset randomly into training and test
credit_train <- credit[train_sample ,]
credit_test <- credit[-train_sample , ]
Let’s check if both the train and test set have rather even split of the class levels. This is to prevent training bias.
knitr::kable(data.frame(prop.table(table(credit_train$default))), caption = "Training Set Proportion", col.names = c("Default Status", "Proportion"), digits = 3)
Default Status | Proportion |
---|---|
no | 0.703 |
yes | 0.297 |
knitr::kable(data.frame(prop.table(table(credit_test$default))), caption = "Test Set Proportion", col.names = c("Default Status", "Proportion"), digits = 4)
Default Status | Proportion |
---|---|
no | 0.67 |
yes | 0.33 |
Since we have all the data we need, the C5.0 model can now be trained on the training set. To use the training set, the class labels must be removed from the training set and plugged into separately as labels.
The summary of the model is as follows:
#building the classifier
credit_model <- C5.0(credit_train[-21], credit_train$default)
credit_model
##
## Call:
## C5.0.default(x = credit_train[-21], y = credit_train$default)
##
## Classification Tree
## Number of samples: 900
## Number of predictors: 20
##
## Tree size: 54
##
## Non-standard options: attempt to group attributes
summary(credit_model)
##
## Call:
## C5.0.default(x = credit_train[-21], y = credit_train$default)
##
##
## C5.0 [Release 2.07 GPL Edition] Wed Nov 16 23:21:34 2016
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (21 attributes) from undefined.data
##
## Decision tree:
##
## checking_balance in {> 200 DM,unknown}: no (412/50)
## checking_balance in {1 - 200 DM,< 0 DM}:
## :...other_debtors = guarantor:
## :...months_loan_duration > 36: yes (4/1)
## : months_loan_duration <= 36:
## : :...installment_plan in {none,stores}: no (24)
## : installment_plan = bank:
## : :...purpose = car (new): yes (3)
## : purpose in {business,car (used),domestic appliances,education,
## : furniture,others,radio/tv,repairs,
## : retraining}: no (7/1)
## other_debtors in {co-applicant,none}:
## :...credit_history = critical: no (102/30)
## credit_history = fully repaid: yes (27/6)
## credit_history = fully repaid this bank:
## :...other_debtors = co-applicant: no (2)
## : other_debtors = none: yes (26/8)
## credit_history in {delayed,repaid}:
## :...savings_balance in {501 - 1000 DM,> 1000 DM}: no (19/3)
## savings_balance = 101 - 500 DM:
## :...other_debtors = co-applicant: yes (3)
## : other_debtors = none:
## : :...personal_status in {divorced male,
## : : married male}: yes (6/1)
## : personal_status = female:
## : :...installment_rate <= 3: no (4/1)
## : : installment_rate > 3: yes (4)
## : personal_status = single male:
## : :...age <= 41: no (15/2)
## : age > 41: yes (2)
## savings_balance = unknown:
## :...credit_history = delayed: no (8)
## : credit_history = repaid:
## : :...foreign_worker = no: no (2)
## : foreign_worker = yes:
## : :...checking_balance = < 0 DM:
## : :...telephone = none: yes (11/2)
## : : telephone = yes:
## : : :...amount <= 5045: no (5/1)
## : : amount > 5045: yes (2)
## : checking_balance = 1 - 200 DM:
## : :...residence_history > 3: no (9)
## : residence_history <= 3: [S1]
## savings_balance = < 100 DM:
## :...months_loan_duration > 39:
## :...residence_history <= 1: no (2)
## : residence_history > 1: yes (19/1)
## months_loan_duration <= 39:
## :...purpose in {car (new),retraining}: yes (47/16)
## purpose in {domestic appliances,others}: no (3)
## purpose = car (used):
## :...amount <= 8086: no (9/1)
## : amount > 8086: yes (5)
## purpose = education:
## :...checking_balance = 1 - 200 DM: no (2)
## : checking_balance = < 0 DM: yes (5)
## purpose = repairs:
## :...residence_history <= 3: yes (4/1)
## : residence_history > 3: no (3)
## purpose = business:
## :...credit_history = delayed: yes (2)
## : credit_history = repaid:
## : :...age <= 34: no (5)
## : age > 34: yes (2)
## purpose = radio/tv:
## :...employment_length in {0 - 1 yrs,
## : : unemployed}: yes (14/5)
## : employment_length = 4 - 7 yrs: no (3)
## : employment_length = > 7 yrs:
## : :...amount <= 932: yes (2)
## : : amount > 932: no (7)
## : employment_length = 1 - 4 yrs:
## : :...months_loan_duration <= 15: no (6)
## : months_loan_duration > 15:
## : :...amount <= 3275: yes (7)
## : amount > 3275: no (2)
## purpose = furniture:
## :...residence_history <= 1: no (8/1)
## residence_history > 1:
## :...installment_plan in {bank,stores}: no (3/1)
## installment_plan = none:
## :...telephone = yes: yes (7/1)
## telephone = none:
## :...months_loan_duration > 27: yes (3)
## months_loan_duration <= 27: [S2]
##
## SubTree [S1]
##
## property in {building society savings,unknown/none}: yes (4)
## property = other: no (6)
## property = real estate:
## :...job = skilled employee: yes (2)
## job in {mangement self-employed,unemployed non-resident,
## unskilled resident}: no (2)
##
## SubTree [S2]
##
## checking_balance = 1 - 200 DM: yes (5/2)
## checking_balance = < 0 DM:
## :...property in {building society savings,real estate,unknown/none}: no (8)
## property = other:
## :...installment_rate <= 1: no (2)
## installment_rate > 1: yes (4)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 54 135(15.0%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 589 44 (a): class no
## 91 176 (b): class yes
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 54.22% other_debtors
## 50.00% credit_history
## 32.56% savings_balance
## 25.22% months_loan_duration
## 19.78% purpose
## 10.11% residence_history
## 7.33% installment_plan
## 5.22% telephone
## 4.78% foreign_worker
## 4.56% employment_length
## 4.33% amount
## 3.44% personal_status
## 3.11% property
## 2.67% age
## 1.56% installment_rate
## 0.44% job
##
##
## Time: 0.0 secs
First few lines of the summary reads the following:
Now it is time to put our test data to the test after training our model. After that, an evaluation is done to check its accuracy rate.
#making prediction
credit_pred <- predict(credit_model, credit_test)
#Evaluation
confusionMatrix(credit_pred, credit_test$default, dnn = c("Predicted", "Actual"))
## Confusion Matrix and Statistics
##
## Actual
## Predicted no yes
## no 60 19
## yes 7 14
##
## Accuracy : 0.74
## 95% CI : (0.6427, 0.8226)
## No Information Rate : 0.67
## P-Value [Acc > NIR] : 0.08146
##
## Kappa : 0.3523
## Mcnemar's Test P-Value : 0.03098
##
## Sensitivity : 0.8955
## Specificity : 0.4242
## Pos Pred Value : 0.7595
## Neg Pred Value : 0.6667
## Prevalence : 0.6700
## Detection Rate : 0.6000
## Detection Prevalence : 0.7900
## Balanced Accuracy : 0.6599
##
## 'Positive' Class : no
##
Turns out that the accuracy is only 74% and there is definitely room for improvement.
There are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly type of mistakes.
Boosting is rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any of the learners alone.
To add bossting to the model, an addtional parameter trials is added. It acts as an upper limit and stop adding trees if accuracy does not improve. The de facto standard is 10.
The evaluation is as follows:
#making the classifier
credit_boost10 <- C5.0(credit_train[-21], credit_train$default, trials = 10)
#making the prediction
credit_boost_pred10 <- predict(credit_boost10, credit_test)
#Evaluation
confusionMatrix(credit_boost_pred10, credit_test$default)
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 60 17
## yes 7 16
##
## Accuracy : 0.76
## 95% CI : (0.6643, 0.8398)
## No Information Rate : 0.67
## P-Value [Acc > NIR] : 0.03281
##
## Kappa : 0.4121
## Mcnemar's Test P-Value : 0.06619
##
## Sensitivity : 0.8955
## Specificity : 0.4848
## Pos Pred Value : 0.7792
## Neg Pred Value : 0.6957
## Prevalence : 0.6700
## Detection Rate : 0.6000
## Detection Prevalence : 0.7700
## Balanced Accuracy : 0.6902
##
## 'Positive' Class : no
##
After adaptive boosting, the accuracy of the model improved to 76%.