Introduction

The global financial crisis of 2007-2008 has highlighted the importance of transparency and rigor in banking practices. As the availability of credit has been limited, banks are increasingly tightening their lending systems and turning to machine learning to more accurately identify risky loans. In this project we will use the concept of Decision Trees in order to develop a simple credit approval model using C5.0 decision trees. We will also see how the results of the model can be tuned to minimize errors that result in a financial loss for the institution.

Decision Trees

As you might intuit from the name, decision tree learners build a model in the form of a tree structure. The model itself comprises a series of logical decisions, similar to a flowchart, with decision nodes that indicate a decision to be made on an attribute. These split into branches that indicate the decision’s choices. The tree is terminated by leaf nodes (also known as terminal nodes) that denote the result of following a combination of decisions.

Data that is to be classified begin at the root node where it is passed through the various decisions in the tree according to the values of its features. The path that the data takes funnels each record into a leaf node, which assigns it a predicted class.

Decision trees are built using a heuristic called recursive partitioning. This approach is generally known as divide and conquer because it uses the feature values to split the data into smaller and smaller subsets of similar classes.The algorithm continues to divide-and-conquer the nodes, choosing the best candidate feature each time until a stopping criterion is reached.

The C5.0 Decision Tree Algorithm

There are numerous implementations of decision trees, but one of the most wellknown is the C5.0 algorithm. This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an improvement over his ID3 (Iterative Dichotomiser 3) algorithm.

Choosing The Best Split

The first challenge that a decision tree will face is to identify which feature to split upon.If the segments of data contain only a single class, they are considered pure. There are many different measurements of purity for identifying splitting criteria. C5.0 uses entropy for measuring purity. The entropy of a sample of data indicates how mixed the class values are; the minimum value of 0 indicates that the sample is completely homogenous, while 1 indicates the maximum amount of disorder. The definition of entropy is specified by:

Entropy(S)=∑i=1c−pilog2(pi)

In the entropy formula, for a given segment of data (S), the term c refers to the number of different class levels, and pi refers to the proportion of values falling into class level i.

Identifying Risky Bank Loans Using C5.0 Decision Trees

Decision trees are widely used in the banking industry due to their high accuracy and ability to formulate a statistical model in plain language. Since government organizations in many countries carefully monitor lending practices, executives must be able to explain why one applicant was rejected for a loan while others were approved. This information is also useful for customers hoping to determine why their credit rating is unsatisfactory.

Step 1 Collecting Data

The idea behind our credit model is to identify factors that make an applicant at higher risk of default. Therefore, we need to obtain data on a large number of past bank loans and whether the loan went into default, as well as information about the applicant. Data with these characteristics are available in a dataset donated to the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml) by Hans Hofmann of the University of Hamburg. They represent loans obtained from a credit agency in Germany.

The credit dataset includes 1,000 examples of loans, plus a combination of numeric and nominal features indicating characteristics of the loan and the loan applicant. A class variable indicates whether the loan went into default. Let’s see if we can determine any patterns that predict this outcome.

Step 2Exploring and Preparing The Data

we will import the data using the read.csv() function. We will ignore the stringsAsFactors option (and therefore use the default value, TRUE) as the majority of features in the data are nominal. We’ll also look at the structure of the credit data frame we created.

credit <- read.csv("D:/Projects/Bank_loans/credit.csv")   ##Importing data
str(credit)                                               ##Having a glance of all the variables and their types

## 'data.frame':    1000 obs. of  21 variables:
##  $ checking_balance    : chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : chr  "critical" "repaid" "critical" "repaid" ...
##  $ purpose             : chr  "radio/tv" "radio/tv" "education" "furniture" ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
##  $ employment_length   : chr  "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ...
##  $ installment_rate    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status     : chr  "single male" "female" "single male" "single male" ...
##  $ other_debtors       : chr  "none" "none" "none" "guarantor" ...
##  $ residence_history   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property            : chr  "real estate" "real estate" "real estate" "building society savings" ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ installment_plan    : chr  "none" "none" "none" "none" ...
##  $ housing             : chr  "own" "own" "own" "for free" ...
##  $ existing_credits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ default             : int  1 2 1 1 2 1 1 1 1 2 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone           : chr  "yes" "none" "none" "none" ...
##  $ foreign_worker      : chr  "yes" "yes" "yes" "yes" ...
##  $ job                 : chr  "skilled employee" "skilled employee" "unskilled resident" "skilled employee" ...

We see the expected 1,000 observations and 17 features, which are a combination of factor and integer data types.

Let’s take a look at some of the table() output for a couple of features of loans that seem likely to predict a default. The checking_balance and savings_balance features indicate the applicant’s checking and savings account balance, and are recorded as categorical variables. We also need to convert the variable default from numerical to factor with 1 representing “no” for default and 2 representing “yes” for default. To confirm the conversion we check the structure of default variable. We also need to convert the column “checking_balance” from character to factors.

table(credit$checking_balance)  ## Table of credit vs checking balance

## 
##     < 0 DM   > 200 DM 1 - 200 DM    unknown 
##        274         63        269        394

table(credit$savings_balance)   ## Table of credit vs checking balance

## 
##      < 100 DM     > 1000 DM  101 - 500 DM 501 - 1000 DM       unknown 
##           603            48           103            63           183

credit$default = as.factor(credit$default)  #Converting numerical variable default to categorical variable.
levels(credit$default)[levels(credit$default) == "1"] <- "No"
levels(credit$default)[levels(credit$default) == "2"] <- "Yes"
str(credit$default)

##  Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 1 1 1 2 ...

credit$checking_balance <- as.factor(credit$checking_balance)
levels(credit$checking_balance)[levels(credit$checking_balance) == "1"] <- "< 0 DM"
levels(credit$checking_balance)[levels(credit$checking_balance) == "2"] <- " > 200 DM"
levels(credit$checking_balance)[levels(credit$checking_balance) == "3"] <- "1 - 200 DM"
levels(credit$checking_balance)[levels(credit$checking_balance) == "4"] <- "unknown"
str(credit$checking_balance)

##  Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...

Since the loan data was obtained from Germany, the currency is recorded in Deutsche Marks (DM). It seems like a safe assumption that larger checking and savings account balances should be related to a reduced chance of loan default.

Some of the loan’s features are numeric, such as its term (months_loan_duration), and the amount of credit requested (amount).

summary(credit$months_loan_duration) ##Summary of the loan duration

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    12.0    18.0    20.9    24.0    72.0

summary(credit$amount)               ##Summary of the amount of loan

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

The loan amounts ranged from 250 DM to 18,420 DM across terms of 4 to 72 months, with a median duration of 18 months and amount of 2,320 DM. The default variable indicates whether the loan applicant was unable to meet the agreed payment terms and went into default. A total of 30 percent of the loans went into default.

table(credit$default)

## 
##  No Yes 
## 700 300

table(credit$checking_balance)

## 
##     < 0 DM   > 200 DM 1 - 200 DM    unknown 
##        274         63        269        394

A high rate of default is undesirable for a bank because it means that the bank is unlikely to fully recover its investment. If we are successful, our model will identify applicants that are likely to default, so that this number can be reduced.

Data Preparation Creating random training and test datasets

Now we will split our data into two portions: a training dataset to build the decision tree and a test dataset to evaluate the performance of the model on new data. We will use 90 percent of the data for training and 10 percent for testing, which will provide us with 100 records to simulate new applicants.

Suppose that the bank had sorted the data by the loan amount, with the largest loans at the end of the file. If we use the first 90 percent for training and the remaining 10 percent for testing, we would be building a model on only the small loans and testing the model on the big loans. Obviously, this could be problematic.

We’ll solve this problem by randomly ordering our credit data frame prior to splitting. The order() function is used to rearrange a list of items in ascending or descending order. If we combine this with a function to generate a list of random numbers, we can generate a randomly-ordered list. For random number generation, we’ll use the runif() function, which by default generates a sequence of random numbers between 0 and 1.

The following command creates a randomly-ordered credit data frame. The set.seed() function is used to generate random numbers in a predefined sequence, starting from a position known as a seed (set here to the arbitrary value 12345). It may seem that this defeats the purpose of generating random numbers, but there is a good reason for doing it this way. The set.seed() function ensures that if the analysis is repeated, an identical result is obtained.

set.seed(12345)
credit_rand <- credit[order(runif(1000)), ]

The runif(1000) command generates a list of 1,000 random numbers. We need exactly 1,000 random numbers because there are 1,000 records in the credit data frame. The order() function then returns a vector of numbers indicating the sorted position of the 1,000 random numbers. We then use these positions to select rows in the credit data frame and store in a new data frame named credit_rand.

To confirm that we have the same data frame sorted differently, we’ll compare values on the amount feature across the two data frames. The following code shows the summary statistics.

summary(credit$amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

summary(credit_rand$amount)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

## Comparing the summaries of the variable amount before and after randomization.

We can use the head() function to examine the first few values in each data frame.

head(credit$amount)

## [1] 1169 5951 2096 7882 4870 9055

head(credit_rand$amount)

## [1] 1199 2576 1103 4020 1501 1568

## We compare the first few values in each dataframe.

Since the summary statistics are identical while the first few values are different, this suggests that our random shuffle worked correctly. Now, we can split into training (90 percent or 900 records), and test data (10 percent or 100 records).

credit_train <- credit_rand[1:900, ]
credit_test <- credit_rand[901:1000, ]

If all went well, we should have about 30 percent of defaulted loans in each of the datasets.

prop.table(table(credit_train$default))

## 
##        No       Yes 
## 0.7022222 0.2977778

prop.table(table(credit_test$default))

## 
##   No  Yes 
## 0.68 0.32

## We compare the default percentage in each set in order to ensure wqual proportion in each of them and similar to the originsal dataset.

Step 3 Training a Model on The Data

We will use the C5.0 algorithm in the C50 package for training our decision tree model.

The following syntax box lists some of the most commonly used commands for building decision trees. The ?C5.0Control command displays the help page for more details on how to finely-tune the algorithm.

Given below is the syntax of C5.0 algorithm. (Here we need to use formula syntax in place of train and class directly as we deal with factors level data)

Syntax of C5.0 Algorithm

library(C50)
credit_model <- C5.0(default ~ ., credit_train)

The credit_model object now contains a C5.0 decision tree object. We can see some basic data about the tree by typing its name.

credit_model

## 
## Call:
## C5.0.formula(formula = default ~ ., data = credit_train)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 20 
## 
## Tree size: 57 
## 
## Non-standard options: attempt to group attributes

The preceding text shows some simple facts about the tree, including the function call that generated it, the number of features (that is, predictors), and examples (that is, samples) used to grow the tree. Also listed is the tree size of 57, which indicates that the tree is 57 decisions deep. To see the decisions, we can call the summary() function on the model.

summary(credit_model)

## 
## Call:
## C5.0.formula(formula = default ~ ., data = credit_train)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Oct 22 15:33:23 2020
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (21 attributes) from undefined.data
## 
## Decision tree:
## 
## checking_balance = unknown: No (358/44)
## checking_balance in {< 0 DM,> 200 DM,1 - 200 DM}:
## :...foreign_worker = no:
##     :...installment_plan in {none,stores}: No (17/1)
##     :   installment_plan = bank:
##     :   :...residence_history <= 3: Yes (2)
##     :       residence_history > 3: No (2)
##     foreign_worker = yes:
##     :...credit_history in {fully repaid,
##         :                  fully repaid this bank}: Yes (61/20)
##         credit_history in {critical,repaid,delayed}:
##         :...months_loan_duration <= 11: No (76/13)
##             months_loan_duration > 11:
##             :...savings_balance = > 1000 DM: No (13)
##                 savings_balance in {< 100 DM,101 - 500 DM,501 - 1000 DM,
##                 :                   unknown}:
##                 :...checking_balance = > 200 DM:
##                     :...dependents > 1: Yes (3)
##                     :   dependents <= 1:
##                     :   :...credit_history in {repaid,delayed}: No (23/3)
##                     :       credit_history = critical:
##                     :       :...amount <= 2337: Yes (3)
##                     :           amount > 2337: No (6)
##                     checking_balance = < 0 DM:
##                     :...other_debtors = guarantor:
##                     :   :...credit_history = critical: Yes (1)
##                     :   :   credit_history in {repaid,delayed}: No (11/1)
##                     :   other_debtors in {none,co-applicant}:
##                     :   :...job = mangement self-employed: No (26/6)
##                     :       job in {unskilled resident,skilled employee,
##                     :       :       unemployed non-resident}:
##                     :       :...purpose in {radio/tv,others,repairs,
##                     :           :           domestic appliances,
##                     :           :           retraining}: Yes (33/10)
##                     :           purpose = education: [S1]
##                     :           purpose = business:
##                     :           :...job in {unskilled resident,
##                     :           :   :       unemployed non-resident}: No (3)
##                     :           :   job = skilled employee: Yes (3)
##                     :           purpose = car (new): [S2]
##                     :           purpose = car (used):
##                     :           :...amount > 6229: Yes (5)
##                     :           :   amount <= 6229: [S3]
##                     :           purpose = furniture:
##                     :           :...months_loan_duration > 27: Yes (9/1)
##                     :               months_loan_duration <= 27: [S4]
##                     checking_balance = 1 - 200 DM:
##                     :...savings_balance = unknown: No (34/6)
##                         savings_balance in {< 100 DM,101 - 500 DM,
##                         :                   501 - 1000 DM}:
##                         :...months_loan_duration > 45: Yes (11/1)
##                             months_loan_duration <= 45:
##                             :...installment_plan = stores:
##                                 :...age <= 35: Yes (4)
##                                 :   age > 35: No (2)
##                                 installment_plan = bank:
##                                 :...residence_history <= 1: No (3)
##                                 :   residence_history > 1:
##                                 :   :...existing_credits <= 1: Yes (5)
##                                 :       existing_credits > 1:
##                                 :       :...installment_rate > 2: Yes (3)
##                                 :           installment_rate <= 2: [S5]
##                                 installment_plan = none:
##                                 :...other_debtors = guarantor: No (7/1)
##                                     other_debtors = co-applicant: Yes (3/1)
##                                     other_debtors = none:
##                                     :...employment_length = 4 - 7 yrs:
##                                         :...age <= 41: No (16)
##                                         :   age > 41: Yes (3/1)
##                                         employment_length in {> 7 yrs,
##                                         :                     1 - 4 yrs,
##                                         :                     0 - 1 yrs,
##                                         :                     unemployed}:
##                                         :...amount > 7980: Yes (7)
##                                             amount <= 7980:
##                                             :...amount > 4746: No (10)
##                                                 amount <= 4746: [S6]
## 
## SubTree [S1]
## 
## savings_balance in {< 100 DM,101 - 500 DM,501 - 1000 DM}: Yes (6)
## savings_balance = unknown: No (2)
## 
## SubTree [S2]
## 
## savings_balance = 101 - 500 DM: No (1)
## savings_balance in {501 - 1000 DM,unknown}: Yes (4)
## savings_balance = < 100 DM:
## :...personal_status in {single male,female,divorced male}: Yes (29/6)
##     personal_status = married male: No (2)
## 
## SubTree [S3]
## 
## job = unskilled resident: Yes (1)
## job in {skilled employee,unemployed non-resident}: No (8/1)
## 
## SubTree [S4]
## 
## employment_length in {> 7 yrs,4 - 7 yrs}: No (7/1)
## employment_length = unemployed: Yes (2)
## employment_length = 0 - 1 yrs:
## :...job = unskilled resident: Yes (1)
## :   job in {skilled employee,unemployed non-resident}: No (4)
## employment_length = 1 - 4 yrs:
## :...property in {building society savings,unknown/none}: No (5)
##     property in {other,real estate}:
##     :...residence_history <= 2: No (4/1)
##         residence_history > 2: Yes (5)
## 
## SubTree [S5]
## 
## other_debtors in {none,guarantor}: No (3)
## other_debtors = co-applicant: Yes (1)
## 
## SubTree [S6]
## 
## housing = for free: No (2)
## housing = rent:
## :...credit_history = critical: No (1)
## :   credit_history in {repaid,delayed}: Yes (10/2)
## housing = own:
## :...savings_balance = 101 - 500 DM: No (6)
##     savings_balance in {< 100 DM,501 - 1000 DM}:
##     :...residence_history <= 1: No (8/1)
##         residence_history > 1:
##         :...installment_rate <= 1: No (2)
##             installment_rate > 1:
##             :...employment_length in {> 7 yrs,unemployed}: No (13/6)
##                 employment_length in {1 - 4 yrs,0 - 1 yrs}: Yes (10)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      57  127(14.1%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     590    42    (a): class No
##      85   183    (b): class Yes
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##   60.22% foreign_worker
##   57.89% credit_history
##   51.11% months_loan_duration
##   42.67% savings_balance
##   30.44% other_debtors
##   17.78% job
##   15.56% installment_plan
##   14.89% purpose
##   12.89% employment_length
##   10.22% amount
##    6.78% residence_history
##    5.78% housing
##    3.89% dependents
##    3.56% installment_rate
##    3.44% personal_status
##    2.78% age
##    1.56% property
##    1.33% existing_credits
## 
## 
## Time: 0.0 secs

The preceding output shows some of the first branches in the decision tree. The first four lines could be represented in plain language as: 1. If the checking account balance is unknown, then classify as not likely to default. 2. Otherwise, if the checking account balance is less than zero DM, between one and 200 DM, or greater than 200 DM andâ¦ The numbers in parentheses indicate the number of examples meeting the criteria for that decision, and the number incorrectly classified by the decision. For instance, on the first line, (358/44) indicates that of the 358 examples reaching the decision, 44 were incorrectly classified as no, that is, not likely to default. In other words, 44 applicants actually defaulted in spite of the model’s prediction to the contrary. After the tree output, the summary(credit_model) displays a confusion matrix, which is a cross-tabulation that indicates the model’s incorrectly classified records in the training data. The Errors field notes that the model correctly classified all but 125 of the 900 training instances for an error rate of 13.9 percent. A total of 23 actual no values were incorrectly classified as yes (false positives), while 102 yes values were misclassified as no (false negatives). Decision trees are known for having a tendency to overfit the model to the training data. For this reason, the error rate reported on training data may be overly optimistic, and it is especially important to evaluate decision trees on a test dataset.

Step 4 Evaluating Model Performance

To apply our decision tree to the test dataset, we use the predict() function as shown in the following line of code:

credit_pred <- predict(credit_model, credit_test)

This creates a vector of predicted class values, which we can compare to the actual class values using the CrossTable() function in the gmodels package. Setting the prop.c and prop.r parameters to FALSE removes the column and row percentages from the table. The remaining percentage (prop.t) indicates the proportion of records in the cell out of the total number of records.

library(gmodels)
CrossTable(credit_test$default, credit_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |        54 |        14 |        68 | 
##                |     0.540 |     0.140 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |        11 |        21 |        32 | 
##                |     0.110 |     0.210 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        65 |        35 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
##

Out of the 100 test loan application records, our model correctly predicted that 54 did not default and 21 did default, resulting in an accuracy of 75 percent and an error rate of 25 percent. This is somewhat worse than its performance on the training data, but not unexpected, given that a model’s performance is often worse on unseen data. Also note that the model only correctly predicted 65.625% percent of the 32 loan defaults in the test data. Unfortunately, this type of error is a potentially very costly mistake. Let’s see if we can improve the result with a bit more effort.

Step 5 Improving Model Performance

Our model’s error rate is likely to be too high to deploy it in a real-time credit scoring application. In fact, if the model had predicted “no default” for every test case, it would have been correct 68 percent of the timeâa result not much worse than our model, but requiring much less effort! Predicting loan defaults from 900 examples seems to be a challenging problem. Making matters even worse, our model performed especially poorly at identifying applicants who default. Luckily, there are a couple of simple ways to adjust the C5.0 algorithm that may help to improve the performance of the model, both overall and for the more costly mistakes.

Boosting the accuracy of decision trees

One way the C5.0 algorithm improved upon the C4.5 algorithm was by adding adaptive boosting. This is a process in which many decision trees are built, and the trees vote on the best class for each example. Boosting is rooted in the notion that by combining a number of weak performing learners, you can create a team that is much stronger than any one of the learners alone. Each of the models has a unique set of strengths and weaknesses, and may be better or worse at certain problems. Using a combination of several learners with complementary strengths and weaknesses can therefore dramatically improve the accuracy of a classifier.

The C5.0() function makes it easy to add boosting to our C5.0 decision tree. We simply need to add an additional trials parameter indicating the number of separate decision trees to use in the boosted team. The trials parameter sets an upper limit; the algorithm will stop adding trees if it recognizes that additional trials do not seem to be improving the accuracy. We’ll start with 10 trialsâa number that has become the de facto standard, as research suggests that this reduces error rates on test data by about 25 percent. (Here we need to use formula syntax in place of train and class directly as we deal with factors level data)

credit_boost10 <- C5.0(default ~ ., credit_train,
trials = 10)

While examining the resulting model, we can see that some additional lines have been added indicating the changes.

credit_boost10

## 
## Call:
## C5.0.formula(formula = default ~ ., data = credit_train, trials = 10)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 20 
## 
## Number of boosting iterations: 10 
## Average tree size: 47.3 
## 
## Non-standard options: attempt to group attributes

Across the 10 iterations, our tree size shrunk. If you would like, you can see all 10 trees by typing summary(credit_boost10) at the command prompt. Let’s take a look at the performance on our training data:

summary(credit_boost10)
## Here I have not shown the entire decision tree summary but rather just the output table

The table in the result after boosting

The classifier made 30 mistakes on 900 training examples for an error rate of 3.33 percent. This is quite an improvement over the 14.11 percent training error rate we noted before adding boosting! However, it remains to be seen whether we see a similar improvement on the test data. Let’s take a look.

credit_boost_pred10 <- predict(credit_boost10, credit_test)
CrossTable(credit_test$default, credit_boost_pred10,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |        63 |         5 |        68 | 
##                |     0.630 |     0.050 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |        16 |        16 |        32 | 
##                |     0.160 |     0.160 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        79 |        21 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
##

Here, we reduced the total error rate from 25 percent prior to boosting down to 21 percent in the boosted model. It does not seem like a large gain, but it is reasonably close to the 25 percent reduction we hoped for. On the other hand, the model is still not doing well at predicting defaults, getting 16 / 32 = 50% wrong (This worsens our case). The lack of an even greater improvement may be a function of our relatively small training dataset, or it may just be a very difficult problem to solve.

Making some mistakes more costly than others

Giving a loan out to an applicant who is likely to default can be an expensive mistake. One solution to reduce the number of false negatives may be to reject a larger number of borderline applicants. The few years’ worth of interest that the bank would earn from a risky loan is far outweighed by the massive loss it would take if the money was never paid back at all. The C5.0 algorithm allows us to assign a penalty to different types of errors in order to discourage a tree from making more costly mistakes. The penalties are designated in a cost matrix, which specifies how many times more costly each error is, relative to any other. Suppose we believe that a loan default costs the bank four times as much as a missed opportunity. Our cost matrix then could be defined as:

error_cost <- matrix(c(0, 1, 4, 0), nrow = 2)

This creates a matrix with two rows and two columns, arranged somewhat differently than the confusion matrixes we have been working with. The value 1 indicates no and the value 2 indicates yes. Rows are for predicted values and columns are for actual values:

error_cost

##      [,1] [,2]
## [1,]    0    4
## [2,]    1    0

As defined by this matrix, there is no cost assigned when the algorithm classifies a no or yes correctly, but a false negative has a cost of 4 versus a false positive’s cost of 1. To see how this impacts classification, let’s apply it to our decision tree using the costs parameter of the C5.0() function. We’ll otherwise use the same steps as before. (Here we need to use formula syntax in place of train and class directly as we deal with factors level data)

credit_cost <- C5.0(default ~ ., credit_train,
costs = error_cost)

## Warning: no dimnames were given for the cost matrix; the factor levels will be
## used

credit_cost_pred <- predict(credit_cost, credit_test)
CrossTable(credit_test$default, credit_cost_pred,
prop.chisq = FALSE, prop.c = FALSE, prop.r = FALSE,
dnn = c('actual default', 'predicted default'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                | predicted default 
## actual default |        No |       Yes | Row Total | 
## ---------------|-----------|-----------|-----------|
##             No |        38 |        30 |        68 | 
##                |     0.380 |     0.300 |           | 
## ---------------|-----------|-----------|-----------|
##            Yes |         5 |        27 |        32 | 
##                |     0.050 |     0.270 |           | 
## ---------------|-----------|-----------|-----------|
##   Column Total |        43 |        57 |       100 | 
## ---------------|-----------|-----------|-----------|
## 
##

## This produces the following confusion matrix:

Compared to our best boosted model, this version makes more mistakes overall: 35 percent here versus 21 percent in the boosted case. However, the types of mistakes vary dramatically. Where the previous models incorrectly classiifed nearly half of the defaults incorrectly, in this model, only 15.6 percent of the defaults were predicted to be non-defaults. This trade resulting in a reduction of false negatives at the expense of increasing false positives may be acceptable if our cost estimates were accurate.

Conclusion

Machine Learning algorithm of decision trees use a divide-and-conquer strategy to create flowcharts. One popular and highly-configurable decision tree algorithm is C5.0.

We used the C5.0 algorithm to create a tree to predict whether a loan applicant will default. Using options for boosting and cost-sensitive errors, we were able to improve our accuracy and avoid risky loans that cost the bank more money. Thus the decision tress helped us to build the model which can identify risky loans and thus hel the bankikg sector to reduce their losses due to default on the borrower’s end.

Identifying Risky Bank Loans using C5.0 Decision Trees in R

Virag Lakhani

22/10/2020