Introduction to Decision Trees

In this tutorial we will cover two machine learning methods that apply a similar strategy of dividing data into smaller and smaller portions to identify patterns that can be ultimately used for prediction. The knowledge gained is then presented in the form of logical structures that can be understood without any statistical knowledge. There are numerous decision trees by one can choose to follow, such as the C5.0, 1R, and RIPPER agorithms.

In this brief example, we will try to determine how likely a consumer is goinng to default based on numerous factors.

1. Underansing Decision Trees

Decusuib tree learners construct a model in the form of a tree structure. The model itself comprises a serious of logical decisions, similar to a flowerchart, with decision nodes tgat ubducate a decusuibnt i be nade ib ab attribute. These split into branches that indicate the decision’s choices. The tree is terminated by leaf nodes (aka terminal nodes) that denote the result of following a combination of decisions.

Data that is long to be classified being at the root node where it is passed through the various decisions in the tree according to the values of its features. The path the data takes funnels each record into a leaf node, which assigns it a predicted class.

Data that is to be classified begin in the root node where it is passed throug hthe various decisions in the tree according to the values of its features. The path that the various decisions in the tree according to the value of its features.

Have you ever stopped to ownder why pollution (save for other countries)

2. Divide and Conquer

Decision trees are built using a heuristic called recursive partitioning. This approach is more commonly known as divide and conquer because it uses the feature values to split the data into smaller and smaller subsets of similar classes.

To elaborate on this concept, let us suppose we begin at the root node, which represents our entire dataset. The algorithm will chose a feature that is the mnost predictive of the target class. The examples are then partitioned into groups of distinct values of this feature; this decision forms the first set of tree branches. The agorithm continues to divide-and-conquer the nodes, choosing the best candidate eature each time until a stopping criterion is reached. This could occur if

  • All, or nearly all, of the ecamples at the node have the same class.
  • There are no more remaining features to distinguish among examples.
  • The tree has grown to a predefined size limit.

Since real-world data contains more than two features, decision trees quickly become ar more complex than textbook examples, with many more nodes, branches, and leaves.

3. The C5.0 Decision Tree Algorithm

While there are numerous implementations of decision trees, one of the most well-known is the C5.0 algorithm. The C5.0 algorithm has become the industry standatd for producing decision trees, because it does well fo rmost types of problems directly out of the box. Compared to more advanced and sophisticated machine learning mnodels (e.g. Neural Networks and Support Vector Machines), the decision trees under the C5.0 algorithm generally perform nearly as well but are much easier to understan and deploy.

4. Choosing the Best Split

The first main challenge that a decision tree will face is to identify which feature to split upon. If the segements of the data contain only a single class, tehy are considered pure.

C5.0 uses the concept of entropy for measuring purity. The entropy of a sample of data indicates how mixed the class values are; the minimum value of \(0\) indicates that the sample is completely homogenous, while \(1\) indicates the maximum amount of disorder. The defintion of entropy can be specified as

\[ \begin{aligned} \text{Entropy}(S) = \sum_{i=1}^{c} -p_{i}\log_{2}(p_{i}) &&&& (1) \end{aligned} \] In equation \(1\), for a given segment of data \((S)\), the term \(c\) refers to the number of differenct class levels, and \(p_{i}\) refers to the proportion of values falling into the cass level \(i\). For example, if we suppose we have a partiition of data with two classes: red (60 percent) and white (40 percent), we can calculate its entropy as

\[ \begin{aligned} -0.6 \times \log_{2}0.6 - 0.4 \times \log_{2}0.4 = 0.9709506 \end{aligned} \]

We can also examine the entropy for all possible two-class arrangements. If we know the proportion of examples in one class is \(x\), then the proportion in the other class is \(1 - x\). Below is an illustration of this arrangement

As illustrated above, the peak entropy is at \(x = 0.5\), a 50-50 split results in the maximum entropy.

Given the measure of purity, the algorithm must still decide which feature to split upon. In order to do so, the algorithm uses entropy to calculate the change in honmogeneity resulting from a split on each possible feature. This calculation is referred to as information gain. The information gain for a feature \(F\) is calculated as *the differences between the entropy in the segment before the split \((S_{1})\), and the partitions resulting from the split \((S_{2})\). That is,

\[ \begin{aligned} \text{InfoGain}(F) = \text{Entropy}(S_{1}) - \text{Entropy}(S_{2}) &&&& (2) \end{aligned} \]

One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate \(\text{Entropy}(S_{2})\) needs to consider the total entropy across all of the partitions. In accomplishes this by weighing each partition’s entropy by the proportion of records falling into that partition, which can be expressed by the following formula

\[ \begin{aligned} \text{Entropy}(S) = \sum_{i=1}^{n}w_{i}\text{Entropy}(P_{i}) &&&& (3) \end{aligned} \]

The higher the information gain, the better a feature is at creating homogenous groups after a split on that feature.

5. Pruning the Decision Tree

Decision trees can continue to grow indefinitely by choosing splitting features and dividing them into smaller and smaller partitions until each example is perfectly classified, or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will overfit the training data. To overcome this, we can prune a decision tree which involves a process of reducing its size such that it generalizes better to unobserved data.

One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions, or if the decision node contains only a small number of examples. This method is known as pre-pruning, or early stopping the decision tree. As the tree avoids doing unnecessary work, this can quite often be an appealing strategy.

The C5.0 Algorithm

One of the benefits of the C5.0 algorithm is that it is opinionated about pruning; it takes care of many of the decisions automatically using fairly reasonable defaults. Its overall strategy is to postprune the tree. It does this by first growing a large tree that overfits the training data. Aftewrwards, nodes and branches that have little effect on the classification errors are removed.

6. Exampple: Identifying Risky Loans Using the C5.0 Decision Tree

In this example, we will be using the C5.0 algorithm to determine the creditworthiness of a loan applicant.

Loading and Inspecting the Data

library(kableExtra)
library(knitr)
credit.df <- read.csv('/Users/czar.yobero/Documents/credit.csv')
credit.df$default <- factor(credit.df$default, levels = c(1, 2), labels = c('Yes', 'No'))
str(credit.df)
## 'data.frame':    1000 obs. of  21 variables:
##  $ checking_balance    : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
##  $ purpose             : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
##  $ employment_length   : Factor w/ 5 levels "> 7 yrs","0 - 1 yrs",..: 1 3 4 4 3 3 1 3 4 5 ...
##  $ installment_rate    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status     : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
##  $ other_debtors       : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
##  $ residence_history   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property            : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ installment_plan    : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ housing             : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
##  $ existing_credits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ default             : Factor w/ 2 levels "Yes","No": 1 2 1 1 2 1 1 1 1 2 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone           : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker      : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ job                 : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
summary(credit.df)
##    checking_balance months_loan_duration                credit_history
##  < 0 DM    :274     Min.   : 4.0         critical              :293   
##  > 200 DM  : 63     1st Qu.:12.0         delayed               : 88   
##  1 - 200 DM:269     Median :18.0         fully repaid          : 40   
##  unknown   :394     Mean   :20.9         fully repaid this bank: 49   
##                     3rd Qu.:24.0         repaid                :530   
##                     Max.   :72.0                                      
##                                                                       
##        purpose        amount           savings_balance  employment_length
##  radio/tv  :280   Min.   :  250   < 100 DM     :603    > 7 yrs   :253    
##  car (new) :234   1st Qu.: 1366   > 1000 DM    : 48    0 - 1 yrs :172    
##  furniture :181   Median : 2320   101 - 500 DM :103    1 - 4 yrs :339    
##  car (used):103   Mean   : 3271   501 - 1000 DM: 63    4 - 7 yrs :174    
##  business  : 97   3rd Qu.: 3972   unknown      :183    unemployed: 62    
##  education : 50   Max.   :18424                                          
##  (Other)   : 55                                                          
##  installment_rate      personal_status      other_debtors
##  Min.   :1.000    divorced male: 50    co-applicant: 41  
##  1st Qu.:2.000    female       :310    guarantor   : 52  
##  Median :3.000    married male : 92    none        :907  
##  Mean   :2.973    single male  :548                      
##  3rd Qu.:4.000                                           
##  Max.   :4.000                                           
##                                                          
##  residence_history                     property        age       
##  Min.   :1.000     building society savings:232   Min.   :19.00  
##  1st Qu.:2.000     other                   :332   1st Qu.:27.00  
##  Median :3.000     real estate             :282   Median :33.00  
##  Mean   :2.845     unknown/none            :154   Mean   :35.55  
##  3rd Qu.:4.000                                    3rd Qu.:42.00  
##  Max.   :4.000                                    Max.   :75.00  
##                                                                  
##  installment_plan     housing    existing_credits default  
##  bank  :139       for free:108   Min.   :1.000    Yes:700  
##  none  :814       own     :713   1st Qu.:1.000    No :300  
##  stores: 47       rent    :179   Median :1.000             
##                                  Mean   :1.407             
##                                  3rd Qu.:2.000             
##                                  Max.   :4.000             
##                                                            
##    dependents    telephone  foreign_worker                      job     
##  Min.   :1.000   none:596   no : 37        mangement self-employed:148  
##  1st Qu.:1.000   yes :404   yes:963        skilled employee       :630  
##  Median :1.000                             unemployed non-resident: 22  
##  Mean   :1.155                             unskilled resident     :200  
##  3rd Qu.:1.000                                                          
##  Max.   :2.000                                                          
## 
kable(head(credit.df, n=15), format="html") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
checking_balance months_loan_duration credit_history purpose amount savings_balance employment_length installment_rate personal_status other_debtors residence_history property age installment_plan housing existing_credits default dependents telephone foreign_worker job
< 0 DM 6 critical radio/tv 1169 unknown > 7 yrs 4 single male none 4 real estate 67 none own 2 Yes 1 yes yes skilled employee
1 - 200 DM 48 repaid radio/tv 5951 < 100 DM 1 - 4 yrs 2 female none 2 real estate 22 none own 1 No 1 none yes skilled employee
unknown 12 critical education 2096 < 100 DM 4 - 7 yrs 2 single male none 3 real estate 49 none own 1 Yes 2 none yes unskilled resident
< 0 DM 42 repaid furniture 7882 < 100 DM 4 - 7 yrs 2 single male guarantor 4 building society savings 45 none for free 1 Yes 2 none yes skilled employee
< 0 DM 24 delayed car (new) 4870 < 100 DM 1 - 4 yrs 3 single male none 4 unknown/none 53 none for free 2 No 2 none yes skilled employee
unknown 36 repaid education 9055 unknown 1 - 4 yrs 2 single male none 4 unknown/none 35 none for free 1 Yes 2 yes yes unskilled resident
unknown 24 repaid furniture 2835 501 - 1000 DM > 7 yrs 3 single male none 4 building society savings 53 none own 1 Yes 1 none yes skilled employee
1 - 200 DM 36 repaid car (used) 6948 < 100 DM 1 - 4 yrs 2 single male none 2 other 35 none rent 1 Yes 1 yes yes mangement self-employed
unknown 12 repaid radio/tv 3059 > 1000 DM 4 - 7 yrs 2 divorced male none 4 real estate 61 none own 1 Yes 1 none yes unskilled resident
1 - 200 DM 30 critical car (new) 5234 < 100 DM unemployed 4 married male none 2 other 28 none own 2 No 1 none yes mangement self-employed
1 - 200 DM 12 repaid car (new) 1295 < 100 DM 0 - 1 yrs 3 female none 1 other 25 none rent 1 No 1 none yes skilled employee
< 0 DM 48 repaid business 4308 < 100 DM 0 - 1 yrs 3 female none 4 building society savings 24 none rent 1 No 1 none yes skilled employee
1 - 200 DM 12 repaid radio/tv 1567 < 100 DM 1 - 4 yrs 1 female none 1 other 22 none own 1 Yes 1 yes yes skilled employee
< 0 DM 24 critical car (new) 1199 < 100 DM > 7 yrs 4 single male none 4 other 60 none own 2 No 1 none yes unskilled resident
< 0 DM 15 repaid car (new) 1403 < 100 DM 1 - 4 yrs 2 female none 4 other 28 none rent 1 Yes 1 none yes skilled employee

Data Preparation: Dividing Data into Training and Test Sets and Randomizing

As is common when using modeling techniques to predict the values given by the model, we will divide our data into training and test sets, using 90% of the data as our training set and the rest of the 10% as our test. But before we divide the data set, we must first randomize our data set since this particular data set isn’t random and could lead to misleading results.

Let us first randomize our data set.

set.seed(1978)
credit.random.df <- credit.df[order(runif(1000)), ]

Now that we have randomized our data set, we still must confirm that we have the same data frame sorted differently. We shall compare values of the amount feature across the two data frames.

summary(credit.df$amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424
summary(credit.random.df$amount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     250    1366    2320    3271    3972   18424

Since the summary statistics are identical, this suggests that our random shuffle worked.

Splitting Data Set Into Training and Test Sets

credit.train <- credit.random.df[1:900, ]
credit.test <- credit.random.df[901:1000, ]

If all went well, we should have about \(30\%\) of the defaulted loans in each data set.

prop.table(table(credit.train$default))
## 
##       Yes        No 
## 0.6888889 0.3111111
prop.table(table(credit.test$default))
## 
## Yes  No 
## 0.8 0.2

This seems to be equally split.

7. Modeling Our Decision Tree

library(C50)
credit.fit <- C5.0(credit.train[-17], credit.train$default)
credit.fit
## 
## Call:
## C5.0.default(x = credit.train[-17], y = credit.train$default)
## 
## Classification Tree
## Number of samples: 900 
## Number of predictors: 20 
## 
## Tree size: 74 
## 
## Non-standard options: attempt to group attributes
summary(credit.fit)
## 
## Call:
## C5.0.default(x = credit.train[-17], y = credit.train$default)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Jul 12 11:56:05 2018
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 900 cases (21 attributes) from undefined.data
## 
## Decision tree:
## 
## checking_balance = unknown: Yes (352/40)
## checking_balance in {< 0 DM,> 200 DM,1 - 200 DM}:
## :...credit_history in {fully repaid,fully repaid this bank}:
##     :...housing in {for free,rent}: No (32/4)
##     :   housing = own:
##     :   :...purpose in {car (used),domestic appliances,radio/tv,
##     :       :           retraining}: Yes (8/2)
##     :       purpose in {education,others,repairs}: No (3)
##     :       purpose = car (new):
##     :       :...months_loan_duration <= 22: No (7)
##     :       :   months_loan_duration > 22: Yes (2)
##     :       purpose = business:
##     :       :...installment_plan = bank: Yes (3)
##     :       :   installment_plan = stores: No (1)
##     :       :   installment_plan = none:
##     :       :   :...age <= 27: No (2)
##     :       :       age > 27: Yes (3)
##     :       purpose = furniture:
##     :       :...installment_plan in {none,stores}: Yes (3)
##     :           installment_plan = bank:
##     :           :...checking_balance in {< 0 DM,> 200 DM}: No (4)
##     :               checking_balance = 1 - 200 DM: Yes (1)
##     credit_history in {critical,delayed,repaid}:
##     :...months_loan_duration > 22:
##         :...checking_balance = > 200 DM:
##         :   :...dependents > 1: No (2)
##         :   :   dependents <= 1:
##         :   :   :...job = mangement self-employed: No (1)
##         :   :       job in {skilled employee,unemployed non-resident,
##         :   :               unskilled resident}: Yes (15/1)
##         :   checking_balance in {< 0 DM,1 - 200 DM}:
##         :   :...savings_balance = > 1000 DM: Yes (3)
##         :       savings_balance = 501 - 1000 DM: No (4/1)
##         :       savings_balance = 101 - 500 DM:
##         :       :...employment_length in {0 - 1 yrs,1 - 4 yrs}: No (13/2)
##         :       :   employment_length in {4 - 7 yrs,unemployed}: Yes (9/1)
##         :       :   employment_length = > 7 yrs:
##         :       :   :...job = mangement self-employed: No (1)
##         :       :       job in {skilled employee,unemployed non-resident,
##         :       :               unskilled resident}: Yes (3)
##         :       savings_balance = unknown:
##         :       :...checking_balance = 1 - 200 DM: Yes (17/1)
##         :       :   checking_balance = < 0 DM:
##         :       :   :...credit_history = critical: Yes (1)
##         :       :       credit_history in {delayed,repaid}: No (12/3)
##         :       savings_balance = < 100 DM:
##         :       :...dependents > 1:
##         :           :...other_debtors in {co-applicant,
##         :           :   :                 guarantor}: Yes (3)
##         :           :   other_debtors = none:
##         :           :   :...credit_history = critical: Yes (3)
##         :           :       credit_history in {delayed,repaid}: No (15/5)
##         :           dependents <= 1:
##         :           :...purpose in {business,car (new),domestic appliances,
##         :               :           repairs,retraining}: No (32/4)
##         :               purpose = others: Yes (3/1)
##         :               purpose = education:
##         :               :...employment_length in {> 7 yrs,1 - 4 yrs,
##         :               :   :                     unemployed}: No (5)
##         :               :   employment_length in {0 - 1 yrs,
##         :               :                         4 - 7 yrs}: Yes (2)
##         :               purpose = radio/tv:
##         :               :...job = mangement self-employed: Yes (2)
##         :               :   job in {skilled employee,unemployed non-resident,
##         :               :           unskilled resident}: No (19/5)
##         :               purpose = car (used):
##         :               :...residence_history <= 3:
##         :               :   :...personal_status = female: No (1)
##         :               :   :   personal_status in {divorced male,married male,
##         :               :   :                       single male}: Yes (4)
##         :               :   residence_history > 3:
##         :               :   :...amount <= 5965: Yes (2)
##         :               :       amount > 5965: No (5)
##         :               purpose = furniture:
##         :               :...checking_balance = 1 - 200 DM: No (5)
##         :                   checking_balance = < 0 DM:
##         :                   :...residence_history > 3: No (4)
##         :                       residence_history <= 3: [S1]
##         months_loan_duration <= 22:
##         :...months_loan_duration <= 8:
##             :...age <= 25: No (4/1)
##             :   age > 25:
##             :   :...amount <= 3001: Yes (37)
##             :       amount > 3001:
##             :       :...residence_history <= 3: Yes (2)
##             :           residence_history > 3: No (2)
##             months_loan_duration > 8:
##             :...other_debtors = co-applicant:
##                 :...installment_rate <= 2: Yes (4)
##                 :   installment_rate > 2:
##                 :   :...amount <= 1569: Yes (2)
##                 :       amount > 1569: No (3)
##                 other_debtors = guarantor:
##                 :...installment_plan in {none,stores}: Yes (15)
##                 :   installment_plan = bank:
##                 :   :...purpose = car (new): No (3)
##                 :       purpose in {business,car (used),domestic appliances,
##                 :                   education,furniture,others,radio/tv,
##                 :                   repairs,retraining}: Yes (3)
##                 other_debtors = none:
##                 :...credit_history in {critical,delayed}: Yes (66/14)
##                     credit_history = repaid:
##                     :...savings_balance = > 1000 DM: Yes (8)
##                         savings_balance in {< 100 DM,101 - 500 DM,
##                         :                   501 - 1000 DM,unknown}:
##                         :...checking_balance = < 0 DM:
##                             :...residence_history <= 2:
##                             :   :...installment_plan = bank: Yes (1)
##                             :   :   installment_plan in {none,
##                             :   :                        stores}: No (20/4)
##                             :   residence_history > 2:
##                             :   :...amount <= 1386: No (18/5)
##                             :       amount > 1386: Yes (19/6)
##                             checking_balance = > 200 DM: [S2]
##                             checking_balance = 1 - 200 DM:
##                             :...installment_plan = stores: No (1)
##                                 installment_plan = bank:
##                                 :...months_loan_duration <= 11: Yes (2)
##                                 :   months_loan_duration > 11: No (6)
##                                 installment_plan = none: [S3]
## 
## SubTree [S1]
## 
## employment_length in {> 7 yrs,4 - 7 yrs}: No (2)
## employment_length in {0 - 1 yrs,1 - 4 yrs,unemployed}: Yes (7)
## 
## SubTree [S2]
## 
## property in {building society savings,other}: Yes (11)
## property = unknown/none: No (1)
## property = real estate:
## :...job in {mangement self-employed,skilled employee,
##     :       unemployed non-resident}: Yes (3)
##     job = unskilled resident: No (3)
## 
## SubTree [S3]
## 
## employment_length = 4 - 7 yrs: Yes (9)
## employment_length in {> 7 yrs,0 - 1 yrs,1 - 4 yrs,unemployed}:
## :...job in {unemployed non-resident,unskilled resident}: Yes (12/2)
##     job = mangement self-employed:
##     :...property in {building society savings,other,real estate}: Yes (2)
##     :   property = unknown/none: No (2)
##     job = skilled employee:
##     :...savings_balance in {101 - 500 DM,501 - 1000 DM}: Yes (6/1)
##         savings_balance = < 100 DM:
##         :...telephone = none: No (9/1)
##         :   telephone = yes: Yes (3/1)
##         savings_balance = unknown:
##         :...personal_status in {female,married male}: No (3)
##             personal_status in {divorced male,single male}: Yes (4)
## 
## 
## Evaluation on training data (900 cases):
## 
##      Decision Tree   
##    ----------------  
##    Size      Errors  
## 
##      74  105(11.7%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##     585    35    (a): class Yes
##      70   210    (b): class No
## 
## 
##  Attribute usage:
## 
##  100.00% checking_balance
##   60.89% credit_history
##   54.22% months_loan_duration
##   35.56% savings_balance
##   28.89% other_debtors
##   15.11% purpose
##   14.67% dependents
##   13.11% installment_plan
##   10.22% employment_length
##   10.00% amount
##    9.78% job
##    9.67% residence_history
##    7.67% housing
##    5.56% age
##    2.44% property
##    1.33% personal_status
##    1.33% telephone
##    1.00% installment_rate
## 
## 
## Time: 0.0 secs

The above output is the decision tree to determine whether or not a loan applicant is worthy of receiving a loan. It also shows which features are most important in determining which factors are most important when considering loan applications.