In this tutorial we will cover two machine learning methods that apply a similar strategy of dividing data into smaller and smaller portions to identify patterns that can be ultimately used for prediction. The knowledge gained is then presented in the form of logical structures that can be understood without any statistical knowledge. There are numerous decision trees by one can choose to follow, such as the C5.0, 1R, and RIPPER agorithms.
In this brief example, we will try to determine how likely a consumer is goinng to default based on numerous factors.
Decusuib tree learners construct a model in the form of a tree structure. The model itself comprises a serious of logical decisions, similar to a flowerchart, with decision nodes tgat ubducate a decusuibnt i be nade ib ab attribute. These split into branches that indicate the decision’s choices. The tree is terminated by leaf nodes (aka terminal nodes) that denote the result of following a combination of decisions.
Data that is long to be classified being at the root node where it is passed through the various decisions in the tree according to the values of its features. The path the data takes funnels each record into a leaf node, which assigns it a predicted class.
Data that is to be classified begin in the root node where it is passed throug hthe various decisions in the tree according to the values of its features. The path that the various decisions in the tree according to the value of its features.
Have you ever stopped to ownder why pollution (save for other countries)
Decision trees are built using a heuristic called recursive partitioning. This approach is more commonly known as divide and conquer because it uses the feature values to split the data into smaller and smaller subsets of similar classes.
To elaborate on this concept, let us suppose we begin at the root node, which represents our entire dataset. The algorithm will chose a feature that is the mnost predictive of the target class. The examples are then partitioned into groups of distinct values of this feature; this decision forms the first set of tree branches. The agorithm continues to divide-and-conquer the nodes, choosing the best candidate eature each time until a stopping criterion is reached. This could occur if
Since real-world data contains more than two features, decision trees quickly become ar more complex than textbook examples, with many more nodes, branches, and leaves.
While there are numerous implementations of decision trees, one of the most well-known is the C5.0 algorithm. The C5.0 algorithm has become the industry standatd for producing decision trees, because it does well fo rmost types of problems directly out of the box. Compared to more advanced and sophisticated machine learning mnodels (e.g. Neural Networks and Support Vector Machines), the decision trees under the C5.0 algorithm generally perform nearly as well but are much easier to understan and deploy.
The first main challenge that a decision tree will face is to identify which feature to split upon. If the segements of the data contain only a single class, tehy are considered pure.
C5.0 uses the concept of entropy for measuring purity. The entropy of a sample of data indicates how mixed the class values are; the minimum value of \(0\) indicates that the sample is completely homogenous, while \(1\) indicates the maximum amount of disorder. The defintion of entropy can be specified as
\[ \begin{aligned} \text{Entropy}(S) = \sum_{i=1}^{c} -p_{i}\log_{2}(p_{i}) &&&& (1) \end{aligned} \] In equation \(1\), for a given segment of data \((S)\), the term \(c\) refers to the number of differenct class levels, and \(p_{i}\) refers to the proportion of values falling into the cass level \(i\). For example, if we suppose we have a partiition of data with two classes: red (60 percent) and white (40 percent), we can calculate its entropy as
\[ \begin{aligned} -0.6 \times \log_{2}0.6 - 0.4 \times \log_{2}0.4 = 0.9709506 \end{aligned} \]
We can also examine the entropy for all possible two-class arrangements. If we know the proportion of examples in one class is \(x\), then the proportion in the other class is \(1 - x\). Below is an illustration of this arrangement
As illustrated above, the peak entropy is at \(x = 0.5\), a 50-50 split results in the maximum entropy.
Given the measure of purity, the algorithm must still decide which feature to split upon. In order to do so, the algorithm uses entropy to calculate the change in honmogeneity resulting from a split on each possible feature. This calculation is referred to as information gain. The information gain for a feature \(F\) is calculated as *the differences between the entropy in the segment before the split \((S_{1})\), and the partitions resulting from the split \((S_{2})\). That is,
\[ \begin{aligned} \text{InfoGain}(F) = \text{Entropy}(S_{1}) - \text{Entropy}(S_{2}) &&&& (2) \end{aligned} \]
One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate \(\text{Entropy}(S_{2})\) needs to consider the total entropy across all of the partitions. In accomplishes this by weighing each partition’s entropy by the proportion of records falling into that partition, which can be expressed by the following formula
\[ \begin{aligned} \text{Entropy}(S) = \sum_{i=1}^{n}w_{i}\text{Entropy}(P_{i}) &&&& (3) \end{aligned} \]
The higher the information gain, the better a feature is at creating homogenous groups after a split on that feature.
Decision trees can continue to grow indefinitely by choosing splitting features and dividing them into smaller and smaller partitions until each example is perfectly classified, or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will overfit the training data. To overcome this, we can prune a decision tree which involves a process of reducing its size such that it generalizes better to unobserved data.
One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions, or if the decision node contains only a small number of examples. This method is known as pre-pruning, or early stopping the decision tree. As the tree avoids doing unnecessary work, this can quite often be an appealing strategy.
One of the benefits of the C5.0 algorithm is that it is opinionated about pruning; it takes care of many of the decisions automatically using fairly reasonable defaults. Its overall strategy is to postprune the tree. It does this by first growing a large tree that overfits the training data. Aftewrwards, nodes and branches that have little effect on the classification errors are removed.
In this example, we will be using the C5.0 algorithm to determine the creditworthiness of a loan applicant.
library(kableExtra)
library(knitr)
credit.df <- read.csv('/Users/czar.yobero/Documents/credit.csv')
credit.df$default <- factor(credit.df$default, levels = c(1, 2), labels = c('Yes', 'No'))
str(credit.df)
## 'data.frame': 1000 obs. of 21 variables:
## $ checking_balance : Factor w/ 4 levels "< 0 DM","> 200 DM",..: 1 3 4 1 1 4 4 3 4 3 ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "critical","delayed",..: 1 5 1 5 2 5 5 5 5 1 ...
## $ purpose : Factor w/ 10 levels "business","car (new)",..: 8 8 5 6 2 5 6 3 8 2 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : Factor w/ 5 levels "< 100 DM","> 1000 DM",..: 5 1 1 1 1 5 4 1 2 1 ...
## $ employment_length : Factor w/ 5 levels "> 7 yrs","0 - 1 yrs",..: 1 3 4 4 3 3 1 3 4 5 ...
## $ installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : Factor w/ 4 levels "divorced male",..: 4 2 4 4 4 4 4 4 1 3 ...
## $ other_debtors : Factor w/ 3 levels "co-applicant",..: 3 3 3 2 3 3 3 3 3 3 ...
## $ residence_history : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "building society savings",..: 3 3 3 1 4 4 1 2 3 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ installment_plan : Factor w/ 3 levels "bank","none",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ housing : Factor w/ 3 levels "for free","own",..: 2 2 2 1 1 1 2 3 2 2 ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ default : Factor w/ 2 levels "Yes","No": 1 2 1 1 2 1 1 1 1 2 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "none","yes": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ job : Factor w/ 4 levels "mangement self-employed",..: 2 2 4 2 2 4 2 1 4 1 ...
summary(credit.df)
## checking_balance months_loan_duration credit_history
## < 0 DM :274 Min. : 4.0 critical :293
## > 200 DM : 63 1st Qu.:12.0 delayed : 88
## 1 - 200 DM:269 Median :18.0 fully repaid : 40
## unknown :394 Mean :20.9 fully repaid this bank: 49
## 3rd Qu.:24.0 repaid :530
## Max. :72.0
##
## purpose amount savings_balance employment_length
## radio/tv :280 Min. : 250 < 100 DM :603 > 7 yrs :253
## car (new) :234 1st Qu.: 1366 > 1000 DM : 48 0 - 1 yrs :172
## furniture :181 Median : 2320 101 - 500 DM :103 1 - 4 yrs :339
## car (used):103 Mean : 3271 501 - 1000 DM: 63 4 - 7 yrs :174
## business : 97 3rd Qu.: 3972 unknown :183 unemployed: 62
## education : 50 Max. :18424
## (Other) : 55
## installment_rate personal_status other_debtors
## Min. :1.000 divorced male: 50 co-applicant: 41
## 1st Qu.:2.000 female :310 guarantor : 52
## Median :3.000 married male : 92 none :907
## Mean :2.973 single male :548
## 3rd Qu.:4.000
## Max. :4.000
##
## residence_history property age
## Min. :1.000 building society savings:232 Min. :19.00
## 1st Qu.:2.000 other :332 1st Qu.:27.00
## Median :3.000 real estate :282 Median :33.00
## Mean :2.845 unknown/none :154 Mean :35.55
## 3rd Qu.:4.000 3rd Qu.:42.00
## Max. :4.000 Max. :75.00
##
## installment_plan housing existing_credits default
## bank :139 for free:108 Min. :1.000 Yes:700
## none :814 own :713 1st Qu.:1.000 No :300
## stores: 47 rent :179 Median :1.000
## Mean :1.407
## 3rd Qu.:2.000
## Max. :4.000
##
## dependents telephone foreign_worker job
## Min. :1.000 none:596 no : 37 mangement self-employed:148
## 1st Qu.:1.000 yes :404 yes:963 skilled employee :630
## Median :1.000 unemployed non-resident: 22
## Mean :1.155 unskilled resident :200
## 3rd Qu.:1.000
## Max. :2.000
##
kable(head(credit.df, n=15), format="html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
| checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | residence_history | property | age | installment_plan | housing | existing_credits | default | dependents | telephone | foreign_worker | job |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| < 0 DM | 6 | critical | radio/tv | 1169 | unknown | > 7 yrs | 4 | single male | none | 4 | real estate | 67 | none | own | 2 | Yes | 1 | yes | yes | skilled employee |
| 1 - 200 DM | 48 | repaid | radio/tv | 5951 | < 100 DM | 1 - 4 yrs | 2 | female | none | 2 | real estate | 22 | none | own | 1 | No | 1 | none | yes | skilled employee |
| unknown | 12 | critical | education | 2096 | < 100 DM | 4 - 7 yrs | 2 | single male | none | 3 | real estate | 49 | none | own | 1 | Yes | 2 | none | yes | unskilled resident |
| < 0 DM | 42 | repaid | furniture | 7882 | < 100 DM | 4 - 7 yrs | 2 | single male | guarantor | 4 | building society savings | 45 | none | for free | 1 | Yes | 2 | none | yes | skilled employee |
| < 0 DM | 24 | delayed | car (new) | 4870 | < 100 DM | 1 - 4 yrs | 3 | single male | none | 4 | unknown/none | 53 | none | for free | 2 | No | 2 | none | yes | skilled employee |
| unknown | 36 | repaid | education | 9055 | unknown | 1 - 4 yrs | 2 | single male | none | 4 | unknown/none | 35 | none | for free | 1 | Yes | 2 | yes | yes | unskilled resident |
| unknown | 24 | repaid | furniture | 2835 | 501 - 1000 DM | > 7 yrs | 3 | single male | none | 4 | building society savings | 53 | none | own | 1 | Yes | 1 | none | yes | skilled employee |
| 1 - 200 DM | 36 | repaid | car (used) | 6948 | < 100 DM | 1 - 4 yrs | 2 | single male | none | 2 | other | 35 | none | rent | 1 | Yes | 1 | yes | yes | mangement self-employed |
| unknown | 12 | repaid | radio/tv | 3059 | > 1000 DM | 4 - 7 yrs | 2 | divorced male | none | 4 | real estate | 61 | none | own | 1 | Yes | 1 | none | yes | unskilled resident |
| 1 - 200 DM | 30 | critical | car (new) | 5234 | < 100 DM | unemployed | 4 | married male | none | 2 | other | 28 | none | own | 2 | No | 1 | none | yes | mangement self-employed |
| 1 - 200 DM | 12 | repaid | car (new) | 1295 | < 100 DM | 0 - 1 yrs | 3 | female | none | 1 | other | 25 | none | rent | 1 | No | 1 | none | yes | skilled employee |
| < 0 DM | 48 | repaid | business | 4308 | < 100 DM | 0 - 1 yrs | 3 | female | none | 4 | building society savings | 24 | none | rent | 1 | No | 1 | none | yes | skilled employee |
| 1 - 200 DM | 12 | repaid | radio/tv | 1567 | < 100 DM | 1 - 4 yrs | 1 | female | none | 1 | other | 22 | none | own | 1 | Yes | 1 | yes | yes | skilled employee |
| < 0 DM | 24 | critical | car (new) | 1199 | < 100 DM | > 7 yrs | 4 | single male | none | 4 | other | 60 | none | own | 2 | No | 1 | none | yes | unskilled resident |
| < 0 DM | 15 | repaid | car (new) | 1403 | < 100 DM | 1 - 4 yrs | 2 | female | none | 4 | other | 28 | none | rent | 1 | Yes | 1 | none | yes | skilled employee |
As is common when using modeling techniques to predict the values given by the model, we will divide our data into training and test sets, using 90% of the data as our training set and the rest of the 10% as our test. But before we divide the data set, we must first randomize our data set since this particular data set isn’t random and could lead to misleading results.
Let us first randomize our data set.
set.seed(1978)
credit.random.df <- credit.df[order(runif(1000)), ]
Now that we have randomized our data set, we still must confirm that we have the same data frame sorted differently. We shall compare values of the amount feature across the two data frames.
summary(credit.df$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18424
summary(credit.random.df$amount)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 250 1366 2320 3271 3972 18424
Since the summary statistics are identical, this suggests that our random shuffle worked.
credit.train <- credit.random.df[1:900, ]
credit.test <- credit.random.df[901:1000, ]
If all went well, we should have about \(30\%\) of the defaulted loans in each data set.
prop.table(table(credit.train$default))
##
## Yes No
## 0.6888889 0.3111111
prop.table(table(credit.test$default))
##
## Yes No
## 0.8 0.2
This seems to be equally split.
library(C50)
credit.fit <- C5.0(credit.train[-17], credit.train$default)
credit.fit
##
## Call:
## C5.0.default(x = credit.train[-17], y = credit.train$default)
##
## Classification Tree
## Number of samples: 900
## Number of predictors: 20
##
## Tree size: 74
##
## Non-standard options: attempt to group attributes
summary(credit.fit)
##
## Call:
## C5.0.default(x = credit.train[-17], y = credit.train$default)
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Jul 12 11:56:05 2018
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 900 cases (21 attributes) from undefined.data
##
## Decision tree:
##
## checking_balance = unknown: Yes (352/40)
## checking_balance in {< 0 DM,> 200 DM,1 - 200 DM}:
## :...credit_history in {fully repaid,fully repaid this bank}:
## :...housing in {for free,rent}: No (32/4)
## : housing = own:
## : :...purpose in {car (used),domestic appliances,radio/tv,
## : : retraining}: Yes (8/2)
## : purpose in {education,others,repairs}: No (3)
## : purpose = car (new):
## : :...months_loan_duration <= 22: No (7)
## : : months_loan_duration > 22: Yes (2)
## : purpose = business:
## : :...installment_plan = bank: Yes (3)
## : : installment_plan = stores: No (1)
## : : installment_plan = none:
## : : :...age <= 27: No (2)
## : : age > 27: Yes (3)
## : purpose = furniture:
## : :...installment_plan in {none,stores}: Yes (3)
## : installment_plan = bank:
## : :...checking_balance in {< 0 DM,> 200 DM}: No (4)
## : checking_balance = 1 - 200 DM: Yes (1)
## credit_history in {critical,delayed,repaid}:
## :...months_loan_duration > 22:
## :...checking_balance = > 200 DM:
## : :...dependents > 1: No (2)
## : : dependents <= 1:
## : : :...job = mangement self-employed: No (1)
## : : job in {skilled employee,unemployed non-resident,
## : : unskilled resident}: Yes (15/1)
## : checking_balance in {< 0 DM,1 - 200 DM}:
## : :...savings_balance = > 1000 DM: Yes (3)
## : savings_balance = 501 - 1000 DM: No (4/1)
## : savings_balance = 101 - 500 DM:
## : :...employment_length in {0 - 1 yrs,1 - 4 yrs}: No (13/2)
## : : employment_length in {4 - 7 yrs,unemployed}: Yes (9/1)
## : : employment_length = > 7 yrs:
## : : :...job = mangement self-employed: No (1)
## : : job in {skilled employee,unemployed non-resident,
## : : unskilled resident}: Yes (3)
## : savings_balance = unknown:
## : :...checking_balance = 1 - 200 DM: Yes (17/1)
## : : checking_balance = < 0 DM:
## : : :...credit_history = critical: Yes (1)
## : : credit_history in {delayed,repaid}: No (12/3)
## : savings_balance = < 100 DM:
## : :...dependents > 1:
## : :...other_debtors in {co-applicant,
## : : : guarantor}: Yes (3)
## : : other_debtors = none:
## : : :...credit_history = critical: Yes (3)
## : : credit_history in {delayed,repaid}: No (15/5)
## : dependents <= 1:
## : :...purpose in {business,car (new),domestic appliances,
## : : repairs,retraining}: No (32/4)
## : purpose = others: Yes (3/1)
## : purpose = education:
## : :...employment_length in {> 7 yrs,1 - 4 yrs,
## : : : unemployed}: No (5)
## : : employment_length in {0 - 1 yrs,
## : : 4 - 7 yrs}: Yes (2)
## : purpose = radio/tv:
## : :...job = mangement self-employed: Yes (2)
## : : job in {skilled employee,unemployed non-resident,
## : : unskilled resident}: No (19/5)
## : purpose = car (used):
## : :...residence_history <= 3:
## : : :...personal_status = female: No (1)
## : : : personal_status in {divorced male,married male,
## : : : single male}: Yes (4)
## : : residence_history > 3:
## : : :...amount <= 5965: Yes (2)
## : : amount > 5965: No (5)
## : purpose = furniture:
## : :...checking_balance = 1 - 200 DM: No (5)
## : checking_balance = < 0 DM:
## : :...residence_history > 3: No (4)
## : residence_history <= 3: [S1]
## months_loan_duration <= 22:
## :...months_loan_duration <= 8:
## :...age <= 25: No (4/1)
## : age > 25:
## : :...amount <= 3001: Yes (37)
## : amount > 3001:
## : :...residence_history <= 3: Yes (2)
## : residence_history > 3: No (2)
## months_loan_duration > 8:
## :...other_debtors = co-applicant:
## :...installment_rate <= 2: Yes (4)
## : installment_rate > 2:
## : :...amount <= 1569: Yes (2)
## : amount > 1569: No (3)
## other_debtors = guarantor:
## :...installment_plan in {none,stores}: Yes (15)
## : installment_plan = bank:
## : :...purpose = car (new): No (3)
## : purpose in {business,car (used),domestic appliances,
## : education,furniture,others,radio/tv,
## : repairs,retraining}: Yes (3)
## other_debtors = none:
## :...credit_history in {critical,delayed}: Yes (66/14)
## credit_history = repaid:
## :...savings_balance = > 1000 DM: Yes (8)
## savings_balance in {< 100 DM,101 - 500 DM,
## : 501 - 1000 DM,unknown}:
## :...checking_balance = < 0 DM:
## :...residence_history <= 2:
## : :...installment_plan = bank: Yes (1)
## : : installment_plan in {none,
## : : stores}: No (20/4)
## : residence_history > 2:
## : :...amount <= 1386: No (18/5)
## : amount > 1386: Yes (19/6)
## checking_balance = > 200 DM: [S2]
## checking_balance = 1 - 200 DM:
## :...installment_plan = stores: No (1)
## installment_plan = bank:
## :...months_loan_duration <= 11: Yes (2)
## : months_loan_duration > 11: No (6)
## installment_plan = none: [S3]
##
## SubTree [S1]
##
## employment_length in {> 7 yrs,4 - 7 yrs}: No (2)
## employment_length in {0 - 1 yrs,1 - 4 yrs,unemployed}: Yes (7)
##
## SubTree [S2]
##
## property in {building society savings,other}: Yes (11)
## property = unknown/none: No (1)
## property = real estate:
## :...job in {mangement self-employed,skilled employee,
## : unemployed non-resident}: Yes (3)
## job = unskilled resident: No (3)
##
## SubTree [S3]
##
## employment_length = 4 - 7 yrs: Yes (9)
## employment_length in {> 7 yrs,0 - 1 yrs,1 - 4 yrs,unemployed}:
## :...job in {unemployed non-resident,unskilled resident}: Yes (12/2)
## job = mangement self-employed:
## :...property in {building society savings,other,real estate}: Yes (2)
## : property = unknown/none: No (2)
## job = skilled employee:
## :...savings_balance in {101 - 500 DM,501 - 1000 DM}: Yes (6/1)
## savings_balance = < 100 DM:
## :...telephone = none: No (9/1)
## : telephone = yes: Yes (3/1)
## savings_balance = unknown:
## :...personal_status in {female,married male}: No (3)
## personal_status in {divorced male,single male}: Yes (4)
##
##
## Evaluation on training data (900 cases):
##
## Decision Tree
## ----------------
## Size Errors
##
## 74 105(11.7%) <<
##
##
## (a) (b) <-classified as
## ---- ----
## 585 35 (a): class Yes
## 70 210 (b): class No
##
##
## Attribute usage:
##
## 100.00% checking_balance
## 60.89% credit_history
## 54.22% months_loan_duration
## 35.56% savings_balance
## 28.89% other_debtors
## 15.11% purpose
## 14.67% dependents
## 13.11% installment_plan
## 10.22% employment_length
## 10.00% amount
## 9.78% job
## 9.67% residence_history
## 7.67% housing
## 5.56% age
## 2.44% property
## 1.33% personal_status
## 1.33% telephone
## 1.00% installment_rate
##
##
## Time: 0.0 secs
The above output is the decision tree to determine whether or not a loan applicant is worthy of receiving a loan. It also shows which features are most important in determining which factors are most important when considering loan applications.