Decision Trees

Decision Tree models are based on how we take decisions in our daily lives. It uses sample data feature that creates decision rules to form a tree-like structure. Decision trees can be used for both regression and classification problems. They reduce a data sample to a manageable set of rules which can be used to make a classification decision or generate a prediction. Decision trees are built by adding features and an associated rule to nodes incrementally. They are typically built from the root downwards using a recursive divide-and-conquer algorithm.The Divide and Conquer Algorithm grows a tree by recursively splitting the data into meaningful subsets. Each new rule progressively refines the classification in a hierarchical manner.

The banknote data-frame in the mclust package contains measurements made on genuine and counterfeit Swiss 1000 franc bank notes.

data("banknote", package = "mclust")
head(banknote)
##    Status Length  Left Right Bottom  Top Diagonal
## 1 genuine  214.8 131.0 131.1    9.0  9.7    141.0
## 2 genuine  214.6 129.7 129.7    8.1  9.5    141.7
## 3 genuine  214.8 129.7 129.7    8.7  9.6    142.2
## 4 genuine  214.8 129.7 129.6    7.5 10.4    142.0
## 5 genuine  215.0 129.6 129.7   10.4  7.7    141.8
## 6 genuine  215.7 130.8 130.5    9.0 10.1    141.4
tail(banknote)
##          Status Length  Left Right Bottom  Top Diagonal
## 195 counterfeit  214.9 130.3 130.5   11.6 10.6    139.8
## 196 counterfeit  215.0 130.4 130.3    9.9 12.1    139.6
## 197 counterfeit  215.1 130.3 129.9   10.3 11.5    139.7
## 198 counterfeit  214.8 130.3 130.4   10.6 11.1    140.0
## 199 counterfeit  214.7 130.7 130.8   11.2 11.2    139.4
## 200 counterfeit  214.3 129.9 129.9   10.2 11.5    139.6

The first six banknotes are all genuine. The first example has a length of 214.8, and diagonal of 141.0.

The last six banknotes are all counterfeit and the last example has a length of 214.3 and a diagonal of 139.6.

Let’s check out the distribution of the class variable.

table(banknote$Status)
## 
## counterfeit     genuine 
##         100         100

The class is well balanced.

Looking at the correlation

## corrplot 0.84 loaded

The sample consists of measurements made on 100 genuine and 100 counterfeit notes. We select 150 at random without replacement for the training set via the sample function. The remaining observations are used for the test set:

set.seed(2020)
N <- nrow(banknote)
ind_train <- sample(1:N, size = 150, replace = F)
train <- banknote[ind_train,]

test <- banknote[-ind_train,]

We observe that Left and Right have a relatively high correlation of 0.74; and Diagonal is negatively correlated with all the features except Length.

Train Model using Train Set We will use the C5 algorithm to build our decision tree. It uses entropy for splits.

What is Entropy?

It is a measure of randomness.It is important to decision tree splits because it can be used to calculate the homogeneity of a sample.

In the case of binary classification, the sample is completely homogeneous when entropy is zero. If there is equal uncertainty about which class an observation belongs to, it has an entropy of one. Therefore, decision trees choose a split for a given partition that minimizes entropy in the partition’s distribution of labels.

Decision tree algorithms use a number of impurity measures. Other popular metrics include the gini- index and distance-based metrics.

The package C50 contains the decision tree fitting function C5.0.

library(C50)
fitc <- C5.0(Status~., data = train)
plot(fitc)

The root node contains the feature Diagonal, with the first split occurring at 140.6. The second node is the feature Bottom, and the final node is Length.

We can view the actual rules of the decision tree as a series of if-then statements.With a large tree, this can improve readability.

fitc_rules <- C5.0(Status~., data = train, rules= T)

summary(fitc_rules)
## 
## Call:
## C5.0.formula(formula = Status ~ ., data = train, rules = T)
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Fri Jun 19 11:12:39 2020
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 150 cases (7 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (70, lift 2.0)
##  Bottom > 8.6
##  Diagonal <= 140.6
##  ->  class counterfeit  [0.986]
## 
## Rule 2: (30, lift 2.0)
##  Top > 11.2
##  Diagonal <= 140.6
##  ->  class counterfeit  [0.969]
## 
## Rule 3: (74, lift 1.9)
##  Diagonal > 140.6
##  ->  class genuine  [0.987]
## 
## Rule 4: (47, lift 1.9)
##  Bottom <= 8.6
##  Top <= 11.2
##  ->  class genuine  [0.980]
## 
## Default class: genuine
## 
## 
## Evaluation on training data (150 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##       4    0( 0.0%)   <<
## 
## 
##     (a)   (b)    <-classified as
##    ----  ----
##      74          (a): class counterfeit
##            76    (b): class genuine
## 
## 
##  Attribute usage:
## 
##   98.67% Diagonal
##   78.00% Bottom
##   51.33% Top
## 
## 
## Time: 0.0 secs

Evaluating the model performance on Training set

predc_train <- predict(fitc, train, type= "class")
head(predc_train)
## [1] counterfeit genuine     genuine     genuine     genuine     counterfeit
## Levels: counterfeit genuine
table(predc_train, train$Status,dnn=c("predicted class","observed class"))
##                observed class
## predicted class counterfeit genuine
##     counterfeit          74       0
##     genuine               0      76

The results are perfect but let’s not forget we are predicting on the train data. There is a huge probability that the model is an over fit when applied on training data. We must test this model on previously unseen data.

predc_test <- predict(fitc, test, type = "class")

table(predc_test, test$Status, dnn = c("predicted class","observed class"))
##                observed class
## predicted class counterfeit genuine
##     counterfeit          25       0
##     genuine               1      24

The test sample contains 50 observations. The decision tree classifies all but 1 of these correctly. a counterfeit note was predicted as genuine.

Improving Model Performance

We will select an alternative splitting criterion. Th impurity measures or splitting criteria that are commonly used in binary decision trees are Gini impurity, Entropy and Deviance. The tree package lets us use the Deviance or Gini metric. Let’s give it a go:

library(tree)
fit <- tree(Status~., data = train, split = "deviance")

The tree function is similar to C5.0, however it allows you to set the splitting criterion via the split argument. The options are “deviance” and “gini”.

plot(fit)
text(fit)

By using deviance we end up with a simpler tree structure. As with the previous model, Diagonal is selected as the root node. However, only Bottom makes the cut as an additional feature in the tree. However, since both branches of Bottom lead to the counterfeit classification, the decision tree breaks down into the simple rule of thumb - If Diagonal <140.65, the banknote is fake.

summary(fit)
## 
## Classification tree:
## tree(formula = Status ~ ., data = train, split = "deviance")
## Variables actually used in tree construction:
## [1] "Diagonal" "Bottom"  
## Number of terminal nodes:  3 
## Residual mean deviance:  0.05196 = 7.638 / 147 
## Misclassification error rate: 0.01333 = 2 / 150

The first few lines remind us of the formula used to fit the tree. This is followed by the details of the variables used in tree construction, and the number of leaf (terminal) nodes.

The model has a residual deviance of 0.051, with a miss classification error rate of just over 1.33%. In other word 2 of the 150 training examples were miss classified.

Test set performance

Let’s test the model on the testing dataset

pred_fit<- predict(fit, test)
pred_class <- colnames(pred_fit)[max.col(pred_fit, ties.method = c("random"))]
table(test$Status, pred_class, dnn = c("Observed Class", "Predicted Class" ))
##               Predicted Class
## Observed Class counterfeit genuine
##    counterfeit          26       0
##    genuine               0      24

The model correctly classifies all of the test set examples. The decision tree has achieved a classification accuracy of 100%.

The use of an alternative split criteria improved the test set performance of the decision tree.