Classification Trees

Introduction

Classification Trees are a type of decision tree algorithm specifically tailored for solving classification problems. Their primary attraction lies in their interpretability, ease of implementation, and ability to handle a variety of data types.

General Goal

The general goal of Classification Trees is to create a model that predicts the value of a target binary variable by learning simple decision rules inferred from the data features.

The decision rules are easy to understand and interpret, making Classification Trees ideal for applications where interpretability is crucial.

Typical Use Cases

A typical classification tree

Terminology for classification Tree

Root Node: The topmost node that includes the entire dataset. It represents the starting point of the decision-making process.

Decision Node: An internal node that performs a test based on feature values and makes a decision to split the data. Each decision node results in two child nodes.

Leaf Node: A terminal node where no further splitting occurs. It contains the class label to be assigned to new data points.

Branch: A section of the tree that connects nodes, representing the decision path leading to a leaf node.

Depth: The length of the longest path from a node to a leaf.

Not easy to understand, right? Let’s use a sample data set to visualize it!

Demo: Building a Classification Tree

We will be working on the credit data again. Again, our goal is to predict default!

file_path = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/credit_default.csv"
credit.data <- read.csv(file = file_path, header=T)[1:2000,]
credit.data$SEX<- as.factor(credit.data$SEX)
credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)
colnames(credit.data)
##  [1] "LIMIT_BAL"                  "SEX"                       
##  [3] "EDUCATION"                  "MARRIAGE"                  
##  [5] "AGE"                        "PAY_0"                     
##  [7] "PAY_2"                      "PAY_3"                     
##  [9] "PAY_4"                      "PAY_5"                     
## [11] "PAY_6"                      "BILL_AMT1"                 
## [13] "BILL_AMT2"                  "BILL_AMT3"                 
## [15] "BILL_AMT4"                  "BILL_AMT5"                 
## [17] "BILL_AMT6"                  "PAY_AMT1"                  
## [19] "PAY_AMT2"                   "PAY_AMT3"                  
## [21] "PAY_AMT4"                   "PAY_AMT5"                  
## [23] "PAY_AMT6"                   "default.payment.next.month"

Let’s look a very simple classification tree. It is so simple, it is just a rule: whether PAY_0 < 2.

That mean:

There are one Root Node (which is naturally a Decision Node) and two Leaf Nodes. For each node, there are three number, for example:

The Root Node has are labeled as 0 and .78 .22:

That means:

Split the tree again!

In this case, we have 1 Root Node, 2 Decision Node (again, Root Node is naturally a Decision Node),and 3 Leaf Nodes

Important Questions about Classification Trees

In this section, we will address four key questions concerning the construction and application of Classification Trees.

  1. How to choose the split variable and its value?

  2. When should we stop splitting?

  3. for the leaf node, how do we decide the predicted class?

  4. How to classify a new data?

1. How to Choose the Split Variable and Its Value?

For each split, here are the general steps involved:

  1. Enumerate Possible Splits: For each variable, list all possible splits.
    • Continuous variables: Usually, sorted unique values are used to find potential splits.
    • Categorical variables: Either “one-vs-all” approach or a full subset search can be employed.
  2. Calculate Split Quality: Use a criterion to evaluate how “good” each split is.
    • Gini Index: Default criterion for rpart function in R. It measures node impurity; a Gini index of 0 indicates perfect purity (all elements are of the same class). The Gini index is computationally less intensive than information gain.

\[ \text{Gini} = 1 - \sum_{i=1}^{c} p(i)^{2} \] where c is 1 for binary classification problem and \(p(i)\) is the proportion of data points belong to i.

  1. Calculate Split Quality (continued):

    • Information Gain: This criterion is based on entropy and measures the quality of a split by calculating how much uncertainty in the class labels is reduced after the split.

\[ \text{Information Gain} = \text{Entropy}(parent) - \sum \left( \frac{|child|}{|parent|} \times \text{Entropy}(child) \right) \] The entropy for a node is given by:

\[ \text{Entropy}(t) = - \sum_{i=1}^{c} p(i) \log_{2} p(i) \]

  1. Select Best Split: Choose the variable and its split value that result in the best split according to the chosen criterion.

  2. Implement the Split: Divide the data into subsets based on the chosen variable and value, resulting in two child nodes.

From these two child nodes, we can do another round of split on each node, and go on…

2. When Should We Stop Splitting?

  • Splitting too much can lead to overfitting

  • Not splitting enough can result in underfitting.

Generally, we want to stop here:

But this is not realistic, as we don’t know how it will perform on new unseen data. But there is something we can do:

2. When Should We Stop Splitting? A realistic way!

  1. Pruning: First built a big tree and use pruning methods like reduced error pruning or cost complexity pruning to remove unnecessary branches.

  2. Complexity Parameter (cp): Setting a threshold for the complexity parameter can stop the tree from growing when the improvement is marginal.

    • Lower cp values result in larger trees, while higher values result in smaller trees.
  3. Minimum Samples: Limit the number of samples required for a split (minsplit) or a leaf node (minbucket).

    • For example, minsplit=20 means that a node must have at least 20 samples to be considered for a further split.
  4. Maximum Depth: As demonstrated earlier, the maximum depth (maxdepth) of the tree can be explicitly set.

3. How Do We Decide the Predicted Class in the Leaf Node?

Once a record reaches a leaf node, it’s time to make a classification decision. Here are the key approaches:

  1. Majority Voting: The most straightforward method, jsut assign the class that are the majority in the leaf node.

  2. Class Probabilities: Instead of making a hard assignment, you can calculate the class probabilities based on the proportion of each class in the leaf node.

    • For example, if a leaf node contains 3 ‘Yes’ and 2 ‘No’, then the probabilities would be \(\frac{3}{5}\) for ‘Yes’ and \(\frac{2}{5}\) for ‘No’.
    • The prediction will not be fixed if use this method.
  3. Weighted Voting: In some cases, the classes in the leaf node can be weighted based on some external metric or business rule.

4. How to Classify a New Data Point?

  1. Start at the Root: The classification process begins at the root node of the tree.

  2. Follow the Path: Based on the decision rules at each internal node, navigate through the tree.

    • If the decision is binary, go left or right.
    • For multi-class splits, take the appropriate branch.
  3. Reach a Leaf Node: Once you reach a leaf node, use the decision criterion at the leaf to classify the data point.

    • Majority Voting, Class Probabilities, etc., as discussed in the previous slide.
  4. Assign Class Label: Finally, assign the class label or class probabilities present at the leaf node to the new data point.

Evaluation of a Tree Model

Once a tree model is built, it’s essential to evaluate its performance to ensure it is generalizing well to new data. We can also use it to compare different models.

  1. Accuracy: Measures the proportion of correctly classified instances out of the total instances.
    • \(\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}\)
  2. Confusion Matrix, Precision, Recall, and F1 Score: These are more detailed metrics that consider the performance per class.
    • \(\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}\)
    • \(\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}\)
    • \(\text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\)
  3. AUC-ROC Curve: Area under the Receiver Operating Characteristic curve is a performance measurement for classification problem at various thresholds settings.

Pros and Cons of Classification Trees

Understanding both the strengths and weaknesses of Classification Trees will help you make informed decisions when selecting algorithms for different tasks.

Pros

  1. Interpretable Model: One of the most interpretable machine learning algorithms, making it easy to explain to non-experts.

  2. Handles both Categorical and Numerical Features: The algorithm can handle data of various types.

  3. Minimal Data Preprocessing: No need for scaling, and it can handle missing values.

  4. Quick to Build: Computational complexity is generally linear in terms of the number of features and data points.

  5. Non-Parametric: Does not make strong assumptions about the underlying data distribution.

Cons

  1. Overfitting: Trees can easily capture noise in data and overfit, although techniques like pruning can help.

  2. Unstable: Slight changes in data can result in a significantly different tree.

  3. Biased to Dominant Class: In imbalanced datasets, the algorithm is biased towards the dominant class.

  4. Limited Expressiveness: Each decision boundary is axis-aligned, in another word, it consider only one variable at a time.(It will be more clear when we talk about Regression Tree)

Extension to Multi-class Problem

Classification Trees are not limited to binary classification. They can be readily extended to handle multi-class problems.

Basic Idea

  1. Same Algorithm: The same basic algorithm used for binary classification can be extended to multi-class classification.

  2. Multiple Classes in Leaf Nodes: Each leaf node in the tree will now contain samples from multiple classes, rather than just two.

  3. Majority Voting: Classification is done by assigning the most frequent class in the leaf node to the new data point.

Package rpart and rpart.plot

rpart Package

  1. Overview: -An important pcajge that implements the classification and regression tree algorithm (CART).

  2. Key Functions:

    • rpart(): Builds the tree model.
    • predict(): Makes predictions based on the model.
    • summary(): Provides a detailed summary of the tree model.
  3. Installation:

#If you haven't, run the following to install rpart package
install.packages('rpart')

rpart.plot Package

  1. Overview: An Enhanced Plotting Package for rpart.
    • Offers advanced visualization features.
    • Highly customizable.
  2. Key Functions:
    • rpart.plot(): Creates an advanced plot of the tree.
  3. Installation:
#If you haven't, run the following to install rpart.plot package
install.packages("rpart.plot")

Let’s code!

We will be working on the credit data again.

file_path = "https://xiaoruizhu.github.io/Data-Mining-R/lecture/data/credit_default.csv"
credit.data <- read.csv(file = file_path, header=T)

And let’s take a look of the column names:

colnames(credit.data)
##  [1] "LIMIT_BAL"                  "SEX"                       
##  [3] "EDUCATION"                  "MARRIAGE"                  
##  [5] "AGE"                        "PAY_0"                     
##  [7] "PAY_2"                      "PAY_3"                     
##  [9] "PAY_4"                      "PAY_5"                     
## [11] "PAY_6"                      "BILL_AMT1"                 
## [13] "BILL_AMT2"                  "BILL_AMT3"                 
## [15] "BILL_AMT4"                  "BILL_AMT5"                 
## [17] "BILL_AMT6"                  "PAY_AMT1"                  
## [19] "PAY_AMT2"                   "PAY_AMT3"                  
## [21] "PAY_AMT4"                   "PAY_AMT5"                  
## [23] "PAY_AMT6"                   "default.payment.next.month"

Pre-processing

rename the default.payment.next.month column to default:

library(dplyr)
credit.data<- rename(credit.data, default=default.payment.next.month)

We know that SEX, EDUCATION, MARRIAGE are categorical, we convert them to factor.

credit.data$SEX<- as.factor(credit.data$SEX)
credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)

Data Spliting

set.seed(2023)
index <- sample(1:nrow(credit.data),nrow(credit.data)*0.60)
credit.train = credit.data[index,]
credit.test = credit.data[-index,]

Fit Classification Tree

We will use the function rpart from package rpart.

Important!! We need to make the default as factor when fitting Classification Tree.

library(rpart)
library(rpart.plot)
# fit the model
fit_tree <- rpart(as.factor(default) ~ ., data=credit.train)
rpart.plot(fit_tree,extra=4, yesno=2)

Prediction and Confusion Matrix (Trian)

Obtain the predicted values for training data and confusion matrix.

# Make predictions on the train data
pred_credit_train <- predict(fit_tree, credit.train, type="class")

# Confusion matrix to evaluate the model on train data
Cmatrix_train = table(true = credit.train$default,
                      pred = pred_credit_train)
Cmatrix_train
##     pred
## true    0    1
##    0 5316  310
##    1  956  618

Mis-classficiation Rate (MR)

1 - sum(diag(Cmatrix_train))/sum(Cmatrix_train)
## [1] 0.1758333

Prediction and Confusion Matrix (Test)

Obtain the predicted values for testing data and confusion matrix.

# Make predictions on the train data
pred_credit_test <- predict(fit_tree, credit.test, type="class")

# Confusion matrix to evaluate the model on train data
Cmatrix_test = table(true = credit.test$default,
                     pred = pred_credit_test)
Cmatrix_test
##     pred
## true    0    1
##    0 3542  200
##    1  644  414

Mis-classficiation Rate (MR)

1 - sum(diag(Cmatrix_test))/sum(Cmatrix_test)
## [1] 0.1758333

Asymmetric Cost!

Using classification tree with asymmetric cost is different again!

# We need to define a cost matrix first, don't change 0 there
cost_matrix <- matrix(c(0, 1,  # cost of 1 for FP
                        5, 0),  # cost of 5 for FN
                      byrow = TRUE, nrow = 2)
fit_tree_asym <- rpart(as.factor(default) ~ ., data=credit.train, 
                       parms = list(loss = cost_matrix))
rpart.plot(fit_tree_asym,extra=4, yesno=2)

#get predictions for training
pred_credit_train <- predict(fit_tree_asym, credit.train,
                             type = "class")
#C matrix for training
table( true = credit.train$default, pred = pred_credit_train)
##     pred
## true    0    1
##    0 3550 2076
##    1  364 1210

Comparing testing confusion matrices for different costs

#get predictions for testing
pred_credit_test <- predict(fit_tree_asym, credit.test, type = "class")
#C matrix for testing with more weights on "1"
Cmatrix_test_weight = table( true = credit.test$default, pred = pred_credit_test)
Cmatrix_test_weight
##     pred
## true    0    1
##    0 2359 1383
##    1  289  769
#Recal we had the testing confusion matrix without any asymmetric cost.
Cmatrix_test # no weights
##     pred
## true    0    1
##    0 3542  200
##    1  644  414

Exercise

1. Using high cost of 2 for FN and 1 for FP and fit the tree model again.

2. Report Confusion matrix for training and testing data.

3. Report MR for training and testing data.

# work on it here!

Obtain AUC for training data

While tree-based models like those generated with rpart are not probabilistic classifiers in the same sense as logistic regression or some other algorithms, you can still compute an AUC score based on the leaf node probabilities of belonging to the positive class.

We will obtain the predicted probabilities first with:

# obtain predicted probability
pred_prob_train = predict(fit_tree, credit.train, type = "prob")

# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
# Replace "1" with the actual category if response variable is a factor
pred_prob_train = pred_prob_train[,"1"] 

Obtain AUC for training data

# Looks familar, right?
library(ROCR)
pred <- prediction(pred_prob_train, credit.train$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

Obtain AUC for training data

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6965917

Obtain AUC for testing data

# obtain predicted probability
pred_prob_test = predict(fit_tree, credit.test, type = "prob")
# This is necessary again, as predict() for tree model return two values, one for 0 and one for 1.
pred_prob_test = pred_prob_test[,"1"] #replace "1" with the actual category if reponse variable is a factor
#ROC
pred <- prediction(pred_prob_test, credit.test$default)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

Obtain AUC for testing data

#Get the AUC
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.6950851

Exercise

1. Revisit the model you just fitted in last exercise.

2. Plot ROC curve for training and testing data.

3. Report AUC for training and testing data.

# work on it here!

Control the depth of a tree

use maxdepth

# Fit the tree with a maximum depth of 2
depth_control_tree <- rpart(default ~ ., data = credit.train, method = "class",
                            control = rpart.control(maxdepth = 3),
                            parms = list(loss = cost_matrix)) 
                            # I am using cost_matrix just for demostration, remove it if needed
rpart.plot(depth_control_tree,extra=4, yesno=2)

Using CP

# Fit the tree with a maximum depth of 2
depth_control_tree <- rpart(default ~ ., data = credit.train,
                            method = "class",cp = 0.003,
                            parms = list(loss = cost_matrix)) 
                            # I am using cost_matrix just for demostration,
                            #remove it if needed
rpart.plot(depth_control_tree,extra=4, yesno=2)

Pruning with cp

# Fit a full tree with very low cp value
full_tree <- rpart(default ~ ., data = credit.train,
                   method = "class", cp = 0.001)

# Display CP table to identify the optimal cp
printcp(full_tree)
## 
## Classification tree:
## rpart(formula = default ~ ., data = credit.train, method = "class", 
##     cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] AGE       BILL_AMT1 BILL_AMT2 BILL_AMT4 BILL_AMT5 BILL_AMT6 EDUCATION
##  [8] LIMIT_BAL PAY_0     PAY_2     PAY_5     PAY_6     PAY_AMT1  PAY_AMT2 
## [15] PAY_AMT3  PAY_AMT5  PAY_AMT6  SEX      
## 
## Root node error: 1574/7200 = 0.21861
## 
## n= 7200 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.1747141      0   1.00000 1.00000 0.022281
## 2  0.0104828      1   0.82529 0.82529 0.020730
## 3  0.0082592      3   0.80432 0.80940 0.020573
## 4  0.0041296      4   0.79606 0.79860 0.020464
## 5  0.0034943      8   0.77891 0.80686 0.020547
## 6  0.0031766     10   0.77192 0.80559 0.020535
## 7  0.0021177     11   0.76874 0.81067 0.020586
## 8  0.0019060     19   0.75095 0.83037 0.020779
## 9  0.0015883     32   0.72490 0.84180 0.020890
## 10 0.0012706     36   0.71792 0.85896 0.021053
## 11 0.0010589     46   0.70457 0.88564 0.021301
## 12 0.0010000     53   0.69314 0.91169 0.021535

From the cp table, choose cp that give you lowest xerror, and use that value to prune a tree as follow:

# Identify optimal CP munually or use the following codes
optimal_cp <- full_tree$cptable[which.min(
  full_tree$cptable[,"xerror"]),"CP"] 

# Prune the tree
pruned_tree <- prune(full_tree, cp = optimal_cp)
rpart.plot(pruned_tree,extra=4, yesno=2)