Decision Trees (Week 2)

Modelling Data - MA7442

Dr Rob Free

rob.free@leicester.ac.uk

Dr Hammad Afzal

ha427@leicester.ac.uk

22/01/2025

Overview

This lecture focuses on Supervised Learning, a method that partitions data into subsets based on similarity without predefined labels.
We shall cover the the basic concepts of Supervised Learning and Classification Models
We shall explore the Decision Tree Model

Decision Trees

Introduction

Decision Trees (DTs) are a non-parametric supervised learning method used for Classification and Regression.
The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
Example: Can we predict infection from temperature?

Lets assume a simple situation: if Temperature < 96.8 °F or Temperature > 100.4 °F, then it is Infection

Definitions

Decision Tree has a hierarchical, tree structure, which consists of a root node, branches, internal nodes (decision nodes) and leaf nodes (terminal nodes).
It starts with a root node that has no incoming branches.
From the root node, branches lead to decision nodes that evaluate the data based on specific features. These evaluations split the data into smaller, more similar groups.
The final groups are represented by leaf nodes (terminal nodes), which show the possible outcomes in the dataset.

Predictive Modelling using Decision Tree

Predicting infection: Considering more attributes:

Temperature < 96.8 °F or > 100.4 °F
Heart rate > 90 beats/min
Respiratory rate > 20 breaths/min

More Complexity? For example, Heart rate depends on age so our prediction can be further improved by employing more related attributes as well!

Predictive Modelling using Decision Tree (contd..)

Classification - A two-step process

Model construction:

Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees, or mathematical formulae

Classification - A two-step process (Contd..)

Model Model Usage:

For classifying future or unknown objects
Estimate accuracy of the model
- The known label of test sample is compared with the classified result from the model
- Accuracy rate is the percentage of test set samples that are correctly classified by the model
- Test set is independent of training set (otherwise over-fitting)
If the accuracy is acceptable, use the model to classify new data

Create Model from Training Data

Apply Model to Test Data

Apply Model to Test Data

Build the Decision Tree (Algorithm)

Step 1: At Start, all training examples are at root node.

Build the Decision Tree (Contd..)

Step 2: Attributes are tested and selected on the basis of a statistical measure (e.g., information gain). Attribute with highest Information Gain is selected as the Decision Node. Examples are partitioned based on selected attributes

Build the Decision Tree (Contd..)

Step 3: Examples are partitioned recursively based on selected attributes

Build the Decision Tree (Contd..)

Repeat ‘Step-2’ and ‘Step-3’ until you get leaf nodes (or) pure nodes, this means all the training examples at each node, belong to the same class.

Choosing the Best Attribute

Consider a Dataset with 20 records
10 records belong to Class 0: C0
10 records belong to Class 1: C1

Choosing the Best Attribute

If we choose Gender as Splitting attribute: The class distribution for each value (M and F) is 6:4 and 4:6

Choosing the Best Attribute

Similarly we can split based on Car Type and Shirt Size as

General rule is that attributes with “purer” class distribution are preferred. We need a measure of degree of purity

Entropy and Information

Entropy is the measure of uncertainty associated with a random variable.
For a discrete randome variable \(Y\) taking m distinct values \({y_1, y_2, .... y_m}\)

\[ Entropy(Y) = -\sum_{i=1}^m P(y_i) \log_2(y_i) \]

Interpretation:
- High Impurity = Higher Uncertainity = Higher Entropy
- Lower Impurity = Lower Uncertainity= Lower Entropy

Computing the Entropy

\[ Entropy(Y) = -\sum_{i=1}^m P(y_i) \log_2(y_i) \]

Computing the Entropy (Contd..)

Entropy is maximum if all the data points are uniformly distributed to both the classes, then the value of entropy = 1.
Entropy is minimum, if all the data points belong to the same class, then the value of entropy=0.
For example: If we have a dataset of 100 data points/observations belonging to two classes either positive(+ve) or negative(-ve), then entropy is maximum (entropy = 1), when 50 are (+ve) and 50 are (-ve).
Entropy is minimum (entropy = 0) when all data points belong to either the positive (+ve) class or the negative class (-ve).

Selecting Best Attribute using Information Gain (IG)

Information Gain is a measure used to decide which feature (attribute) is the best to split the data when building a decision tree. It quantifies how much uncertainty (measured as entropy) is reduced in the dataset after splitting it based on a specific feature. Here’s how it works:

Calculate Entropy of Dataset (Before Split):
Entropy measures the uncertainty or randomness in the dataset before any split. A higher entropy indicates more disorder, while a lower entropy suggests a clearer separation of classes.
Split the Dataset into Subsets (Based on Feature Values):
The dataset is divided into smaller subsets based on the unique values of a chosen feature. Each subset should ideally represent a more homogeneous group with reduced uncertainty.

Selecting best attribute using Information Gain (IG) (Contd..)

Calculate Entropy of Subsets (After Split): For each subset, calculate its entropy to measure how uncertain or disordered it is. Combine these values into a weighted sum based on the size of each subset relative to the entire dataset. This gives the total entropy after the split.
Calculate Information Gain: Information Gain is the difference between the entropy of the dataset before the split and the total entropy after the split. It indicates how much uncertainty has been reduced by dividing the data using the feature. A higher Information Gain suggests the feature is more effective at creating subsets with distinct classes. |

How it Works

Calculate Entropy of Dataset D Before Split: \[ Entropy (D) \].
Split the Dataset into subsets based on values of feature (F): \[ {D_1, D_2, ..., Dm} \]
Calculate Entropy of each subset after split and combine to find total entropy : \[ Entropy_F = \sum_{i=1}^m \frac{|D_i|}{|D|} Entropy(D_i) = \frac{|D_1|}{|D|}Entropy (D_1) + \frac{|D_2|}{|D|}Entropy (D_2) + ... \frac{|D_m|}{|D|}Entropy (D_m) \]
Information Gain: Subtract the total entropy after the split from the entropy before the split: \[ \text{IG} = Entropy(D) - Entropy(D_F) \]

Example: Calculating the IG

Consider a hypothetical dataset D as follows:

Step 1: Compute the Entropy of Dataset (D)

Class Variable: Play Tennis

Value of Play Tennis: 2 (Yes and No)

Let

\[ D_1 = \text{Records in dataset with Play Tennis = Yes} \]

\[ |D_1| = 9 \]

\[ D_2 = \text{Records in dataset with Play Tennis = No} \]

\[ |D_2| = 5 \]

\[ Entropy (D) = -\frac{|9|}{|14|} log_2 (\frac{|9|}{|14|}) - \frac{|5|}{|14|} log_2 (\frac{|5|}{|14|}) = 0.94 \]

Step 2: Compute IG for each feature

Lets start with Outlook

Outlook has 3 possible values: Rainy, Overcast, Sunny

\[ D_1 = \text{Subset of dataset with Outlook = Rainy} \]

\[ |D_1| = 5 \]

\[ D_2 = \text{Subset of dataset with Outlook = Sunny} \]

\[ |D_2| = 5 \]

\[ D_3 = \text{Subset of dataset with Outlook = Overcast} \]

\[ |D_3| = 4 \]

Step 2: Compute the IG for each feature (Contd..)

\[ IG(Outlook) = Entropy(D) - Entropy (D_{Outlook}) \] \[ IG(Outlook) = Entropy(D) - (\frac{|D_1|}{|D|}Entropy (D_1) + \frac{|D_2|}{|D|}Entropy (D_2) \frac{|D_3|}{|D|}Entropy (D_3)) \]

\[ IG(Outlook) = 0.94 - (\frac{|5|}{|14|}Entropy (D_1) + \frac{|5|}{|14|}Entropy (D_2) \frac{|4|}{|14|}Entropy (D_3)) \] \[ IG(Outlook) = 0.94 - (\frac{|5|}{|14|}(0.972) + \frac{|5|}{|14|}(0.972) \frac{|4|}{|14|}(0)) = 0.24 \]

Step 2: Compute the IG for each feature (Contd..)

\[ D_1 \text{ has class distribution of Play Tennis as 2:3} \]

\[ Entropy(D_1) = -\frac{|2|}{|5|} log_2 (\frac{|2|}{|5|}) - \frac{|3|}{|5|} log_2 (\frac{|3|}{|5|}) = 0.97 \]

\[ D_2 \text{ has class distribution of Play Tennis as 3:2} \]

\[ Entropy(D_2) = -\frac{|2|}{|5|} log_2 (\frac{|2|}{|5|}) - \frac{|3|}{|5|} log_2 (\frac{|3|}{|5|}) = 0.97 \]

\[ D_3 \text{ has class distribution of Play Tennis as 4:0} \]

\[ Entropy(D_3) = -\frac{|4|}{|4|} log_2 (\frac{|4|}{|4|}) - \frac{|9|}{|4|} log_2 (\frac{|0|}{|4|}) = 0 \]

Step 2: Compute the IG for each feature (Contd..)

Repeat for “Temp”: Temp has 3 possible values: Hot, Mild, Cold

\[ D_1 = \text{Subset of dataset with Temp = Hot} \]

\[ |D_1| = 4 \]

\[ D_2 = \text{Subset of dataset with Temp = Mild} \]

\[ |D_2| = 6 \]

\[ D_3 = \text{Subset of dataset with Temp = Cold} \]

\[ |D_3| = 4 \]

Step 2: Compute the IG for each feature (Contd..)

\[ IG(Temp) = Entropy(D) - Entropy (D_{Temp}) \] \[ IG(Temp) = Entropy(D) - (\frac{|D_1|}{|D|}Entropy (D_1) + \frac{|D_2|}{|D|}Entropy (D_2) \frac{|D_3|}{|D|}Entropy (D_3)) \]

\[ IG(Temp) = 0.94 - (\frac{|4|}{|14|}Entropy (D_1) + \frac{|6|}{|14|}Entropy (D_2) \frac{|4|}{|14|}Entropy (D_3)) \] \[ IG(Temp) = 0.94 - (\frac{|4|}{|14|}(1) + \frac{|6|}{|14|}(0.92) \frac{|4|}{|14|}(0.82)) = 0.03 \]

Step 2: Compute the IG for each feature (Contd..)

\[ D_1 \text{ has class distribution of Play Tennis as 2:2} \]

\[ Entropy(D_1) = -\frac{|2|}{|4|} log_2 (\frac{|2|}{|4|}) - \frac{|2|}{|4|} log_2 (\frac{|2|}{|4|}) = 1 \]

\[ D_2 \text{ has class distribution of Play Tennis as 4:2} \]

\[ Entropy(D_2) = -\frac{|4|}{|6|} log_2 (\frac{|4|}{|6|}) - \frac{|2|}{|6|} log_2 (\frac{|2|}{|6|}) = 0.92 \]

\[ D_3 \text{ has class distribution of Play Tennis as 3:1} \]

\[ Entropy(D_3) = -\frac{|3|}{|4|} log_2 (\frac{|3|}{|4|}) - \frac{|1|}{|4|} log_2 (\frac{|1|}{|4|}) = 0.82 \]

Step 2: Compute IG for each feature (Contd..)

Same Procedure to be repeated for Humidity and Windy.

\[ IG(D_{Windy}) = 0.043 \]

\[ IG(D_{Humidity}) = 0.145 \]
Comparing the IG for all attributes reveal that IG (Outlook) has the highest value.

Therefore, Outlook is selected as the Root Node

Step 3: Based on Root Node, Divide the dataset

Step 3: Based on Root Node, Divide the dataset (Contd..)

Here we can see that, all records for Outlook = Overcast has same value of Play Golf.

Therefore, we do not further need to go deeper in this node and assign the Target value of Yes

Step 4: Repeating Previous Steps again

Pick remaining branches of our tree one by one (which are not pure) and repeat the procedure of selecting the best attribute.
The procedure will continue till the stopping condition.
Stopping Condition:
- The subset of dataset newly created is pure (i.e. all records belong to same class)
- There are no more attributes on the basis of which dataset can be further split

Final Decision Tree

Stopping Conditions for Decision Tree

The stopping criteria for building decision trees are the conditions under which the tree’s growth is terminated. Here are the common stopping criteria used to prevent overfitting and ensure efficiency:

1. Pure Nodes

A node is not split further if all samples in the node belong to the same class (i.e., it is “pure”).

2. No Remaining Features

If no features remain to split the data, the tree cannot grow further.

3. Maximum Tree Depth Reached

The tree stops growing when it reaches a predefined maximum depth.
Example: Limit the tree to a maximum of 3 or 5 levels.

Stopping Conditions for Decision Tree (Contd..)

4. Minimum Number of Samples per Node

The tree stops splitting a node if the number of samples in that node falls below a specified threshold.
Example: Do not split if a node contains fewer than 10 samples.

5. Minimum Information Gain

Stop splitting if the improvement in Information Gain (or another splitting metric like Gini Impurity) is below a defined threshold.
Example: Stop when the Information Gain from a split is less than 0.01.

6. Early Stopping via Validation Data

Use a validation set to monitor the tree’s performance. Stop growing the tree when adding splits no longer improves performance on the validation set.

Stopping Conditions for Decision Tree (Contd..)

7. Maximum Number of Nodes or Splits

Limit the total number of nodes or splits in the tree to control complexity.

8. Predefined Time or Resource Limits

Stop training if the process exceeds a given time or computational resource limit.

These criteria are often used individually or in combination to create a balanced and efficient decision tree. They prevent overfitting and ensure the tree generalizes well to unseen data.

Computing Entropy using Gini Index

The Gini Index (or Gini Impurity) is another metric used to evaluate splits in a decision tree, often as an alternative to Information Gain.

It measures how “pure” or homogeneous a node is.

The goal is to minimize the Gini Index after a split, leading to nodes that contain predominantly one class.

What is the Gini Index?

The Gini Index calculates impurity as follows:

\(Gini= 1 - \sum_{i=1}^{m} p_i^2\)

\(m\): Number of classes in the dataset.
\(p_i\): Proportion of samples belonging to class \(i\) in the node.
\(Gini = 0\): The node is pure (all samples belong to one class).
Higher Gini: The node is more impure (samples are evenly distributed among classes).

Steps to use Gini Index

Example

Problems with Decision Trees - Overfitting

Overfitting occurs when a decision tree model is too complex and learns too much from training data, making it less accurate when predicting new data

A decision tree can become overfit by creating too many branches or leaf nodes, which can lead to rules that only apply to the training set. This can cause the model to memorize noise in the training data and fail to capture important patterns.

Consequences

Overfitting can lead to poor performance, inaccurate predictions, and reduced reliability.

Prevention of Over-fitting

To limit overfitting, you can try these techniques:

Pre-pruning: Generate a tree with fewer branches than it would otherwise have
Post-pruning: Generate a full tree and then remove parts of it
Cross-validation: Split the data into training and validation sets multiple times to evaluate the model’s performance
Apply multiple criteria: Apply a minimum number of examples per leaf or a maximum depth to limit the tree’s growth

Handling Continuous Attributes

Imagine we have a dataset with features like age, glucose levels, blood pressure, etc., and a target variable indicating whether a person has diabetes.
Splitting Criterion: The DT algorithm will use a criterion like Information Gain to determine the best split. For instance, it might find that splitting the age at 45 years minimizes impurity the most.
Challenge: Unlike categorical attributes (e.g., Yes/No), continuous attributes can take infinite possible values.

Handling Continuous Attributes (Contd..)

How Thresholds Are Chosen:

Sort values of the continuous attribute (e.g., Age: [22, 25, 30, 35, 40]).
For each possible split (midpoints like 27.5, 32.5), compute:
- Entropy or Gini Impurity.
- Information Gain from the split.
Select the threshold that minimizes uncertainty.
Splitting point is considered at each data point in the dataset. If the data points are unique, there will be n-1 split points for n data points.
So depending on which splitting variable and splitting point, we get high information gain, that split is selected.
If it is a large dataset, it is common to consider only split points at certain percentiles like (10%,20%,30%)of the distribution of values

Handling Continuous Attributes (Contd..)

Select the attribute Age as the splitting variable and and after considering various split points, ≤45 is best splitting point.

Total Value of Entropy is 0.92

Selecting attribute BMI as the splitting variable and after considering various split points, ≤30 is best splitting point

Total Value of Entropy is 0.811

Classification and Regression Tree: CART

Let’s do a simple example for regression.

Predicting a Continuous Variable

We aim to predict house prices based on two continuous features:

Size (in square feet)
Number of Bedrooms

Step-1: Calculate Variance before splitting

Step 2: Evaluate Splits

Evaluate Splits (Contd..)

Step 3 & Step 4

Key Takeaways:

CART for regression minimizes the variance within nodes to create splits.
Splitting stops when variance cannot be significantly reduced or a predefined criterion (e.g., max depth) is met.

Random Forest

Majority Voting

A randomly chosen hyperplane has an expected error of 0.5.
Many random hyperplanes combined by majority vote will still be random.
Suppose we have 𝑚 classifiers, performing slightly better than random, that is 𝑒𝑟𝑟𝑜𝑟=0.5−∈
Combine: make a decision on majority vote?
- What if we combined these 𝑚 slightly-better-than-random classifiers? Would majority vote be a good choice?

Majority Voting (Contd..)

Condorcet’s Jury Theorem

Analysis to the Probability of Majority Decisions – 1785

Assumptions:

Each individual makes the right choice with a probability 𝑝
The votes are independent

If 𝑝 > 0.5, then adding more voters increases the probability that the majority decision is correct
if 𝑝<0.5, then adding more voters makes things worse.

Ensemble Methods

An Ensemble Method combines the predictions of many individual classifiers by majority voting.

Such individual classifiers, called weak learners, are required to perform slightly better than random.

How do we produce independent weak learners using the same training data?

Use a strategy to obtain relatively independent weak learners!

Different methods:

Boosting: Reweight training data
Bagging: Resample training data
Random Forests: Bagging + feature randomness

What is a Random Forest

Random Forest is an ensemble learning method that builds multiple decision trees and combines their results.
Works for both classification and regression tasks.
Reduces overfitting and improves generalization compared to single decision trees.

How does a Random Forest Work?

Bootstrap Sampling: Randomly selects subsets of data for each tree.
Random Feature Selection: Each tree splits based on a random subset of features.
Aggregation:
- Classification: Majority vote across trees.
- Regression: Average of predictions.

Why Use Random Forests?

Advantages:
1. Reduces overfitting by averaging multiple trees.
2. Handles missing data well.
3. Can model non-linear relationships effectively.
Limitations:
1. Computationally intensive for large datasets.
2. May lose interpretability compared to single decision trees.

Hyper-parameters in Random Forests

Key Parameters to Tune:
1. Number of Trees (n_estimators): More trees improve performance but increase computation.
2. Max Features (max_features): Controls how many features each tree uses.
3. Max Depth (max_depth): Prevents trees from becoming too deep.
4. Minimum Samples Split (min_samples_split): Controls minimum samples required to split a node.

Code Snippets (Trial)

# Load the necessary packages
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

# Load the mtcars dataset
data(mtcars)

# Convert the 'am' variable to a factor (0 = automatic, 1 = manual)
mtcars$am <- factor(mtcars$am, levels = c(0, 1), labels = c("Automatic", "Manual"))

# Create a decision tree model with deeper levels
tree_model <- rpart(am ~ ., data = mtcars, method = "class",
                    control = rpart.control(cp = 0.001, maxdepth = 10))

# Plot the decision tree with rpart.plot
rpart.plot(tree_model, type = 3, extra = 101, fallen.leaves = TRUE)

# Install required packages if not already installed
if (!require(rpart)) install.packages("rpart")
if (!require(rpart.plot)) install.packages("rpart.plot")

# Load necessary libraries
library(rpart)
library(rpart.plot)

# Use the iris dataset
data(iris)

# Build the decision tree model
decision_tree <- rpart(Species ~ ., data = iris, method = "class")

# Visualize the decision tree
rpart.plot(
  decision_tree,
  type = 2,         # Display splits and labels
  extra = 104,      # Add class probabilities and percentages at each node
  under = TRUE,     # Show node numbers below
  main = "Decision Tree for Iris Dataset",  # Title of the plot
  box.palette = "RdYlGn",  # Use a red-yellow-green color palette
  shadow.col = "gray"      # Add shadows for a 3D effect
)

Code Snippets (Trial-2)

# Install required packages if not already installed
if (!require(titanic)) install.packages("titanic")
if (!require(rpart)) install.packages("rpart")
if (!require(rpart.plot)) install.packages("rpart.plot")

# Load necessary libraries
library(titanic)
library(rpart)
library(rpart.plot)

# Prepare the Titanic dataset
data <- titanic::titanic_train  # Load Titanic training dataset
data <- data[!is.na(data$Age), ]  # Remove rows with missing Age values
data$Survived <- factor(data$Survived, levels = c(0, 1), labels = c("No", "Yes"))
data$Pclass <- factor(data$Pclass, levels = c(1, 2, 3), labels = c("First", "Second", "Third"))

# Build a decision tree model
decision_tree <- rpart(
  Survived ~ Pclass + Sex + Age + Embarked,
  data = data,
  method = "class"
)

# Visualize the decision tree
rpart.plot(
  decision_tree,
  type = 2,         # Display splits with labels
  extra = 104,      # Add class probabilities and percentages at each node
  under = TRUE,     # Show node numbers below
  main = "Titanic Survival Prediction",  # Title of the plot
  box.palette = "RdYlGn",  # Use a red-yellow-green color palette
  shadow.col = "gray"      # Add shadows for a 3D effect
)