Statistical Techniques - An Introduction to Decision Trees

October 26, 2025

What is a Decision Tree?

A decision tree is a type of machine learning model, and one of the most intuitive models in statistics.

You can think of a decision tree as a flowchart-style diagram used to visualize decisions and the potential outcomes, based on a set of “if-then” questions.

The model starts with a root node, which is the entire dataset being used
The data is split into two branches, based on a given question
These splits create branches (the questions) and internal nodes (the decisions)
The process ends at leaf nodes (the final prediction)

Sample Dataset

To better explain the concept of decision trees, we will use the ‘iris’ dataset already built into R.

This dataset contains 5 variables (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species) and 150 observations
Goal: Predict the ‘Species’ of an Iris flower, given the other features listed
Outcome: The relevant ‘Species’ (setosa, versicolor, or virginica)

Exploring the Data with ggplot (1)

How do Decision Trees “Decide”?

At each node, the model searches for the best possible split

It checks every possible split point by checking the value of each feature (4, in this case)
The “best” split is the one that results in the “purest” child nodes, meaning they contain mostly only one class (such as 100% ‘setosa’ Species)
Purity is measured using math, commonly one of two methods: Gini Impurity or Entropy

Math: Gini Impurity with Latex (1)

Gini Impurity measures the probability of incorrectly classifying a randomly chosen element in a node, if it were randomly labeled (according to the class distribution in that node)

A Gini score of 0 is perfect purity (all one class).

The formula for a given node is:

\[ Gini = 1 - \sum_{i=1}^{C} (p_i)^2 \]

Where: * \(C\) is the total number of classes * \(p_i\) is the proportion of items in the node that belong to class \(i\)

Math: Entropy with Latex (2)

Entropy is another measure of impurity, and measures the amount of uncertainty or “disorder” in a node

An Entropy of 0 is perfect purity (all one class/Species). The model splits the data to achieve the largest Information Gain (the biggest drop in entropy)

The formula for a given node is:

\[ Entropy (H) = - \sum_{i=1}^{C} p_i \log_{2}(p_i) \]

Where: * \(C\) is the total number of classes * \(p_i\) is the proportion of items in the node that belong to class \(i\)

Visualizing a Decision Tree with ggplot (2)

Here, we overlay decision boundaries from the ‘rpart’ model with the same scatter plot from earlier.Note how the decision tree draws boxes to categorize the data.

Sample Code in R

The code below is what we used to visualize the decision tree shown on the previous slide. It uses the ‘ggplot2’ and ‘parttree’ packages

# 1. Load the required libraries (done in setup)
# library(ggplot2)
# library(rpart)
# library(parttree)

# 2. Build the decision tree model (done in setup)
# tree_model <- rpart(Species ~ Petal.Length + Petal.Width, data = iris)

# 3. Plot the data and the tree boundaries
ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) +
  geom_point(alpha = 0.7, size = 3) +
  
  # This function plots the partitions from the 'tree_model'
  geom_parttree(data = iris_tree, aes(fill = Species), alpha = 0.1) +
  
  theme_minimal(base_size = 14) +
  labs(title = "Decision Tree Boundaries")

Exploring in 3D with plotly

2D plots work great, but using a 3D interactive plot can help us better visualize the separation between classes. This is great for datasets with several features, such as ours

** Try it! Click and drag the plot **