October 26, 2025
A decision tree is a type of machine learning model, and one of the most intuitive models in statistics.
You can think of a decision tree as a flowchart-style diagram used to visualize decisions and the potential outcomes, based on a set of “if-then” questions.
To better explain the concept of decision trees, we will use the ‘iris’ dataset already built into R.
At each node, the model searches for the best possible split
Gini Impurity measures the probability of incorrectly classifying a randomly chosen element in a node, if it were randomly labeled (according to the class distribution in that node)
A Gini score of 0 is perfect purity (all one class).
The formula for a given node is:
\[ Gini = 1 - \sum_{i=1}^{C} (p_i)^2 \]
Where: * \(C\) is the total number of classes * \(p_i\) is the proportion of items in the node that belong to class \(i\)
Entropy is another measure of impurity, and measures the amount of uncertainty or “disorder” in a node
An Entropy of 0 is perfect purity (all one class/Species). The model splits the data to achieve the largest Information Gain (the biggest drop in entropy)
The formula for a given node is:
\[ Entropy (H) = - \sum_{i=1}^{C} p_i \log_{2}(p_i) \]
Where: * \(C\) is the total number of classes * \(p_i\) is the proportion of items in the node that belong to class \(i\)
Here, we overlay decision boundaries from the ‘rpart’ model with the same scatter plot from earlier.Note how the decision tree draws boxes to categorize the data.
The code below is what we used to visualize the decision tree shown on the previous slide. It uses the ‘ggplot2’ and ‘parttree’ packages
# 1. Load the required libraries (done in setup) # library(ggplot2) # library(rpart) # library(parttree) # 2. Build the decision tree model (done in setup) # tree_model <- rpart(Species ~ Petal.Length + Petal.Width, data = iris) # 3. Plot the data and the tree boundaries ggplot(iris, aes(x = Petal.Length, y = Petal.Width, color = Species)) + geom_point(alpha = 0.7, size = 3) + # This function plots the partitions from the 'tree_model' geom_parttree(data = iris_tree, aes(fill = Species), alpha = 0.1) + theme_minimal(base_size = 14) + labs(title = "Decision Tree Boundaries")
2D plots work great, but using a 3D interactive plot can help us better visualize the separation between classes. This is great for datasets with several features, such as ours
** Try it! Click and drag the plot **