Gini Index: Gini = 1 - p^2 - (1 - p)^2 = 2 * p * (1 - p)
Classification Error: Error = 1 - max(p, 1 - p)
Entropy: Entropy = -plog2(p) - (1 - p)log2(1 - p)
These are measures of node impurity used in decision trees.
All are 0 when the node is pure (p = 0 or 1), and maximum at p = 0.5.
# Generate values of p (probability of class 1) from 0 to 1
p <- seq(0, 1, length.out = 100)
# Compute Gini Index
gini <- 2 * p * (1 - p)
# Compute Classification Error
class_error <- 1 - pmax(p, 1 - p)
# Compute Entropy (with fix for log(0))
entropy <- -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.nan(entropy)] <- 0 # Replace NaNs with 0 (at p = 0 or 1)
plot(p, gini, type = "l", col = "blue", lwd = 2,
ylim = c(0, 1), xlab = expression(hat(p)[m1]),
ylab = "Impurity Measure", main = "Impurity Measures vs Class Probability")
lines(p, class_error, col = "red", lwd = 2, lty = 2)
lines(p, entropy, col = "darkgreen", lwd = 2, lty = 3)
legend("topright", legend = c("Gini Index", "Classification Error", "Entropy"),
col = c("blue", "red", "darkgreen"), lty = c(1, 2, 3), lwd = 2)
- All three measures peak at p = 0.5 (maximum uncertainty).
- Entropy is most sensitive; Gini is moderate; Error is least.
- Gini is used in CART, Entropy in ID3/C4.5, Error in pruning.
The left panel of Figure 8.14 shows how the 2D space of predictors X1 and X2 is split into five rectangular regions. Each box displays the average value of the response variable Y in that region.
We reverse-engineer this to construct the following tree:
X1 < 1
/ \
X2 < 0 Region (Y = 5)
/ \
Region(Y=15) X1 < 0
/ \
Region(Y=3) Region(Y=10)
If X1≥1: predict 5
If X1<1 and X2<0: predict 15
If X1<1, X2≥0, and X1<0: predict 3
If X1<1, X2≥0, and X1≥0: predict 10
This is a regression tree where splits are made recursively on X1 and X2 and terminal nodes represent the mean Y values.
The right panel of Figure 8.14 displays a regression tree. To recreate the corresponding partitioning of the predictor space, we follow the tree’s logic:
X2 < 1
/ \
X1 < 1 X2 < 2
/ \ / \
-1.80 0.63 -1.06 X1 < 0
/ \
0.21 2.49
Each terminal node represents the average value of Y in a subregion of the predictor space.
# Load required packages
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.4.3
# Simulated data reflecting the right panel's splits and mean values
set.seed(42)
data <- data.frame(
X1 = c(-2, -1, 0, 1, 2, 1.5, -0.5, 0.5, 2.5),
X2 = c(0.5, 0.3, 1.5, 0.7, 2.2, 2.5, 1.8, 2.1, 2.8),
Y = c(-1.8, -1.8, -1.06, 0.63, 2.49, 2.49, 0.21, 0.21, 2.49)
)
# Fit regression tree
tree_model <- rpart(Y ~ X1 + X2, data = data, method = "anova")
# Plot the tree
rpart.plot(tree_model, type = 4, fallen.leaves = TRUE,
main = "Regression Tree (Right Panel of Fig 8.14)")
To visualize how the tree divides the predictor space (like the left panel):
# Draw a custom partition plot
plot(NA, xlim = c(-2, 3), ylim = c(-1, 3), xlab = "X1", ylab = "X2",
main = "Partition Based on Tree Structure", asp = 1)
# Add horizontal splits (X2 = 1 and X2 = 2)
abline(h = 1, col = "blue", lty = 2)
abline(h = 2, col = "blue", lty = 2)
# Add vertical splits (X1 = 1 and X1 = 0)
abline(v = 1, col = "red", lty = 2)
abline(v = 0, col = "red", lty = 2)
# Add region labels
text(-1.5, 0.5, "-1.80")
text(1.5, 0.5, "0.63")
text(-1.5, 1.5, "-1.06")
text(0.5, 2.5, "0.21")
text(2, 2.5, "2.49")
This diagram mimics the left panel from the book but based on the splits of the right panel’s tree.
Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of \(X\), produce 10 estimates of\(P(\textrm{Class is Red}|X)\): > \[0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, \textrm{and } 0.75.\]
There are two common ways to combine these results together into a single class prediction.
One is the majority vote approach discussed in this chapter.
The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?
# Probability estimates from the 10 trees
probs <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
# 1. Majority vote classification
votes_red <- sum(probs > 0.5)
votes_green <- sum(probs <= 0.5)
majority_class <- ifelse(votes_red > votes_green, "Red", "Green")
# 2. Average probability classification
avg_prob <- mean(probs)
avg_class <- ifelse(avg_prob > 0.5, "Red", "Green")
# Print results
cat("Majority Vote Classification:", majority_class, "\n")
## Majority Vote Classification: Red
cat("Average Probability Classification:", avg_class, "\n")
## Average Probability Classification: Green
cat("Average Probability Value:", avg_prob, "\n")
## Average Probability Value: 0.45
A regression tree recursively partitions the predictor space to minimize the residual sum of squares (RSS). Here’s how the algorithm works:
Start with all training data in the root node.
For each feature \(X_j\) and each potential split point \(s\), define regions:
\(R_1 = \{X | X_j < s\}\), \(R_2 = \{X | X_j \geq s\}\)
Minimize RSS: \[ RSS = \sum_{i \in R_1}(y_i - \bar{y}_{R_1})^2 + \sum_{i \in R_2}(y_i - \bar{y}_{R_2})^2 \]
Split at the best (j, s) and repeat recursively.
Stop when a stopping rule is reached (e.g., min node size).
Prediction for any input is the mean \(\bar{y}\) in the terminal node.
Prune using cost-complexity pruning: \[ C_\alpha(T) = \text{Total RSS} + \alpha \cdot |T| \]
This method builds an interpretable model while minimizing variance within each region.