In a binary classification problem with two classes, we define:
The three impurity measures are:
# Sequence of probabilities for class 1
p <- seq(0, 1, length.out = 200)
p2 <- 1 - p
# Gini index
gini <- 2 * p * (1 - p)
# Classification error
class_error <- 1 - pmax(p, p2)
# Entropy (handle log(0) by replacing NaNs with 0)
entropy <- -(p * log2(p) + p2 * log2(p2))
entropy[is.nan(entropy)] <- 0 # define 0*log(0) as 0
# Plot
plot(p, gini, type = "l", col = "blue", ylim = c(0, 1), ylab = "Measure", xlab = expression(hat(p)[m1]), lwd = 2,
main = "Gini Index, Classification Error, and Entropy")
lines(p, class_error, col = "red", lwd = 2)
lines(p, entropy, col = "darkgreen", lwd = 2)
legend("top", legend = c("Gini Index", "Classification Error", "Entropy"),
col = c("blue", "red", "darkgreen"), lwd = 2)
The plot illustrates how three impurity measures behave in a binary classification setting as ^pm1 ranges from 0 to 1:
Entropy (green curve) reaches its maximum value when ^pm1 = 0.5, meaning the uncertainty is highest when the two classes are equally probable. It symmetrically decreases as the probability of one class becomes more dominant (closer to 0 or 1). Entropy is more sensitive to class distribution changes near ^pm1 = 0.5.
Gini Index (blue curve) also peaks at ^pm1 = 0.5, reflecting maximum impurity when both classes are equally likely. Compared to entropy, the Gini index has a similar shape but is slightly lower in magnitude and curvature, making it computationally simpler and commonly used in decision trees like CART.
Classification Error (red curve) shows the least sensitivity among the three. It is defined as 1 − max(^pm1, ^pm2), so it remains flat when one class becomes dominant, peaking at 0.5 where class probabilities are equal. It captures only the misclassification rate and doesn’t account for probability distribution nuances like entropy and Gini.
if (!require(rpart)) {
install.packages("rpart.plot")
}
## Loading required package: rpart
library(rpart.plot)
library(rpart)
# Define a fake dataset to represent the regions and allow rpart to draw the tree
set.seed(42)
x1 <- runif(100, 0, 1)
x2 <- runif(100, 0, 1)
y <- ifelse(x2 > 0.5,
15,
ifelse(x1 < 0.25, 3,
ifelse(x1 < 0.5,
ifelse(x2 < 0.25, 10, 0), 5)))
df <- data.frame(x1 = x1, x2 = x2, y = y)
# Fit the model
tree_fit <- rpart(y ~ x1 + x2, data = df, control = rpart.control(cp = 0, minsplit = 2))
rpart.plot(tree_fit, main = "Tree for Left Panel Partition", extra = 1)
The left panel of Figure 8.14 shows a recursive binary partitioning of the predictor space defined by two features, X1 and X2. The R output correctly reproduces the tree structure corresponding to this partition using a decision tree. The tree:
- First splits on X2 < 0.51, dividing the space into upper (X2 ≥
0.51) and lower (X2 < 0.51) regions.
- Then recursively splits based on values of X1 and X2 to isolate
smaller rectangular regions.
- Each terminal node of the tree shows the average response Y in that
region, matching the numbers in the partition diagram (0, 3, 5, 10,
15).
# Create a partition plot manually for the splits:
plot(NA, xlim = c(-3, 3), ylim = c(-3, 3), xlab = "X1", ylab = "X2", main = "Partition for Right Panel Tree")
# Draw vertical and horizontal lines for splits
abline(h = 1, col = "gray")
abline(v = 1, col = "gray")
abline(v = -1, col = "gray")
abline(h = 2, col = "gray")
# Region labels and their mean predictions from the tree
text(-2, 2, "-1.80", col = "blue", cex = 1.2)
text(0.5, 2, "0.63", col = "blue", cex = 1.2)
text(-2, 0, "-1.06", col = "blue", cex = 1.2)
text(0.5, 0, "0.21", col = "blue", cex = 1.2)
text(2, 0, "2.49", col = "blue", cex = 1.2)
The right panel of Figure 8.14 shows a regression tree, and the
corresponding partition has been manually recreated by drawing
horizontal and vertical lines at the tree’s split thresholds:
- The tree splits first at X2 < 1, creating a horizontal
division.
- It then splits at X1 < 1 and further along X2 < 2 and X1 < 0
in the right subtree, resulting in 5 distinct regions.
- The resulting partitioning of the X1-X2 space exactly mirrors the
decision boundaries in the tree.
- The labels in the plot (e.g., -1.80, 0.63, etc.) are the mean
predictions for Y in each region and match the terminal node values from
the regression tree.
We will analyze two common methods of aggregating these predictions:
If a model’s predicted probability is greater than 0.5, we label the observation as Red, otherwise as Green. The majority class label across all models becomes the final prediction.
# Vector of estimated probabilities
prob_estimates <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
# Classify each estimate based on threshold 0.5
class_labels <- ifelse(prob_estimates > 0.5, "Red", "Green")
# Count how many votes for each class
table(class_labels)
## class_labels
## Green Red
## 4 6
# Determine majority class
majority_vote_class <- names(which.max(table(class_labels)))
majority_vote_class
## [1] "Red"
We take the mean of the predicted probabilities. If the average probability is greater than 0.5, we classify as Red.
# Average of the probability estimates
avg_prob <- mean(c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75))
avg_prob
## [1] 0.45
# Classify based on average probability
avg_vote_class <- ifelse(avg_prob > 0.5, "Red", "Green")
avg_vote_class
## [1] "Green"
These two methods can yield different results.
- Majority vote relies on the count of classifiers favoring a label,
making it robust to outliers but sensitive to threshold crossings.
- Average probability smooths out the predictions but may underestimate
minority opinions if most estimates are near the threshold.
In this case, majority vote led to a classification of “Red”, while average probability led to “Green”, illustrating how ensemble method choice can influence the final decision.
A regression tree is a decision tree used for predicting continuous outcomes. The algorithm fits the tree by recursively splitting the data into regions that minimize the residual sum of squares (RSS). Below is a detailed explanation of the steps involved in the algorithm.
Start with the full dataset
Consider all \(n\) observations and all
\(p\) predictor variables.
Select the best predictor and split point
For each predictor variable \(X_j\) and
each split point \(s\):
Split the data
Apply the best split \((j, s)\) and
divide the data accordingly.
Repeat the process recursively
Stop splitting when a stopping criterion is
met
Typical stopping conditions include:
Assign prediction to each terminal node
For each final region \(R_m\), the
prediction is the mean response in that region: \[
\hat{y}_{R_m} = \frac{1}{|R_m|} \sum_{i: x_i \in R_m} y_i
\]
The regression tree algorithm works in a top-down greedy fashion, greedy because it makes the optimal split at each step without considering future splits. The tree partitions the predictor space such that observations within each region are similar in response value, leading to minimal prediction error.
This approach is easy to interpret and visualize, making regression trees a popular choice for modeling nonlinear relationships.