Let \(\hat{p}_{m1}\) denote the predicted probability of class 1 in a two-class classification problem.
We consider the following impurity measures as functions of \(\hat{p}_{m1}\):
\[ \text{Gini Index} = 2 \hat{p}_{m1}(1 - \hat{p}_{m1}) \]
\[ \text{Entropy} = - \left[ \hat{p}_{m1} \log_2(\hat{p}_{m1}) + (1 - \hat{p}_{m1}) \log_2(1 - \hat{p}_{m1}) \right] \]
\[ \text{Classification Error} = 1 - \max(\hat{p}_{m1}, 1 - \hat{p}_{m1}) \]
# Load required packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks ggtree::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Define range of probabilities for class 1
p <- seq(0, 1, length.out = 100)
# Create a data frame with impurity measures
impurity_df <- data.frame(
`p_hat_m1` = p,
`Gini Index` = 2 * p * (1 - p),
`Entropy` = -(p * log2(p) + (1 - p) * log2(1 - p)),
`Classification Error` = 1 - pmax(p, 1 - p),
check.names = FALSE
)
# Handle NaN due to log(0)
impurity_df[is.na(impurity_df)] <- 0
# Reshape the data for plotting
impurity_long <- pivot_longer(impurity_df, !p_hat_m1, names_to = "Measure", values_to = "Value")
# Plot
ggplot(impurity_long, aes(x = p_hat_m1, y = Value, color = Measure)) +
geom_line(size = 1.2) +
labs(
title = "Gini Index, Entropy, and Classification Error vs p̂_m1",
x = expression(hat(p)[m1]),
y = "Value",
color = "Impurity Measure"
) +
theme_minimal(base_size = 14)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
All three impurity measures reach their maximum at \(\hat{p}{m1} = 0.5\), indicating highest uncertainty. Entropy is more sensitive to probability changes than Gini, while classification error is the least sensitive. At extreme values of \(\hat{p}{m1}\) (near 0 or 1), all measures approach 0, reflecting high certainty.
tree <- ape::read.tree(text = "(((3:1.5,(10:1,0:1)A:1)B:1,15:2)C:1,5:2)D;")
tree$node.label <- c("X1 < 1", "X2 < 1", "X1 < 0", "X2 < 0")
ggtree(tree, ladderize = FALSE) + scale_x_reverse() + coord_flip() +
geom_tiplab(vjust = 2, hjust = 0.5) +
geom_text2(aes(label = label, subset = !isTip), hjust = -0.1, vjust = -1)
plot(NULL, xlab = "X1", ylab = "X2", xlim = c(-1, 2), ylim = c(0, 3), xaxs = "i", yaxs = "i")
abline(h = 1, col = "red", lty = 2) # horizontal line at X2 = 1
lines(c(1, 1), c(0, 1), col = "blue", lty = 2) # vertical line at X1 = 1 (only below X2=1)
lines(c(-1, 2), c(2, 2), col = "red", lty = 2) # horizontal at X2 = 2
lines(c(0, 0), c(1, 2), col = "blue", lty = 2) # vertical at X1 = 0 (between X2 = 1 and 2)
text(
c(0, 1.5, -0.5, 1, 0.5),
c(0.5, 0.5, 1.5, 1.5, 2.5),
labels = c("-1.80", "0.63", "-1.06", "0.21", "2.49")
)
Majority vote
x <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
ifelse(mean(x > 0.5), "red", "green")
## [1] "red"
Average probability
ifelse(mean(x) > 0.5, "red", "green")
## [1] "green"
Final Answer:
Majority Vote: Red
Average Probability: Green
These two methods can give different results depending on the distribution of the individual probabilities.
To fit a regression tree, we begin with binary recursive splitting, where the dataset is repeatedly split into two regions based on the predictor and cutpoint that minimize the residual sum of squares (RSS). This process continues until a stopping condition is met, such as a minimum number of observations in each terminal node. The resulting tree is typically large and may overfit the data, so we apply cost-complexity pruning, which introduces a penalty term controlled by a parameter α to balance model fit and complexity. A series of subtrees is generated by varying α, and K-fold cross-validation is used to select the value of α that minimizes the estimated test error. Finally, the regression tree is refit on the entire dataset using the selected α, producing the final pruned tree that aims to generalize well to unseen data.