SLexercise8

Question 3

Impurity Measures in Binary Classification

Let \(\hat{p}_{m1}\) denote the predicted probability of class 1 in a two-class classification problem.

We consider the following impurity measures as functions of \(\hat{p}_{m1}\):

Gini Index

\[ \text{Gini Index} = 2 \hat{p}_{m1}(1 - \hat{p}_{m1}) \]

Entropy (Cross-Entropy)

\[ \text{Entropy} = - \left[ \hat{p}_{m1} \log_2(\hat{p}_{m1}) + (1 - \hat{p}_{m1}) \log_2(1 - \hat{p}_{m1}) \right] \]

Classification Error

\[ \text{Classification Error} = 1 - \max(\hat{p}_{m1}, 1 - \hat{p}_{m1}) \]

# Load required packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks ggtree::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Define range of probabilities for class 1
p <- seq(0, 1, length.out = 100)

# Create a data frame with impurity measures
impurity_df <- data.frame(
  `p_hat_m1` = p,
  `Gini Index` = 2 * p * (1 - p),
  `Entropy` = -(p * log2(p) + (1 - p) * log2(1 - p)),
  `Classification Error` = 1 - pmax(p, 1 - p),
  check.names = FALSE
)

# Handle NaN due to log(0)
impurity_df[is.na(impurity_df)] <- 0

# Reshape the data for plotting
impurity_long <- pivot_longer(impurity_df, !p_hat_m1, names_to = "Measure", values_to = "Value")

# Plot
ggplot(impurity_long, aes(x = p_hat_m1, y = Value, color = Measure)) +
  geom_line(size = 1.2) +
  labs(
    title = "Gini Index, Entropy, and Classification Error vs p̂_m1",
    x = expression(hat(p)[m1]),
    y = "Value",
    color = "Impurity Measure"
  ) +
  theme_minimal(base_size = 14)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

All three impurity measures reach their maximum at \(\hat{p}{m1} = 0.5\), indicating highest uncertainty. Entropy is more sensitive to probability changes than Gini, while classification error is the least sensitive. At extreme values of \(\hat{p}{m1}\) (near 0 or 1), all measures approach 0, reflecting high certainty.

Question 4

Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.12. The numbers inside the boxes indicate the mean of Y within each region.

tree <- ape::read.tree(text = "(((3:1.5,(10:1,0:1)A:1)B:1,15:2)C:1,5:2)D;")
tree$node.label <- c("X1 < 1", "X2 < 1", "X1 < 0", "X2 < 0")
ggtree(tree, ladderize = FALSE) + scale_x_reverse() + coord_flip() +
  geom_tiplab(vjust = 2, hjust = 0.5) +
  geom_text2(aes(label = label, subset = !isTip), hjust = -0.1, vjust = -1)

Create a diagram similar to the left-hand panel of Figure 8.12, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

plot(NULL, xlab = "X1", ylab = "X2", xlim = c(-1, 2), ylim = c(0, 3), xaxs = "i", yaxs = "i")
abline(h = 1, col = "red", lty = 2)         # horizontal line at X2 = 1
lines(c(1, 1), c(0, 1), col = "blue", lty = 2) # vertical line at X1 = 1 (only below X2=1)
lines(c(-1, 2), c(2, 2), col = "red", lty = 2) # horizontal at X2 = 2
lines(c(0, 0), c(1, 2), col = "blue", lty = 2) # vertical at X1 = 0 (between X2 = 1 and 2)
text(
  c(0, 1.5, -0.5, 1, 0.5), 
  c(0.5, 0.5, 1.5, 1.5, 2.5), 
  labels = c("-1.80", "0.63", "-1.06", "0.21", "2.49")
)

Question 5

Majority vote

x <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
ifelse(mean(x > 0.5), "red", "green")

## [1] "red"

Average probability

ifelse(mean(x) > 0.5, "red", "green")

## [1] "green"

Final Answer:

Majority Vote: Red

Average Probability: Green

These two methods can give different results depending on the distribution of the individual probabilities.

Question 6

To fit a regression tree, we begin with binary recursive splitting, where the dataset is repeatedly split into two regions based on the predictor and cutpoint that minimize the residual sum of squares (RSS). This process continues until a stopping condition is met, such as a minimum number of observations in each terminal node. The resulting tree is typically large and may overfit the data, so we apply cost-complexity pruning, which introduces a penalty term controlled by a parameter α to balance model fit and complexity. A series of subtrees is generated by varying α, and K-fold cross-validation is used to select the value of α that minimizes the estimated test error. Finally, the regression tree is refit on the entire dataset using the selected α, producing the final pruned tree that aims to generalize well to unseen data.