Statistical Learning Exercise-6 (Chapter 8)

3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

In a binary classification problem with two classes, we define:

Let \(\hat{p}_{m1}\) be the estimated probability of class 1.
Then \(\hat{p}_{m2} = 1 - \hat{p}_{m1}\).

The three impurity measures are:

Gini Index: \(G = \hat{p}_{m1}(1 - \hat{p}_{m1}) + \hat{p}_{m2}(1 - \hat{p}_{m2}) = 2\hat{p}_{m1}(1 - \hat{p}_{m1})\)
Classification Error: \(E = 1 - \max(\hat{p}_{m1}, \hat{p}_{m2})\)
Entropy: \(H = -\hat{p}_{m1} \log_2(\hat{p}_{m1}) - \hat{p}_{m2} \log_2(\hat{p}_{m2})\)

# Sequence of probabilities for class 1
p <- seq(0, 1, length.out = 200)
p2 <- 1 - p

# Gini index
gini <- 2 * p * (1 - p)

# Classification error
class_error <- 1 - pmax(p, p2)

# Entropy (handle log(0) by replacing NaNs with 0)
entropy <- -(p * log2(p) + p2 * log2(p2))
entropy[is.nan(entropy)] <- 0  # define 0*log(0) as 0

# Plot
plot(p, gini, type = "l", col = "blue", ylim = c(0, 1), ylab = "Measure", xlab = expression(hat(p)[m1]), lwd = 2,
     main = "Gini Index, Classification Error, and Entropy")
lines(p, class_error, col = "red", lwd = 2)
lines(p, entropy, col = "darkgreen", lwd = 2)
legend("top", legend = c("Gini Index", "Classification Error", "Entropy"),
       col = c("blue", "red", "darkgreen"), lwd = 2)

Interpretation

The plot illustrates how three impurity measures behave in a binary classification setting as ^pm1 ranges from 0 to 1:

Entropy (green curve) reaches its maximum value when ^pm1 = 0.5, meaning the uncertainty is highest when the two classes are equally probable. It symmetrically decreases as the probability of one class becomes more dominant (closer to 0 or 1). Entropy is more sensitive to class distribution changes near ^pm1 = 0.5.
Gini Index (blue curve) also peaks at ^pm1 = 0.5, reflecting maximum impurity when both classes are equally likely. Compared to entropy, the Gini index has a similar shape but is slightly lower in magnitude and curvature, making it computationally simpler and commonly used in decision trees like CART.
Classification Error (red curve) shows the least sensitivity among the three. It is defined as 1 − max(^pm1, ^pm2), so it remains flat when one class becomes dominant, peaking at 0.5 where class probabilities are equal. It captures only the misclassification rate and doesn’t account for probability distribution nuances like entropy and Gini.

4. This question relates to the plots in Figure 8.14.

(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The numbers inside the boxes indicate the mean of Y within each region.

if (!require(rpart)) {
  install.packages("rpart.plot")
}

## Loading required package: rpart

library(rpart.plot)
library(rpart)

# Define a fake dataset to represent the regions and allow rpart to draw the tree
set.seed(42)
x1 <- runif(100, 0, 1)
x2 <- runif(100, 0, 1)
y <- ifelse(x2 > 0.5, 
            15, 
            ifelse(x1 < 0.25, 3,
                   ifelse(x1 < 0.5, 
                          ifelse(x2 < 0.25, 10, 0), 5)))

df <- data.frame(x1 = x1, x2 = x2, y = y)

# Fit the model
tree_fit <- rpart(y ~ x1 + x2, data = df, control = rpart.control(cp = 0, minsplit = 2))
rpart.plot(tree_fit, main = "Tree for Left Panel Partition", extra = 1)

The left panel of Figure 8.14 shows a recursive binary partitioning of the predictor space defined by two features, X1 and X2. The R output correctly reproduces the tree structure corresponding to this partition using a decision tree. The tree:

- First splits on X2 < 0.51, dividing the space into upper (X2 ≥ 0.51) and lower (X2 < 0.51) regions.
- Then recursively splits based on values of X1 and X2 to isolate smaller rectangular regions.
- Each terminal node of the tree shows the average response Y in that region, matching the numbers in the partition diagram (0, 3, 5, 10, 15).

(b) Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

# Create a partition plot manually for the splits:
plot(NA, xlim = c(-3, 3), ylim = c(-3, 3), xlab = "X1", ylab = "X2", main = "Partition for Right Panel Tree")

# Draw vertical and horizontal lines for splits
abline(h = 1, col = "gray")
abline(v = 1, col = "gray")
abline(v = -1, col = "gray")
abline(h = 2, col = "gray")

# Region labels and their mean predictions from the tree
text(-2, 2, "-1.80", col = "blue", cex = 1.2)
text(0.5, 2, "0.63", col = "blue", cex = 1.2)
text(-2, 0, "-1.06", col = "blue", cex = 1.2)
text(0.5, 0, "0.21", col = "blue", cex = 1.2)
text(2, 0, "2.49", col = "blue", cex = 1.2)

The right panel of Figure 8.14 shows a regression tree, and the corresponding partition has been manually recreated by drawing horizontal and vertical lines at the tree’s split thresholds:
- The tree splits first at X2 < 1, creating a horizontal division.
- It then splits at X1 < 1 and further along X2 < 2 and X1 < 0 in the right subtree, resulting in 5 distinct regions.
- The resulting partitioning of the X1-X2 space exactly mirrors the decision boundaries in the tree.
- The labels in the plot (e.g., -1.80, 0.63, etc.) are the mean predictions for Y in each region and match the terminal node values from the regression tree.

5. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

We will analyze two common methods of aggregating these predictions:

1. Majority Vote Classification

If a model’s predicted probability is greater than 0.5, we label the observation as Red, otherwise as Green. The majority class label across all models becomes the final prediction.

# Vector of estimated probabilities
prob_estimates <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)

# Classify each estimate based on threshold 0.5
class_labels <- ifelse(prob_estimates > 0.5, "Red", "Green")

# Count how many votes for each class
table(class_labels)

## class_labels
## Green   Red 
##     4     6

# Determine majority class
majority_vote_class <- names(which.max(table(class_labels)))
majority_vote_class

## [1] "Red"

2. Average Probability Classification

We take the mean of the predicted probabilities. If the average probability is greater than 0.5, we classify as Red.

# Average of the probability estimates
avg_prob <- mean(c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75))
avg_prob

## [1] 0.45

# Classify based on average probability
avg_vote_class <- ifelse(avg_prob > 0.5, "Red", "Green")
avg_vote_class

## [1] "Green"

These two methods can yield different results.
- Majority vote relies on the count of classifiers favoring a label, making it robust to outliers but sensitive to threshold crossings.
- Average probability smooths out the predictions but may underestimate minority opinions if most estimates are near the threshold.

In this case, majority vote led to a classification of “Red”, while average probability led to “Green”, illustrating how ensemble method choice can influence the final decision.

6. Provide a detailed explanation of the algorithm that is used to fit a regression tree.

A regression tree is a decision tree used for predicting continuous outcomes. The algorithm fits the tree by recursively splitting the data into regions that minimize the residual sum of squares (RSS). Below is a detailed explanation of the steps involved in the algorithm.

Algorithm:

Start with the full dataset
Consider all \(n\) observations and all \(p\) predictor variables.
Select the best predictor and split point
For each predictor variable \(X_j\) and each split point \(s\):
- Divide the data into two regions:
  - \(R_1(j, s) = \{X | X_j < s\}\)
  - \(R_2(j, s) = \{X | X_j \geq s\}\)
- Compute the sum of squared residuals (SSR) for the two regions: \[ SSR = \sum_{i: x_i \in R_1(j,s)} (y_i - \bar{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \bar{y}_{R_2})^2 \]
- Choose the split \((j, s)\) that minimizes the total SSR.
Split the data
Apply the best split \((j, s)\) and divide the data accordingly.
Repeat the process recursively
- Continue the splitting process on each resulting region.
- Use the same criteria (minimizing SSR) at each step.
Stop splitting when a stopping criterion is met
Typical stopping conditions include:
- A minimum number of observations in a node (e.g., 5).
- A minimum reduction in RSS after the split.
- A maximum tree depth.
Assign prediction to each terminal node
For each final region \(R_m\), the prediction is the mean response in that region: \[ \hat{y}_{R_m} = \frac{1}{|R_m|} \sum_{i: x_i \in R_m} y_i \]

The regression tree algorithm works in a top-down greedy fashion, greedy because it makes the optimal split at each step without considering future splits. The tree partitions the predictor space such that observations within each region are similar in response value, leading to minimal prediction error.

This approach is easy to interpret and visualize, making regression trees a popular choice for modeling nonlinear relationships.