Question 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that dis plays each of these quantities as a function of \(\hat{p}_{m1}\) . The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1− \(\hat{p}_{m2}\) . You could make this plot by hand, but it will be much easier to make in R.

# Sequence of probabilities from 0 to 1
p <- seq(0, 1, length = 100)

# Gini index
gini <- 2 * p * (1 - p)

# Classification error
classification_error <- 1 - pmax(p, 1 - p)

# Entropy (with handling for log(0))
entropy <- -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.nan(entropy)] <- 0  # Replace NaN with 0 at p = 0 or 1

# Plot all three
plot(p, gini, type = "l", col = "blue", ylim = c(0, 1),
     ylab = "Impurity", xlab = expression(hat(p)[m1]),
     main = "Gini, Entropy, and Classification Error vs. p-hat")
lines(p, entropy, col = "red", lty = 2)
lines(p, classification_error, col = "darkgreen", lty = 3)
legend("top", legend = c("Gini", "Entropy", "Classification Error"),
       col = c("blue", "red", "darkgreen"), lty = c(1, 2, 3))

Question 4

This question relates to the plots in Figure 8.14.

Part A

Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The numbers inside the boxes indicate the mean of Y within each region.

# Plot settings
plot(NULL, xlim = c(0, 1), ylim = c(0, 1), xlab = expression(X[1]), ylab = expression(X[2]), main = "Exercise 4(a) - Partition to Tree")
rect(0, 0, 0.5, 0.5)         # region with mean 3
rect(0, 0.5, 0.5, 1)         # region with mean 15
rect(0.5, 0, 0.75, 0.5)      # region with mean 10
rect(0.75, 0, 1, 0.5)        # region with mean 0
rect(0.5, 0.5, 1, 1)         # region with mean 5

# Add region means
text(0.25, 0.25, "3")
text(0.25, 0.75, "15")
text(0.625, 0.25, "10")
text(0.875, 0.25, "0")
text(0.75, 0.75, "5")

Part B

Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

# Create Y based on tree splits:
# If X2 < 1 → if X1 < 1 → -1.80 else 0.63
# Else → if X2 < 2 → if X1 < 0 → -1.06 else 0.21
# else → 2.49
Y <- ifelse(X2 < 1,
            ifelse(X1 < 1, -1.80, 0.63),
            ifelse(X2 < 2,
                   ifelse(X1 < 0, -1.06, 0.21),
                   2.49))

data <- data.frame(X1 = X1, X2 = X2, Y = Y)

# Fit a regression tree
tree_model <- rpart(Y ~ X1 + X2, data = data, method = "anova", control = rpart.control(cp = 0.001, minsplit = 2))

# Plot the tree
rpart.plot(tree_model, type = 3, fallen.leaves = TRUE, roundint = FALSE,
           main = "Exercise 4(b): Tree Diagram")

Question 5

Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X):

0.1, 0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and 0.75.

There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

# Given bootstrap estimates of P(Class = Red | X)
probs <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)

# Approach 1: Classification by average probability
avg_prob <- mean(probs)
class_avg <- ifelse(avg_prob > 0.5, "Red", "Green")

# Approach 2: Classification by majority vote
# Each tree votes "Red" if P(Red | X) > 0.5
votes_red <- sum(probs > 0.5)
votes_green <- length(probs) - votes_red
class_majority <- ifelse(votes_red > votes_green, "Red", "Green")
## Average Probability: 0.45
## Classification (Average Probability): Green
## Votes for Red: 6
## Votes for Green: 4
## Classification (Majority Vote): Red

Question 6

Provide a detailed explanation of the algorithm that is used to fit a regression tree.

Regression Tree Fitting Algorithm (CART)

A regression tree is a type of decision tree used when the response variable is continuous. It recursively partitions the predictor space and fits a simple model (typically the mean) in each region.


Step-by-Step Algorithm

1. Start with the Full Dataset

We begin with a dataset containing predictors \(X_1, X_2, \ldots, X_p\) and a continuous response \(Y\). The goal is to divide the space into regions \(R_1, R_2, \ldots, R_M\), each with a predicted value \(\hat{y}_m\) equal to the mean of \(Y\) in that region.


2. Recursive Binary Splitting

This is the fundamental process used to grow the tree:

  • At each step, consider all predictors \(X_j\) and all possible split points \(s\).
  • Split the data into two regions: \[ R_1(j, s) = \{ X \mid X_j \leq s \}, \quad R_2(j, s) = \{ X \mid X_j > s \} \]
  • Choose the pair \((j, s)\) that minimizes the sum of squared residuals (SSR): \[ \text{SSR} = \sum_{i: X_i \in R_1} (y_i - \bar{y}_{R_1})^2 + \sum_{i: X_i \in R_2} (y_i - \bar{y}_{R_2})^2 \]

This is a greedy algorithm — it chooses the best split locally, not globally.


3. Repeat the Splitting

  • Apply the same splitting procedure to each resulting region (node).
  • Continue recursively until a stopping condition is met:
    • Minimum number of observations in a node (minsplit)
    • Maximum depth
    • Minimum reduction in error

4. Assign Predictions to Terminal Nodes

Each terminal node (region \(R_m\)) receives a prediction: \[ \hat{y}_m = \frac{1}{|R_m|} \sum_{x_i \in R_m} y_i \]

That is, the mean of \(Y\) in that region.


5. (Optional) Tree Pruning

Fully grown trees may overfit. To avoid this, we use cost-complexity pruning: \[ C_\alpha(T) = \sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T| \]

  • \(|T|\): number of terminal nodes (leaves)
  • \(\alpha\): complexity parameter controlling tree size vs. fit

Cross-validation is used to choose the optimal \(\alpha\).


Summary Table

Step Description
1. Initialize Start with the full dataset
2. Split Find the best predictor and split point to minimize SSR
3. Recurse Apply the process to resulting nodes
4. Stop Stop based on min samples, depth, or improvement
5. Predict Assign mean response of each region as prediction
6. Prune (optional) Reduce overfitting using cost-complexity pruning

Regression trees are easy to interpret and useful for capturing nonlinear relationships — especially when combined into ensembles like bagging or boosting.