3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆ pm1. The x-axis should display ˆ pm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, ˆ pm1 =1− ˆ pm2. You could make this plot by hand, but it will be much easier to make in R.

p <- seq(0, 1, by = 0.01)

gini <- 2 * p * (1 - p)
class_error <- 1 - pmax(p, 1 - p)
entropy <- - (p * log2(p) + (1 - p) * log2(1 - p))
entropy[is.na(entropy)] <- 0

plot(p, gini, type = "l", col = "red", ylim = c(0, 1), ylab = "Impurity Measure", xlab = "P(Class = 1)")
lines(p, class_error, col = "blue")
lines(p, entropy, col = "darkgreen")
legend("topright", legend = c("Gini Index", "Classification Error", "Entropy"),
       col = c("red", "blue", "darkgreen"), lty = 1)

4. This question relates to the plots in Figure 8.14.

(a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The numbers inside the boxes indicate the mean of Y within each region.

            X1 < 0.5
           /        \
       Yes           No
      (6)       X2 < 0.5
                 /     \
              Yes       No
              (2)      (10)

(b) Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.

#                    x1
#
#           0         1
#     |--------------------------------|
#     |             2.49               |
#   2 |--------------------------------|
#x2   |   -1.06  |          0.21       |
#  1  |--------------------------------|
#     |      -1.8        |       0.63  |
#     |--------------------------------|

5. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1, 0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and 0.75.

There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?

We are given 10 probabilities:

0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75

  1. Majority Vote: Count of probabilities ≥ 0.5 = 6

Count < 0.5 = 4 Majority Vote Classification: Red

  1. Average Probability: Average = (sum of all 10 values) / 10 = 4.5 / 10 = 0.45 Average Probability Classification: Green

6. Provide a detailed explanation of the algorithm that is used to fit a regression tree.

Regression Tree Fitting Algorithm: Start with all data at the root node.

Search over all predictors (X) and all possible cutpoints to find the split that minimizes the residual sum of squares (RSS).

Split the data into two regions based on the optimal cutpoint found in step 2.

Repeat recursively: Apply steps 2 and 3 to each resulting region, continuing to split based on minimizing RSS.

Stop when:

A minimum node size is reached,

The improvement in RSS is not significant,

Or a maximum tree depth is specified.

Pruning:

To avoid overfitting, trees are often grown large and then pruned back using cost-complexity pruning, where a complexity parameter α penalizes tree size.

The best α is typically chosen via cross-validation.

This greedy, top-down approach is known as recursive binary splitting and is central to CART (Classification and Regression Trees).