1. Consider the Gini index, classifcation error, and entropy in a simple classifcation setting with two classes. Create a single plot that displays each of these quantities as a function of pˆm1. The x-axis should FIGURE 8.14. Left: A partition of the predictor space corresponding to Exercise 4a. Right: A tree corresponding to Exercise 4b. display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classifcation error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
df <- data.frame(pm1 = seq(0.01, 0.99, by = 0.01))

df$Gini <- 2 * df$pm1 * (1 - df$pm1)

df$Entropy <- -df$pm1 * log(df$pm1) - (1 - df$pm1) * log(1 - df$pm1)

df$Error <- 1 - pmax(df$pm1, 1 - df$pm1)

library(ggplot2)
ggplot(df, aes(x = pm1)) +
  geom_line(aes(y = Gini, color = "Gini Index")) +
  geom_line(aes(y = Entropy, color = "Entropy")) +
  geom_line(aes(y = Error, color = "Classification Error")) +
  labs(
    title = "Impurity Measures vs. Class Probability",
    x = expression((p)[m1]),
    y = "Measure Value",
    color = "Impurity Metric"
  ) +
  theme_minimal()

We can observe fron this plot that : Classification error is less sensitive to probability than entropy and Gini, which makes it less informative during tree splitting. Entropy and Gini both reach their maximum when p^m1=0.5, i.e., the node is purest when one class dominates.

  1. This question relates to the plots in Figure 8.14.
  1. Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The numbers inside the boxes indicate the mean of Y within each region.

A: Refer Picture Decision Tree

  1. Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same fgure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
ggplot() +
  xlim(-1, 2) + ylim(0, 3) +
  geom_segment(aes(x = -1, xend = 2, y = 1, yend = 1)) + # Horizontal
  geom_segment(aes(x = 1, xend = 1, y = 0, yend = 1)) + # Right 
  geom_segment(aes(x = 0, xend = 0, y = 1, yend = 2)) + # Left 
  geom_segment(aes(x = -1, xend = 2, y = 2, yend = 2)) + # Top 
  geom_text(aes(x = 0, y = 0.5, label = "-1.80")) +
  geom_text(aes(x = 1.5, y = 0.5, label = "0.63")) +
  geom_text(aes(x = -0.5, y = 1.5, label = "-1.06")) +
  geom_text(aes(x = 1, y = 1.5, label = "0.21")) +
  geom_text(aes(x = 0.5, y = 2.5, label = "2.49")) +
  xlab(expression(X[1])) + ylab(expression(X[2])) +
  theme_minimal()

  1. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classifcation tree to each bootstrapped sample and, for a specifc value of X, produce 10 estimates of P(Class is Red|X):0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the fnal classifcation under each of these two approaches? A: We have 10 bootstrapped predictions for

0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75 P(Red|X)

We classify each prediction as Red if the probability > 0.5. Here, 6 out of 10 trees favor Red So final classification: Red Now the second apperoach :

mean(c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75))  
## [1] 0.45

Since the average is below 0.5 , The final classification: Green This shows how different ensemble strategies can lead to different classifications.

  1. Provide a detailed explanation of the algorithm that is used to ft a regression tree.

A:The process of building a regression tree includes growing and pruning: Step 1: Grow the Full Tree: Start with the entire dataset and split nodes based on predictors to minimize Residual Sum of Squares (RSS), stopping when a minimal node size is reached. Step 2: Apply Cost-Complexity Pruning: Introduce a penalty term alpha for tree complexity. At each pruning step, remove the split that yields the least reduction in RSS until we achieve multiple candidate trees with varying sizes. Step 3 : Use Cross-Validation to Select Optimal Tree: Perform K-fold CV. For each fold: Fit the largest tree on training data. Prune it to generate a sequence of subtrees. Compute test MSE on the held-out fold. Average the MSEs across folds for each tree size. Step 4: Choose the Tree Size with Lowest CV Error and retrain this tree on the entire dataset. Step 5: Prediction: The final tree can now be used for regression tasks on new data.