Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that dis plays each of these quantities as a function of \(\hat{p}_{m1}\) . The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1− \(\hat{p}_{m2}\) . You could make this plot by hand, but it will be much easier to make in R.
# Sequence of probabilities from 0 to 1
p <- seq(0, 1, length = 100)
# Gini index
gini <- 2 * p * (1 - p)
# Classification error
classification_error <- 1 - pmax(p, 1 - p)
# Entropy (with handling for log(0))
entropy <- -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.nan(entropy)] <- 0 # Replace NaN with 0 at p = 0 or 1
# Plot all three
plot(p, gini, type = "l", col = "blue", ylim = c(0, 1),
ylab = "Impurity", xlab = expression(hat(p)[m1]),
main = "Gini, Entropy, and Classification Error vs. p-hat")
lines(p, entropy, col = "red", lty = 2)
lines(p, classification_error, col = "darkgreen", lty = 3)
legend("top", legend = c("Gini", "Entropy", "Classification Error"),
col = c("blue", "red", "darkgreen"), lty = c(1, 2, 3))
This question relates to the plots in Figure 8.14.
Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The numbers inside the boxes indicate the mean of Y within each region.
# Plot settings
plot(NULL, xlim = c(0, 1), ylim = c(0, 1), xlab = expression(X[1]), ylab = expression(X[2]), main = "Exercise 4(a) - Partition to Tree")
rect(0, 0, 0.5, 0.5) # region with mean 3
rect(0, 0.5, 0.5, 1) # region with mean 15
rect(0.5, 0, 0.75, 0.5) # region with mean 10
rect(0.75, 0, 1, 0.5) # region with mean 0
rect(0.5, 0.5, 1, 1) # region with mean 5
# Add region means
text(0.25, 0.25, "3")
text(0.25, 0.75, "15")
text(0.625, 0.25, "10")
text(0.875, 0.25, "0")
text(0.75, 0.75, "5")
Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
# Create Y based on tree splits:
# If X2 < 1 → if X1 < 1 → -1.80 else 0.63
# Else → if X2 < 2 → if X1 < 0 → -1.06 else 0.21
# else → 2.49
Y <- ifelse(X2 < 1,
ifelse(X1 < 1, -1.80, 0.63),
ifelse(X2 < 2,
ifelse(X1 < 0, -1.06, 0.21),
2.49))
data <- data.frame(X1 = X1, X2 = X2, Y = Y)
# Fit a regression tree
tree_model <- rpart(Y ~ X1 + X2, data = data, method = "anova", control = rpart.control(cp = 0.001, minsplit = 2))
# Plot the tree
rpart.plot(tree_model, type = 3, fallen.leaves = TRUE, roundint = FALSE,
main = "Exercise 4(b): Tree Diagram")
Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X):
0.1, 0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and 0.75.
There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?
# Given bootstrap estimates of P(Class = Red | X)
probs <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
# Approach 1: Classification by average probability
avg_prob <- mean(probs)
class_avg <- ifelse(avg_prob > 0.5, "Red", "Green")
# Approach 2: Classification by majority vote
# Each tree votes "Red" if P(Red | X) > 0.5
votes_red <- sum(probs > 0.5)
votes_green <- length(probs) - votes_red
class_majority <- ifelse(votes_red > votes_green, "Red", "Green")
## Average Probability: 0.45
## Classification (Average Probability): Green
## Votes for Red: 6
## Votes for Green: 4
## Classification (Majority Vote): Red
Provide a detailed explanation of the algorithm that is used to fit a regression tree.
A regression tree is a type of decision tree used when the response variable is continuous. It recursively partitions the predictor space and fits a simple model (typically the mean) in each region.
We begin with a dataset containing predictors \(X_1, X_2, \ldots, X_p\) and a continuous response \(Y\). The goal is to divide the space into regions \(R_1, R_2, \ldots, R_M\), each with a predicted value \(\hat{y}_m\) equal to the mean of \(Y\) in that region.
This is the fundamental process used to grow the tree:
This is a greedy algorithm — it chooses the best split locally, not globally.
minsplit
)Each terminal node (region \(R_m\)) receives a prediction: \[ \hat{y}_m = \frac{1}{|R_m|} \sum_{x_i \in R_m} y_i \]
That is, the mean of \(Y\) in that region.
Fully grown trees may overfit. To avoid this, we use cost-complexity pruning: \[ C_\alpha(T) = \sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha |T| \]
Cross-validation is used to choose the optimal \(\alpha\).
Step | Description |
---|---|
1. Initialize | Start with the full dataset |
2. Split | Find the best predictor and split point to minimize SSR |
3. Recurse | Apply the process to resulting nodes |
4. Stop | Stop based on min samples, depth, or improvement |
5. Predict | Assign mean response of each region as prediction |
6. Prune (optional) | Reduce overfitting using cost-complexity pruning |
Regression trees are easy to interpret and useful for capturing nonlinear relationships — especially when combined into ensembles like bagging or boosting.