# Define the range of p^m1 from 0 to 1
p_m1 <- seq(0, 1, by = 0.01)
# Calculate p^m2 using the relationship p^m1 + p^m2 = 1
p_m2 <- 1 - p_m1
# Calculate Gini index
gini <- 2 * p_m1 * p_m2
# Calculate classification error
classification_error <- p_m1 * (1 - p_m1) + p_m2 * (1 - p_m2)
# Calculate entropy
entropy <- -p_m1 * log2(p_m1) - p_m2 * log2(p_m2)
# Plotting
plot(p_m1, gini, type = "l", col = "blue", xlab = "p^m1", ylab = "Value", main = "Comparison of Gini index, Classification Error, and Entropy")
lines(p_m1, classification_error, type = "l", col = "red")
lines(p_m1, entropy, type = "l", col = "green")
legend("topright", legend = c("Gini Index", "Classification Error", "Entropy"), col = c("blue", "red", "green"), lty = 1)
library(ggplot2)
# Create an empty ggplot object
plot <- ggplot() +
xlim(0, 1) + ylim(0, 1) + # Set plot limits
theme_void() # Remove axis labels, ticks, and gridlines
# Add segments representing decision boundaries
plot <- plot +
geom_segment(aes(x = 0.5, xend = 0.5, y = 0.95, yend = 0.85)) +
geom_segment(aes(x = 0.5, xend = 0.3, y = 0.85, yend = 0.85)) +
geom_segment(aes(x = 0.5, xend = 0.7, y = 0.85, yend = 0.85)) +
geom_segment(aes(x = 0.7, xend = 0.7, y = 0.85, yend = 0.75)) +
geom_segment(aes(x = 0.3, xend = 0.3, y = 0.85, yend = 0.75)) +
geom_segment(aes(x = 0.15, xend = 0.3, y = 0.75, yend = 0.75)) +
geom_segment(aes(x = 0.3, xend = 0.45, y = 0.75, yend = 0.75)) +
geom_segment(aes(x = 0.15, xend = 0.15, y = 0.75, yend = 0.6)) +
geom_segment(aes(x = 0.45, xend = 0.45, y = 0.75, yend = 0.6)) +
geom_segment(aes(x = 0.15, xend = 0.10, y = 0.6, yend = 0.6)) +
geom_segment(aes(x = 0.15, xend = 0.2, y = 0.6, yend = 0.6)) +
geom_segment(aes(x = 0.1, xend = 0.1, y = 0.6, yend = 0.45)) +
geom_segment(aes(x = 0.2, xend = 0.2, y = 0.6, yend = 0.45)) +
geom_segment(aes(x = 0.1, xend = 0.05, y = 0.45, yend = 0.45)) +
geom_segment(aes(x = 0.1, xend = 0.15, y = 0.45, yend = 0.45)) +
geom_segment(aes(x = 0.05, xend = 0.05, y = 0.45, yend = 0.30)) +
geom_segment(aes(x = 0.15, xend = 0.15, y = 0.45, yend = 0.30))
# Add text labels for decision rules and leaf values
plot <- plot +
geom_text(aes(label = "X[1] < 1", x = 0.45, y = 0.875), parse = TRUE) +
geom_text(aes(label = "X[2] < 1", x = 0.25, y = 0.775), parse = TRUE) +
geom_text(aes(label = "X[1] > 0", x = 0.1, y = 0.625), parse = TRUE) +
geom_text(aes(label = "X[2] > 0", x = 0.05, y = 0.475), parse = TRUE) +
geom_text(aes(label = "5", x = 0.7, y = 0.7)) +
geom_text(aes(label = "15", x = 0.45, y = 0.55)) +
geom_text(aes(label = "3", x = 0.2, y = 0.4)) +
geom_text(aes(label = "0", x = 0.05, y = 0.25)) +
geom_text(aes(label = "10", x = 0.15, y = 0.25))
# Display the plot
print(plot)
library(ggplot2)
# Create an empty ggplot object
plot <- ggplot() +
xlim(-1, 2) + ylim(0, 3) + # Set plot limits
theme(panel.background = element_blank(), # Remove panel background
panel.border = element_rect(colour = "black", fill=NA), # Set panel border
axis.title = element_text(size = 12), # Set axis title size
axis.text = element_text(size = 10)) # Set axis text size
# Add line segments representing splits
plot <- plot +
geom_segment(aes(x = -1, xend = 2, y = 1, yend = 1)) +
geom_segment(aes(x = 1, xend = 1, y = 0, yend = 1)) +
geom_segment(aes(x = -1, xend = 2, y = 2, yend = 2)) +
geom_segment(aes(x = 0, xend = 0, y = 2, yend = 1))
# Add text labels for split points
plot <- plot +
geom_text(aes(label = "-1.80", x = 0, y = 0.5)) +
geom_text(aes(label = "0.63", x = 1.5, y = 0.5)) +
geom_text(aes(label = "2.49", x = 0.5, y = 2.5)) +
geom_text(aes(label = "-1.06", x = -0.5, y = 1.5)) +
geom_text(aes(label = "0.21", x = 1, y = 1.5))
# Add axis labels
plot <- plot +
ylab(expression(X[2])) +
xlab(expression(X[1]))
# Display the plot
print(plot)
Under the majority vote approach, we would classify based on the class that receives the most votes among the ten estimates of P(Class is Red|X).
Given the following estimates: 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, and 0.75
The counts of votes for each class are as follows:
Votes for Red: 0 Votes for Green: 10 Since all the estimates are for the Green class, the majority vote approach would classify the observation as Green.
Under the average probability approach, we would calculate the average of the ten estimates and then classify based on whether this average probability is greater than or equal to 0.5.
The average probability is: (0.1 + 0.15 + 0.2 + 0.2 + 0.55 + 0.6 + 0.6 + 0.65 + 0.7 + 0.75) / 10 = 0.475
Since the average probability is less than 0.5, the average probability approach would classify the observation as Green.
Start with the Root Node: At the beginning, the entire dataset is considered as one single region or node. This node represents the root of the tree.
Splitting the Node: The dataset is split into two or more subsets based on the feature that provides the best split. The goal is to minimize the variance of the target variable within each subset. The splitting process involves considering all possible splits on all features and selecting the one that maximally reduces the variance.
Recursive Splitting: After the initial split, the same process of splitting is applied recursively to each of the resulting subsets. This process continues until a stopping criterion is reached, such as a maximum tree depth, minimum number of samples in a node, or inability to find a split that reduces the variance.
Stopping Criteria: There are several stopping criteria to prevent overfitting and control the size of the tree. Some common stopping criteria include:
Maximum depth of the tree: Limiting the depth of the tree helps prevent overfitting. Minimum samples per leaf node: If the number of samples in a node falls below a certain threshold, further splitting is not allowed. Minimum improvement in variance: If a split does not lead to a sufficient reduction in the variance of the target variable, it is not considered. Prediction at Terminal Nodes (Leaves): Once the tree is fully grown according to the stopping criteria, each terminal node (also called a leaf node) contains a subset of the data. The prediction for a new data point is made by taking the average (or weighted average) of the target variable within the leaf node that the data point falls into.
Tree Pruning (Optional): After the tree is grown, pruning techniques can be applied to remove unnecessary splits that do not significantly improve the predictive performance of the tree. Pruning helps prevent overfitting and can lead to simpler and more interpretable trees.
Output: The output of the regression tree algorithm is a binary tree structure where each internal node represents a decision based on a feature and each leaf node represents a prediction.
In summary, the regression tree algorithm recursively partitions the feature space into regions, making predictions based on the mean of the target variable within each region. It aims to find the splits that minimize the variance of the target variable, resulting in a tree that captures the underlying patterns in the data.