library(tidyverse)
library(ISLR2)
library(tree)
library(randomForest)
library(BART)
library(caret)Tree Based Methods
3
Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy
A <- seq(0,1,.01)
B <- 1-A
error <- 1 - pmax(A,B)
gini <- A*(1-A) + B * (1-B)
entropy <- -A * log(A) - B * log(B)
df <- cbind(A,error,gini,entropy) |>
as_tibble()
df |>
ggplot(aes(x = A)) +
geom_line(aes(y = error, color = "Error"), size = 1.2) +
geom_line(aes(y = gini, color = "Gini"), size = 1.2) +
geom_line(aes(y = entropy, color = "Entropy"), size = 1.2) +
labs(x = "Probability of Class A", y = "Value") +
scale_color_manual(name = "", values = c("Error" = "orange", "Gini" = "pink", "Entropy" = "purple"))8A
Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable. Split the data set into a training set and a test set.
data("Carseats")
Carseats <- as_tibble(Carseats)
idx <- sample(nrow(Carseats), nrow(Carseats) * .8, replace = FALSE)
train <- Carseats[idx,]
test <- Carseats[-idx,]8B
Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain
car.tree <- tree::tree(Sales ~ . , train)
plot(car.tree)
text(car.tree, pretty = 0, cex = 0.6)The plot shows the calculated decision branches. The tree has 17 terminal nodes, which show the average value for Sales in the given subspace.
car.preds <- predict(car.tree, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))[1] "MSE:" "4.51748396042299"
8C
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
car.tree.cv <- cv.tree(car.tree)
plot(car.tree.cv$size, car.tree.cv$dev, type = "b")df <- cbind(car.tree.cv$size, car.tree.cv$dev) |>
as.data.frame()
colnames(df) <- c("size","dev")
df |>
filter(dev == min(dev)) size dev
1 12 1408.359
We can see that the tree size with the lowest CV error is size = 11.
prune.car.tree <- prune.tree(car.tree, best = 11)
car.preds <- predict(prune.car.tree, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))[1] "MSE:" "4.31493839066439"
In this instance, the MSE increases with the pruned model.
8D
Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
Here, we use the randomForest function to perform bagging. Random forest is identical to bagging when the subset of predictors is equal to the full number of predictors.
car.random <- randomForest(formula = Sales ~ ., data = Carseats, mtry = ncol(Carseats) - 1,
importance = TRUE, subset = idx)
car.preds <- predict(car.random, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))[1] "MSE:" "2.50997131340607"
Bagging reduces the MSE to 2.764.
importance(car.random) %IncMSE IncNodePurity
CompPrice 32.0652336 238.68276
Income 8.5529854 127.64685
Advertising 25.5721610 199.23256
Population -1.7227863 82.87033
Price 69.8381400 734.11747
ShelveLoc 77.5298259 757.82306
Age 24.8215568 246.10722
Education 2.9018100 75.67995
Urban -0.7037512 11.55332
US 4.0235057 20.32574
Baed in %IncMSE and IncNodePurity, ShelveLoc and Price are the most important variables.
8E
Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
car.random <- randomForest(formula = Sales ~ ., data = Carseats, importance = TRUE, subset = idx)
car.preds <- predict(car.random, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))[1] "MSE:" "2.99029532767346"
Compared to basic bagging, random forest performs worse with a MSE of 3.05. When selecting the predictors for a new tree, random forest only uses a subset of p predictors \(m = \sqrt p\)
importance(car.random) %IncMSE IncNodePurity
CompPrice 16.5296908 227.60906
Income 4.3387596 178.91001
Advertising 20.4895341 217.05455
Population -0.3604669 151.03092
Price 45.4410721 596.73902
ShelveLoc 50.8324689 575.28036
Age 21.7056302 286.93399
Education 4.9649687 108.42047
Urban -0.9810651 18.98387
US 7.3233199 53.98749
Again, ShelveLoc and Price are the most important predictors.
9A
data(OJ)idx <- sample(nrow(OJ), 800, replace = FALSE)
train <- OJ[idx,]
test <- OJ[-idx,]9B
oj.tree <- tree(Purchase ~ . , data = train)
summary(oj.tree)
Classification tree:
tree(formula = Purchase ~ ., data = train)
Variables actually used in tree construction:
[1] "LoyalCH" "SalePriceMM" "SpecialCH" "ListPriceDiff"
[5] "PriceDiff"
Number of terminal nodes: 8
Residual mean deviance: 0.7444 = 589.5 / 792
Misclassification error rate: 0.1575 = 126 / 800
There are 8 terminal nodes with an error rate of 15%
9C
oj.treenode), split, n, deviance, yval, (yprob)
* denotes terminal node
1) root 800 1069.000 CH ( 0.61125 0.38875 )
2) LoyalCH < 0.5036 346 404.800 MM ( 0.27168 0.72832 )
4) LoyalCH < 0.280875 162 108.800 MM ( 0.10494 0.89506 ) *
5) LoyalCH > 0.280875 184 250.200 MM ( 0.41848 0.58152 )
10) SalePriceMM < 2.04 96 105.700 MM ( 0.23958 0.76042 )
20) SpecialCH < 0.5 75 65.950 MM ( 0.16000 0.84000 ) *
21) SpecialCH > 0.5 21 29.060 CH ( 0.52381 0.47619 ) *
11) SalePriceMM > 2.04 88 117.400 CH ( 0.61364 0.38636 ) *
3) LoyalCH > 0.5036 454 350.800 CH ( 0.87004 0.12996 )
6) LoyalCH < 0.764572 183 208.500 CH ( 0.74317 0.25683 )
12) ListPriceDiff < 0.18 53 73.300 MM ( 0.47170 0.52830 ) *
13) ListPriceDiff > 0.18 130 108.200 CH ( 0.85385 0.14615 )
26) PriceDiff < -0.165 5 5.004 MM ( 0.20000 0.80000 ) *
27) PriceDiff > -0.165 125 91.730 CH ( 0.88000 0.12000 ) *
7) LoyalCH > 0.764572 271 98.270 CH ( 0.95572 0.04428 ) *
If LoyalCH is less than .48, and LoyalCH is greater than .03, and PriceDiff is less than .05, then the resulting decision is MM.
9D
plot(oj.tree)
text(oj.tree, pretty = 0, cex = 0.6)9E
preds <- predict(oj.tree, newdata = test, type = "class")
confusionMatrix(preds,test$Purchase)Confusion Matrix and Statistics
Reference
Prediction CH MM
CH 142 24
MM 22 82
Accuracy : 0.8296
95% CI : (0.7794, 0.8725)
No Information Rate : 0.6074
P-Value [Acc > NIR] : 2.083e-15
Kappa : 0.6416
Mcnemar's Test P-Value : 0.8828
Sensitivity : 0.8659
Specificity : 0.7736
Pos Pred Value : 0.8554
Neg Pred Value : 0.7885
Prevalence : 0.6074
Detection Rate : 0.5259
Detection Prevalence : 0.6148
Balanced Accuracy : 0.8197
'Positive' Class : CH
The error rate is 1-.7963
9F
oj.tree.cv <- cv.tree(oj.tree)9G
plot(oj.tree.cv$size, oj.tree.cv$dev, type = "b")9H
7 is the lowest.
9I
oj.pruned <- prune.tree(oj.tree, best = 7)9J
The training error rate for the pruned model is 130/800
9H
preds <- predict(oj.pruned, newdata = test, type = "class")
confusionMatrix(preds,test$Purchase)Confusion Matrix and Statistics
Reference
Prediction CH MM
CH 136 22
MM 28 84
Accuracy : 0.8148
95% CI : (0.7633, 0.8593)
No Information Rate : 0.6074
P-Value [Acc > NIR] : 1.693e-13
Kappa : 0.6156
Mcnemar's Test P-Value : 0.4795
Sensitivity : 0.8293
Specificity : 0.7925
Pos Pred Value : 0.8608
Neg Pred Value : 0.7500
Prevalence : 0.6074
Detection Rate : 0.5037
Detection Prevalence : 0.5852
Balanced Accuracy : 0.8109
'Positive' Class : CH
The test error rate improves to 81%