Let’s talk briefly about these terms.
Gini Index: given by the equation \(\sum_{k=1}^{K}{\hat{p}_{mk}(1-\hat{p}_{mk})}\), this is a measure of the total variance across all K classes. When the value of \(\hat{p}_{mk}\) approaches either 0 or 1, the Gini index is smallest. This is a good reason Gini index is also considered to be a measure of node purity - a small Gini index means the node contains mostly observations from a single class.
Classification Error: given by the equation \(E = 1- max_k(\hat{p}_{mk})\) is the fraction of the training datapoints in a region that don’t belong to the most common class. This type of error isn’t sensitive enough for tree-growing, so it’s typically not used (instead we use Gini index or cross-entropy).
Cross-Entropy: given by the equation \(D = -\sum_{k=1}^{K}{\hat{p}_{mk}\log\hat{p}_{mk}}\) this equation is similar to the Gini index in that if \(\hat{p}_{mk}\) takes on a value near 1 or 0, the cross-entropy value will be near 0. This means the cross-entropy value will be near-zero if the node is really pure.
Here’s the plot:
p = seq(0, 1, 0.01)
class.error = 1 - pmax(p, 1 - p)
gini.index = 2 * p * (1-p)
Xentropy = - (p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(class.error, gini.index, Xentropy), type = "l", lwd=3, col = c("red", "blue", "green" ), pch = 16,xlim = c(0,1.2), xlab = "P-hat", ylab = "Resulting Value")
legend("topright",pch=16, title = "Quality Measure", col=c("red", "blue", "green"), legend=c("Classification Error", "Gini Index", "Cross-Entropy"), box.lty = 1)
set.seed(2021)
carseats = Carseats
inTrain <- createDataPartition(carseats$Sales, p = 0.7, list = FALSE, times = 1)
carseats.train <- carseats[inTrain, ]
carseats.test <- carseats[-inTrain, ]
set.seed(2021)
tree <- tree(Sales ~ ., carseats.train)
summary(tree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "CompPrice" "Income"
## Number of terminal nodes: 16
## Residual mean deviance: 2.476 = 656.1 / 265
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.068000 -1.015000 -0.006471 0.000000 0.913500 4.546000
plot(tree); text(tree, pretty = 0)
The variables used in tree construction are ShelveLoc, Price, Age, CompPrice and Income. There are 16 terminal nodes, and the residual mean deviance is 2.48. The variable ShelveLoc is the most important variable (with good acting as the reference), based on its position in the plot, followed by Price.
set.seed(2021)
tree.preds <- predict(tree, newdata = carseats.test)
tree.MSE <- mean((tree.preds - carseats.test$Sales)^2)
cat('The decision tree MSE is: ', tree.MSE)
## The decision tree MSE is: 5.119811
The decision tree MSE is approximately 5.12.
set.seed(2021)
CVtree = cv.tree(tree)
CVtree
## $size
## [1] 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 1198.868 1220.312 1211.394 1207.512 1198.210 1206.164 1237.239 1302.599
## [9] 1422.524 1415.249 1438.069 1425.824 1525.997 1576.075 1817.700 2392.918
##
## $k
## [1] -Inf 23.35243 26.76040 28.34406 28.51928 28.64549 38.54721
## [8] 43.54840 57.44161 60.57303 69.37522 105.02149 134.19967 174.08867
## [15] 251.33441 588.26787
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(CVtree$size, CVtree$dev, type = "b")
min.dev <- which.min(CVtree$dev)
min.dev.val <- CVtree$dev[min.dev]
min.dev.size <- CVtree$size[min.dev]
cat('The model with the minimum dev value is', min.dev.val)
## The model with the minimum dev value is 1198.21
cat('\nThe model size with the minimum dev value is' ,min.dev.size)
##
## The model size with the minimum dev value is 12
Pruning the tree for optimal size:
set.seed(2021)
prunetree = prune.tree(tree, best = min.dev.size)
plot(prunetree)
text(prunetree, pretty = 0)
CVtree.preds <- predict(prunetree, newdata = carseats.test)
CVtree.MSE <- mean((CVtree.preds - carseats.test$Sales)^2)
cat('The CV decision tree MSE is: ', CVtree.MSE)
## The CV decision tree MSE is: 5.231266
MSE on the CV pruned decision tree is 5.23, which is higher than the un-pruned decision tree MSE of 5.12.
Typically use randomforest for bagging trees applications. Telling it to use an mtry of 10 so it can grab from the entire set of variables. Using ntrees = 1000.
set.seed(2021)
bagmodel <- randomForest(Sales ~ ., carseats.train, mtry = 10, importance = TRUE, ntree = 1000)
bagmodel
##
## Call:
## randomForest(formula = Sales ~ ., data = carseats.train, mtry = 10, importance = TRUE, ntree = 1000)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.318885
## % Var explained: 71.84
varImpPlot(bagmodel)
The bagged model explains 71.84% of the variance in Sales in the data we fed it. Based on the importance plot, ShelveLoc and Price are the two most important variables.
set.seed(2021)
bagmodel.preds <- predict(bagmodel, newdata = carseats.test)
bagmodel.MSE <- mean((bagmodel.preds - carseats.test$Sales)^2)
cat('The bagged model MSE is ',bagmodel.MSE)
## The bagged model MSE is 2.747382
The bagged model has an MSE of 2.75, which is about half of the MSE from the decision tree tried above. Heck yeah, bagged models! Whoo, random forests!
set.seed(2021)
RF1 <- randomForest(Sales ~ ., carseats.train, mtry=5, ntree = 1000, importance = TRUE)
RF1
##
## Call:
## randomForest(formula = Sales ~ ., data = carseats.train, mtry = 5, ntree = 1000, importance = TRUE)
## Type of random forest: regression
## Number of trees: 1000
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 2.434538
## % Var explained: 70.44
importance(RF1)
## %IncMSE IncNodePurity
## CompPrice 35.2811766 214.84678
## Income 13.6836596 150.61959
## Advertising 21.5526008 164.64507
## Population 0.2415301 107.30226
## Price 79.8599871 621.45955
## ShelveLoc 81.5668189 617.46293
## Age 31.3461235 248.92422
## Education 1.1764927 72.31147
## Urban -1.3948924 14.71352
## US 6.0714316 15.37581
RF1.preds <- predict(RF1, newdata = carseats.test)
RF1.MSE <- mean((RF1.preds - carseats.test$Sales)^2)
cat('Random Forest MSE is ',RF1.MSE)
## Random Forest MSE is 2.613064
Random forest with mtry left to R to decide upon explains 66.3% of Sales variance. The top three important variables are ShelveLoc, Price and Age. Its MSE is 2.77. If I manually code mtry = 5, the MSE changes to 2.61. There appears to be a sweet spot where MSE is minimized.
OJ data set which is part of the ISLR package.set.seed(2021)
oj <- OJ
inTrain <- createDataPartition(oj$Purchase, p=0.747, list=FALSE, times = 1)
oj.train <- oj[inTrain, ]
oj.test <- oj[-inTrain, ]
dim(oj.train)
## [1] 800 18
dim(oj.test)
## [1] 270 18
Purchase as the response and the other variables except for Buy as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?I’m sorry, Dave, but I can’t do that. Why? Because Buy doesn’t exist in the OJ dataset. Maybe it did in a previous edition, but not in this one. So you’re getting all of the variables as predictors. Deal with it.
oj.tree <- tree(Purchase ~ ., data = oj.train)
summary(oj.tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7709 = 611.3 / 793
## Misclassification error rate: 0.1838 = 147 / 800
This tree uses two of the predictors, LoyalCH and PriceDiff, and has 7 nodes. The training error / misclassification error rate is 18.4%.
oj.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1070.000 CH ( 0.61000 0.39000 )
## 2) LoyalCH < 0.50395 352 422.000 MM ( 0.28693 0.71307 )
## 4) LoyalCH < 0.136698 101 45.520 MM ( 0.05941 0.94059 ) *
## 5) LoyalCH > 0.136698 251 333.000 MM ( 0.37849 0.62151 )
## 10) PriceDiff < 0.05 99 87.580 MM ( 0.16162 0.83838 ) *
## 11) PriceDiff > 0.05 152 210.500 CH ( 0.51974 0.48026 ) *
## 3) LoyalCH > 0.50395 448 356.500 CH ( 0.86384 0.13616 )
## 6) LoyalCH < 0.764572 182 212.000 CH ( 0.73077 0.26923 )
## 12) PriceDiff < -0.065 45 60.570 MM ( 0.40000 0.60000 ) *
## 13) PriceDiff > -0.065 137 120.700 CH ( 0.83942 0.16058 )
## 26) PriceDiff < 0.31 94 99.860 CH ( 0.77660 0.22340 ) *
## 27) PriceDiff > 0.31 43 9.499 CH ( 0.97674 0.02326 ) *
## 7) LoyalCH > 0.764572 266 97.820 CH ( 0.95489 0.04511 ) *
Picking terminal node 4. The splitting variable here is LoyalCH, and the split value is approximately 0.14. There’s 101 subtree points below this node, and the deviance for the points below Node 4 is 45.52. This is the first terminal node (* means terminal node), and it is predicting MM for Sales, with approximately 6% leaning for Sales = CH and the remaining 94% of the points leaning for Sales = MM.
plot(oj.tree); text(oj.tree, pretty = TRUE)
LoyalCH is the most important variable and make up the first three nodes. The second-most important variable is PriceDiff. If LoyalCH is < 0.504, and LoyalCH < 0.14, the tree predicts MM. If LoyalCH > 0.14, it becomes a contest of PriceDiff, where PriceDiff < 0.05 is predicted to be MM, otherwise it goes to CH. On the other side, if LoyalCH > 0.76, the tree predicts CH, otherwise it comes down to PriceDiff again.
set.seed(2021)
oj.preds <- predict(oj.tree, newdata = oj.test, type = 'class')
conf.matrix <- table(oj.preds, oj.test$Purchase)
#Test error rate
tree.acc <- sum(diag(conf.matrix))/sum(conf.matrix)
tree.err <- 1 - tree.acc
cat('Overall tree model accuracy is: ', tree.acc)
## Overall tree model accuracy is: 0.7740741
cat('\nOverall tree model test error rate is: ', tree.err)
##
## Overall tree model test error rate is: 0.2259259
The overall tree model accuracy is 77.4%, and the overall error rate is 22.6%
set.seed(2021)
CV.oj.tree <- cv.tree(oj.tree, FUN = prune.misclass)
CV.oj.tree
## $size
## [1] 7 6 4 2 1
##
## $dev
## [1] 173 171 174 170 312
##
## $k
## [1] -Inf 0.0 3.0 4.5 150.0
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
Looks like a tree of size of 2 has lowest deviance, based on the information above. We could also look at a size of 6.
Plotting this we get:
plot(CV.oj.tree)
plot(CV.oj.tree$size, CV.oj.tree$dev, type = "l", lwd = 2, col = "red",
main = "CV Classification Error vs. Tree Size",
xlab = "Tree Size",
ylab = "CV Classification Error")
There are two options we can go with. A tree with 2 terminal nodes has a CV classification error rate of 170, while a tree with 6 terminal nodes has a CV classification error rate of 171. I’ll try both in the next step to see which one really has the best performance.
pruned.oj.tree2 <- prune.misclass(oj.tree, best = 2)
summary(pruned.oj.tree2)
##
## Classification tree:
## snip.tree(tree = oj.tree, nodes = 2:3)
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes: 2
## Residual mean deviance: 0.9756 = 778.5 / 798
## Misclassification error rate: 0.2025 = 162 / 800
If I use 2 terminal nodes, the misclassification error rate is 20.25%
pruned.oj.tree6 <- prune.misclass(oj.tree, best = 6)
summary(pruned.oj.tree6)
##
## Classification tree:
## snip.tree(tree = oj.tree, nodes = 13L)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 6
## Residual mean deviance: 0.7843 = 622.7 / 794
## Misclassification error rate: 0.1838 = 147 / 800
If I use 6 terminal nodes, the misclassification error rate is 18.4%, which is better than the 2-terminal-node solution. So let’s go with the 6-TN model and call it good.
The pruned training error rate for 6-TN is 18.38%. The training error rate for the un-pruned is also 18.38%, identical to the pruned model, but then again it has 7 terminal nodes, very similar to the 6 terminal node model.
Let’s unleash the pruned decision tree upon the test data and see what kind of performance it has.
pruned.oj.preds <- predict(pruned.oj.tree6, newdata = oj.test, type = 'class')
pruned.conf.matrix <- table(pruned.oj.preds, oj.test$Purchase)
pruned.tree.acc <- sum(diag(pruned.conf.matrix))/sum(pruned.conf.matrix)
pruned.tree.err <- 1 - pruned.tree.acc
cat('The unpruned tree model confusion matrix looks like this: \n')
## The unpruned tree model confusion matrix looks like this:
conf.matrix
##
## oj.preds CH MM
## CH 142 38
## MM 23 67
cat('\nOverall unpruned tree model accuracy is: ', tree.acc)
##
## Overall unpruned tree model accuracy is: 0.7740741
cat('\nOverall unpruned tree model test error rate is: ', tree.err)
##
## Overall unpruned tree model test error rate is: 0.2259259
cat('The pruned tree model confusion matrix looks like this: \n')
## The pruned tree model confusion matrix looks like this:
pruned.conf.matrix
##
## pruned.oj.preds CH MM
## CH 142 38
## MM 23 67
cat('\nOverall pruned tree model accuracy is: ', pruned.tree.acc)
##
## Overall pruned tree model accuracy is: 0.7740741
cat('\nOverall pruned tree model test error rate is: ', pruned.tree.err)
##
## Overall pruned tree model test error rate is: 0.2259259
There is no difference in performance between the unpruned and pruned decision tree. This does not come as a surprise.