Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
The Gini Index is a measure of the total variance across all the K classes, and would take a small value if the pmk (the proportion of training observations in the mth region that are from the kth class) is close to either 0 or 1; the same can be said for the entropy of a model. The Gini index can be measured as the negative sum of pmk multiplied by its complement for all the K classes of a model, while the entropy of the model can be measured as the the negative sum of pmk multiplied by the log of the same value for all the K classes of a model. Our classification error is just defined as the fraction of the training observation that DO NOT belong to the most common cause; its formula is 1 minus the maximum pmk.
p <- seq(0, 1, 0.01)
gini_index <- 2 * p * (1 - p) #this is because we have 2 classes
class_error <- 1 - pmax(p, 1 - p) #here, pmax will look at the highest value of the p-vector we defined earlier
entropy <- - (p * log(p) + (1 - p) * log(1 - p))
par(bg = "orchid3")
matplot(p, cbind(gini_index, class_error, entropy), pch=c(15,17,19) ,ylab = "gini.index, class.error, cross.entropy",col = c("ghostwhite" , "gray0", "gray"), type = 'b')
legend('bottom', inset=.01, legend = c('gini.index', 'class.error', 'cross.entropy'), col = c("ghostwhite" , "gray0", "gray"), pch=c(15,17,19))
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.
set.seed(24)
train_division <- sample(1:nrow(Carseats), nrow(Carseats)*0.75)
training_carseats <- Carseats[train_division, ]
test_carseat <- Carseats[-train_division, ]
(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
tree_carseats <- tree(Sales ~ ., data = training_carseats)
plot(tree_carseats)
text(tree_carseats, pretty = 1)
To obtain our MSE, we must obtain the predicted values of the sales of the test data and subtract the observed sales from them, squaring this result. Thus, we do the following:
yhat_tree=predict(tree_carseats, test_carseat)
obs_sales<-test_carseat$Sales
mean((yhat_tree - obs_sales)^2)
## [1] 5.064223
Therefore, our MSE value is 5.064223
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
cross_validation_carseats <- cv.tree(tree_carseats)
cross_validation_carseats
## $size
## [1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 1466.526 1495.742 1548.589 1511.621 1501.165 1546.583 1530.132 1538.571
## [9] 1512.517 1533.302 1542.787 1574.725 1512.818 1512.818 1469.126 1498.218
## [17] 1706.665 1708.358 1856.341 2330.926
##
## $k
## [1] -Inf 24.58710 28.41399 30.78746 32.30501 35.19482 36.22118
## [8] 36.60542 37.96271 38.27342 40.64363 43.52142 54.20696 54.49613
## [15] 63.35705 83.46101 130.39388 132.78774 232.13313 537.53515
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cross_validation_carseats$size, cross_validation_carseats$dev,
xlab = "size",
ylab = "cv error")
abline(h = min(cross_validation_carseats$dev) + 0.2*sd(cross_validation_carseats$dev))
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
bag_carseats = randomForest(Sales ~ ., data = training_carseats, mtry = 10, ntree = 500,
importance = T)
bag_pred = predict(bag_carseats, test_carseat)
mean((test_carseat$Sales - bag_pred)^2)
## [1] 2.659297
The model with the lowest MSE would be the one with 4 nodes, having an MSE of about 2.643673
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
set.seed(24)
bag_carseats<-randomForest(Sales ~ ., data = training_carseats, mtry = 10, importance = TRUE)
yhat_bag<-predict(bag_carseats, newdata = test_carseat)
mean((yhat_bag - test_carseat$Sales)^2)
## [1] 2.651329
importance(bag_carseats)
## %IncMSE IncNodePurity
## CompPrice 36.5403195 256.47385
## Income 7.0174602 110.17012
## Advertising 20.2839775 171.36030
## Population 0.6991205 87.06898
## Price 75.2681369 648.05673
## ShelveLoc 68.1472870 618.35931
## Age 29.8947733 270.79924
## Education 6.3758695 68.78396
## Urban -1.7614733 10.72299
## US 1.9791749 10.25786
The MSE obtained by this process if is about 2.651329 (almost the same as the obtained MSE when using a bagging approach). According to our importance function, the top 3 most important variables would be CompPrice, Income, and Advertising.
This problem involves the OJ data set which is part of the ISLR package. (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
oj <- ISLR::OJ
train_division_oj <- sample(1:nrow(oj), nrow(oj)*0.75)
training_oj <- oj[train_division_oj, ]
test_oj <- oj[-train_division_oj, ]
(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
tree_oj <- tree(Purchase ~ ., data = training_oj)
summary(tree_oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = training_oj)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7264 = 576.8 / 794
## Misclassification error rate: 0.1534 = 123 / 802
The tree has 8 terminal nodes and has 4 variables: LoyalCH, PriceDiff, ListPriceDiff, and SalePriceMM. The training error rate for the tree is 0.1633.
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
tree_oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 802 1065.00 CH ( 0.61970 0.38030 )
## 2) LoyalCH < 0.48285 298 322.50 MM ( 0.23154 0.76846 )
## 4) LoyalCH < 0.0356415 52 0.00 MM ( 0.00000 1.00000 ) *
## 5) LoyalCH > 0.0356415 246 292.00 MM ( 0.28049 0.71951 )
## 10) PriceDiff < 0.195 111 95.05 MM ( 0.15315 0.84685 ) *
## 11) PriceDiff > 0.195 135 180.00 MM ( 0.38519 0.61481 )
## 22) LoyalCH < 0.277977 63 64.14 MM ( 0.20635 0.79365 ) *
## 23) LoyalCH > 0.277977 72 99.31 CH ( 0.54167 0.45833 ) *
## 3) LoyalCH > 0.48285 504 427.50 CH ( 0.84921 0.15079 )
## 6) LoyalCH < 0.764572 233 275.90 CH ( 0.72103 0.27897 )
## 12) PriceDiff < -0.165 34 39.30 MM ( 0.26471 0.73529 ) *
## 13) PriceDiff > -0.165 199 199.70 CH ( 0.79899 0.20101 )
## 26) PriceDiff < 0.265 111 133.30 CH ( 0.71171 0.28829 ) *
## 27) PriceDiff > 0.265 88 53.62 CH ( 0.90909 0.09091 ) *
## 7) LoyalCH > 0.764572 271 92.04 CH ( 0.95941 0.04059 ) *
We will be interpreting the note denoted by the number “12”. The splitting variable for this node is PriceDiff and the splitting value of the node would be -0.165. There are 37 points in the subtree below this node and the deviance for all the points here is below 45. The * in this line denotes that it is a terminal node and the prediction at this node is MM. About 24% of the points in this node have CH as value of Sales, while the remaining 76% have MM as a value of Sales.
(d) Create a plot of the tree, and interpret the results.
plot(tree_oj)
text(tree_oj, pretty = 1)
LoyalCH is the most important variable of the tree, in fact top 3 nodes contain LoyalCH. If LoyalCH<0.035 , the tree predicts MM. If LoyalCH>0.76, the tree predicts CH. For intermediate values of LoyalCH, the decision also depends on the value of PriceDiff.
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
oj_pred = predict(tree_oj, test_oj, type = "class")
table(test_oj$Purchase, oj_pred)
## oj_pred
## CH MM
## CH 137 19
## MM 37 75
The test error rate for this model can be calculated by (17 + 35) = 52 / 268 = 0.19, meaning that the test error rate for would be around 19%.
(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
cv_oj = cv.tree(tree_oj, FUN = prune.tree)
(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv_oj$size, cv_oj$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
(h) Which tree size corresponds to the lowest cross-validated classification error rate?
The tree size with the lowest cross-validated classification error rate is 7
(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
oj_pruned = prune.tree(tree_oj, best = 7)
(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
summary(oj_pruned)
##
## Classification tree:
## snip.tree(tree = tree_oj, nodes = 13L)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7416 = 589.6 / 795
## Misclassification error rate: 0.1534 = 123 / 802
As we can see from the pruned tree summary, the training error rate is exactly the same as it was from the unpruned tree: 0.1633. Therefore, neither is higher by comaprison.
(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
pred_unpruned = predict(tree_oj, test_oj, type = "class")
misclass_unpruned = sum(test_oj$Purchase != pred_unpruned)
misclass_unpruned/length(pred_unpruned)
## [1] 0.2089552
pred_pruned = predict(oj_pruned, test_oj, type = "class")
misclass_pruned = sum(test_oj$Purchase != pred_pruned)
misclass_pruned/length(pred_pruned)
## [1] 0.2089552
As we can see, the test error for both the pruned and unpruned tree is exactly the same: 0.1940299