Question 3)
Consider the Gini index, classification error, and entropy in a simple
classification setting with two classes. Create a single plot that
displays each of these quantities as a function of ˆpm1. The x-axis
should display ˆpm1, ranging from 0 to 1, and the y-axis should display
the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make
this plot by hand, but it will be much easier to make in R
P = seq(0, 1, 0.001)
gini = 2 * P * (1 - P)
class_error = 1 - pmax(P, 1 - P)
entropy = - (P * log(P) + (1 - P) * log(1 - P))
matplot(P, cbind(gini, class_error, entropy), ylab = "Gini Index, Classification Error, Entropy", col = c("#9ac483", "#50aed3", "#e4b031"))
legend("topright", legend =c("Gini Index", "Entropy", "Classification Error"), col = c("#e4b031", "#9ac483", "#50aed3"), pch = 19, bty = "n")
Question 8)
In the lab, a classification tree was applied to the Carseats data set
after converting Sales into a qualitative response variable. Now we will
seek to predict Sales using regression trees and related approaches,
treating the response as a quantitative variable.
attach(Carseats)
set.seed(1)
train = sample(nrow(Carseats), nrow(Carseats)/2)
car_train = Carseats[train, ]
car_test = Carseats[-train, ]
car_tree = tree(Sales ~ ., data = car_train)
summary(car_tree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = car_train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
plot(car_tree)
text(car_tree, pretty = 0, cex=.55)
car_pred = predict(car_tree, car_test)
(car_error<-mean((car_test$Sales - car_pred)^2))
## [1] 4.922039
Solution: The test MSE obtained is 4.922
cv_car = cv.tree(car_tree)
plot(cv_car$size, cv_car$dev, type = "b")
prune_car = prune.tree(car_tree, best = 10)
plot(prune_car)
text(prune_car,pretty=0)
prune_pred = predict(prune_car, car_test)
(prune_mse=mean((car_test$Sales - prune_pred)^2))
## [1] 4.918134
Solution: Pruning the tree lowered the MSE from 4.922 to 4.918
car_bag = randomForest(Sales ~ ., data = car_train, mtry = 10, ntree = 500, importance = TRUE)
pred_bag = predict(car_bag, car_test)
(bag_mse=mean((car_test$Sales - pred_bag)^2))
## [1] 2.657296
importance(car_bag)
## %IncMSE IncNodePurity
## CompPrice 23.07909904 171.185734
## Income 2.82081527 94.079825
## Advertising 11.43295625 99.098941
## Population -3.92119532 59.818905
## Price 54.24314632 505.887016
## ShelveLoc 46.26912996 361.962753
## Age 14.24992212 159.740422
## Education -0.07662320 46.738585
## Urban 0.08530119 8.453749
## US 4.34349223 15.157608
Solution: Bagging the data lowered the MSE to 2.613. Looking at the importance table, we can see that CompPrice, Price, ShelveLoc, and Age are the most important.
car_random = randomForest(Sales~., data = car_train, mtry = 10, importance = TRUE)
car_pred_random = predict(car_random, newdata = car_test)
mean((car_pred_random - car_test$Sales)^2)
## [1] 2.610527
importance(car_random)
## %IncMSE IncNodePurity
## CompPrice 27.2185992 175.364033
## Income 7.2671214 92.458328
## Advertising 12.2426915 101.904768
## Population -1.0615424 58.937256
## Price 59.1988997 506.231577
## ShelveLoc 48.2092806 376.678459
## Age 17.1341327 155.920772
## Education 1.6004779 46.229529
## Urban -0.3397478 9.263352
## US 3.9928997 12.917736
Solution: Applying the random forest method to the data lowered the MSE further to a value of 2.585. Looking at the importance table, we can see that CompPrice, Price, ShelveLoc, and Age remain the most important.
Question 9) This problem involves the OJ data set which is part of the ISLR2 package.
attach(OJ)
set.seed(1)
train = sample(1:nrow(OJ), 800)
oj_train = OJ[train, ]
oj_test = OJ[-train, ]
oj_tree = tree(Purchase ~ ., data = oj_train)
summary(oj_tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
plot(oj_tree)
text(oj_tree, pretty = 0)
Solution: The tree’s error rate is 0.1588. The names of the terminal nodes of the tree are: “LoyalCH”, “PriceDiff”, “SpecialCH”, “ListPriceDiff”, and “PctDiscMM”.
oj_tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196197 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196197 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
Solution: I selected terminal node 21. For this branch, The split criterion is “SpecialCH” with a value of greater than 0.5. There are 15 observations in this branch with a deviance of 20.19. The overall prediction for the branch is “CH”, and the fraction of values that take on the values of CH or MM is 0.60 or 0.40.
plot(oj_tree)
text(oj_tree, pretty=0)
Solution: This tree first splits the data at LoyalCH values over/under 0.5036. Then, it further splits the LoyalCH variable, indicating that this is an important variable in our dataset. If Loyal is < 0.28, then it will go to a response value of “MM” or “MM”. Although “MM” is the same value, it will purify the node for a more certain result. If the LoyalCH valyue is greater than 0.28, the data will be further split by PiceDiff. If PriceDiff is less than 0.05, the data will be split depending on SpecialCH. If SpecialCH is less than 0.5, the result will be MM, if greater it will be CH. If PriceDiff is greater than 0.05, the result will be 0.05. On the other side of the tree for LocalCH values greater then 0.5036, the data will be split by a LoyalCH value less than or greater than 0.76. If the value is greater than 0.76, the result will be CH. If the value is less than 0.76, the data will be split by ListPriceDiff. if ListPriceDiff is greater than 0.235, the result will be CH. If ListPriceDiff is less than 0.235, the data will be split by PctDiscMM. If PctDiscMM is less than 0.196, the result is CH, if the result is greater than 0.196,the result will be MM.
oj_pred = predict(oj_tree, oj_test, type = 'class')
confusionMatrix(oj_test$Purchase, oj_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 160 8
## MM 38 64
##
## Accuracy : 0.8296
## 95% CI : (0.7794, 0.8725)
## No Information Rate : 0.7333
## P-Value [Acc > NIR] : 0.0001259
##
## Kappa : 0.6154
##
## Mcnemar's Test P-Value : 1.904e-05
##
## Sensitivity : 0.8081
## Specificity : 0.8889
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.6275
## Prevalence : 0.7333
## Detection Rate : 0.5926
## Detection Prevalence : 0.6222
## Balanced Accuracy : 0.8485
##
## 'Positive' Class : CH
##
oj_cvtree = cv.tree(oj_tree)
oj_cvtree
## $size
## [1] 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 685.6493 698.8799 702.8083 702.8083 714.1093 725.4734 780.2099
## [8] 790.0301 1074.2062
##
## $k
## [1] -Inf 12.62207 13.94616 14.35384 26.21539 35.74964 43.07317
## [8] 45.67120 293.15784
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(oj_cvtree$size, oj_cvtree$dev, type = "b", xlab = "Tree Size", ylab = "cross-validated classification error rate")
Solution: The lowest cross-validated classification error rate is found at Tree Size = 6, with a cross-validated classification error rate of less than 750.
oj_prune = prune.tree(oj_tree, best = 6)
oj_prune
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
plot(oj_prune)
text(oj_prune, pretty=0)
summary(oj_prune)
##
## Classification tree:
## snip.tree(tree = oj_tree, nodes = c(10L, 4L, 12L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 6
## Residual mean deviance: 0.7919 = 628.8 / 794
## Misclassification error rate: 0.1788 = 143 / 800
Solution: The training error rate of the unpruned tree was 0.1588 and the test error rate of the pruned tree is 0.1788. The test error rate of the pruned tree is therefore higher.
oj_tree_pred = predict(oj_tree, newdata = oj_test, type = "class")
oj_tree_error = sum(oj_test$Purchase != oj_tree_pred)
oj_tree_error/length(oj_tree_pred)
## [1] 0.1703704
oj_pruned_pred = predict(oj_prune, oj_test, type = "class")
oj_pruned_error = sum(oj_test$Purchase != oj_pruned_pred)
oj_pruned_error/length(oj_pruned_pred)
## [1] 0.1851852
Solution: The unpruned tree has a test error rate of 0.170 while the pruned tree has a test error rate of 0.189. We can conclude that the pruned tree has a higher test error rate.