Chapter 08 (page 332): 3, 8, 9
Q3. Consider the Gini index, classification error, and cross-entropy in a simple setting with two classes. Create a single plot that displays each of these quantities as a function of p^m1. The x-axis should display p^m1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
p <- seq(0, 1, 0.01)
gini.index <- 2 * p * (1 - p)
class.error <- 1 - pmax(p, 1 - p)
cross.entropy <- - (p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(gini.index, class.error, cross.entropy), col = c("red", "green", "blue"))

Q9. This problem involves the “OJ” data set which is part of the “ISLR” package.
(a)Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.5.3
library(caret)
## Warning: package 'caret' was built under R version 3.5.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.5.3
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.3
## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --
## v tibble 2.1.1 v purrr 0.3.2
## v tidyr 0.8.3 v dplyr 0.8.0.1
## v readr 1.3.1 v stringr 1.4.0
## v tibble 2.1.1 v forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.5.3
## Warning: package 'tidyr' was built under R version 3.5.3
## Warning: package 'readr' was built under R version 3.5.3
## Warning: package 'purrr' was built under R version 3.5.3
## Warning: package 'dplyr' was built under R version 3.5.3
## Warning: package 'forcats' was built under R version 3.5.3
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.5.3
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.3
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 3.5.3
data(OJ)
set.seed(1)
inTrain <- createDataPartition(OJ$Purchase, p = 800/1070, list = FALSE)
training <- OJ[inTrain,]
testing <- OJ[-inTrain,]
(b) Fit a tree to the training data, with “Purchase” as the response and the other variables except for “Buy” as predictors. Use the “summary()” function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate ? How many terminal nodes does the tree have ?
rpart_model <- rpart(Purchase ~ ., data = training, method = 'class',
control = rpart.control(cp = 0))
summary(rpart_model, cp = 1)
## Call:
## rpart(formula = Purchase ~ ., data = training, method = "class",
## control = rpart.control(cp = 0))
## n= 801
##
## CP nsplit rel error xerror xstd
## 1 0.522435897 0 1.0000000 1.0000000 0.04423447
## 2 0.023504274 1 0.4775641 0.5128205 0.03626754
## 3 0.016025641 4 0.4070513 0.4583333 0.03473842
## 4 0.009615385 5 0.3910256 0.4583333 0.03473842
## 5 0.006410256 9 0.3493590 0.4358974 0.03405724
## 6 0.001068376 10 0.3429487 0.4679487 0.03502081
## 7 0.000000000 13 0.3397436 0.4903846 0.03565844
##
## Variable importance
## LoyalCH PriceDiff SalePriceMM ListPriceDiff PriceMM
## 53 8 7 7 4
## PriceCH DiscMM PctDiscMM STORE StoreID
## 3 3 3 2 2
## WeekofPurchase DiscCH SalePriceCH PctDiscCH
## 2 2 2 1
##
## Node number 1: 801 observations
## predicted class=CH expected loss=0.3895131 P(node) =1
## class counts: 489 312
## probabilities: 0.610 0.390
The summary shows us that the variable LoyalCH is by far the most important for determining which orange juice a customer will buy. It also tells us that if we were to only guess the majority class we would get a baseline accuracy of 61.8%.
postResample(predict(rpart_model, training, type = 'class'), training$Purchase)%>%
kable
|
|
x
|
|
Accuracy
|
0.8676654
|
|
Kappa
|
0.7217437
|
The simple tree model beats the baselinewth an accuracy of 85.6%.
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
rpart_model
## n= 801
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 801 312 CH (0.61048689 0.38951311)
## 2) LoyalCH>=0.48285 500 80 CH (0.84000000 0.16000000)
## 4) LoyalCH>=0.7645725 255 10 CH (0.96078431 0.03921569) *
## 5) LoyalCH< 0.7645725 245 70 CH (0.71428571 0.28571429)
## 10) ListPriceDiff>=0.235 141 17 CH (0.87943262 0.12056738) *
## 11) ListPriceDiff< 0.235 104 51 MM (0.49038462 0.50961538)
## 22) PriceDiff>=0.085 42 11 CH (0.73809524 0.26190476) *
## 23) PriceDiff< 0.085 62 20 MM (0.32258065 0.67741935)
## 46) STORE>=3.5 11 3 CH (0.72727273 0.27272727) *
## 47) STORE< 3.5 51 12 MM (0.23529412 0.76470588) *
## 3) LoyalCH< 0.48285 301 69 MM (0.22923588 0.77076412)
## 6) LoyalCH>=0.282272 126 49 MM (0.38888889 0.61111111)
## 12) PriceDiff>=0.195 68 31 CH (0.54411765 0.45588235)
## 24) LoyalCH< 0.3084325 7 0 CH (1.00000000 0.00000000) *
## 25) LoyalCH>=0.3084325 61 30 MM (0.49180328 0.50819672)
## 50) WeekofPurchase>=248.5 40 17 CH (0.57500000 0.42500000)
## 100) STORE< 1.5 26 9 CH (0.65384615 0.34615385) *
## 101) STORE>=1.5 14 6 MM (0.42857143 0.57142857) *
## 51) WeekofPurchase< 248.5 21 7 MM (0.33333333 0.66666667) *
## 13) PriceDiff< 0.195 58 12 MM (0.20689655 0.79310345) *
## 7) LoyalCH< 0.282272 175 20 MM (0.11428571 0.88571429)
## 14) LoyalCH>=0.051325 110 19 MM (0.17272727 0.82727273)
## 28) LoyalCH< 0.203377 65 15 MM (0.23076923 0.76923077)
## 56) LoyalCH>=0.180654 7 3 CH (0.57142857 0.42857143) *
## 57) LoyalCH< 0.180654 58 11 MM (0.18965517 0.81034483) *
## 29) LoyalCH>=0.203377 45 4 MM (0.08888889 0.91111111) *
## 15) LoyalCH< 0.051325 65 1 MM (0.01538462 0.98461538) *
The root is split into nodes using the varable LoyalCH which was also determined to be the most important variable for the model in the summary. If a customer scored LoyalCH???0.51 they are predicted to be in the class CH. That means we expect them to buy Citrus Hill instead of Minute Maid.
##We can see from the output that the accuracy of that node is 81.9% ## The summary shows us that the variable LoyalCH is by far the most important for determining which orange juice a customer will buy. It also tells us that if we were to only guess the majority class we would get a baseline accuracy of 61.8%.
(d)Create a plot of the tree, and interpret the results.
rpart.plot(rpart_model)
## The plot shows us the same results as the output above. If we have information about a particular customer we can use the plot to predict which brand of orange juice they will buy.
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate ?
pred <- predict(rpart_model, testing, type = 'class')
caret::confusionMatrix(pred, testing$Purchase)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 140 25
## MM 24 80
##
## Accuracy : 0.8178
## 95% CI : (0.7664, 0.8621)
## No Information Rate : 0.6097
## P-Value [Acc > NIR] : 1.354e-13
##
## Kappa : 0.6166
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8537
## Specificity : 0.7619
## Pos Pred Value : 0.8485
## Neg Pred Value : 0.7692
## Prevalence : 0.6097
## Detection Rate : 0.5204
## Detection Prevalence : 0.6134
## Balanced Accuracy : 0.8078
##
## 'Positive' Class : CH
##
The test accuracy is found to be 78.2%.
(f) Apply the “cv.tree()” function to the training set in order to determine the optimal size tree.
rpart_cv_model <- train(training[,-1], training[,1],
method = 'rpart',
trControl = trainControl(method = 'cv', number = 10),
tuneGrid = expand.grid(cp = seq(0, 0.5, length.out = 10)))
rpart_cv_model
## CART
##
## 801 samples
## 17 predictor
## 2 classes: 'CH', 'MM'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 721, 721, 720, 722, 721, 721, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.00000000 0.8077877 0.5948491
## 0.05555556 0.8065227 0.5922651
## 0.11111111 0.8065227 0.5922651
## 0.16666667 0.8065227 0.5922651
## 0.22222222 0.8065227 0.5922651
## 0.27777778 0.8065227 0.5922651
## 0.33333333 0.8065227 0.5922651
## 0.38888889 0.8065227 0.5922651
## 0.44444444 0.8065227 0.5922651
## 0.50000000 0.7902727 0.5375197
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(rpart_cv_model)

(h) Which tree size corresponds to the lowest cross-validated classification error rate ?
rpart_cv_model$bestTune %>% kable
rpart_cv_model$results %>% kable
|
cp
|
Accuracy
|
Kappa
|
AccuracySD
|
KappaSD
|
|
0.0000000
|
0.8077877
|
0.5948491
|
0.0320064
|
0.0700690
|
|
0.0555556
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.1111111
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.1666667
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.2222222
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.2777778
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.3333333
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.3888889
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.4444444
|
0.8065227
|
0.5922651
|
0.0431349
|
0.0914759
|
|
0.5000000
|
0.7902727
|
0.5375197
|
0.0750970
|
0.2092606
|
(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
set.seed(1)
rpart_tuned <- rpart(Purchase ~ ., data = training, method = 'class',
control = rpart.control(cp = 0.02))
rpart_tuned
## n= 801
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 801 312 CH (0.61048689 0.38951311)
## 2) LoyalCH>=0.48285 500 80 CH (0.84000000 0.16000000)
## 4) LoyalCH>=0.7645725 255 10 CH (0.96078431 0.03921569) *
## 5) LoyalCH< 0.7645725 245 70 CH (0.71428571 0.28571429)
## 10) ListPriceDiff>=0.235 141 17 CH (0.87943262 0.12056738) *
## 11) ListPriceDiff< 0.235 104 51 MM (0.49038462 0.50961538)
## 22) PriceDiff>=0.085 42 11 CH (0.73809524 0.26190476) *
## 23) PriceDiff< 0.085 62 20 MM (0.32258065 0.67741935) *
## 3) LoyalCH< 0.48285 301 69 MM (0.22923588 0.77076412) *
We now have a pruned tree modelled on the whole training set by using the value of Cp obtained from cross-validation.
rpart.plot(rpart_tuned)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher ?
postResample(predict(rpart_model,
training,
type = 'class'), training$Purchase) %>% kable
|
|
x
|
|
Accuracy
|
0.8676654
|
|
Kappa
|
0.7217437
|
postResample(predict(rpart_tuned,
training,
type = 'class'), training$Purchase) %>% kable
|
|
x
|
|
Accuracy
|
0.8414482
|
|
Kappa
|
0.6761968
|
The unpruned model has higher training accuracy. This does not mean we should never prune. It just means that the training set is well representative of the testing set.
(k) Compare the test error rates between the pruned and unpruned trees. Which is higher ?
postResample(predict(rpart_model,
testing,
type = 'class'), testing$Purchase) %>% kable
|
|
x
|
|
Accuracy
|
0.8178439
|
|
Kappa
|
0.6166196
|
postResample(predict(rpart_tuned,
testing,
type = 'class'), testing$Purchase) %>% kable
|
|
x
|
|
Accuracy
|
0.8104089
|
|
Kappa
|
0.6090228
|
Again the unpruned model has higher accuracy and again this does not mean we should never prune.