Chapter 08 (page 332): 3, 8, 9

Q3. Consider the Gini index, classification error, and cross-entropy in a simple setting with two classes. Create a single plot that displays each of these quantities as a function of p^m1. The x-axis should display p^m1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

p <- seq(0, 1, 0.01)
gini.index <- 2 * p * (1 - p)
class.error <- 1 - pmax(p, 1 - p)
cross.entropy <- - (p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(gini.index, class.error, cross.entropy), col = c("red", "green", "blue"))

Q9. This problem involves the “OJ” data set which is part of the “ISLR” package.

(a)Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

library(ISLR)

## Warning: package 'ISLR' was built under R version 3.5.3

library(caret)

## Warning: package 'caret' was built under R version 3.5.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.5.3

library(knitr)

## Warning: package 'knitr' was built under R version 3.5.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.3

## -- Attaching packages --------------------------------------------- tidyverse 1.2.1 --

## v tibble  2.1.1       v purrr   0.3.2  
## v tidyr   0.8.3       v dplyr   0.8.0.1
## v readr   1.3.1       v stringr 1.4.0  
## v tibble  2.1.1       v forcats 0.4.0

## Warning: package 'tibble' was built under R version 3.5.3

## Warning: package 'tidyr' was built under R version 3.5.3

## Warning: package 'readr' was built under R version 3.5.3

## Warning: package 'purrr' was built under R version 3.5.3

## Warning: package 'dplyr' was built under R version 3.5.3

## Warning: package 'forcats' was built under R version 3.5.3

## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.5.3

library(kableExtra)

## Warning: package 'kableExtra' was built under R version 3.5.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

library(ggthemes)

## Warning: package 'ggthemes' was built under R version 3.5.3

data(OJ)
set.seed(1)
inTrain <- createDataPartition(OJ$Purchase, p = 800/1070, list = FALSE)
training <- OJ[inTrain,]
testing <- OJ[-inTrain,]

(b) Fit a tree to the training data, with “Purchase” as the response and the other variables except for “Buy” as predictors. Use the “summary()” function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate ? How many terminal nodes does the tree have ?

rpart_model <- rpart(Purchase ~ ., data = training, method = 'class', 
                     control = rpart.control(cp = 0))
summary(rpart_model, cp = 1)

## Call:
## rpart(formula = Purchase ~ ., data = training, method = "class", 
##     control = rpart.control(cp = 0))
##   n= 801 
## 
##            CP nsplit rel error    xerror       xstd
## 1 0.522435897      0 1.0000000 1.0000000 0.04423447
## 2 0.023504274      1 0.4775641 0.5128205 0.03626754
## 3 0.016025641      4 0.4070513 0.4583333 0.03473842
## 4 0.009615385      5 0.3910256 0.4583333 0.03473842
## 5 0.006410256      9 0.3493590 0.4358974 0.03405724
## 6 0.001068376     10 0.3429487 0.4679487 0.03502081
## 7 0.000000000     13 0.3397436 0.4903846 0.03565844
## 
## Variable importance
##        LoyalCH      PriceDiff    SalePriceMM  ListPriceDiff        PriceMM 
##             53              8              7              7              4 
##        PriceCH         DiscMM      PctDiscMM          STORE        StoreID 
##              3              3              3              2              2 
## WeekofPurchase         DiscCH    SalePriceCH      PctDiscCH 
##              2              2              2              1 
## 
## Node number 1: 801 observations
##   predicted class=CH  expected loss=0.3895131  P(node) =1
##     class counts:   489   312
##    probabilities: 0.610 0.390

The summary shows us that the variable LoyalCH is by far the most important for determining which orange juice a customer will buy. It also tells us that if we were to only guess the majority class we would get a baseline accuracy of 61.8%.

postResample(predict(rpart_model, training, type = 'class'), training$Purchase)%>%
kable

	x
Accuracy	0.8676654
Kappa	0.7217437

The simple tree model beats the baselinewth an accuracy of 85.6%.

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

rpart_model

## n= 801 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 801 312 CH (0.61048689 0.38951311)  
##     2) LoyalCH>=0.48285 500  80 CH (0.84000000 0.16000000)  
##       4) LoyalCH>=0.7645725 255  10 CH (0.96078431 0.03921569) *
##       5) LoyalCH< 0.7645725 245  70 CH (0.71428571 0.28571429)  
##        10) ListPriceDiff>=0.235 141  17 CH (0.87943262 0.12056738) *
##        11) ListPriceDiff< 0.235 104  51 MM (0.49038462 0.50961538)  
##          22) PriceDiff>=0.085 42  11 CH (0.73809524 0.26190476) *
##          23) PriceDiff< 0.085 62  20 MM (0.32258065 0.67741935)  
##            46) STORE>=3.5 11   3 CH (0.72727273 0.27272727) *
##            47) STORE< 3.5 51  12 MM (0.23529412 0.76470588) *
##     3) LoyalCH< 0.48285 301  69 MM (0.22923588 0.77076412)  
##       6) LoyalCH>=0.282272 126  49 MM (0.38888889 0.61111111)  
##        12) PriceDiff>=0.195 68  31 CH (0.54411765 0.45588235)  
##          24) LoyalCH< 0.3084325 7   0 CH (1.00000000 0.00000000) *
##          25) LoyalCH>=0.3084325 61  30 MM (0.49180328 0.50819672)  
##            50) WeekofPurchase>=248.5 40  17 CH (0.57500000 0.42500000)  
##             100) STORE< 1.5 26   9 CH (0.65384615 0.34615385) *
##             101) STORE>=1.5 14   6 MM (0.42857143 0.57142857) *
##            51) WeekofPurchase< 248.5 21   7 MM (0.33333333 0.66666667) *
##        13) PriceDiff< 0.195 58  12 MM (0.20689655 0.79310345) *
##       7) LoyalCH< 0.282272 175  20 MM (0.11428571 0.88571429)  
##        14) LoyalCH>=0.051325 110  19 MM (0.17272727 0.82727273)  
##          28) LoyalCH< 0.203377 65  15 MM (0.23076923 0.76923077)  
##            56) LoyalCH>=0.180654 7   3 CH (0.57142857 0.42857143) *
##            57) LoyalCH< 0.180654 58  11 MM (0.18965517 0.81034483) *
##          29) LoyalCH>=0.203377 45   4 MM (0.08888889 0.91111111) *
##        15) LoyalCH< 0.051325 65   1 MM (0.01538462 0.98461538) *

The root is split into nodes using the varable LoyalCH which was also determined to be the most important variable for the model in the summary. If a customer scored LoyalCH???0.51 they are predicted to be in the class CH. That means we expect them to buy Citrus Hill instead of Minute Maid.

##We can see from the output that the accuracy of that node is 81.9% ## The summary shows us that the variable LoyalCH is by far the most important for determining which orange juice a customer will buy. It also tells us that if we were to only guess the majority class we would get a baseline accuracy of 61.8%.

(d)Create a plot of the tree, and interpret the results.

rpart.plot(rpart_model)

## The plot shows us the same results as the output above. If we have information about a particular customer we can use the plot to predict which brand of orange juice they will buy.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate ?

pred <- predict(rpart_model, testing, type = 'class')

caret::confusionMatrix(pred, testing$Purchase)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 140  25
##         MM  24  80
##                                           
##                Accuracy : 0.8178          
##                  95% CI : (0.7664, 0.8621)
##     No Information Rate : 0.6097          
##     P-Value [Acc > NIR] : 1.354e-13       
##                                           
##                   Kappa : 0.6166          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8537          
##             Specificity : 0.7619          
##          Pos Pred Value : 0.8485          
##          Neg Pred Value : 0.7692          
##              Prevalence : 0.6097          
##          Detection Rate : 0.5204          
##    Detection Prevalence : 0.6134          
##       Balanced Accuracy : 0.8078          
##                                           
##        'Positive' Class : CH              
##

The test accuracy is found to be 78.2%.

(f) Apply the “cv.tree()” function to the training set in order to determine the optimal size tree.

rpart_cv_model <- train(training[,-1], training[,1],
                        method = 'rpart',
                        trControl = trainControl(method = 'cv', number = 10),
                        tuneGrid = expand.grid(cp = seq(0, 0.5, length.out = 10)))
rpart_cv_model

## CART 
## 
## 801 samples
##  17 predictor
##   2 classes: 'CH', 'MM' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 721, 721, 720, 722, 721, 721, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.00000000  0.8077877  0.5948491
##   0.05555556  0.8065227  0.5922651
##   0.11111111  0.8065227  0.5922651
##   0.16666667  0.8065227  0.5922651
##   0.22222222  0.8065227  0.5922651
##   0.27777778  0.8065227  0.5922651
##   0.33333333  0.8065227  0.5922651
##   0.38888889  0.8065227  0.5922651
##   0.44444444  0.8065227  0.5922651
##   0.50000000  0.7902727  0.5375197
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(rpart_cv_model)

(h) Which tree size corresponds to the lowest cross-validated classification error rate ?

rpart_cv_model$bestTune %>% kable

cp
0

rpart_cv_model$results %>% kable

cp	Accuracy	Kappa	AccuracySD	KappaSD
0.0000000	0.8077877	0.5948491	0.0320064	0.0700690
0.0555556	0.8065227	0.5922651	0.0431349	0.0914759
0.1111111	0.8065227	0.5922651	0.0431349	0.0914759
0.1666667	0.8065227	0.5922651	0.0431349	0.0914759
0.2222222	0.8065227	0.5922651	0.0431349	0.0914759
0.2777778	0.8065227	0.5922651	0.0431349	0.0914759
0.3333333	0.8065227	0.5922651	0.0431349	0.0914759
0.3888889	0.8065227	0.5922651	0.0431349	0.0914759
0.4444444	0.8065227	0.5922651	0.0431349	0.0914759
0.5000000	0.7902727	0.5375197	0.0750970	0.2092606

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

set.seed(1)
rpart_tuned <- rpart(Purchase ~ ., data = training, method = 'class',
                     control = rpart.control(cp = 0.02))
rpart_tuned

## n= 801 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 801 312 CH (0.61048689 0.38951311)  
##    2) LoyalCH>=0.48285 500  80 CH (0.84000000 0.16000000)  
##      4) LoyalCH>=0.7645725 255  10 CH (0.96078431 0.03921569) *
##      5) LoyalCH< 0.7645725 245  70 CH (0.71428571 0.28571429)  
##       10) ListPriceDiff>=0.235 141  17 CH (0.87943262 0.12056738) *
##       11) ListPriceDiff< 0.235 104  51 MM (0.49038462 0.50961538)  
##         22) PriceDiff>=0.085 42  11 CH (0.73809524 0.26190476) *
##         23) PriceDiff< 0.085 62  20 MM (0.32258065 0.67741935) *
##    3) LoyalCH< 0.48285 301  69 MM (0.22923588 0.77076412) *

We now have a pruned tree modelled on the whole training set by using the value of Cp obtained from cross-validation.

rpart.plot(rpart_tuned)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher ?

postResample(predict(rpart_model, 
                     training,
                     type = 'class'), training$Purchase) %>% kable

	x
Accuracy	0.8676654
Kappa	0.7217437

postResample(predict(rpart_tuned, 
                     training,
                     type = 'class'), training$Purchase) %>% kable

	x
Accuracy	0.8414482
Kappa	0.6761968

The unpruned model has higher training accuracy. This does not mean we should never prune. It just means that the training set is well representative of the testing set.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher ?

postResample(predict(rpart_model, 
                     testing,
                     type = 'class'), testing$Purchase) %>% kable

	x
Accuracy	0.8178439
Kappa	0.6166196

postResample(predict(rpart_tuned, 
                     testing,
                     type = 'class'), testing$Purchase) %>% kable

	x
Accuracy	0.8104089
Kappa	0.6090228

Data miningHW 7