3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

p <- seq(0, 1, 0.01)
gini <- 2 * p * (1 - p)
class <- 1 - pmax(p, 1 - p)
entropy <- - (p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(gini, class, entropy), pch=c(15,17,19), col = c("orange", "purple", "black"))

library(rattle)
## Warning: package 'rattle' was built under R version 4.0.5
## Loading required package: tibble
## Loading required package: bitops
## Warning: package 'bitops' was built under R version 4.0.3
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(tree)
## Warning: package 'tree' was built under R version 4.0.5
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
library(rpart)
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.3
set.seed(1)

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

train = sample(1:nrow(Carseats), nrow(Carseats) / 2)
Car.train = Carseats[train, ]
Car.test = Carseats[-train,]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree.CarseatSales = rpart(Sales~.,data = Car.train)
summary(tree.CarseatSales)
## Call:
## rpart(formula = Sales ~ ., data = Car.train)
##   n= 200 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.21466548      0 1.0000000 1.0181022 0.09266075
## 2  0.10160639      1 0.7853345 0.8557999 0.07675451
## 3  0.07420352      2 0.6837281 0.8699907 0.08193945
## 4  0.06332768      3 0.6095246 0.8191473 0.07619093
## 5  0.04846408      4 0.5461969 0.7646509 0.06843043
## 6  0.03342767      5 0.4977328 0.7154996 0.06535876
## 7  0.02591390      6 0.4643052 0.7330992 0.06471245
## 8  0.02512748      8 0.4124774 0.7470824 0.06578629
## 9  0.02007232      9 0.3873499 0.7381530 0.06588700
## 10 0.01843725     10 0.3672776 0.7288800 0.06420764
## 11 0.01307522     11 0.3488403 0.6927729 0.06322899
## 12 0.01080663     12 0.3357651 0.6895313 0.06299303
## 13 0.01027186     13 0.3249585 0.6801025 0.06260240
## 14 0.01000000     14 0.3146866 0.6801025 0.06260240
## 
## Variable importance
##       Price   ShelveLoc   CompPrice         Age Advertising          US 
##          27          22          14          10           9           8 
##  Population      Income       Urban   Education 
##           4           4           2           2 
## 
## Node number 1: 200 observations,    complexity param=0.2146655
##   mean=7.57815, MSE=7.863433 
##   left son=2 (158 obs) right son=3 (42 obs)
##   Primary splits:
##       ShelveLoc   splits as  LRL,       improve=0.21466550, (0 missing)
##       Price       < 89.5  to the right, improve=0.17140880, (0 missing)
##       Age         < 61.5  to the right, improve=0.06863684, (0 missing)
##       Advertising < 2.5   to the left,  improve=0.06010481, (0 missing)
##       US          splits as  LR,        improve=0.04966938, (0 missing)
##   Surrogate splits:
##       Price < 163.5 to the left,  agree=0.795, adj=0.024, (0 split)
## 
## Node number 2: 158 observations,    complexity param=0.1016064
##   mean=6.908291, MSE=6.105264 
##   left son=4 (134 obs) right son=5 (24 obs)
##   Primary splits:
##       Price       < 94.5  to the right, improve=0.16565390, (0 missing)
##       Age         < 50.5  to the right, improve=0.10951490, (0 missing)
##       ShelveLoc   splits as  L-R,       improve=0.08044967, (0 missing)
##       Advertising < 9.5   to the left,  improve=0.07610689, (0 missing)
##       Income      < 61.5  to the left,  improve=0.06165666, (0 missing)
##   Surrogate splits:
##       CompPrice < 98.5  to the right, agree=0.88, adj=0.208, (0 split)
## 
## Node number 3: 42 observations,    complexity param=0.07420352
##   mean=10.0981, MSE=6.439368 
##   left son=6 (12 obs) right son=7 (30 obs)
##   Primary splits:
##       Price       < 135   to the right, improve=0.4314934, (0 missing)
##       US          splits as  LR,        improve=0.2079923, (0 missing)
##       Advertising < 0.5   to the left,  improve=0.1864735, (0 missing)
##       Income      < 35    to the left,  improve=0.1510370, (0 missing)
##       Age         < 62.5  to the right, improve=0.1377097, (0 missing)
##   Surrogate splits:
##       CompPrice < 140.5 to the right, agree=0.738, adj=0.083, (0 split)
##       Income    < 85.5  to the right, agree=0.738, adj=0.083, (0 split)
## 
## Node number 4: 134 observations,    complexity param=0.06332768
##   mean=6.482687, MSE=5.178796 
##   left son=8 (59 obs) right son=9 (75 obs)
##   Primary splits:
##       Advertising < 4     to the left,  improve=0.1435166, (0 missing)
##       Price       < 130.5 to the right, improve=0.1265827, (0 missing)
##       Age         < 47.5  to the right, improve=0.1148043, (0 missing)
##       CompPrice   < 124.5 to the left,  improve=0.1140388, (0 missing)
##       ShelveLoc   splits as  L-R,       improve=0.0915152, (0 missing)
##   Surrogate splits:
##       US         splits as  LR,        agree=0.888, adj=0.746, (0 split)
##       Population < 123.5 to the left,  agree=0.664, adj=0.237, (0 split)
##       Income     < 40.5  to the left,  agree=0.604, adj=0.102, (0 split)
##       CompPrice  < 151   to the right, agree=0.597, adj=0.085, (0 split)
##       Age        < 69.5  to the right, agree=0.597, adj=0.085, (0 split)
## 
## Node number 5: 24 observations,    complexity param=0.02512748
##   mean=9.284583, MSE=4.619916 
##   left son=10 (17 obs) right son=11 (7 obs)
##   Primary splits:
##       Age         < 42    to the right, improve=0.35640660, (0 missing)
##       Income      < 62.5  to the left,  improve=0.14511910, (0 missing)
##       Price       < 79    to the right, improve=0.13907090, (0 missing)
##       Population  < 272   to the left,  improve=0.13062380, (0 missing)
##       Advertising < 8.5   to the left,  improve=0.07329618, (0 missing)
##   Surrogate splits:
##       CompPrice  < 122.5 to the left,  agree=0.875, adj=0.571, (0 split)
##       Population < 272   to the left,  agree=0.750, adj=0.143, (0 split)
##       Price      < 93.5  to the left,  agree=0.750, adj=0.143, (0 split)
## 
## Node number 6: 12 observations
##   mean=7.4625, MSE=3.074302 
## 
## Node number 7: 30 observations,    complexity param=0.02007232
##   mean=11.15233, MSE=3.895431 
##   left son=14 (8 obs) right son=15 (22 obs)
##   Primary splits:
##       US          splits as  LR,        improve=0.2701239, (0 missing)
##       Advertising < 0.5   to the left,  improve=0.2663472, (0 missing)
##       Price       < 109.5 to the right, improve=0.2592822, (0 missing)
##       Population  < 361   to the left,  improve=0.2062495, (0 missing)
##       Education   < 12.5  to the right, improve=0.1517277, (0 missing)
##   Surrogate splits:
##       Advertising < 0.5   to the left,  agree=0.9, adj=0.625, (0 split)
##       Age         < 34.5  to the left,  agree=0.8, adj=0.250, (0 split)
## 
## Node number 8: 59 observations,    complexity param=0.0259139
##   mean=5.510678, MSE=3.897203 
##   left son=16 (22 obs) right son=17 (37 obs)
##   Primary splits:
##       Price     < 130   to the right, improve=0.16871680, (0 missing)
##       CompPrice < 124.5 to the left,  improve=0.14002200, (0 missing)
##       ShelveLoc splits as  L-R,       improve=0.09363494, (0 missing)
##       Income    < 101.5 to the left,  improve=0.06549027, (0 missing)
##       Education < 16.5  to the right, improve=0.06262016, (0 missing)
##   Surrogate splits:
##       Income      < 34    to the left,  agree=0.678, adj=0.136, (0 split)
##       CompPrice   < 140   to the right, agree=0.661, adj=0.091, (0 split)
##       Age         < 28.5  to the left,  agree=0.661, adj=0.091, (0 split)
##       Advertising < 0.5   to the right, agree=0.644, adj=0.045, (0 split)
## 
## Node number 9: 75 observations,    complexity param=0.04846408
##   mean=7.247333, MSE=4.859054 
##   left son=18 (32 obs) right son=19 (43 obs)
##   Primary splits:
##       Age        < 58.5  to the right, improve=0.20914580, (0 missing)
##       Price      < 124.5 to the right, improve=0.14042090, (0 missing)
##       CompPrice  < 123.5 to the left,  improve=0.12533240, (0 missing)
##       ShelveLoc  splits as  L-R,       improve=0.09377620, (0 missing)
##       Population < 264.5 to the right, improve=0.06002403, (0 missing)
##   Surrogate splits:
##       CompPrice  < 125.5 to the left,  agree=0.653, adj=0.187, (0 split)
##       Price      < 112.5 to the left,  agree=0.627, adj=0.125, (0 split)
##       Income     < 46.5  to the left,  agree=0.613, adj=0.094, (0 split)
##       Population < 73.5  to the left,  agree=0.613, adj=0.094, (0 split)
##       US         splits as  LR,        agree=0.600, adj=0.062, (0 split)
## 
## Node number 10: 17 observations
##   mean=8.461176, MSE=3.345846 
## 
## Node number 11: 7 observations
##   mean=11.28429, MSE=2.06871 
## 
## Node number 14: 8 observations
##   mean=9.45125, MSE=4.676611 
## 
## Node number 15: 22 observations
##   mean=11.77091, MSE=2.176481 
## 
## Node number 16: 22 observations,    complexity param=0.01080663
##   mean=4.459091, MSE=2.536654 
##   left son=32 (15 obs) right son=33 (7 obs)
##   Primary splits:
##       CompPrice  < 137.5 to the left,  improve=0.30454290, (0 missing)
##       Price      < 140   to the left,  improve=0.15588450, (0 missing)
##       Age        < 59.5  to the right, improve=0.13454920, (0 missing)
##       Population < 141.5 to the right, improve=0.09637008, (0 missing)
##       Education  < 13.5  to the right, improve=0.04917852, (0 missing)
##   Surrogate splits:
##       Price      < 158.5 to the left,  agree=0.818, adj=0.429, (0 split)
##       Population < 69    to the right, agree=0.773, adj=0.286, (0 split)
##       Education  < 16.5  to the left,  agree=0.727, adj=0.143, (0 split)
## 
## Node number 17: 37 observations,    complexity param=0.0259139
##   mean=6.135946, MSE=3.657694 
##   left son=34 (16 obs) right son=35 (21 obs)
##   Primary splits:
##       CompPrice  < 123.5 to the left,  improve=0.31562490, (0 missing)
##       ShelveLoc  splits as  L-R,       improve=0.29669210, (0 missing)
##       Population < 69.5  to the left,  improve=0.14733500, (0 missing)
##       Education  < 16.5  to the right, improve=0.12712460, (0 missing)
##       Age        < 54.5  to the right, improve=0.07070531, (0 missing)
##   Surrogate splits:
##       Price      < 98.5  to the left,  agree=0.730, adj=0.375, (0 split)
##       Population < 81.5  to the left,  agree=0.676, adj=0.250, (0 split)
##       Age        < 60.5  to the right, agree=0.676, adj=0.250, (0 split)
##       Income     < 33    to the left,  agree=0.622, adj=0.125, (0 split)
##       US         splits as  RL,        agree=0.622, adj=0.125, (0 split)
## 
## Node number 18: 32 observations,    complexity param=0.01843725
##   mean=6.07875, MSE=4.673117 
##   left son=36 (24 obs) right son=37 (8 obs)
##   Primary splits:
##       Urban      splits as  RL,        improve=0.19390170, (0 missing)
##       CompPrice  < 123.5 to the left,  improve=0.17818000, (0 missing)
##       Price      < 131.5 to the right, improve=0.09564456, (0 missing)
##       Income     < 85.5  to the left,  improve=0.09025031, (0 missing)
##       Population < 258   to the right, improve=0.06684518, (0 missing)
##   Surrogate splits:
##       CompPrice  < 145   to the left,  agree=0.812, adj=0.250, (0 split)
##       Income     < 113.5 to the left,  agree=0.812, adj=0.250, (0 split)
##       US         splits as  RL,        agree=0.812, adj=0.250, (0 split)
##       Population < 399   to the left,  agree=0.781, adj=0.125, (0 split)
## 
## Node number 19: 43 observations,    complexity param=0.03342767
##   mean=8.116977, MSE=3.224896 
##   left son=38 (18 obs) right son=39 (25 obs)
##   Primary splits:
##       Price       < 127   to the right, improve=0.37910910, (0 missing)
##       Advertising < 10.5  to the left,  improve=0.16269530, (0 missing)
##       ShelveLoc   splits as  L-R,       improve=0.13220550, (0 missing)
##       Population  < 415.5 to the left,  improve=0.08891671, (0 missing)
##       Education   < 15.5  to the right, improve=0.05145156, (0 missing)
##   Surrogate splits:
##       CompPrice   < 134.5 to the right, agree=0.721, adj=0.333, (0 split)
##       Education   < 11.5  to the left,  agree=0.674, adj=0.222, (0 split)
##       Advertising < 5.5   to the left,  agree=0.651, adj=0.167, (0 split)
##       Income      < 111   to the right, agree=0.605, adj=0.056, (0 split)
##       Age         < 52.5  to the right, agree=0.605, adj=0.056, (0 split)
## 
## Node number 32: 15 observations
##   mean=3.858667, MSE=1.900452 
## 
## Node number 33: 7 observations
##   mean=5.745714, MSE=1.472024 
## 
## Node number 34: 16 observations
##   mean=4.905, MSE=2.865988 
## 
## Node number 35: 21 observations
##   mean=7.07381, MSE=2.226852 
## 
## Node number 36: 24 observations,    complexity param=0.01027186
##   mean=5.529167, MSE=3.831483 
##   left son=72 (9 obs) right son=73 (15 obs)
##   Primary splits:
##       CompPrice  < 118.5 to the left,  improve=0.17567620, (0 missing)
##       Population < 258   to the right, improve=0.13673580, (0 missing)
##       Education  < 13.5  to the right, improve=0.07801690, (0 missing)
##       Price      < 130.5 to the right, improve=0.07542213, (0 missing)
##       Income     < 59.5  to the left,  improve=0.06896229, (0 missing)
##   Surrogate splits:
##       Price       < 111.5 to the left,  agree=0.833, adj=0.556, (0 split)
##       Advertising < 9.5   to the left,  agree=0.750, adj=0.333, (0 split)
##       Education   < 12.5  to the left,  agree=0.750, adj=0.333, (0 split)
##       Income      < 51.5  to the left,  agree=0.708, adj=0.222, (0 split)
##       Age         < 74    to the right, agree=0.667, adj=0.111, (0 split)
## 
## Node number 37: 8 observations
##   mean=7.7275, MSE=3.573519 
## 
## Node number 38: 18 observations
##   mean=6.813889, MSE=1.691102 
## 
## Node number 39: 25 observations,    complexity param=0.01307522
##   mean=9.0552, MSE=2.226377 
##   left son=78 (13 obs) right son=79 (12 obs)
##   Primary splits:
##       CompPrice   < 129.5 to the left,  improve=0.3694473, (0 missing)
##       Advertising < 10.5  to the left,  improve=0.2354112, (0 missing)
##       ShelveLoc   splits as  L-R,       improve=0.2285101, (0 missing)
##       Income      < 50    to the left,  improve=0.1770412, (0 missing)
##       Education   < 12.5  to the right, improve=0.1040487, (0 missing)
##   Surrogate splits:
##       Population  < 327   to the right, agree=0.72, adj=0.417, (0 split)
##       Age         < 35.5  to the left,  agree=0.72, adj=0.417, (0 split)
##       Income      < 74.5  to the right, agree=0.64, adj=0.250, (0 split)
##       Education   < 11.5  to the right, agree=0.64, adj=0.250, (0 split)
##       Advertising < 9     to the left,  agree=0.60, adj=0.167, (0 split)
## 
## Node number 72: 9 observations
##   mean=4.47, MSE=2.295778 
## 
## Node number 73: 15 observations
##   mean=6.164667, MSE=3.675945 
## 
## Node number 78: 13 observations
##   mean=8.183846, MSE=1.448731 
## 
## Node number 79: 12 observations
##   mean=9.999167, MSE=1.355224
fancyRpartPlot(tree.CarseatSales)

plot(tree.CarseatSales)
text(tree.CarseatSales, pretty = 0)

yhat = predict(tree.CarseatSales,newdata = Car.test)
mean((yhat - Car.test$Sales)^2)
## [1] 5.165182

The MSE is estimated value of 4.67.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

plotcp(tree.CarseatSales)

printcp(tree.CarseatSales)
## 
## Regression tree:
## rpart(formula = Sales ~ ., data = Car.train)
## 
## Variables actually used in tree construction:
## [1] Advertising Age         CompPrice   Price       ShelveLoc   Urban      
## [7] US         
## 
## Root node error: 1572.7/200 = 7.8634
## 
## n= 200 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.214665      0   1.00000 1.01810 0.092661
## 2  0.101606      1   0.78533 0.85580 0.076755
## 3  0.074204      2   0.68373 0.86999 0.081939
## 4  0.063328      3   0.60952 0.81915 0.076191
## 5  0.048464      4   0.54620 0.76465 0.068430
## 6  0.033428      5   0.49773 0.71550 0.065359
## 7  0.025914      6   0.46431 0.73310 0.064712
## 8  0.025127      8   0.41248 0.74708 0.065786
## 9  0.020072      9   0.38735 0.73815 0.065887
## 10 0.018437     10   0.36728 0.72888 0.064208
## 11 0.013075     11   0.34884 0.69277 0.063229
## 12 0.010807     12   0.33577 0.68953 0.062993
## 13 0.010272     13   0.32496 0.68010 0.062602
## 14 0.010000     14   0.31469 0.68010 0.062602
tree.CarseatSales$cptable[which.min(tree.CarseatSales$cptable[,"xerror"]),"CP"]
## [1] 0.01027186
carseats.prune=prune(tree.CarseatSales,cp=tree.CarseatSales$cptable[which.min(tree.CarseatSales$cptable[,"xerror"]),"CP"])
fancyRpartPlot(carseats.prune, uniform=TRUE, main="Pruned Tree")

tree.pred <- predict(carseats.prune, Carseats[-train, ])
mean((Carseats[-train, 'Sales'] - tree.pred) ^ 2)
## [1] 5.249431

The pruning tree shows an increase in MSE by .23.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:rattle':
## 
##     importance
set.seed (1)
Carseats.bag =train(Sales~.,data=Car.train, method='rf', trControl = trainControl("cv", number = 10),importance = TRUE)
Carseats.bag
## Random Forest 
## 
## 200 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 180, 179, 180, 180, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##    2    2.003846  0.6174311  1.657741
##    6    1.741101  0.6626650  1.430498
##   11    1.696706  0.6642040  1.387246
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 11.
varImp(Carseats.bag)
## rf variable importance
## 
##                 Overall
## Price           100.000
## ShelveLocGood    82.102
## CompPrice        47.497
## ShelveLocMedium  32.811
## Age              31.915
## Advertising      26.258
## Income           11.761
## USYes            11.519
## Education         3.428
## UrbanYes          2.207
## Population        0.000
bag.pred <- predict(Carseats.bag, Carseats[-train, ])
mean((Carseats[-train, 'Sales'] - bag.pred) ^ 2)
## [1] 2.656762

The MSE has improved to 3.05.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(1)
rf.CarseatsSales = randomForest(Sales~.,data=Car.train,mtry = 3, importance = TRUE)
yhat.rf = predict(rf.CarseatsSales,newdata=Car.test)
mean((yhat.rf-Car.test$Sales)^2)
## [1] 2.960559

There is no improvement.

9. This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

attach(OJ)
str(OJ)
## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: num  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : num  1 1 1 1 0 0 0 0 0 0 ...
inTrain=createDataPartition(OJ$Purchase,p=0.75,list=FALSE)
OJ.train = OJ[inTrain,]
OJ.test = OJ[-inTrain, ]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

OJtrain.tree = rpart(Purchase~.,data=OJ.train, method = "class", control = rpart.control(minsplit=15, cp=0.01))
summary(OJtrain.tree)
## Call:
## rpart(formula = Purchase ~ ., data = OJ.train, method = "class", 
##     control = rpart.control(minsplit = 15, cp = 0.01))
##   n= 803 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.46325879      0 1.0000000 1.0000000 0.04415380
## 2 0.03035144      1 0.5367412 0.5591054 0.03737634
## 3 0.02875399      3 0.4760383 0.5623003 0.03745312
## 4 0.01597444      4 0.4472843 0.5143770 0.03624752
## 5 0.01000000      6 0.4153355 0.4952077 0.03573150
## 
## Variable importance
##        LoyalCH      PriceDiff        StoreID    SalePriceMM         DiscMM 
##             46              9              8              7              6 
##      PctDiscMM        PriceMM WeekofPurchase        PriceCH      SpecialMM 
##              6              5              4              3              3 
##  ListPriceDiff    SalePriceCH          STORE 
##              2              2              1 
## 
## Node number 1: 803 observations,    complexity param=0.4632588
##   predicted class=CH  expected loss=0.3897883  P(node) =1
##     class counts:   490   313
##    probabilities: 0.610 0.390 
##   left son=2 (450 obs) right son=3 (353 obs)
##   Primary splits:
##       LoyalCH   < 0.5036    to the right, improve=125.47740, (0 missing)
##       StoreID   < 3.5       to the right, improve= 42.81586, (0 missing)
##       Store7    splits as  RL, improve= 21.60041, (0 missing)
##       STORE     < 0.5       to the left,  improve= 21.60041, (0 missing)
##       PriceDiff < 0.015     to the right, improve= 19.99179, (0 missing)
##   Surrogate splits:
##       StoreID        < 3.5       to the right, agree=0.653, adj=0.210, (0 split)
##       WeekofPurchase < 248.5     to the right, agree=0.604, adj=0.099, (0 split)
##       PriceMM        < 1.89      to the right, agree=0.590, adj=0.068, (0 split)
##       SpecialMM      < 0.5       to the left,  agree=0.582, adj=0.048, (0 split)
##       PriceCH        < 1.72      to the right, agree=0.580, adj=0.045, (0 split)
## 
## Node number 2: 450 observations,    complexity param=0.02875399
##   predicted class=CH  expected loss=0.1422222  P(node) =0.5603985
##     class counts:   386    64
##    probabilities: 0.858 0.142 
##   left son=4 (421 obs) right son=5 (29 obs)
##   Primary splits:
##       PriceDiff   < -0.39     to the right, improve=16.312060, (0 missing)
##       LoyalCH     < 0.705699  to the right, improve=13.957150, (0 missing)
##       DiscMM      < 0.47      to the left,  improve= 8.912502, (0 missing)
##       PctDiscMM   < 0.227263  to the left,  improve= 8.912502, (0 missing)
##       SalePriceMM < 1.585     to the right, improve= 8.673878, (0 missing)
##   Surrogate splits:
##       DiscMM      < 0.72      to the left,  agree=0.969, adj=0.517, (0 split)
##       SalePriceMM < 1.435     to the right, agree=0.969, adj=0.517, (0 split)
##       PctDiscMM   < 0.3342595 to the left,  agree=0.969, adj=0.517, (0 split)
##       SalePriceCH < 2.075     to the left,  agree=0.944, adj=0.138, (0 split)
## 
## Node number 3: 353 observations,    complexity param=0.03035144
##   predicted class=MM  expected loss=0.2946176  P(node) =0.4396015
##     class counts:   104   249
##    probabilities: 0.295 0.705 
##   left son=6 (190 obs) right son=7 (163 obs)
##   Primary splits:
##       LoyalCH   < 0.2761415 to the right, improve=15.437140, (0 missing)
##       StoreID   < 3.5       to the right, improve= 8.247835, (0 missing)
##       PriceDiff < 0.31      to the right, improve= 5.579407, (0 missing)
##       DiscCH    < 0.255     to the right, improve= 4.700327, (0 missing)
##       PctDiscCH < 0.132882  to the right, improve= 4.700327, (0 missing)
##   Surrogate splits:
##       STORE       < 1.5       to the left,  agree=0.601, adj=0.135, (0 split)
##       SalePriceCH < 1.875     to the left,  agree=0.589, adj=0.110, (0 split)
##       PriceCH     < 1.925     to the left,  agree=0.581, adj=0.092, (0 split)
##       StoreID     < 3.5       to the right, agree=0.561, adj=0.049, (0 split)
##       PriceMM     < 2.205     to the left,  agree=0.555, adj=0.037, (0 split)
## 
## Node number 4: 421 observations
##   predicted class=CH  expected loss=0.1068884  P(node) =0.5242839
##     class counts:   376    45
##    probabilities: 0.893 0.107 
## 
## Node number 5: 29 observations,    complexity param=0.01597444
##   predicted class=MM  expected loss=0.3448276  P(node) =0.03611457
##     class counts:    10    19
##    probabilities: 0.345 0.655 
##   left son=10 (11 obs) right son=11 (18 obs)
##   Primary splits:
##       LoyalCH   < 0.742157  to the right, improve=5.1842560, (0 missing)
##       STORE     < 1.5       to the right, improve=1.4266810, (0 missing)
##       SpecialCH < 0.5       to the left,  improve=0.7527989, (0 missing)
##       Store7    splits as  LR, improve=0.6402904, (0 missing)
##       StoreID   < 5.5       to the left,  improve=0.6402904, (0 missing)
##   Surrogate splits:
##       PriceCH     < 2.04      to the right, agree=0.759, adj=0.364, (0 split)
##       DiscMM      < 0.47      to the left,  agree=0.759, adj=0.364, (0 split)
##       SalePriceMM < 1.64      to the right, agree=0.759, adj=0.364, (0 split)
##       SalePriceCH < 2.04      to the right, agree=0.759, adj=0.364, (0 split)
##       PctDiscMM   < 0.2224545 to the left,  agree=0.759, adj=0.364, (0 split)
## 
## Node number 6: 190 observations,    complexity param=0.03035144
##   predicted class=MM  expected loss=0.4315789  P(node) =0.2366127
##     class counts:    82   108
##    probabilities: 0.432 0.568 
##   left son=12 (113 obs) right son=13 (77 obs)
##   Primary splits:
##       PriceDiff   < 0.05      to the right, improve=12.967750, (0 missing)
##       SalePriceMM < 1.94      to the right, improve=10.809290, (0 missing)
##       DiscMM      < 0.22      to the left,  improve= 7.498238, (0 missing)
##       PctDiscMM   < 0.0729725 to the left,  improve= 6.619742, (0 missing)
##       SpecialMM   < 0.5       to the left,  improve= 4.495811, (0 missing)
##   Surrogate splits:
##       SalePriceMM   < 1.94      to the right, agree=0.942, adj=0.857, (0 split)
##       DiscMM        < 0.08      to the left,  agree=0.832, adj=0.584, (0 split)
##       PctDiscMM     < 0.038887  to the left,  agree=0.832, adj=0.584, (0 split)
##       ListPriceDiff < 0.135     to the right, agree=0.795, adj=0.494, (0 split)
##       PriceMM       < 2.04      to the right, agree=0.763, adj=0.416, (0 split)
## 
## Node number 7: 163 observations
##   predicted class=MM  expected loss=0.1349693  P(node) =0.2029888
##     class counts:    22   141
##    probabilities: 0.135 0.865 
## 
## Node number 10: 11 observations
##   predicted class=CH  expected loss=0.2727273  P(node) =0.01369863
##     class counts:     8     3
##    probabilities: 0.727 0.273 
## 
## Node number 11: 18 observations
##   predicted class=MM  expected loss=0.1111111  P(node) =0.02241594
##     class counts:     2    16
##    probabilities: 0.111 0.889 
## 
## Node number 12: 113 observations,    complexity param=0.01597444
##   predicted class=CH  expected loss=0.4159292  P(node) =0.1407223
##     class counts:    66    47
##    probabilities: 0.584 0.416 
##   left son=24 (100 obs) right son=25 (13 obs)
##   Primary splits:
##       SpecialMM < 0.5       to the left,  improve=2.2441930, (0 missing)
##       LoyalCH   < 0.303104  to the left,  improve=2.1923740, (0 missing)
##       StoreID   < 3.5       to the right, improve=1.7916910, (0 missing)
##       SpecialCH < 0.5       to the right, improve=1.0263660, (0 missing)
##       STORE     < 1.5       to the left,  improve=0.8888444, (0 missing)
##   Surrogate splits:
##       WeekofPurchase < 275       to the left,  agree=0.929, adj=0.385, (0 split)
##       DiscCH         < 0.485     to the left,  agree=0.920, adj=0.308, (0 split)
##       PriceDiff      < 0.195     to the right, agree=0.894, adj=0.077, (0 split)
##       ListPriceDiff  < 0.195     to the right, agree=0.894, adj=0.077, (0 split)
## 
## Node number 13: 77 observations
##   predicted class=MM  expected loss=0.2077922  P(node) =0.09589041
##     class counts:    16    61
##    probabilities: 0.208 0.792 
## 
## Node number 24: 100 observations
##   predicted class=CH  expected loss=0.38  P(node) =0.124533
##     class counts:    62    38
##    probabilities: 0.620 0.380 
## 
## Node number 25: 13 observations
##   predicted class=MM  expected loss=0.3076923  P(node) =0.01618929
##     class counts:     4     9
##    probabilities: 0.308 0.692

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

OJtrain.tree
## n= 803 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 803 313 CH (0.6102117 0.3897883)  
##    2) LoyalCH>=0.5036 450  64 CH (0.8577778 0.1422222)  
##      4) PriceDiff>=-0.39 421  45 CH (0.8931116 0.1068884) *
##      5) PriceDiff< -0.39 29  10 MM (0.3448276 0.6551724)  
##       10) LoyalCH>=0.742157 11   3 CH (0.7272727 0.2727273) *
##       11) LoyalCH< 0.742157 18   2 MM (0.1111111 0.8888889) *
##    3) LoyalCH< 0.5036 353 104 MM (0.2946176 0.7053824)  
##      6) LoyalCH>=0.2761415 190  82 MM (0.4315789 0.5684211)  
##       12) PriceDiff>=0.05 113  47 CH (0.5840708 0.4159292)  
##         24) SpecialMM< 0.5 100  38 CH (0.6200000 0.3800000) *
##         25) SpecialMM>=0.5 13   4 MM (0.3076923 0.6923077) *
##       13) PriceDiff< 0.05 77  16 MM (0.2077922 0.7922078) *
##      7) LoyalCH< 0.2761415 163  22 MM (0.1349693 0.8650307) *

In node 2 LoyalCh>=.48 with 503 observations. It has a deviance of 82 of overall prediction. Ch= .84 and MM = .16

(d) Create a plot of the tree, and interpret the results.

fancyRpartPlot(OJtrain.tree)

LoyalCH is the most significant variable when predicting Purchase. ListPriceDiff is least but still significant.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

OJtrain.pred = predict(OJtrain.tree, OJ.test, type = "class")
table(OJ.test$Purchase, OJtrain.pred)
##     OJtrain.pred
##       CH  MM
##   CH 150  13
##   MM  31  73
(131+87)/267
## [1] 0.8164794

There is an 18% error rate.

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

OJ.tree = tree(Purchase~., data=OJ.train)
cv.OJ = cv.tree(OJ.tree, FUN = prune.tree)
cv.OJ
## $size
## [1] 8 7 6 5 4 3 2 1
## 
## $dev
## [1]  719.1450  711.2059  710.0345  736.7886  742.4371  770.8040  820.5206
## [8] 1074.6414
## 
## $k
## [1]      -Inf  13.40952  14.63721  27.68891  31.87708  39.16834  57.67995
## [8] 277.77625
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv.OJ$size,cv.OJ$dev,type='b', xlab = "Tree size", ylab = "Deviance")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

In this case, a tree size of 5 has the lowest error rate.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

OJtrain.prune=prune(OJtrain.tree,cp=OJtrain.tree$cptable[which.min(OJtrain.tree$cptable[,"xerror"]),"CP"])
fancyRpartPlot(OJtrain.prune, uniform=TRUE, main="Pruned Classification Tree")

OJtrain.prune
## n= 803 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 803 313 CH (0.6102117 0.3897883)  
##    2) LoyalCH>=0.5036 450  64 CH (0.8577778 0.1422222)  
##      4) PriceDiff>=-0.39 421  45 CH (0.8931116 0.1068884) *
##      5) PriceDiff< -0.39 29  10 MM (0.3448276 0.6551724)  
##       10) LoyalCH>=0.742157 11   3 CH (0.7272727 0.2727273) *
##       11) LoyalCH< 0.742157 18   2 MM (0.1111111 0.8888889) *
##    3) LoyalCH< 0.5036 353 104 MM (0.2946176 0.7053824)  
##      6) LoyalCH>=0.2761415 190  82 MM (0.4315789 0.5684211)  
##       12) PriceDiff>=0.05 113  47 CH (0.5840708 0.4159292)  
##         24) SpecialMM< 0.5 100  38 CH (0.6200000 0.3800000) *
##         25) SpecialMM>=0.5 13   4 MM (0.3076923 0.6923077) *
##       13) PriceDiff< 0.05 77  16 MM (0.2077922 0.7922078) *
##      7) LoyalCH< 0.2761415 163  22 MM (0.1349693 0.8650307) *
printcp(OJtrain.prune)
## 
## Classification tree:
## rpart(formula = Purchase ~ ., data = OJ.train, method = "class", 
##     control = rpart.control(minsplit = 15, cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] LoyalCH   PriceDiff SpecialMM
## 
## Root node error: 313/803 = 0.38979
## 
## n= 803 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.463259      0   1.00000 1.00000 0.044154
## 2 0.030351      1   0.53674 0.55911 0.037376
## 3 0.028754      3   0.47604 0.56230 0.037453
## 4 0.015974      4   0.44728 0.51438 0.036248
## 5 0.010000      6   0.41534 0.49521 0.035731
plotcp(OJtrain.prune)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

tree.pred = predict(OJtrain.prune, newdata = OJ.test, type = "class")
table(tree.pred,OJ.test$Purchase)
##          
## tree.pred  CH  MM
##        CH 150  31
##        MM  13  73

We have the same exact error rates.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

Same error rates.