p=seq(0,1,0.01)
gini= 2*p*(1-p)
classerror= 1-pmax(p,1-p)
crossentropy= -(p*log(p)+(1-p)*log(1-p))
plot(NA,NA,xlim=c(0,1),ylim=c(0,1),xlab='pm1',ylab='gini index')
lines(p,gini,type='l', col='purple')
lines(p,classerror,col='darkgreen')
lines(p,crossentropy,col='darkblue')
legend(x='top',legend=c('gini','class error','cross entropy'),
col=c('purple','darkgreen','darkblue'),lty=1,text.width = .4)
grid()
attach(OJ)
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
dim(OJ)
## [1] 1070 18
set.seed(1)
inTrain=sample(1:nrow(OJ),800)
train=OJ[inTrain,]
test=OJ[-inTrain,]
(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
control=rpart.control(minsplit=15, cp=0.1)
tree.oj<-rpart(Purchase~., data=train, method="class")
summary(tree.oj)
## Call:
## rpart(formula = Purchase ~ ., data = train, method = "class")
## n= 800
##
## CP nsplit rel error xerror xstd
## 1 0.50476190 0 1.0000000 1.0000000 0.04387030
## 2 0.01904762 1 0.4952381 0.5365079 0.03665241
## 3 0.01587302 4 0.4253968 0.5174603 0.03616662
## 4 0.01269841 6 0.3936508 0.5174603 0.03616662
## 5 0.01000000 8 0.3682540 0.5047619 0.03583208
##
## Variable importance
## LoyalCH PriceDiff SalePriceMM PriceMM StoreID
## 46 8 8 6 6
## DiscMM WeekofPurchase PctDiscMM PriceCH ListPriceDiff
## 5 5 5 3 2
## STORE SpecialCH SpecialMM Store7 SalePriceCH
## 2 2 1 1 1
##
## Node number 1: 800 observations, complexity param=0.5047619
## predicted class=CH expected loss=0.39375 P(node) =1
## class counts: 485 315
## probabilities: 0.606 0.394
## left son=2 (489 obs) right son=3 (311 obs)
## Primary splits:
## LoyalCH < 0.48285 to the right, improve=133.25810, (0 missing)
## StoreID < 3.5 to the right, improve= 44.40685, (0 missing)
## Store7 splits as RL, improve= 28.30298, (0 missing)
## STORE < 0.5 to the left, improve= 28.30298, (0 missing)
## PriceDiff < 0.015 to the right, improve= 22.29786, (0 missing)
## Surrogate splits:
## StoreID < 3.5 to the right, agree=0.646, adj=0.090, (0 split)
## PriceMM < 1.89 to the right, agree=0.628, adj=0.042, (0 split)
## WeekofPurchase < 236.5 to the right, agree=0.625, adj=0.035, (0 split)
## PriceCH < 1.72 to the right, agree=0.622, adj=0.029, (0 split)
## SpecialMM < 0.5 to the left, agree=0.619, adj=0.019, (0 split)
##
## Node number 2: 489 observations, complexity param=0.01904762
## predicted class=CH expected loss=0.1635992 P(node) =0.61125
## class counts: 409 80
## probabilities: 0.836 0.164
## left son=4 (261 obs) right son=5 (228 obs)
## Primary splits:
## LoyalCH < 0.7645725 to the right, improve=16.514490, (0 missing)
## PriceDiff < 0.015 to the right, improve=14.720370, (0 missing)
## SalePriceMM < 1.84 to the right, improve=10.965130, (0 missing)
## ListPriceDiff < 0.255 to the right, improve= 8.289196, (0 missing)
## SpecialMM < 0.5 to the left, improve= 7.093301, (0 missing)
## Surrogate splits:
## WeekofPurchase < 257.5 to the right, agree=0.607, adj=0.158, (0 split)
## SalePriceMM < 1.84 to the right, agree=0.595, adj=0.132, (0 split)
## PriceMM < 2.04 to the right, agree=0.593, adj=0.127, (0 split)
## PriceDiff < 0.015 to the right, agree=0.593, adj=0.127, (0 split)
## PriceCH < 1.825 to the right, agree=0.589, adj=0.118, (0 split)
##
## Node number 3: 311 observations, complexity param=0.01587302
## predicted class=MM expected loss=0.244373 P(node) =0.38875
## class counts: 76 235
## probabilities: 0.244 0.756
## left son=6 (134 obs) right son=7 (177 obs)
## Primary splits:
## LoyalCH < 0.280875 to the right, improve=9.721989, (0 missing)
## PriceDiff < 0.49 to the right, improve=6.531048, (0 missing)
## STORE < 1.5 to the left, improve=6.506024, (0 missing)
## StoreID < 3.5 to the right, improve=6.184411, (0 missing)
## Store7 splits as RL, improve=5.771670, (0 missing)
## Surrogate splits:
## STORE < 1.5 to the left, agree=0.624, adj=0.127, (0 split)
## StoreID < 1.5 to the left, agree=0.617, adj=0.112, (0 split)
## SalePriceCH < 1.775 to the left, agree=0.595, adj=0.060, (0 split)
## PriceDiff < 0.325 to the right, agree=0.592, adj=0.052, (0 split)
## WeekofPurchase < 275.5 to the right, agree=0.588, adj=0.045, (0 split)
##
## Node number 4: 261 observations
## predicted class=CH expected loss=0.04214559 P(node) =0.32625
## class counts: 250 11
## probabilities: 0.958 0.042
##
## Node number 5: 228 observations, complexity param=0.01904762
## predicted class=CH expected loss=0.3026316 P(node) =0.285
## class counts: 159 69
## probabilities: 0.697 0.303
## left son=10 (148 obs) right son=11 (80 obs)
## Primary splits:
## PriceDiff < 0.015 to the right, improve=18.285490, (0 missing)
## ListPriceDiff < 0.235 to the right, improve=16.816390, (0 missing)
## SalePriceMM < 1.84 to the right, improve=13.398910, (0 missing)
## SpecialMM < 0.5 to the left, improve= 8.988505, (0 missing)
## DiscMM < 0.15 to the left, improve= 8.823708, (0 missing)
## Surrogate splits:
## SalePriceMM < 1.84 to the right, agree=0.961, adj=0.888, (0 split)
## PctDiscMM < 0.1155095 to the left, agree=0.890, adj=0.688, (0 split)
## DiscMM < 0.15 to the left, agree=0.873, adj=0.638, (0 split)
## PriceMM < 2.04 to the right, agree=0.794, adj=0.413, (0 split)
## ListPriceDiff < 0.18 to the right, agree=0.789, adj=0.400, (0 split)
##
## Node number 6: 134 observations, complexity param=0.01587302
## predicted class=MM expected loss=0.3880597 P(node) =0.1675
## class counts: 52 82
## probabilities: 0.388 0.612
## left son=12 (58 obs) right son=13 (76 obs)
## Primary splits:
## SalePriceMM < 2.04 to the right, improve=8.030176, (0 missing)
## PriceDiff < 0.05 to the right, improve=5.930605, (0 missing)
## DiscMM < 0.22 to the left, improve=4.398151, (0 missing)
## PctDiscMM < 0.0729725 to the left, improve=4.080526, (0 missing)
## SpecialCH < 0.5 to the right, improve=4.027225, (0 missing)
## Surrogate splits:
## PriceDiff < 0.135 to the right, agree=0.896, adj=0.759, (0 split)
## PriceMM < 2.04 to the right, agree=0.799, adj=0.534, (0 split)
## DiscMM < 0.08 to the left, agree=0.784, adj=0.500, (0 split)
## PctDiscMM < 0.038887 to the left, agree=0.784, adj=0.500, (0 split)
## WeekofPurchase < 244 to the right, agree=0.739, adj=0.397, (0 split)
##
## Node number 7: 177 observations
## predicted class=MM expected loss=0.1355932 P(node) =0.22125
## class counts: 24 153
## probabilities: 0.136 0.864
##
## Node number 10: 148 observations
## predicted class=CH expected loss=0.1554054 P(node) =0.185
## class counts: 125 23
## probabilities: 0.845 0.155
##
## Node number 11: 80 observations, complexity param=0.01904762
## predicted class=MM expected loss=0.425 P(node) =0.1
## class counts: 34 46
## probabilities: 0.425 0.575
## left son=22 (38 obs) right son=23 (42 obs)
## Primary splits:
## StoreID < 3.5 to the right, improve=6.177694, (0 missing)
## ListPriceDiff < 0.235 to the right, improve=4.729091, (0 missing)
## WeekofPurchase < 240.5 to the left, improve=4.130644, (0 missing)
## DiscMM < 0.47 to the left, improve=3.141026, (0 missing)
## PctDiscMM < 0.227263 to the left, improve=3.141026, (0 missing)
## Surrogate splits:
## Store7 splits as RL, agree=0.850, adj=0.684, (0 split)
## STORE < 0.5 to the left, agree=0.850, adj=0.684, (0 split)
## WeekofPurchase < 238 to the left, agree=0.775, adj=0.526, (0 split)
## SpecialMM < 0.5 to the left, agree=0.725, adj=0.421, (0 split)
## PriceCH < 1.825 to the left, agree=0.688, adj=0.342, (0 split)
##
## Node number 12: 58 observations
## predicted class=CH expected loss=0.4137931 P(node) =0.0725
## class counts: 34 24
## probabilities: 0.586 0.414
##
## Node number 13: 76 observations, complexity param=0.01269841
## predicted class=MM expected loss=0.2368421 P(node) =0.095
## class counts: 18 58
## probabilities: 0.237 0.763
## left son=26 (12 obs) right son=27 (64 obs)
## Primary splits:
## SpecialCH < 0.5 to the right, improve=5.265351, (0 missing)
## PriceDiff < -0.24 to the right, improve=1.925297, (0 missing)
## WeekofPurchase < 234 to the left, improve=1.903618, (0 missing)
## DiscMM < 0.47 to the left, improve=1.598684, (0 missing)
## PctDiscMM < 0.227263 to the left, improve=1.598684, (0 missing)
## Surrogate splits:
## DiscCH < 0.25 to the right, agree=0.868, adj=0.167, (0 split)
## SalePriceCH < 1.49 to the left, agree=0.868, adj=0.167, (0 split)
## PctDiscCH < 0.1366045 to the right, agree=0.868, adj=0.167, (0 split)
##
## Node number 22: 38 observations, complexity param=0.01269841
## predicted class=CH expected loss=0.3684211 P(node) =0.0475
## class counts: 24 14
## probabilities: 0.632 0.368
## left son=44 (30 obs) right son=45 (8 obs)
## Primary splits:
## WeekofPurchase < 272.5 to the left, improve=2.950877, (0 missing)
## PriceCH < 1.89 to the left, improve=1.455639, (0 missing)
## PriceMM < 2.04 to the left, improve=1.455639, (0 missing)
## LoyalCH < 0.5039495 to the right, improve=1.455639, (0 missing)
## SalePriceCH < 1.89 to the left, improve=1.455639, (0 missing)
## Surrogate splits:
## PriceCH < 1.89 to the left, agree=0.947, adj=0.75, (0 split)
## PriceMM < 2.04 to the left, agree=0.947, adj=0.75, (0 split)
## SalePriceCH < 1.89 to the left, agree=0.947, adj=0.75, (0 split)
## PriceDiff < -0.25 to the right, agree=0.947, adj=0.75, (0 split)
## DiscMM < 0.47 to the left, agree=0.895, adj=0.50, (0 split)
##
## Node number 23: 42 observations
## predicted class=MM expected loss=0.2380952 P(node) =0.0525
## class counts: 10 32
## probabilities: 0.238 0.762
##
## Node number 26: 12 observations
## predicted class=CH expected loss=0.3333333 P(node) =0.015
## class counts: 8 4
## probabilities: 0.667 0.333
##
## Node number 27: 64 observations
## predicted class=MM expected loss=0.15625 P(node) =0.08
## class counts: 10 54
## probabilities: 0.156 0.844
##
## Node number 44: 30 observations
## predicted class=CH expected loss=0.2666667 P(node) =0.0375
## class counts: 22 8
## probabilities: 0.733 0.267
##
## Node number 45: 8 observations
## predicted class=MM expected loss=0.25 P(node) =0.01
## class counts: 2 6
## probabilities: 0.250 0.750
There are 9 terminal nodes. This model leverages LoyalCH, PriceDiff, SalePriceMM, SpecialCH, StoreID, and WeekofPurchase. The cross validated error rate is 0.4666667.
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
tree.oj
## n= 800
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 315 CH (0.60625000 0.39375000)
## 2) LoyalCH>=0.48285 489 80 CH (0.83640082 0.16359918)
## 4) LoyalCH>=0.7645725 261 11 CH (0.95785441 0.04214559) *
## 5) LoyalCH< 0.7645725 228 69 CH (0.69736842 0.30263158)
## 10) PriceDiff>=0.015 148 23 CH (0.84459459 0.15540541) *
## 11) PriceDiff< 0.015 80 34 MM (0.42500000 0.57500000)
## 22) StoreID>=3.5 38 14 CH (0.63157895 0.36842105)
## 44) WeekofPurchase< 272.5 30 8 CH (0.73333333 0.26666667) *
## 45) WeekofPurchase>=272.5 8 2 MM (0.25000000 0.75000000) *
## 23) StoreID< 3.5 42 10 MM (0.23809524 0.76190476) *
## 3) LoyalCH< 0.48285 311 76 MM (0.24437299 0.75562701)
## 6) LoyalCH>=0.280875 134 52 MM (0.38805970 0.61194030)
## 12) SalePriceMM>=2.04 58 24 CH (0.58620690 0.41379310) *
## 13) SalePriceMM< 2.04 76 18 MM (0.23684211 0.76315789)
## 26) SpecialCH>=0.5 12 4 CH (0.66666667 0.33333333) *
## 27) SpecialCH< 0.5 64 10 MM (0.15625000 0.84375000) *
## 7) LoyalCH< 0.280875 177 24 MM (0.13559322 0.86440678) *
Terminal node 4 begins at root variable LoyalCH. If LoyalCH>=0.7645725 then the observation will be classified as CH. Terminal Node 4 consists of CH classifciations and 11 that required further analysis.
(d) Create a plot of the tree, and interpret the results.
library(rattle)
fancyRpartPlot(tree.oj)
5 terminal nodes classify as CH and 4 termnal nodes classify as MM. It appears that the major impacts for MM classification involve a low CH loyalty mixed with a variable sales price.
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
oj.pred = predict(tree.oj, newdata = test, type = "class")
table(oj.pred, test$Purchase)
##
## oj.pred CH MM
## CH 154 35
## MM 14 67
(35+14)/270
## [1] 0.1814815
(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
library(tree)
oj.tree2<-tree(Purchase~., data=train, method = "class")
cv.oj.tree2<-cv.tree(oj.tree2, FUN=prune.misclass)
summary(oj.tree2)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train, method = "class")
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
cv.oj.tree2
## $size
## [1] 9 8 7 4 2 1
##
## $dev
## [1] 147 147 154 159 168 311
##
## $k
## [1] -Inf 0.000000 3.000000 4.333333 10.500000 151.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
(g) Produce a plot with tree size on the x-axis and cross(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv.oj.tree2$size, cv.oj.tree2$dev, type = "b", xlab = "Size", ylab = "CV Error Rate")
(h) Which tree size corresponds to the lowest cross-validated classification error rate?
Tree size of 4 corresponds with the lowest CV error rate.
(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
oj.tree2.prune<-prune.misclass(oj.tree2, best=7)
summary(oj.tree2.prune)
##
## Classification tree:
## snip.tree(tree = oj.tree2, nodes = c(4L, 10L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "PctDiscMM"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7748 = 614.4 / 793
## Misclassification error rate: 0.1625 = 130 / 800
plot(oj.tree2.prune)
text(oj.tree2.prune)
(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
Pruned Misclassification error rate: 0.1625 = 130 / 800 Non-Pruned Misclassification error rate: 0.1588 = 127 / 800
Oddly, it seems that in this case the pruned has a higher error rate
(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
oj.tree2.prune.pred<-predict(oj.tree2.prune, newdata = test, type="class")
table(oj.tree2.prune.pred, test$Purchase)
##
## oj.tree2.prune.pred CH MM
## CH 160 36
## MM 8 66
(36+8)/270
#Pruned Test Error
(36+8)/270
## [1] 0.162963
oj.tree2.pred<-predict(oj.tree2, newdata = test, type="class")
table(oj.tree2.pred, test$Purchase)
##
## oj.tree2.pred CH MM
## CH 160 38
## MM 8 64
#Non-Pruned test error
(38+8)/270
## [1] 0.1703704
The test error is higher on the non-pruned tree