library(ISLR2)
library(tree)
library(rpart)
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(BART)
## Loading required package: nlme
## Loading required package: nnet
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
p = seq(0, 1, 0.01)
gini = p * (1 - p) * 2
entropy = -(p * log(p) + (1 - p) * log(1 - p))
class.err = 1 - pmax(p, 1 - p)
matplot(p, cbind(gini, entropy, class.err), col = c("blue", "red", "green"))
OJ data set which is part of
the ISLR2 package.set.seed(1)
train = sample(nrow(OJ), 800)
OJtrain = OJ[train, ]
OJtest = OJ[-train, ]
Purchase as
the response and the other variables as predictors. Use the
summary() function to produce summary statistics about the
tree, and describe the results obtained. What is the training error
rate? How many terminal nodes does the tree have?tree.OJ = tree(Purchase ~ ., data = OJtrain)
summary(tree.OJ)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJtrain)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
5 variables were actually used in the tree construction. The training error rate is 0.1588. There are 9 terminal nodes on the tree.
tree.OJ
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196196 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196196 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
Node 5. The splitting variable at this node is LoyalCH.
It also shows that the splitting value of this node is 0.280875. There
are 188 points in the sub-tree below this node. 55.9% of the
observations take the value of MM and 44.1% of the
observations take the value of CH.
plot(tree.OJ)
text(tree.OJ, pretty = 0)
LoyalCH, PriceDiff, SpecialCH,
PctDiscMMand ListPriceDiff are the most
important variables.
pred.OJ= predict(tree.OJ, OJtest, type = "class")
confusionMatrix(OJtest$Purchase, pred.OJ)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 160 8
## MM 38 64
##
## Accuracy : 0.8296
## 95% CI : (0.7794, 0.8725)
## No Information Rate : 0.7333
## P-Value [Acc > NIR] : 0.0001259
##
## Kappa : 0.6154
##
## Mcnemar's Test P-Value : 1.904e-05
##
## Sensitivity : 0.8081
## Specificity : 0.8889
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.6275
## Prevalence : 0.7333
## Detection Rate : 0.5926
## Detection Prevalence : 0.6222
## Balanced Accuracy : 0.8485
##
## 'Positive' Class : CH
##
Test error rate is 0.1704
OJ.cv = cv.tree(tree.OJ, FUN = prune.misclass)
OJ.cv
## $size
## [1] 9 8 7 4 2 1
##
## $dev
## [1] 150 150 149 158 172 315
##
## $k
## [1] -Inf 0.000000 3.000000 4.333333 10.500000 151.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(OJ.cv$size, OJ.cv$dev, type = "b", xlab = "Tree Size", ylab = "cv classification error rate")
Size of 7 has lowest cross-validation error.
prune.OJ=prune.tree(tree.OJ,best=7)
summary(prune.OJ)
##
## Classification tree:
## snip.tree(tree = tree.OJ, nodes = c(10L, 4L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "PctDiscMM"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7748 = 614.4 / 793
## Misclassification error rate: 0.1625 = 130 / 800
Training error rate is 0.1625, which is higher than unpruned trees.
unpruned_pred = predict(tree.OJ, OJtest, type = "class")
confusionMatrix(OJtest$Purchase, unpruned_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 160 8
## MM 38 64
##
## Accuracy : 0.8296
## 95% CI : (0.7794, 0.8725)
## No Information Rate : 0.7333
## P-Value [Acc > NIR] : 0.0001259
##
## Kappa : 0.6154
##
## Mcnemar's Test P-Value : 1.904e-05
##
## Sensitivity : 0.8081
## Specificity : 0.8889
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.6275
## Prevalence : 0.7333
## Detection Rate : 0.5926
## Detection Prevalence : 0.6222
## Balanced Accuracy : 0.8485
##
## 'Positive' Class : CH
##
pruned_pred<-predict(prune.OJ,OJtest,type = "class")
confusionMatrix(OJtest$Purchase, pruned_pred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 160 8
## MM 36 66
##
## Accuracy : 0.837
## 95% CI : (0.7875, 0.879)
## No Information Rate : 0.7259
## P-Value [Acc > NIR] : 1.185e-05
##
## Kappa : 0.6336
##
## Mcnemar's Test P-Value : 4.693e-05
##
## Sensitivity : 0.8163
## Specificity : 0.8919
## Pos Pred Value : 0.9524
## Neg Pred Value : 0.6471
## Prevalence : 0.7259
## Detection Rate : 0.5926
## Detection Prevalence : 0.6222
## Balanced Accuracy : 0.8541
##
## 'Positive' Class : CH
##
Pruned tree has a lower test error rate compared to the unpruned tree.