x = seq(0,1,.01)
class_err = 1 - pmax(x, 1-x)
gini = 2 * x * (1 - x) # x(1-x) + (1-x)(1-(1-x)) is 2 * x(1-x) as the second part of the formula reducees to x(1-x)
entropy = -(x * log(x) + (1 - x) * log(1 - x))
plot(x,class_err,xlim = c(0, 1), ylim = c(0, 1),col = 'red', type = 'l', ylab = 'measure', main = 'Values of metrics across different classification rates')
lines(x,gini, col = 'blue')
lines(x,entropy, col = 'green')
legend(x = 'top', legend = c('gini', 'class error', 'entropy'), col = c('blue', 'red', 'green'), lty = 1)
To create a regression tree, a top-down, greedy approach is used to split the observation space into smaller and smaller regions where all observations classified to a region are predicted to have the same response. The regression tree determines where to split the observation space by minimizing the RSS. Further splits are determined in the same way only now they split one of the given regions until a certain stopping criteria is reached. This stopping criteria could be a minimum number of observations in the region or a minimum improvement in the model.
dat = OJ
set.seed(1)
trainset = sample(nrow(dat), 800)
train = dat[trainset,]
test = dat[-trainset,]
The tree has 9 terminal nodes with an error rate of 15.8%.
tree = tree(Purchase ~ ., train, method = 'class') #build tree
summary(tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train, method = "class")
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
Review of node 8 shows that we have 59 observations that were in this region of which 98.305% were classified correctly as MM (all but 1).
tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196197 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196197 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
From the plot we see that LoyalCH is the most important predictor. If LoyalCH is greater than .5036, we proceed to check a few other criteria (ListPriceDiff and PctDiscMM) to pick out MM purchasers, but otherwise these will be classified as CH pruchasers. We repeat the same process on the other side of the tree for observations where LoyalCH is less than .5036. We look at a few other predictors (PriceDiff and SpecialCH) to pick out a few CH purchasers, but otherwise these observations are classifed as MM.
plot(tree)
text(tree)
The test error rate is 17.0%.
preds = predict(tree, test, type = 'class')
table(test$Purchase, preds)
## preds
## CH MM
## CH 160 8
## MM 38 64
1 - sum(preds == test$Purchase) / nrow(test)
## [1] 0.1703704
From the pruned tree we see that 4 nodes is sufficient. Adding 4 more nodes only correctly classifies one more purchaser. Therefore, I will use 4 nodes, not 8.
set.seed(1)
cv.tree = cv.tree(tree, FUN = prune.misclass)
cv.tree
## $size
## [1] 9 8 7 4 2 1
##
## $dev
## [1] 145 145 146 146 167 315
##
## $k
## [1] -Inf 0.000000 3.000000 4.333333 10.500000 151.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.tree$size, cv.tree$dev/nrow(train), type = 'l')
The pruned tree misclassifies 2% more of the training data than the unpruned tree does. It also misclassifies .8% more of the testing data than the unpruned tree does. This seems to be a good tradeoff as it has 4 nodes instead of 9 which is easier to interpret. Also, the smaller difference between the test error rates of the pruned and unpruned models means that our unpruned model was overfitting to the data.
prune.tree = prune.misclass(tree, best = 4)
prune.pred.train = predict(prune.tree, train, type = 'class')
table(train$Purchase, prune.pred.train)
## prune.pred.train
## CH MM
## CH 443 42
## MM 101 214
1 - sum(prune.pred.train == train$Purchase) / nrow(train)
## [1] 0.17875
prune.pred.test = predict(prune.tree, test, type = 'class')
table(test$Purchase,prune.pred.test)
## prune.pred.test
## CH MM
## CH 161 7
## MM 41 61
1 - sum(prune.pred.test == test$Purchase) / nrow(test)
## [1] 0.1777778