Exercise 3

x = seq(0,1,.01)
class_err = 1 - pmax(x, 1-x)
gini = 2 * x * (1 - x) # x(1-x) + (1-x)(1-(1-x)) is 2 * x(1-x) as the second part of the formula reducees to x(1-x)
entropy = -(x * log(x) + (1 - x) * log(1 - x))

plot(x,class_err,xlim = c(0, 1), ylim = c(0, 1),col = 'red', type = 'l', ylab = 'measure', main = 'Values of metrics across different classification rates')
lines(x,gini, col = 'blue')
lines(x,entropy, col = 'green')
legend(x = 'top', legend = c('gini', 'class error', 'entropy'), col = c('blue', 'red', 'green'), lty = 1)

Exercise 8

To create a regression tree, a top-down, greedy approach is used to split the observation space into smaller and smaller regions where all observations classified to a region are predicted to have the same response. The regression tree determines where to split the observation space by minimizing the RSS. Further splits are determined in the same way only now they split one of the given regions until a certain stopping criteria is reached. This stopping criteria could be a minimum number of observations in the region or a minimum improvement in the model.

Exercise 9

A

dat = OJ
set.seed(1)
trainset = sample(nrow(dat), 800)
train = dat[trainset,]
test = dat[-trainset,]

B

The tree has 9 terminal nodes with an error rate of 15.8%.

tree = tree(Purchase ~ ., train, method = 'class') #build tree
summary(tree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train, method = "class")
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

C

Review of node 8 shows that we have 59 observations that were in this region of which 98.305% were classified correctly as MM (all but 1).

tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196197 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196197 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

D

From the plot we see that LoyalCH is the most important predictor. If LoyalCH is greater than .5036, we proceed to check a few other criteria (ListPriceDiff and PctDiscMM) to pick out MM purchasers, but otherwise these will be classified as CH pruchasers. We repeat the same process on the other side of the tree for observations where LoyalCH is less than .5036. We look at a few other predictors (PriceDiff and SpecialCH) to pick out a few CH purchasers, but otherwise these observations are classifed as MM.

plot(tree)
text(tree)

E

The test error rate is 17.0%.

preds = predict(tree, test, type = 'class')
table(test$Purchase, preds)
##     preds
##       CH  MM
##   CH 160   8
##   MM  38  64
1 - sum(preds == test$Purchase) / nrow(test)
## [1] 0.1703704

F

From the pruned tree we see that 4 nodes is sufficient. Adding 4 more nodes only correctly classifies one more purchaser. Therefore, I will use 4 nodes, not 8.

set.seed(1)
cv.tree = cv.tree(tree, FUN = prune.misclass)
cv.tree
## $size
## [1] 9 8 7 4 2 1
## 
## $dev
## [1] 145 145 146 146 167 315
## 
## $k
## [1]       -Inf   0.000000   3.000000   4.333333  10.500000 151.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
plot(cv.tree$size, cv.tree$dev/nrow(train), type = 'l')

I and J and K

The pruned tree misclassifies 2% more of the training data than the unpruned tree does. It also misclassifies .8% more of the testing data than the unpruned tree does. This seems to be a good tradeoff as it has 4 nodes instead of 9 which is easier to interpret. Also, the smaller difference between the test error rates of the pruned and unpruned models means that our unpruned model was overfitting to the data.

prune.tree = prune.misclass(tree, best = 4)

prune.pred.train = predict(prune.tree, train, type = 'class')
table(train$Purchase, prune.pred.train)
##     prune.pred.train
##       CH  MM
##   CH 443  42
##   MM 101 214
1 - sum(prune.pred.train == train$Purchase) / nrow(train)
## [1] 0.17875
prune.pred.test = predict(prune.tree, test, type = 'class')
table(test$Purchase,prune.pred.test)
##     prune.pred.test
##       CH  MM
##   CH 161   7
##   MM  41  61
1 - sum(prune.pred.test == test$Purchase) / nrow(test)
## [1] 0.1777778