Problem 3
In this chunk, I am creating a plot which contains the Gini index, classification error and entropy in a classification setting with two classes.
p <- seq(0,1,0.01)
gini.index <- p*(1-p)*2
class.error <- 1-pmax(p,1-p)
entropy <- -(p*log(p)+(1-p)*log(1-p))
plot(NA, xlim=c(0,1), ylim=c(0,1), xlab='p', ylab='Gini, Class Error, Entropy')
legend(x='top', legend=c('Gini Index', 'Class Error', 'Entropy'), col=c('green', 'blue', 'red'), lty=1)
lines(p, gini.index, col='green')
lines(p, class.error, col='blue')
lines(p, entropy, col='red')
Problem 8 - Part A
carseats <- Carseats
attach(carseats)
In this chunk, I am creating a split train and test
set.seed(1)
split <- sample(1:nrow(carseats),nrow(carseats)/2)
carseatsTrain <- carseats[split,]
carseatsTest <- carseats[-split,]
Problem 8 - Part B
The test MSE for the tree model is approximately 4.92. In the plot, it is apparent that shelf location (ShelfLoc) is the most important predictor for sales and price is the second most important predictor. Additional variables which factor into the tree are Price, Age, Advertising and CompPrice and US. The number of terminal nodes in the tree is 18.
set.seed(1)
carseatsTree <- tree(Sales~., data=carseatsTrain)
summary(carseatsTree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = carseatsTrain)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
plot(carseatsTree)
text(carseatsTree, pretty=0)
carseatsPred <- predict(carseatsTree, carseatsTest)
mean((carseatsPred-carseatsTest$Sales)^2)
## [1] 4.922039
Problem 8 - Part C
Cross-validation indicates that the optimal level of tree complexity is 12. In this instance, pruning the tree does not improve the test MSE in comparison to the un-pruned tree.
set.seed(10)
carseatsTree.cv <- cv.tree(carseatsTree)
plot(carseatsTree.cv)
carseatsPrune <- prune.tree(carseatsTree, best=12)
plot(carseatsPrune)
text(carseatsPrune, pretty=0)
carseatsPred.2 <- predict(carseatsPrune, carseatsTest)
mean((carseatsPred.2-carseatsTest$Sales)^2)
## [1] 4.966929
Problem 8 - Part D
The test MSE for the bagging approach is approximately 2.65. The importance function indicates that ShelveLoc, Price, Age and CompPrice are the most important variables (similar to the tree models in Part B and Part C).
set.seed(1)
carseatsBag <- randomForest(Sales~., data=carseatsTrain, ntree=200, mtry=(ncol(carseats)-1), importance=TRUE)
carseatsPred.3 <- predict(carseatsBag, carseatsTest)
mean((carseatsPred.3-carseatsTest$Sales)^2)
## [1] 2.648186
importance(carseatsBag)
## %IncMSE IncNodePurity
## CompPrice 15.7222058 169.659205
## Income 2.7594485 93.101909
## Advertising 8.1845163 98.322261
## Population -1.7007384 61.499757
## Price 37.0191656 490.324947
## ShelveLoc 32.3103292 374.849405
## Age 11.8377967 155.842428
## Education -0.4093829 45.196607
## Urban -0.2230521 9.522458
## US 2.6574292 16.327555
Problem 8 - Part E
The importance function for the random forest model indicates that ShelveLoc, Price, Age and CompPrice are the most important variables. The test MSE value varies between approximately 2.6 and 4.7 depending on the value of m (the m/MSE plot below indicates that m=7 provides the lowest MSE value). The test MSE obtained for the m=7 model is approximately 2.61.
carseatsMSE <- c()
set.seed(1)
for(x in 1:10) {
carseatsRF <- randomForest(Sales~., data=carseatsTrain, mtry=x, ntree=200, importance = TRUE)
carseatsPred.4 <- predict(carseatsRF, carseatsTest)
carseatsMSE <- rbind(carseatsMSE, mean((carseatsPred.4-carseatsTest$Sales)^2))
}
plot(1:10, carseatsMSE, type='b')
carseatsRF.2 <- randomForest(Sales~., data=carseatsTrain, mtry=7, ntree=200, importance=TRUE)
carseatsPred.5 <- predict(carseatsRF.2, carseatsTest)
mean((carseatsPred.5-carseatsTest$Sales)^2)
## [1] 2.611588
importance(carseatsRF.2)
## %IncMSE IncNodePurity
## CompPrice 15.4993135 162.327912
## Income 3.5085424 105.838861
## Advertising 6.8856713 103.505137
## Population -2.2433850 63.309175
## Price 33.4575763 467.446439
## ShelveLoc 29.0120778 363.172690
## Age 9.6010973 159.393029
## Education -0.9275776 52.217507
## Urban -0.6025475 9.700305
## US 4.1227908 22.023457
Problem 9 - Part A
oj <- OJ
attach(oj)
In this chunk, I am creating a training set with 800 observations and assigning the remaining observations to a test set.
set.seed(1)
split.2 <- sample(1:nrow(oj), 800)
ojTrain <- oj[split.2,]
ojTest <- oj[-split.2,]
Problem 9 - Part B
The results of the tree summary below indicate that there are 9 terminal nodes in the tree. The variables used in the tree construction are LoyalCH, PriceDiff, SpecialCH, ListPriceDiff and PctDiscMM. The training classification error rate is 15.88%.
ojTree <- tree(Purchase~., data=ojTrain)
summary(ojTree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = ojTrain)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
Problem 9 - Part C
For terminal node 8 in the results below, the variable for which the split occurs is LoyalCH and the value is <0.035. 59 points exist in the tree beneath this node and the deviance for these points is shown to be 10.14. The label MM indicates that Minute Maid is the predicted value for Sales at this node, for which approximately 98.3% of points are attributed to Minute Maid and 1.7% to Citrus Hill.
ojTree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196197 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196197 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
Problem 9 - Part D
In this plot, it is apparent that loyalty to Citrus Hill is the most important predictor in the decision tree. When the LoyalCH for Citrus Hill is low, Minute Maid is predicted to be purchased. When the LoyalCH for Citrus Hill is high, Citrus Hill is predicted to be purchased. For LoyalCH values greater than 0.280875 in the left branch, PriceDiff and Special CH become important variables for predicting which brand will be purchased. For LoyalCH values less than 0.764572 in the right branch, ListPriceDiff and PctDiscMM become importance variables for predicting which brand will be purchased.
plot(ojTree)
text(ojTree, pretty=0)
Problem 9 - Part E
ojPred <- predict(ojTree, ojTest, type="class")
table(ojPred, ojTest$Purchase)
##
## ojPred CH MM
## CH 160 38
## MM 8 64
The test error rate is approximately 17.03%.
mean(ojPred != ojTest$Purchase)
## [1] 0.1703704
Problem 9 - Part F
The optimal tree size appears to be 7 since it corresponds to the smallest deviance value.
set.seed(4)
ojTree.cv <- cv.tree(ojTree, FUN=prune.misclass)
ojTree.cv
## $size
## [1] 9 8 7 4 2 1
##
## $dev
## [1] 163 163 163 171 173 309
##
## $k
## [1] -Inf 0.000000 3.000000 4.333333 10.500000 151.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
Problem 9 - Part G
In this chunk, I am producing a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(ojTree.cv)
Problem 9 - Part H
The tree size which corresponds to the lowest cross-validated classification error rate is 7.
Problem 9 - Part I
ojPrune <- prune.tree(ojTree, best=7)
Problem 9 - Part J
The training error rate of the pruned tree (16.25%) is higher than that of the un-pruned tree (15.88%).
summary(ojTree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = ojTrain)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
summary(ojPrune)
##
## Classification tree:
## snip.tree(tree = ojTree, nodes = c(10L, 4L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "PctDiscMM"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7748 = 614.4 / 793
## Misclassification error rate: 0.1625 = 130 / 800
Problem 9 - Part K
The test error rate of the pruned tree (16.29%) is slightly smaller than that of the un-pruned tree (17.03%).
ojPred <- predict(ojTree, ojTest, type="class")
mean(ojPred != ojTest$Purchase)
## [1] 0.1703704
ojPred.prune <- predict(ojPrune, ojTest, type="class")
mean(ojPred.prune != ojTest$Purchase)
## [1] 0.162963