Problem 3

In this chunk, I am creating a plot which contains the Gini index, classification error and entropy in a classification setting with two classes.

p <- seq(0,1,0.01)
gini.index <- p*(1-p)*2
class.error <- 1-pmax(p,1-p)
entropy <- -(p*log(p)+(1-p)*log(1-p))

plot(NA, xlim=c(0,1), ylim=c(0,1), xlab='p', ylab='Gini, Class Error, Entropy')
legend(x='top', legend=c('Gini Index', 'Class Error', 'Entropy'), col=c('green', 'blue', 'red'), lty=1)
lines(p, gini.index, col='green')
lines(p, class.error, col='blue')
lines(p, entropy, col='red')

Problem 8 - Part A

carseats <- Carseats
attach(carseats)

In this chunk, I am creating a split train and test

set.seed(1)
split <- sample(1:nrow(carseats),nrow(carseats)/2)
carseatsTrain <- carseats[split,]
carseatsTest <- carseats[-split,]

Problem 8 - Part B

The test MSE for the tree model is approximately 4.92. In the plot, it is apparent that shelf location (ShelfLoc) is the most important predictor for sales and price is the second most important predictor. Additional variables which factor into the tree are Price, Age, Advertising and CompPrice and US. The number of terminal nodes in the tree is 18.

set.seed(1)
carseatsTree <- tree(Sales~., data=carseatsTrain)
summary(carseatsTree)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = carseatsTrain)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "CompPrice"  
## [6] "US"         
## Number of terminal nodes:  18 
## Residual mean deviance:  2.167 = 394.3 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.88200 -0.88200 -0.08712  0.00000  0.89590  4.09900
plot(carseatsTree)
text(carseatsTree, pretty=0)

carseatsPred <- predict(carseatsTree, carseatsTest)
mean((carseatsPred-carseatsTest$Sales)^2)
## [1] 4.922039

Problem 8 - Part C

Cross-validation indicates that the optimal level of tree complexity is 12. In this instance, pruning the tree does not improve the test MSE in comparison to the un-pruned tree.

set.seed(10)
carseatsTree.cv <- cv.tree(carseatsTree)
plot(carseatsTree.cv)

carseatsPrune <- prune.tree(carseatsTree, best=12)
plot(carseatsPrune)
text(carseatsPrune, pretty=0)

carseatsPred.2 <- predict(carseatsPrune, carseatsTest)
mean((carseatsPred.2-carseatsTest$Sales)^2)
## [1] 4.966929

Problem 8 - Part D

The test MSE for the bagging approach is approximately 2.65. The importance function indicates that ShelveLoc, Price, Age and CompPrice are the most important variables (similar to the tree models in Part B and Part C).

set.seed(1)
carseatsBag <- randomForest(Sales~., data=carseatsTrain, ntree=200, mtry=(ncol(carseats)-1), importance=TRUE)
carseatsPred.3 <- predict(carseatsBag, carseatsTest)
mean((carseatsPred.3-carseatsTest$Sales)^2)
## [1] 2.648186
importance(carseatsBag)
##                %IncMSE IncNodePurity
## CompPrice   15.7222058    169.659205
## Income       2.7594485     93.101909
## Advertising  8.1845163     98.322261
## Population  -1.7007384     61.499757
## Price       37.0191656    490.324947
## ShelveLoc   32.3103292    374.849405
## Age         11.8377967    155.842428
## Education   -0.4093829     45.196607
## Urban       -0.2230521      9.522458
## US           2.6574292     16.327555

Problem 8 - Part E

The importance function for the random forest model indicates that ShelveLoc, Price, Age and CompPrice are the most important variables. The test MSE value varies between approximately 2.6 and 4.7 depending on the value of m (the m/MSE plot below indicates that m=7 provides the lowest MSE value). The test MSE obtained for the m=7 model is approximately 2.61.

carseatsMSE <- c()
set.seed(1)
for(x in 1:10) {
  carseatsRF <- randomForest(Sales~., data=carseatsTrain, mtry=x, ntree=200, importance = TRUE)
  carseatsPred.4 <- predict(carseatsRF, carseatsTest)
  carseatsMSE <- rbind(carseatsMSE, mean((carseatsPred.4-carseatsTest$Sales)^2))
}
plot(1:10, carseatsMSE, type='b')

carseatsRF.2 <- randomForest(Sales~., data=carseatsTrain, mtry=7, ntree=200, importance=TRUE)
carseatsPred.5 <- predict(carseatsRF.2, carseatsTest)
mean((carseatsPred.5-carseatsTest$Sales)^2)
## [1] 2.611588
importance(carseatsRF.2)
##                %IncMSE IncNodePurity
## CompPrice   15.4993135    162.327912
## Income       3.5085424    105.838861
## Advertising  6.8856713    103.505137
## Population  -2.2433850     63.309175
## Price       33.4575763    467.446439
## ShelveLoc   29.0120778    363.172690
## Age          9.6010973    159.393029
## Education   -0.9275776     52.217507
## Urban       -0.6025475      9.700305
## US           4.1227908     22.023457

Problem 9 - Part A

oj <- OJ
attach(oj)

In this chunk, I am creating a training set with 800 observations and assigning the remaining observations to a test set.

set.seed(1)
split.2 <- sample(1:nrow(oj), 800)
ojTrain <- oj[split.2,]
ojTest <- oj[-split.2,]

Problem 9 - Part B

The results of the tree summary below indicate that there are 9 terminal nodes in the tree. The variables used in the tree construction are LoyalCH, PriceDiff, SpecialCH, ListPriceDiff and PctDiscMM. The training classification error rate is 15.88%.

ojTree <- tree(Purchase~., data=ojTrain)
summary(ojTree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = ojTrain)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

Problem 9 - Part C

For terminal node 8 in the results below, the variable for which the split occurs is LoyalCH and the value is <0.035. 59 points exist in the tree beneath this node and the deviance for these points is shown to be 10.14. The label MM indicates that Minute Maid is the predicted value for Sales at this node, for which approximately 98.3% of points are attributed to Minute Maid and 1.7% to Citrus Hill.

ojTree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196197 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196197 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

Problem 9 - Part D

In this plot, it is apparent that loyalty to Citrus Hill is the most important predictor in the decision tree. When the LoyalCH for Citrus Hill is low, Minute Maid is predicted to be purchased. When the LoyalCH for Citrus Hill is high, Citrus Hill is predicted to be purchased. For LoyalCH values greater than 0.280875 in the left branch, PriceDiff and Special CH become important variables for predicting which brand will be purchased. For LoyalCH values less than 0.764572 in the right branch, ListPriceDiff and PctDiscMM become importance variables for predicting which brand will be purchased.

plot(ojTree)
text(ojTree, pretty=0)

Problem 9 - Part E

ojPred <- predict(ojTree, ojTest, type="class")
table(ojPred, ojTest$Purchase)
##       
## ojPred  CH  MM
##     CH 160  38
##     MM   8  64

The test error rate is approximately 17.03%.

mean(ojPred != ojTest$Purchase)
## [1] 0.1703704

Problem 9 - Part F

The optimal tree size appears to be 7 since it corresponds to the smallest deviance value.

set.seed(4)
ojTree.cv <- cv.tree(ojTree, FUN=prune.misclass)
ojTree.cv
## $size
## [1] 9 8 7 4 2 1
## 
## $dev
## [1] 163 163 163 171 173 309
## 
## $k
## [1]       -Inf   0.000000   3.000000   4.333333  10.500000 151.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

Problem 9 - Part G

In this chunk, I am producing a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(ojTree.cv)

Problem 9 - Part H

The tree size which corresponds to the lowest cross-validated classification error rate is 7.

Problem 9 - Part I

ojPrune <- prune.tree(ojTree, best=7)

Problem 9 - Part J

The training error rate of the pruned tree (16.25%) is higher than that of the un-pruned tree (15.88%).

summary(ojTree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = ojTrain)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800
summary(ojPrune)
## 
## Classification tree:
## snip.tree(tree = ojTree, nodes = c(10L, 4L))
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff" "PctDiscMM"    
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7748 = 614.4 / 793 
## Misclassification error rate: 0.1625 = 130 / 800

Problem 9 - Part K

The test error rate of the pruned tree (16.29%) is slightly smaller than that of the un-pruned tree (17.03%).

ojPred <- predict(ojTree, ojTest, type="class")
mean(ojPred != ojTest$Purchase)
## [1] 0.1703704
ojPred.prune <- predict(ojPrune, ojTest, type="class")
mean(ojPred.prune != ojTest$Purchase)
## [1] 0.162963