Chapter 8

Exercise 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p_{m1}}\). The x- axis should display \(\hat{p_{m1}}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

p <- seq(0, 1, 0.01)
gini.index <- 2 * p * (1 - p)
class.error <- 1 - pmax(p, 1 - p)
cross.entropy <- - (p * log(p) + (1 - p) * log(1 - p))

matplot(p, cbind(gini.index, class.error, cross.entropy), pch=c(14,15,16) ,ylab = "gini.index, class.error, cross.entropy",col = c("royalblue" , "salmon1", "red3"), type = 'l')
legend('bottom', inset=.01, legend = c('Gini index', 'Classification error', 'Cross entropy'), col = c("royalblue" , "salmon1", "red3"), pch=c(14,15,16))

Exercise 8

  • (a) Split the data set into a training set and a test set.
library(ISLR)
library(tree)
## Warning: package 'tree' was built under R version 4.2.1
data(Carseats)

set.seed(1)

s <- sample(1:nrow(Carseats), nrow(Carseats)/2)
train <- Carseats[s, ]
test <- Carseats[-s, ]
  • (b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
t=tree(Sales~.,train)
summary(t)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "CompPrice"  
## [6] "US"         
## Number of terminal nodes:  18 
## Residual mean deviance:  2.167 = 394.3 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.88200 -0.88200 -0.08712  0.00000  0.89590  4.09900
plot(t)
text(t,pretty=0,cex=0.5)

pred <- predict(t, newdata = test)
mean((pred - test$Sales)^2)
## [1] 4.922039

So the MSE is 4.922.

  • (c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
cv_c<-  cv.tree(t)
par(mfrow=c(1, 2))
plot(cv_c$size, cv_c$dev, type="b")
plot(cv_c$k, cv_c$dev, type="b")

par(mfrow = c(1,1))
prune <- prune.tree(t, best = 8)
plot(prune)
text(prune, pretty = 0)

predict <- predict(prune, newdata = test)
mean((predict - test$Sales)^2)
## [1] 5.113254

In this case, pruning the tree actually makes the MSE worse - it goes from 4.92 to 5.11. This isn’t always the case - it depends on the test and training sets, sometimes the pruned tree will result in a better MSE than the unpruned tree (when the unpruned tree has been overfit).

  • (d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(randomForest)
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(ggplot2)
library(lattice)
set.seed(1)
Carseats_fit <- train(Sales ~ ., data = train,
                       method = "rf",
                       trControl = trainControl(method = "none"),
                       tuneGrid = data.frame(mtry = 10),
                       importance = TRUE)

preds_carseats <- predict(Carseats_fit, newdata = test)
mean((test$Sales - preds_carseats)^2)
## [1] 2.713437
ggplot(varImp(Carseats_fit, scale = FALSE))

As seen above, bagging improves the test MSE to 2.71. Also, it looks like the price of the carseat and where it is located on the shelf are the most important predictors of how a carseat will sell. competitor price, Age and advertising budget also appear to have an effect, but all other variables seem to be less important.

  • (e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
mse.vec <- NA
for (a in 1:10){
  rf.carseats <-  randomForest(Sales ~ . , data=train, 
                             mtry=a,importance=TRUE)
  rf.pred <-  predict(rf.carseats, test)
  mse.vec[a] <- mean((test$Sales - rf.pred)^2)
}

# best model
which.min(mse.vec)
## [1] 7
mse.vec[which.min(mse.vec)]
## [1] 2.590518

We see that the best model uses 7 variables at each split. That model decreases test error (2.59) compared to bagging.

set.seed(1)
Carseats_fit_rf <- train(Sales ~ ., data = train,
                       method = "rf",
                       trControl = trainControl(method = "none"),
                       tuneGrid = data.frame(mtry = 7),
                       importance = TRUE)

preds_carseats_rf <- predict(Carseats_fit_rf, newdata = test) 
mean((test$Sales - preds_carseats_rf)^2)
## [1] 2.734402
ggplot(varImp(Carseats_fit_rf, scale = FALSE))

The price of the carseat and where it is located on the shelf are still the most important predictors of how a carseat will sell.

Exercise 9

  • (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(1)
data(OJ)
x <- sample(1:nrow(OJ), 800)
train <- OJ[x, ]
test <- OJ[-x, ]
  • (b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
t<- tree(Purchase ~ ., data =train)
summary(t)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

The summary shows that fitted tree has a training error rate of 0.158 and it has 9 terminal nodes.

  • (c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
t
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196196 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196196 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

The root is split into nodes using the varable LoyalCH which was also determined to be the most important variable for the model in the summary. Lets take the terminal node labeled 8 is used. The split criterion is LoyalCH<0.035, the number of observations in the branch is 59 with a deviance of 10.14 and an overall prediction for the branch of MM. That means we expect them to buy Minute Maid instead o fCitrus Hill.

  • (d) Create a plot of the tree, and interpret the results.
plot(t)
text(t,pretty=0)

Based on the plot above, the most important splitting variables is LoyalCH. The reason is the top 3 nodes have splitting variable LoyalCH.

  • (e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
pred <- predict(t, test, type = "class")
caret::confusionMatrix(pred, test$Purchase)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 160  38
##         MM   8  64
##                                           
##                Accuracy : 0.8296          
##                  95% CI : (0.7794, 0.8725)
##     No Information Rate : 0.6222          
##     P-Value [Acc > NIR] : 8.077e-14       
##                                           
##                   Kappa : 0.6154          
##                                           
##  Mcnemar's Test P-Value : 1.904e-05       
##                                           
##             Sensitivity : 0.9524          
##             Specificity : 0.6275          
##          Pos Pred Value : 0.8081          
##          Neg Pred Value : 0.8889          
##              Prevalence : 0.6222          
##          Detection Rate : 0.5926          
##    Detection Prevalence : 0.7333          
##       Balanced Accuracy : 0.7899          
##                                           
##        'Positive' Class : CH              
## 
#test error
mean(pred != test$Purchase)  
## [1] 0.1703704

The test error rate is about 17%.

  • (f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
cv <- cv.tree(t )
cv
## $size
## [1] 9 8 7 6 5 4 3 2 1
## 
## $dev
## [1]  685.6493  698.8799  702.8083  702.8083  714.1093  725.4734  780.2099
## [8]  790.0301 1074.2062
## 
## $k
## [1]      -Inf  12.62207  13.94616  14.35384  26.21539  35.74964  43.07317
## [8]  45.67120 293.15784
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
  • (g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv$size, cv$dev, type = "b", xlab = "Tree size", ylab = "cross-validated classification error rate")

  • (h) Which tree size corresponds to the lowest cross-validated classification error rate?

Based on the plot, Size of 5 has the lowest error.

  • (i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
prune.oj <- prune.misclass(t, best = 7)
plot(prune.oj)
text(prune.oj, pretty = 0)

  • (j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
summary(t)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800
summary(prune.oj)
## 
## Classification tree:
## snip.tree(tree = t, nodes = c(4L, 10L))
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff" "PctDiscMM"    
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7748 = 614.4 / 793 
## Misclassification error rate: 0.1625 = 130 / 800

The training error rates of prune tree is higher than un-pruned tree. 0.7432 for un-pruned tree and 0.7748 for pruned tree. The reason is model of pruned tree is more flexiable compared to model of un-pruned tree.

  • (k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
prune.pred <- predict(prune.oj,test, type = "class")
caret::confusionMatrix(prune.pred, test$Purchase)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 160  36
##         MM   8  66
##                                          
##                Accuracy : 0.837          
##                  95% CI : (0.7875, 0.879)
##     No Information Rate : 0.6222         
##     P-Value [Acc > NIR] : 8.697e-15      
##                                          
##                   Kappa : 0.6336         
##                                          
##  Mcnemar's Test P-Value : 4.693e-05      
##                                          
##             Sensitivity : 0.9524         
##             Specificity : 0.6471         
##          Pos Pred Value : 0.8163         
##          Neg Pred Value : 0.8919         
##              Prevalence : 0.6222         
##          Detection Rate : 0.5926         
##    Detection Prevalence : 0.7259         
##       Balanced Accuracy : 0.7997         
##                                          
##        'Positive' Class : CH             
## 
#test error
mean(prune.pred != test$Purchase) 
## [1] 0.162963

The test error rates of pruned tree is slighly lower than test error rates of un-pruned tree. The reason is model of prune tree is less overfitted compare to un-prune tree.