Assignment 7

\(\textbf{Chapter 8}\)

\(\textbf{Question 3:}\) Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1 − \(\hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R

p <- seq(0, 1, 0.01)
gi <- 2 * p * (1 - p)
c_err <- 1 - pmax(p, 1 - p)
entropy <- - (p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(gi, c_err, entropy), col = c("blue", "grey", "black"))

\(\textbf{Question 8:}\) In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

library(tree)

## Warning: package 'tree' was built under R version 4.0.4

library(ISLR)
library(randomForest)

## Warning: package 'randomForest' was built under R version 4.0.5

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

Split the data set into a training set and a test set.

set.seed(5)
training_sample <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
training_ds <- Carseats[training_sample,]
test_ds <- Carseats[-training_sample,]

Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree_fit <- tree(Sales ~ ., data = training_ds)
plot(tree_fit)
text(tree_fit, pretty = 0)

prediction = predict(tree_fit, newdata = test_ds)
mean((prediction - test_ds$Sales)^2)

## [1] 5.041825

\(\textbf{Response:}\) The resultant MSE is 5.041825.

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

set.seed(6)
tree_cv <- cv.tree(tree_fit)
plot(rev(tree_cv$size), tree_cv$dev, type = "b")
points(which.min(tree_cv$dev), tree_cv$dev[which.min(tree_cv$dev)], col = "blue", cex = 2, pch = 20)

prune_tree = prune.tree(tree_fit, best = 9)
plot(prune_tree)
text(prune_tree, pretty = 0)

prediction = predict(prune_tree, newdata = test_ds)
mean((prediction - test_ds$Sales)^2)

## [1] 4.774217

\(\textbf{Response:}\) The resultant MSE decreased to 4.774217 after being pruned to 9 nodes.

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

set.seed(6)
bagging <- randomForest(Sales ~ ., data = training_ds, mtry = 10, importance = TRUE)
prediction <- predict(bagging, newdata = test_ds)
mean((prediction - test_ds$Sales)^2)

## [1] 2.568033

importance(bagging)

##                %IncMSE IncNodePurity
## CompPrice   24.1507039    195.333645
## Income      10.9173085    140.555948
## Advertising 15.4635744    116.386749
## Population  -0.9377838     53.144288
## Price       42.8334785    352.388293
## ShelveLoc   56.6800070    531.259260
## Age         14.8566179    139.218229
## Education    2.0203536     51.393441
## Urban        1.6053019      8.602717
## US           1.8223005      6.866832

varImpPlot(bagging)

\(\textbf{Response:}\) As observed in the output, it appears ShelveLoc and Price are the most important variables.

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

ms = 1:8
mse = rep(NA, length(ms))
for(i in 1:length(ms)){
  m = ms[i]
  rf_fit = randomForest(Sales ~ ., data = training_ds, mtry = m, importance = TRUE)
  print(m)
  print(importance(rf_fit))
  prediction <- predict(rf_fit, newdata = test_ds)
  mse[i] = mean((prediction - test_ds$Sales)^2)
}

## [1] 1
##                  %IncMSE IncNodePurity
## CompPrice    9.471782418     135.25511
## Income       5.523634254     120.56066
## Advertising  7.514128967     108.82321
## Population  -0.952666520      96.80034
## Price       17.551273056     169.87415
## ShelveLoc   23.861029135     210.72491
## Age          9.902839611     129.78439
## Education   -0.127561028      72.90789
## Urban       -0.004639537      22.18199
## US           3.294371377      33.54436
## [1] 2
##                %IncMSE IncNodePurity
## CompPrice   15.1376052     180.01142
## Income       7.3622947     157.28786
## Advertising  7.8944897     142.00347
## Population  -3.8387095     104.22104
## Price       25.7064486     279.67362
## ShelveLoc   34.3383852     319.62708
## Age          9.8895830     166.64765
## Education   -2.2860866      84.58807
## Urban       -0.7065511      19.73474
## US           3.7810012      28.33808
## [1] 3
##                %IncMSE IncNodePurity
## CompPrice   15.4977462     182.83494
## Income       8.7248590     155.52146
## Advertising  9.2564508     146.19840
## Population  -2.8426640      97.02143
## Price       30.3251786     303.77527
## ShelveLoc   40.8622773     380.66805
## Age         12.3746064     156.89713
## Education    0.1390324      71.65719
## Urban        1.9395501      13.36642
## US           3.0428562      21.76415
## [1] 4
##                %IncMSE IncNodePurity
## CompPrice   16.8635237     184.53394
## Income       8.4133380     163.58147
## Advertising 11.7747188     138.75331
## Population  -2.0190914      77.51777
## Price       33.0928308     317.86257
## ShelveLoc   44.1168563     419.76485
## Age         11.3157301     159.36913
## Education    0.5436397      65.71724
## Urban       -0.1330979      12.97418
## US           4.4624163      17.32562
## [1] 5
##                %IncMSE IncNodePurity
## CompPrice   18.8809593     186.85501
## Income       7.7980845     152.54766
## Advertising 11.4851745     129.75165
## Population  -1.5748711      72.02456
## Price       36.7320399     339.98411
## ShelveLoc   47.5037691     443.59907
## Age         14.3102077     147.21637
## Education    1.3500123      61.26393
## Urban       -0.5680042      11.45582
## US           3.6183442      13.54315
## [1] 6
##                %IncMSE IncNodePurity
## CompPrice   19.8944066    192.180115
## Income      10.0528597    151.349181
## Advertising 11.8066517    124.075077
## Population  -1.4861804     66.802037
## Price       36.3697614    336.195955
## ShelveLoc   50.8270188    467.038269
## Age         12.8520494    147.919400
## Education    1.6136479     59.442598
## Urban       -0.1202717      9.730166
## US           3.8022388     12.221410
## [1] 7
##                %IncMSE IncNodePurity
## CompPrice   22.9356355    191.882851
## Income       8.6450110    146.875049
## Advertising 13.6284419    119.662174
## Population  -2.4806891     60.792529
## Price       40.3931279    345.296481
## ShelveLoc   55.8851733    494.066654
## Age         15.8031536    146.705284
## Education    3.3911822     51.593242
## Urban        0.5074606      8.938265
## US           3.5761544     10.430808
## [1] 8
##                %IncMSE IncNodePurity
## CompPrice   23.0633877    192.830665
## Income       9.3654437    148.955729
## Advertising 13.6498786    116.669006
## Population  -1.1591034     55.396291
## Price       39.0987042    341.801328
## ShelveLoc   55.0896660    503.280454
## Age         13.9036826    148.871036
## Education    1.3259643     50.651824
## Urban        0.2491894      7.774526
## US           3.7409931      8.157653

mse

## [1] 4.830256 3.457590 3.050524 2.855002 2.709845 2.702094 2.626212 2.593132

\(\textbf{Response:}\) As can be observed above, as m increased, the MSE decreased. That being said, for m = \(\sqrt p \approx 3\), an MSE of 3.145097 was arrive at which was a bit higher than bagging. Additionally, the importance of ShelveLoc and Price, which appeared to be the most important based off the output, increased as m incremented.

\(\textbf{Question 8:}\) This problem involves the OJ data set which is part of the ISLR package.

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(5)
training_sample <- sample(1:nrow(OJ), 800)
training_ds <- OJ[training_sample,]
test_ds <- OJ[-training_sample,]

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

tree_fit =tree(Purchase ~ ., data = training_ds)
summary(tree_fit)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = training_ds)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7347 = 581.1 / 791 
## Misclassification error rate: 0.1662 = 133 / 800

\(\textbf{Response:}\) as shown above, the tree has 9 terminal nodes with training error rate 0.1662.

Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree_fit

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1068.00 CH ( 0.61250 0.38750 )  
##    2) LoyalCH < 0.5036 346  412.40 MM ( 0.28324 0.71676 )  
##      4) LoyalCH < 0.280875 164  125.50 MM ( 0.12805 0.87195 )  
##        8) LoyalCH < 0.0356415 56   10.03 MM ( 0.01786 0.98214 ) *
##        9) LoyalCH > 0.0356415 108  103.50 MM ( 0.18519 0.81481 ) *
##      5) LoyalCH > 0.280875 182  248.00 MM ( 0.42308 0.57692 )  
##       10) PriceDiff < 0.05 71   67.60 MM ( 0.18310 0.81690 ) *
##       11) PriceDiff > 0.05 111  151.30 CH ( 0.57658 0.42342 ) *
##    3) LoyalCH > 0.5036 454  362.00 CH ( 0.86344 0.13656 )  
##      6) PriceDiff < -0.39 31   40.32 MM ( 0.35484 0.64516 )  
##       12) LoyalCH < 0.638841 10    0.00 MM ( 0.00000 1.00000 ) *
##       13) LoyalCH > 0.638841 21   29.06 CH ( 0.52381 0.47619 ) *
##      7) PriceDiff > -0.39 423  273.70 CH ( 0.90071 0.09929 )  
##       14) LoyalCH < 0.705326 135  143.00 CH ( 0.77778 0.22222 )  
##         28) ListPriceDiff < 0.255 67   89.49 CH ( 0.61194 0.38806 ) *
##         29) ListPriceDiff > 0.255 68   30.43 CH ( 0.94118 0.05882 ) *
##       15) LoyalCH > 0.705326 288   99.77 CH ( 0.95833 0.04167 ) *

\(\textbf{Response:}\) for the returned values above, * marks the terminal nodes. In selecting 10, it can be observed that there are 71 observations with a value of PriceDiff < 0.05, which are classified as MM.

Create a plot of the tree, and interpret the results.

plot(tree_fit)
text(tree_fit, pretty = 0)

\(\textbf{Response:}\) I find it interesting that after the LoyalCH evaluation at the top, all but one terminal node on the left classifies as MM and all but one terminal node on the right classifies as CH. To that extent, it appears that LoyalCH is the most instrumental in the classification results.

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

tree_prediction <- predict(tree_fit, test_ds, type = "class") 
conf_mat <- table(test_ds$Purchase, tree_prediction)
conf_mat

##     tree_prediction
##       CH  MM
##   CH 148  15
##   MM  32  75

(conf_mat[2, 1] + conf_mat[1, 2]) / sum(conf_mat)

## [1] 0.1740741

\(\textbf{Response:}\) the test error rate is 0.1740741.

Apply the cv.tree() function to the training set in order to determine the optimal tree size.

set.seed(6)
tree_cv <- cv.tree(tree_fit)
tree_cv

## $size
## [1] 9 8 7 6 5 4 3 2 1
## 
## $dev
## [1]  699.2249  689.8235  690.5285  742.8358  765.6732  761.3058  775.4678
## [8]  787.4929 1068.9050
## 
## $k
## [1]      -Inf  11.25968  11.98017  23.10013  29.11546  30.91262  38.92794
## [8]  47.97460 293.76700
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

\(\textbf{Response:}\) 8 appears to be the optimal size but 7 is similar and is a simpler model.

Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis

plot(tree_cv$size, tree_cv$dev, type = "b")
points(tree_cv$size[which.max(tree_cv$size)] - which.min(tree_cv$dev) + 1, tree_cv$dev[which.min(tree_cv$dev)], col = "blue", cex = 2, pch = 20)

Which tree size corresponds to the lowest cross-validated classification error rate?

\(\textbf{Response:}\) though 8 corresponds to the lowest cross-validated classification error rate, note that 7 is comparable yet simpler.

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

tree_pruned = prune.tree(tree_fit, best = 7)
summary(tree_pruned)

## 
## Classification tree:
## snip.tree(tree = tree_fit, nodes = c(6L, 4L))
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7622 = 604.4 / 793 
## Misclassification error rate: 0.1675 = 134 / 800

\(\textbf{Response:}\) given that 7 is comparable to 8 but simpler, so I opted for 7 with the motivation that it might generalize better to the test data.

Compare the training error rates between the pruned and unpruned trees. Which is higher?

\(\textbf{Response:}\) the training error rate was a little higher for the pruned tree.

Compare the test error rates between the pruned and unpruned trees. Which is higher?

# pruned tree error
tree_pruned_prediction <- predict(tree_pruned, test_ds, type = "class") 
conf_mat <- table(test_ds$Purchase, tree_pruned_prediction)
(conf_mat[2, 1] + conf_mat[1, 2]) / sum(conf_mat)

## [1] 0.1777778

\(\textbf{Response:}\) the test error rate of pruned tree of 0.1777778 is slightly higher than 0.1740741 for the unpruned tree. Out of curiosity, I ran the experiment for 6 and 8 as well, with both also giving 0.1777778.