Tree Based Methods

Author

Justin Pons

3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy

library(tidyverse)
library(ISLR2)
library(tree)
library(randomForest)
library(BART)
library(caret)

A <- seq(0,1,.01)
B <- 1-A
error <- 1 - pmax(A,B)
gini <- A*(1-A) + B * (1-B)
entropy <- -A * log(A) - B * log(B)
df <- cbind(A,error,gini,entropy) |> 
  as_tibble()

df |> 
  ggplot(aes(x = A)) +
  geom_line(aes(y = error, color = "Error"), size = 1.2) +
  geom_line(aes(y = gini, color = "Gini"), size = 1.2) +
  geom_line(aes(y = entropy, color = "Entropy"), size = 1.2) +
  labs(x = "Probability of Class A", y = "Value") +
  scale_color_manual(name = "", values = c("Error" = "orange", "Gini" = "pink", "Entropy" = "purple"))

8A

Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable. Split the data set into a training set and a test set.

data("Carseats")
Carseats <- as_tibble(Carseats)
idx <- sample(nrow(Carseats), nrow(Carseats) * .8, replace = FALSE)
train <- Carseats[idx,] 
test <- Carseats[-idx,]

8B

Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain

car.tree <- tree::tree(Sales ~ . , train)
plot(car.tree)
text(car.tree, pretty = 0, cex = 0.6)

The plot shows the calculated decision branches. The tree has 17 terminal nodes, which show the average value for Sales in the given subspace.

car.preds <- predict(car.tree, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))

[1] "MSE:"             "4.51748396042299"

8C

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

car.tree.cv <- cv.tree(car.tree)
plot(car.tree.cv$size, car.tree.cv$dev, type = "b")

df <- cbind(car.tree.cv$size, car.tree.cv$dev) |> 
  as.data.frame()
colnames(df) <- c("size","dev")
df |>  
  filter(dev == min(dev))

  size      dev
1   12 1408.359

We can see that the tree size with the lowest CV error is size = 11.

prune.car.tree <- prune.tree(car.tree, best = 11)
car.preds <- predict(prune.car.tree, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))

[1] "MSE:"             "4.31493839066439"

In this instance, the MSE increases with the pruned model.

8D

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

Here, we use the randomForest function to perform bagging. Random forest is identical to bagging when the subset of predictors is equal to the full number of predictors.

car.random <- randomForest(formula = Sales ~ ., data = Carseats, mtry = ncol(Carseats) - 1,
importance = TRUE, subset = idx)

car.preds <- predict(car.random, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))

[1] "MSE:"             "2.50997131340607"

Bagging reduces the MSE to 2.764.

importance(car.random)

               %IncMSE IncNodePurity
CompPrice   32.0652336     238.68276
Income       8.5529854     127.64685
Advertising 25.5721610     199.23256
Population  -1.7227863      82.87033
Price       69.8381400     734.11747
ShelveLoc   77.5298259     757.82306
Age         24.8215568     246.10722
Education    2.9018100      75.67995
Urban       -0.7037512      11.55332
US           4.0235057      20.32574

Baed in %IncMSE and IncNodePurity, ShelveLoc and Price are the most important variables.

8E

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

car.random <- randomForest(formula = Sales ~ ., data = Carseats, importance = TRUE, subset = idx)

car.preds <- predict(car.random, newdata = test)
mse <- mean((car.preds - test$Sales)^2)
paste(c("MSE:",mse))

[1] "MSE:"             "2.99029532767346"

Compared to basic bagging, random forest performs worse with a MSE of 3.05. When selecting the predictors for a new tree, random forest only uses a subset of p predictors \(m = \sqrt p\)

importance(car.random)

               %IncMSE IncNodePurity
CompPrice   16.5296908     227.60906
Income       4.3387596     178.91001
Advertising 20.4895341     217.05455
Population  -0.3604669     151.03092
Price       45.4410721     596.73902
ShelveLoc   50.8324689     575.28036
Age         21.7056302     286.93399
Education    4.9649687     108.42047
Urban       -0.9810651      18.98387
US           7.3233199      53.98749

Again, ShelveLoc and Price are the most important predictors.

9A

data(OJ)

idx <- sample(nrow(OJ), 800, replace = FALSE)
train <- OJ[idx,]
test <- OJ[-idx,]

9B

oj.tree <- tree(Purchase ~ . , data = train)
summary(oj.tree)


Classification tree:
tree(formula = Purchase ~ ., data = train)
Variables actually used in tree construction:
[1] "LoyalCH"       "SalePriceMM"   "SpecialCH"     "ListPriceDiff"
[5] "PriceDiff"    
Number of terminal nodes:  8 
Residual mean deviance:  0.7444 = 589.5 / 792 
Misclassification error rate: 0.1575 = 126 / 800

There are 8 terminal nodes with an error rate of 15%

9C

oj.tree

node), split, n, deviance, yval, (yprob)
      * denotes terminal node

 1) root 800 1069.000 CH ( 0.61125 0.38875 )  
   2) LoyalCH < 0.5036 346  404.800 MM ( 0.27168 0.72832 )  
     4) LoyalCH < 0.280875 162  108.800 MM ( 0.10494 0.89506 ) *
     5) LoyalCH > 0.280875 184  250.200 MM ( 0.41848 0.58152 )  
      10) SalePriceMM < 2.04 96  105.700 MM ( 0.23958 0.76042 )  
        20) SpecialCH < 0.5 75   65.950 MM ( 0.16000 0.84000 ) *
        21) SpecialCH > 0.5 21   29.060 CH ( 0.52381 0.47619 ) *
      11) SalePriceMM > 2.04 88  117.400 CH ( 0.61364 0.38636 ) *
   3) LoyalCH > 0.5036 454  350.800 CH ( 0.87004 0.12996 )  
     6) LoyalCH < 0.764572 183  208.500 CH ( 0.74317 0.25683 )  
      12) ListPriceDiff < 0.18 53   73.300 MM ( 0.47170 0.52830 ) *
      13) ListPriceDiff > 0.18 130  108.200 CH ( 0.85385 0.14615 )  
        26) PriceDiff < -0.165 5    5.004 MM ( 0.20000 0.80000 ) *
        27) PriceDiff > -0.165 125   91.730 CH ( 0.88000 0.12000 ) *
     7) LoyalCH > 0.764572 271   98.270 CH ( 0.95572 0.04428 ) *

If LoyalCH is less than .48, and LoyalCH is greater than .03, and PriceDiff is less than .05, then the resulting decision is MM.

9D

plot(oj.tree)
text(oj.tree, pretty = 0, cex = 0.6)

9E

preds <- predict(oj.tree, newdata = test, type = "class")
confusionMatrix(preds,test$Purchase)

Confusion Matrix and Statistics

          Reference
Prediction  CH  MM
        CH 142  24
        MM  22  82
                                          
               Accuracy : 0.8296          
                 95% CI : (0.7794, 0.8725)
    No Information Rate : 0.6074          
    P-Value [Acc > NIR] : 2.083e-15       
                                          
                  Kappa : 0.6416          
                                          
 Mcnemar's Test P-Value : 0.8828          
                                          
            Sensitivity : 0.8659          
            Specificity : 0.7736          
         Pos Pred Value : 0.8554          
         Neg Pred Value : 0.7885          
             Prevalence : 0.6074          
         Detection Rate : 0.5259          
   Detection Prevalence : 0.6148          
      Balanced Accuracy : 0.8197          
                                          
       'Positive' Class : CH

The error rate is 1-.7963

9F

oj.tree.cv <- cv.tree(oj.tree)

9G

plot(oj.tree.cv$size, oj.tree.cv$dev, type = "b")

9H

7 is the lowest.

9I

oj.pruned <- prune.tree(oj.tree, best = 7)

9J

The training error rate for the pruned model is 130/800

9H

preds <- predict(oj.pruned, newdata = test, type = "class")
confusionMatrix(preds,test$Purchase)

Confusion Matrix and Statistics

          Reference
Prediction  CH  MM
        CH 136  22
        MM  28  84
                                          
               Accuracy : 0.8148          
                 95% CI : (0.7633, 0.8593)
    No Information Rate : 0.6074          
    P-Value [Acc > NIR] : 1.693e-13       
                                          
                  Kappa : 0.6156          
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 0.8293          
            Specificity : 0.7925          
         Pos Pred Value : 0.8608          
         Neg Pred Value : 0.7500          
             Prevalence : 0.6074          
         Detection Rate : 0.5037          
   Detection Prevalence : 0.5852          
      Balanced Accuracy : 0.8109          
                                          
       'Positive' Class : CH

The test error rate improves to 81%