Classification trees

Overview

Tree-based models have advantages

Interpretability + Ease-of-use + Accuracy
Make decisions + Numeric predictions

Build a classification tree

Library setup

library(rpart)
library(rpart.plot)
library(tidymodels)
library(ModelMetrics)
library(caret)

A classification tree is a decision tree that performs a classification (vs regression) task.

Data setup

# Look at the data
recipe <- read.csv("recipe_site_traffic_2212.csv")

recipe_data_new<- recipe|> 
  mutate(servings=readr::parse_number(servings)) |> 
  mutate(high_traffic=ifelse(is.na(high_traffic),"low",high_traffic)) |>
  mutate(category=if_else(category=="Chicken Breast","Chicken",category))|> 
  mutate(sugar = replace_na(sugar,mean(sugar,na.rm=T)),
         calories = replace_na(calories,mean(calories,na.rm=T)),
         protein = replace_na(protein,mean(protein,na.rm=T)),
         carbohydrate =replace_na(carbohydrate,mean(carbohydrate,na.rm=T)))|> 
  mutate(high_traffic=if_else(high_traffic=="High",1,0)) |> 
  mutate(high_traffic=as.factor(high_traffic)) |> 
  select(-recipe)

str(recipe_data_new)
## 'data.frame':    947 obs. of  7 variables:
##  $ calories    : num  435.9 35.5 914.3 97 27.1 ...
##  $ carbohydrate: num  35.07 38.56 42.68 30.56 1.85 ...
##  $ sugar       : num  9.05 0.66 3.09 38.63 0.8 ...
##  $ protein     : num  24.15 0.92 2.88 0.02 0.53 ...
##  $ category    : chr  "Pork" "Potato" "Breakfast" "Beverages" ...
##  $ servings    : num  6 4 1 4 4 2 4 4 6 2 ...
##  $ high_traffic: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 1 2 1 ...

# Create the model
recipe_model <- rpart(formula = high_traffic ~ ., 
                      data = recipe_data_new, 
                      method = "class")

# Display the results
rpart.plot(x = recipe_model, yesno = 2, type = 0, extra = 0)

Back to the basics

Overview of the modeling process

# Total number of rows in the recipe data frame
n <- nrow(recipe_data_new)

# Number of rows for the training set
n_train <- round(0.80*n)

# set a random seed for reproducibility
set.seed(123)

# Create a vector of indices which is an 80% random sample 
train_indices <- sample(1:n, n_train)

# Subset the data frame to training indices only
recipe_train <- recipe_data_new[train_indices, ]

# Exclude the training indices to create the test set
recipe_test <- recipe_data_new[-train_indices, ]

# train the model to predict the binary response 
recipe_model <- rpart(formula = high_traffic ~ ., 
                      data = recipe_data_new, 
                      method = "class")

Train/test split

For this exercise, you will randomly split the German recipe dataset into two pieces: a training set (80%) called recipe_train

Train a classification tree model

To train a classification tree in R, you will must specify the formula, the data and the method.

# Train the model (to predict 'default')
recipe_model <- rpart(formula = high_traffic~., 
                      data = recipe_train, 
                      method = "class")

# Look at the model output                      
print(recipe_model)
## n= 758 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 758 294 1 (0.3878628 0.6121372)  
##    2) category=Beverages,Breakfast,Chicken 298  90 0 (0.6979866 0.3020134)  
##      4) protein< 7.825 121  15 0 (0.8760331 0.1239669) *
##      5) protein>=7.825 177  75 0 (0.5762712 0.4237288)  
##       10) calories< 519.04 118  42 0 (0.6440678 0.3559322)  
##         20) protein< 98.04 110  36 0 (0.6727273 0.3272727) *
##         21) protein>=98.04 8   2 1 (0.2500000 0.7500000) *
##       11) calories>=519.04 59  26 1 (0.4406780 0.5593220)  
##         22) protein>=45.665 26  10 0 (0.6153846 0.3846154) *
##         23) protein< 45.665 33  10 1 (0.3030303 0.6969697) *
##    3) category=Dessert,Lunch/Snacks,Meat,One Dish Meal,Pork,Potato,Vegetable 460  86 1 (0.1869565 0.8130435) *

Evaluate model performance

\[ Accuracy = \frac {n~of~correct~predictions}{n~of~total~data~points} \]

Compute confusion matrix

A confusion matrix is a convenient way to examine the per-class error rates for all classes at once.

The confusionMatrix() function from the caret package prints both the confusion matrix and a number of other useful classification metrics such as “Accuracy” (fraction of correctly classified instances).

# calculate the confusion matrix for the test set
class_pred <- predict(object = recipe_model,
                      newdata = recipe_test,
                      type = "class")

caret::confusionMatrix(data = class_pred,
                       reference = recipe_test$high_traffic)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 43 18
##          1 36 92
##                                           
##                Accuracy : 0.7143          
##                  95% CI : (0.6442, 0.7775)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : 0.0001141       
##                                           
##                   Kappa : 0.3933          
##                                           
##  Mcnemar's Test P-Value : 0.0207003       
##                                           
##             Sensitivity : 0.5443          
##             Specificity : 0.8364          
##          Pos Pred Value : 0.7049          
##          Neg Pred Value : 0.7188          
##              Prevalence : 0.4180          
##          Detection Rate : 0.2275          
##    Detection Prevalence : 0.3228          
##       Balanced Accuracy : 0.6903          
##                                           
##        'Positive' Class : 0               
##

Splitting criterion in trees

A classification tree uses a split condition to predict class labels based on one or more input variables. The classification process starts from the root node of the tree and each node, the process will check whether the input value should recursively continue to the right or left sub-branch according to the split condition.The process stops when meeting any leaf or terminal nodes. The idea behind classification trees is to split the data into subsets where each subset belongs to only one class. This is accomplished by dividing the input space into pure regions, that is - regions with samples from only one class.

With real data, completely pure regions may not be possible, so the decision tree will do the best it can to create regions that are as pure as possible.
Boundaries separating these regions are called decsion boundaries, and the decision tree model makes classification decisions based on these decision boundaries. The goal is to partition data at a node into subsets that are as pure as possible.
Theefore, we need a way to measure the purity of a split, in order to compare different ways to partition a set of data. It works out better mathematically if we measure the impurity rather than the purity. Thus, the impurity measure of a node specifies how mixed the resulting subsets are.
Since we want the resulting subsets to have homogeneous class labels, not mixed class labels, we want the split that minimizes the impurity measure.
Gini index: higher value equals less pure
Misclassification rate

Compare models with a different splitting criterion

Train two models that use a different splitting criterion and use the validation set to choose a “best” model from this group. To do this you’ll use the parms argument of the rpart() function. This argument takes a named list that contains values of different parameters you can use to change how the model is trained. Set the parameter split to control the splitting criterion.

# Train a gini-based model
recipe_model1 <- rpart(formula = high_traffic ~ ., 
                       data = recipe_train, 
                       method = "class",
                       parms = list(split = "gini"))

# Train an information-based model
recipe_model2 <- rpart(formula = high_traffic ~ ., 
                       data = recipe_train, 
                       method = "class",
                       parms = list(split = "information"))

# Generate predictions on the validation set using the gini model
pred1 <- predict(object = recipe_model1, 
             newdata = recipe_test,
             type = "class")    

# Generate predictions on the validation set using the information model
pred2 <- predict(object = recipe_model2, 
             newdata = recipe_test,
             type = "class")

# Compare classification error
ce(actual = recipe_test$high_traffic, 
   predicted = pred1)
## [1] 0.2857143
ce(actual = recipe_test$high_traffic, 
   predicted = pred2)  
## [1] 0.2857143

Bagged Trees

One of the main drawbacks of a decision tree is their high variance. Often a small change in the data can result in va very different series of splits, which can also make model intepretation somewhat precautious.
Bagging, and in particular, bagged trees, averages many trees to reduce this variance. Combining several moels into one is what is called an ensemble model and averaging is one of the easiest ways to create an ensemble from a collection of models.
In addition to reducing variance, it can also help avoid overfitting.
Bagging is an ensemble method and the term “bagging” is shorthand for bootstrap
aggregation. Bagging uses bootstrap sampling and agrregates the individual models by averaging.
Bootstrap means sampling rows at random from the training dataset, with replacement. When we draw samples with replacement, that means it is possible that you will draw a single training example more than once.

This results in a modified version of the training set where some rows are represented multiple times and some rows are absent. This let’s you generate new data that is similar to the data you started with.

By doing this, we can fit many different, but similar models.

Step 1: You draw B samples with replacement from the original training set where B is a number less than or equal to the N, number of total samples in the training set.

Step 2: Train a decision tree on the newly created bootstrapped sample.

Repeate step 1 through 2 for as many times as you like - that could be 20 times, 100 times or 1000. Typically, the more trees, the better the model.

Suppose we use 1000 trees for the model construction. In this case, each model has different featurs, and to generate a prediction using a bagged tree model, you need to generate predictions from each of the 1000 trees and then simply average the prediction together to get a final prediction.

The bagged, or ensemble prediction is the average prediction across the bootstrapped trees. Bagging can dramatically reduce the variance of unstable models such as trees, leading to improved prediction.

This means averaging reduces variance and leaves bias unchanged.

bagging(formula = response ~., data = dat)

Train a bagged tree model

Let’s start by training a bagged tree model. You’ll be using the bagging() function from the ipred package. The number of bagged trees can be specified using the nbagg parameter, but here we will use the default (25).

If we want to estimate the model’s accuracy using the “out-of-bag” (OOB) samples, we can set the the coob parameter to TRUE. The OOB samples are the training obsevations that were not selected into the bootstrapped sample (used in training). Since these observations were not used in training, we can use them instead to evaluate the accuracy of the model (done automatically inside the bagging() function).

library(ipred)
set.seed(123)

# Train a bagged model
recipe_model <- bagging(formula = high_traffic ~ ., 
                        data = recipe_train,
                        coob = TRUE)

# Print the model
print(recipe_model)
## 
## Bagging classification trees with 25 bootstrap replications 
## 
## Call: bagging.data.frame(formula = high_traffic ~ ., data = recipe_train, 
##     coob = TRUE)
## 
## Out-of-bag estimate of misclassification error:  0.2955

Evaluating the bagged tree performance

We have to pass the model object, the test dataset and what type you want your prediction to be. In this example, we set the argument type to be class since we want the function to returns a vector of class predictions.

If you want to take a peak in the predictions, you can print the class prediction and you will see that indeed classification labels were returned. After making predictions, it is time to evaluate the model performance. We can use the confusionMatrix() from the caret package.

It’s always good to take a look at the output using the print() function.

# Generate predicted classes using the model object
class_prediction <- predict(object = recipe_model,    
                            newdata = recipe_test,  
                            type = "class")  # return classification labels


# Calculate the confusion matrix for the test set
caret::confusionMatrix(data = class_prediction,       
                reference = recipe_test$high_traffic)  
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 48 19
##          1 31 91
##                                           
##                Accuracy : 0.7354          
##                  95% CI : (0.6665, 0.7968)
##     No Information Rate : 0.582           
##     P-Value [Acc > NIR] : 8.189e-06       
##                                           
##                   Kappa : 0.4444          
##                                           
##  Mcnemar's Test P-Value : 0.1198          
##                                           
##             Sensitivity : 0.6076          
##             Specificity : 0.8273          
##          Pos Pred Value : 0.7164          
##          Neg Pred Value : 0.7459          
##              Prevalence : 0.4180          
##          Detection Rate : 0.2540          
##    Detection Prevalence : 0.3545          
##       Balanced Accuracy : 0.7174          
##                                           
##        'Positive' Class : 0               
##

Predict on a test set and compute AUC

In binary classification problems, we can predict numeric values instead of class labels. In fact, class labels are created only after you use the model to predict a raw, numeric, predicted value for a test point.

# Generate predictions on the test set
pred <- predict(object = recipe_model,
                newdata = recipe_test,
                type = "prob")

                
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual =recipe_test$high_traffic , 
    predicted = pred[,"0"])      
## [1] 0.1886651



# yardstick::roc_curve()
bind_cols(
  actual = recipe_test$high_traffic , 
    predicted = pred[,"0"]
) %>%
  mutate(actual = as.factor(actual)) %>% 
  roc_curve(truth = actual, predicted)


# yardstick::roc_auc()
bind_cols(
  actual = recipe_test$high_traffic , 
    predicted = pred[,"0"]
) %>%
  mutate(actual = as.factor(actual)) %>% 
  roc_auc(truth = actual, predicted)

ML with Tree-Based Models

Bongani Ncube