Tree-based models have advantages
library(rpart)
library(rpart.plot)
library(tidymodels)
library(ModelMetrics)
library(caret)
# Look at the data
<- read.csv("recipe_site_traffic_2212.csv")
recipe
<- recipe|>
recipe_data_newmutate(servings=readr::parse_number(servings)) |>
mutate(high_traffic=ifelse(is.na(high_traffic),"low",high_traffic)) |>
mutate(category=if_else(category=="Chicken Breast","Chicken",category))|>
mutate(sugar = replace_na(sugar,mean(sugar,na.rm=T)),
calories = replace_na(calories,mean(calories,na.rm=T)),
protein = replace_na(protein,mean(protein,na.rm=T)),
carbohydrate =replace_na(carbohydrate,mean(carbohydrate,na.rm=T)))|>
mutate(high_traffic=if_else(high_traffic=="High",1,0)) |>
mutate(high_traffic=as.factor(high_traffic)) |>
select(-recipe)
str(recipe_data_new)
## 'data.frame': 947 obs. of 7 variables:
## $ calories : num 435.9 35.5 914.3 97 27.1 ...
## $ carbohydrate: num 35.07 38.56 42.68 30.56 1.85 ...
## $ sugar : num 9.05 0.66 3.09 38.63 0.8 ...
## $ protein : num 24.15 0.92 2.88 0.02 0.53 ...
## $ category : chr "Pork" "Potato" "Breakfast" "Beverages" ...
## $ servings : num 6 4 1 4 4 2 4 4 6 2 ...
## $ high_traffic: Factor w/ 2 levels "0","1": 2 2 1 2 1 2 1 1 2 1 ...
# Create the model
<- rpart(formula = high_traffic ~ .,
recipe_model data = recipe_data_new,
method = "class")
# Display the results
rpart.plot(x = recipe_model, yesno = 2, type = 0, extra = 0)
# Total number of rows in the recipe data frame
<- nrow(recipe_data_new)
n
# Number of rows for the training set
<- round(0.80*n)
n_train
# set a random seed for reproducibility
set.seed(123)
# Create a vector of indices which is an 80% random sample
<- sample(1:n, n_train)
train_indices
# Subset the data frame to training indices only
<- recipe_data_new[train_indices, ]
recipe_train
# Exclude the training indices to create the test set
<- recipe_data_new[-train_indices, ] recipe_test
# train the model to predict the binary response
<- rpart(formula = high_traffic ~ .,
recipe_model data = recipe_data_new,
method = "class")
For this exercise, you will randomly split the German recipe dataset
into two pieces: a training set (80%) called
recipe_train
# Train the model (to predict 'default')
<- rpart(formula = high_traffic~.,
recipe_model data = recipe_train,
method = "class")
# Look at the model output
print(recipe_model)
## n= 758
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 758 294 1 (0.3878628 0.6121372)
## 2) category=Beverages,Breakfast,Chicken 298 90 0 (0.6979866 0.3020134)
## 4) protein< 7.825 121 15 0 (0.8760331 0.1239669) *
## 5) protein>=7.825 177 75 0 (0.5762712 0.4237288)
## 10) calories< 519.04 118 42 0 (0.6440678 0.3559322)
## 20) protein< 98.04 110 36 0 (0.6727273 0.3272727) *
## 21) protein>=98.04 8 2 1 (0.2500000 0.7500000) *
## 11) calories>=519.04 59 26 1 (0.4406780 0.5593220)
## 22) protein>=45.665 26 10 0 (0.6153846 0.3846154) *
## 23) protein< 45.665 33 10 1 (0.3030303 0.6969697) *
## 3) category=Dessert,Lunch/Snacks,Meat,One Dish Meal,Pork,Potato,Vegetable 460 86 1 (0.1869565 0.8130435) *
\[ Accuracy = \frac {n~of~correct~predictions}{n~of~total~data~points} \]
A confusion matrix is a convenient way to examine the per-class error rates for all classes at once.
The confusionMatrix()
function from the caret package
prints both the confusion matrix and a number of other useful
classification metrics such as “Accuracy” (fraction of correctly
classified instances).
# calculate the confusion matrix for the test set
<- predict(object = recipe_model,
class_pred newdata = recipe_test,
type = "class")
::confusionMatrix(data = class_pred,
caretreference = recipe_test$high_traffic)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 43 18
## 1 36 92
##
## Accuracy : 0.7143
## 95% CI : (0.6442, 0.7775)
## No Information Rate : 0.582
## P-Value [Acc > NIR] : 0.0001141
##
## Kappa : 0.3933
##
## Mcnemar's Test P-Value : 0.0207003
##
## Sensitivity : 0.5443
## Specificity : 0.8364
## Pos Pred Value : 0.7049
## Neg Pred Value : 0.7188
## Prevalence : 0.4180
## Detection Rate : 0.2275
## Detection Prevalence : 0.3228
## Balanced Accuracy : 0.6903
##
## 'Positive' Class : 0
##
A classification tree uses a split condition to predict class labels based on one or more input variables. The classification process starts from the root node of the tree and each node, the process will check whether the input value should recursively continue to the right or left sub-branch according to the split condition.The process stops when meeting any leaf or terminal nodes. The idea behind classification trees is to split the data into subsets where each subset belongs to only one class. This is accomplished by dividing the input space into pure regions, that is - regions with samples from only one class.
With real data, completely pure regions may not be possible, so the decision tree will do the best it can to create regions that are as pure as possible.
Boundaries separating these regions are called decsion boundaries, and the decision tree model makes classification decisions based on these decision boundaries. The goal is to partition data at a node into subsets that are as pure as possible.
Theefore, we need a way to measure the purity of a split, in order to compare different ways to partition a set of data. It works out better mathematically if we measure the impurity rather than the purity. Thus, the impurity measure of a node specifies how mixed the resulting subsets are.
Since we want the resulting subsets to have homogeneous class labels, not mixed class labels, we want the split that minimizes the impurity measure.
Gini index: higher value equals less pure
Misclassification rate
Train two models that use a different splitting criterion and use the
validation set to choose a “best” model from this group. To do this
you’ll use the parms argument of the rpart()
function. This
argument takes a named list that contains values of different parameters
you can use to change how the model is trained. Set the parameter split
to control the splitting criterion.
# Train a gini-based model
<- rpart(formula = high_traffic ~ .,
recipe_model1 data = recipe_train,
method = "class",
parms = list(split = "gini"))
# Train an information-based model
<- rpart(formula = high_traffic ~ .,
recipe_model2 data = recipe_train,
method = "class",
parms = list(split = "information"))
# Generate predictions on the validation set using the gini model
<- predict(object = recipe_model1,
pred1 newdata = recipe_test,
type = "class")
# Generate predictions on the validation set using the information model
<- predict(object = recipe_model2,
pred2 newdata = recipe_test,
type = "class")
# Compare classification error
ce(actual = recipe_test$high_traffic,
predicted = pred1)
## [1] 0.2857143
ce(actual = recipe_test$high_traffic,
predicted = pred2)
## [1] 0.2857143
One of the main drawbacks of a decision tree is their high variance. Often a small change in the data can result in va very different series of splits, which can also make model intepretation somewhat precautious.
Bagging, and in particular, bagged trees, averages many trees to reduce this variance. Combining several moels into one is what is called an ensemble model and averaging is one of the easiest ways to create an ensemble from a collection of models.
In addition to reducing variance, it can also help avoid overfitting.
Bagging is an ensemble method and the term “bagging” is shorthand for bootstrap
aggregation. Bagging uses bootstrap sampling and agrregates the individual models by averaging.
Bootstrap means sampling rows at random from the training dataset, with replacement. When we draw samples with replacement, that means it is possible that you will draw a single training example more than once.
This results in a modified version of the training set where some rows are represented multiple times and some rows are absent. This let’s you generate new data that is similar to the data you started with.
By doing this, we can fit many different, but similar models.
Repeate step 1 through 2 for as many times as you like - that could be 20 times, 100 times or 1000. Typically, the more trees, the better the model.
Suppose we use 1000 trees for the model construction. In this case, each model has different featurs, and to generate a prediction using a bagged tree model, you need to generate predictions from each of the 1000 trees and then simply average the prediction together to get a final prediction.
The bagged, or ensemble prediction is the average prediction across the bootstrapped trees. Bagging can dramatically reduce the variance of unstable models such as trees, leading to improved prediction.
This means averaging reduces variance and leaves bias unchanged.
bagging(formula = response ~., data = dat)
Let’s start by training a bagged tree model. You’ll be using the
bagging()
function from the ipred package. The number of
bagged trees can be specified using the nbagg
parameter,
but here we will use the default (25).
If we want to estimate the model’s accuracy using the “out-of-bag”
(OOB) samples, we can set the the coob
parameter to
TRUE
. The OOB samples are the training obsevations that
were not selected into the bootstrapped sample (used in training). Since
these observations were not used in training, we can use them instead to
evaluate the accuracy of the model (done automatically inside the
bagging()
function).
library(ipred)
set.seed(123)
# Train a bagged model
<- bagging(formula = high_traffic ~ .,
recipe_model data = recipe_train,
coob = TRUE)
# Print the model
print(recipe_model)
##
## Bagging classification trees with 25 bootstrap replications
##
## Call: bagging.data.frame(formula = high_traffic ~ ., data = recipe_train,
## coob = TRUE)
##
## Out-of-bag estimate of misclassification error: 0.2955
We have to pass the model object, the test dataset and what type you want your prediction to be. In this example, we set the argument type to be class since we want the function to returns a vector of class predictions.
If you want to take a peak in the predictions, you can print the
class prediction and you will see that indeed classification labels were
returned. After making predictions, it is time to evaluate the model
performance. We can use the confusionMatrix()
from the
caret
package.
It’s always good to take a look at the output using the
print()
function.
# Generate predicted classes using the model object
<- predict(object = recipe_model,
class_prediction newdata = recipe_test,
type = "class") # return classification labels
# Calculate the confusion matrix for the test set
::confusionMatrix(data = class_prediction,
caretreference = recipe_test$high_traffic)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 48 19
## 1 31 91
##
## Accuracy : 0.7354
## 95% CI : (0.6665, 0.7968)
## No Information Rate : 0.582
## P-Value [Acc > NIR] : 8.189e-06
##
## Kappa : 0.4444
##
## Mcnemar's Test P-Value : 0.1198
##
## Sensitivity : 0.6076
## Specificity : 0.8273
## Pos Pred Value : 0.7164
## Neg Pred Value : 0.7459
## Prevalence : 0.4180
## Detection Rate : 0.2540
## Detection Prevalence : 0.3545
## Balanced Accuracy : 0.7174
##
## 'Positive' Class : 0
##
In binary classification problems, we can predict numeric values instead of class labels. In fact, class labels are created only after you use the model to predict a raw, numeric, predicted value for a test point.
# Generate predictions on the test set
<- predict(object = recipe_model,
pred newdata = recipe_test,
type = "prob")
# Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
auc(actual =recipe_test$high_traffic ,
predicted = pred[,"0"])
## [1] 0.1886651
# yardstick::roc_curve()
bind_cols(
actual = recipe_test$high_traffic ,
predicted = pred[,"0"]
%>%
) mutate(actual = as.factor(actual)) %>%
roc_curve(truth = actual, predicted)
# yardstick::roc_auc()
bind_cols(
actual = recipe_test$high_traffic ,
predicted = pred[,"0"]
%>%
) mutate(actual = as.factor(actual)) %>%
roc_auc(truth = actual, predicted)