The “tips” dataset is a commonly used example in tutorials and documentation to illustrate data visualization techniques and exploratory analysis. It provides information about tables at a bar or restaurant. Let’s do an exercise where we will try to predict the tip amount given by the tables using a regression tree modeled on this dataset.

Package installation

packages <- c('tidyverse','rpart','rpart.plot','gtools','Rmisc','scales','viridis','caret','AMR','randomForest','fastDummies','rattle','xgboost','ggpubr','reshape2')

if(sum(as.numeric(!packages %in% installed.packages())) != 0){
  instalador <- packages[!packages %in% installed.packages()]
  for(i in 1:length(instalador)) {
    install.packages(instalador, dependencies = T)
    break()}
  sapply(packages, require, character = T) 
} else {
  sapply(packages, require, character = T) 
}

Loading the dataset

data(tips)

tips %>% head
##   total_bill  tip    sex smoker day   time size
## 1      16.99 1.01 Female     No Sun Dinner    2
## 2      10.34 1.66   Male     No Sun Dinner    3
## 3      21.01 3.50   Male     No Sun Dinner    3
## 4      23.68 3.31   Male     No Sun Dinner    2
## 5      24.59 3.61 Female     No Sun Dinner    4
## 6      25.29 4.71   Male     No Sun Dinner    4

1.1 Building the tree

tree <- rpart(tip~., 
              data=tips,
              control=rpart.control(maxdepth = 4, cp=0))

Saving predicted values (p) and error (r) in the dataset

tips['p'] = predict(tree, tips)
tips$p %>% tail # olhando a previsão
## [1] 4.330000 4.330000 2.928182 3.637619 2.791000 2.791000
tips['r'] = tips$tip - tips$p

Creating a function to plot the tree

plot <- function(arvore_){
  paleta <- scales::viridis_pal(begin=.75, end=1)(20)
  plot <- rpart.plot::rpart.plot(arvore_,
                                 box.palette = paleta)}

Plotting the tree

plot(tree)

1.2 Calculating evaluation metrics

Calculation of SSE, MSE, SST, MSR, and R²

Creating a function for evaluation

metrics <- function(tips_in, p_var, tip_var){
  n <- dim(tips_in)[1]
  SSE <- sum((tips_in[tip_var] - tips[tip_var])^2)
  MSE <- SSE/n
  
  # Calculation of SSE (Sum of Squares Total)
SST <- sum((tips_in[tip_var] - (tips_in[tip_var] %>% sum)/n)^2)
MSR <- SST/n

# Calculation of R-squared
R_squared <- 1 - SSE/SST

# Printing the results
cat("Sum of Squares of Errors (SSE): ", SSE, "\n") 
cat("Mean Square Error (MSE): ", MSE, "\n")
cat("Sum of Squares Total (SST): ", SST, "\n") 
cat("Mean Square Total (MSR): ", MSR, "\n")
cat("R-squared (R²): ", R_squared, "\n")
  
}
metrics(tips, "p", "tip")
## Sum of Squares of Errors (SSE):  0 
## Mean Square Error (MSE):  0 
## Sum of Squares Total (SST):  465.2125 
## Mean Square Total (MSR):  1.906609 
## R-squared (R²):  1

1.3 Graphical analysis

Function to plot predicted values on x, observed values on y, and error represented by colors

grafico1 <- function(data, x_var, y_var, r_var) {
  ggplot(data) +
    geom_point(aes(x = !!sym(x_var), y = !!sym(y_var), color = !!sym(r_var))) +
    theme(legend.position="bottom") +
    ggtitle("Scatterplot") +
    scale_color_viridis_c()
}

Plotting

grafico1(tips, "p", "tip", "r")

Despite our tree already showing good metrics, we will perform a procedure to improve its accuracy for study purposes. We will create a new tree with ‘maxdepth = 30’ and evaluate its ‘complexity path.’ From this evaluation, we can create another, more ‘optimized’ tree.

2.1 Training the tree without constraints

tree_hm <- rpart(tip~.,
                 data=tips[, !(names(tips) %in% c("p", "r"))],
                 xval=10,
                 control = rpart.control(cp = 0, 
                                         minsplit = 2,
                                         maxdepth = 30)
)
tips['p_hm'] = predict(tree_hm, tips)
tips$p %>% tail # investigar a previsão
## [1] 4.330000 4.330000 2.928182 3.637619 2.791000 2.791000
tips['r_hm'] = tips$tip - tips$p_hm

2.2 Evaluating the tree_hm

metrics(tips, "p_hm", "tip")
## Sum of Squares of Errors (SSE):  0 
## Mean Square Error (MSE):  0 
## Sum of Squares Total (SST):  465.2125 
## Mean Square Total (MSR):  1.906609 
## R-squared (R²):  1
grafico1(tips, "p_hm", "tip", "r_hm")

n plotaremos essa árvore para evitar travamentos no computador devido a extensão da mesma.

Complexity of the paths

tab_cp <- rpart::printcp(tree_hm)
rpart::plotcp(tree_hm)

Choosing the path that optimizes impurity in cross-validation

tab_cp[which.min(tab_cp[,'xerror']),]
##         CP     nsplit  rel error     xerror       xstd 
## 0.01556767 6.00000000 0.42020104 0.59689617 0.07520624
cp_min <- tab_cp[which.min(tab_cp[,'xerror']),'CP']
cp_min
## [1] 0.01556767

Modeling the pruned/tuned tree

tree_tune <- rpart(tip~., 
                   data=tips[, !(names(tips) %in% c("p", "r", "p_hm", "r_hm"))],
                   xval=0,
                   control = rpart.control(cp = cp_min, 
                                           maxdepth = 30)
)

From the analysis of the complexity path, we identified the value that represents the point on the complexity path with the lowest relative error (xerror) obtained during cross-validation and saved it in an object called ‘cp_min.’

Next, when creating the optimized tree ‘tree_tune,’ the parameter ‘cp’ was set to ‘cp_min,’ which means that the model was tuned up to the specified point on the complexity path, resulting in a tree with improved accuracy.

Predicted values

tips['p_tune'] = predict(tree_tune, tips)
tips$p %>% tail # investigar a previsão
## [1] 4.330000 4.330000 2.928182 3.637619 2.791000 2.791000
tips['r_tune'] = tips$tip - tips$p_tune

Evaluating the tuned tree

metrics(tips, "p_tune", "tip")
## Sum of Squares of Errors (SSE):  0 
## Mean Square Error (MSE):  0 
## Sum of Squares Total (SST):  465.2125 
## Mean Square Total (MSR):  1.906609 
## R-squared (R²):  1
grafico1(tips, "p_tune", "tip", "r_tune")

plot(tree_tune)

We can see that our optimized/tuned tree presents the same metrics as our initial tree. This happens because our initial tree was already well-fitted to the data’s variability, probably due to the low complexity in the relationships of these data. However, this exercise serves to demonstrate how a regression tree optimization process can be applied. In summary, in the example of this exercise, we could use our initial tree to try to predict tip amounts without compromising accuracy. But in databases where variables have more complex relationships, the procedure presented here could be of great value