The “tips” dataset is a commonly used example in tutorials and documentation to illustrate data visualization techniques and exploratory analysis. It provides information about tables at a bar or restaurant. Let’s do an exercise where we will try to predict the tip amount given by the tables using a regression tree modeled on this dataset.
packages <- c('tidyverse','rpart','rpart.plot','gtools','Rmisc','scales','viridis','caret','AMR','randomForest','fastDummies','rattle','xgboost','ggpubr','reshape2')
if(sum(as.numeric(!packages %in% installed.packages())) != 0){
instalador <- packages[!packages %in% installed.packages()]
for(i in 1:length(instalador)) {
install.packages(instalador, dependencies = T)
break()}
sapply(packages, require, character = T)
} else {
sapply(packages, require, character = T)
}
data(tips)
tips %>% head
## total_bill tip sex smoker day time size
## 1 16.99 1.01 Female No Sun Dinner 2
## 2 10.34 1.66 Male No Sun Dinner 3
## 3 21.01 3.50 Male No Sun Dinner 3
## 4 23.68 3.31 Male No Sun Dinner 2
## 5 24.59 3.61 Female No Sun Dinner 4
## 6 25.29 4.71 Male No Sun Dinner 4
tree <- rpart(tip~.,
data=tips,
control=rpart.control(maxdepth = 4, cp=0))
tips['p'] = predict(tree, tips)
tips$p %>% tail # olhando a previsão
## [1] 4.330000 4.330000 2.928182 3.637619 2.791000 2.791000
tips['r'] = tips$tip - tips$p
plot <- function(arvore_){
paleta <- scales::viridis_pal(begin=.75, end=1)(20)
plot <- rpart.plot::rpart.plot(arvore_,
box.palette = paleta)}
plot(tree)
metrics <- function(tips_in, p_var, tip_var){
n <- dim(tips_in)[1]
SSE <- sum((tips_in[tip_var] - tips[tip_var])^2)
MSE <- SSE/n
# Calculation of SSE (Sum of Squares Total)
SST <- sum((tips_in[tip_var] - (tips_in[tip_var] %>% sum)/n)^2)
MSR <- SST/n
# Calculation of R-squared
R_squared <- 1 - SSE/SST
# Printing the results
cat("Sum of Squares of Errors (SSE): ", SSE, "\n")
cat("Mean Square Error (MSE): ", MSE, "\n")
cat("Sum of Squares Total (SST): ", SST, "\n")
cat("Mean Square Total (MSR): ", MSR, "\n")
cat("R-squared (R²): ", R_squared, "\n")
}
metrics(tips, "p", "tip")
## Sum of Squares of Errors (SSE): 0
## Mean Square Error (MSE): 0
## Sum of Squares Total (SST): 465.2125
## Mean Square Total (MSR): 1.906609
## R-squared (R²): 1
grafico1 <- function(data, x_var, y_var, r_var) {
ggplot(data) +
geom_point(aes(x = !!sym(x_var), y = !!sym(y_var), color = !!sym(r_var))) +
theme(legend.position="bottom") +
ggtitle("Scatterplot") +
scale_color_viridis_c()
}
grafico1(tips, "p", "tip", "r")
Despite our tree already showing good metrics, we will perform a procedure to improve its accuracy for study purposes. We will create a new tree with ‘maxdepth = 30’ and evaluate its ‘complexity path.’ From this evaluation, we can create another, more ‘optimized’ tree.
tree_hm <- rpart(tip~.,
data=tips[, !(names(tips) %in% c("p", "r"))],
xval=10,
control = rpart.control(cp = 0,
minsplit = 2,
maxdepth = 30)
)
tips['p_hm'] = predict(tree_hm, tips)
tips$p %>% tail # investigar a previsão
## [1] 4.330000 4.330000 2.928182 3.637619 2.791000 2.791000
tips['r_hm'] = tips$tip - tips$p_hm
metrics(tips, "p_hm", "tip")
## Sum of Squares of Errors (SSE): 0
## Mean Square Error (MSE): 0
## Sum of Squares Total (SST): 465.2125
## Mean Square Total (MSR): 1.906609
## R-squared (R²): 1
grafico1(tips, "p_hm", "tip", "r_hm")
n plotaremos essa árvore para evitar travamentos no computador devido a extensão da mesma.
tab_cp <- rpart::printcp(tree_hm)
rpart::plotcp(tree_hm)
tab_cp[which.min(tab_cp[,'xerror']),]
## CP nsplit rel error xerror xstd
## 0.01556767 6.00000000 0.42020104 0.59689617 0.07520624
cp_min <- tab_cp[which.min(tab_cp[,'xerror']),'CP']
cp_min
## [1] 0.01556767
tree_tune <- rpart(tip~.,
data=tips[, !(names(tips) %in% c("p", "r", "p_hm", "r_hm"))],
xval=0,
control = rpart.control(cp = cp_min,
maxdepth = 30)
)
From the analysis of the complexity path, we identified the value that represents the point on the complexity path with the lowest relative error (xerror) obtained during cross-validation and saved it in an object called ‘cp_min.’
Next, when creating the optimized tree ‘tree_tune,’ the parameter ‘cp’ was set to ‘cp_min,’ which means that the model was tuned up to the specified point on the complexity path, resulting in a tree with improved accuracy.
tips['p_tune'] = predict(tree_tune, tips)
tips$p %>% tail # investigar a previsão
## [1] 4.330000 4.330000 2.928182 3.637619 2.791000 2.791000
tips['r_tune'] = tips$tip - tips$p_tune
metrics(tips, "p_tune", "tip")
## Sum of Squares of Errors (SSE): 0
## Mean Square Error (MSE): 0
## Sum of Squares Total (SST): 465.2125
## Mean Square Total (MSR): 1.906609
## R-squared (R²): 1
grafico1(tips, "p_tune", "tip", "r_tune")
plot(tree_tune)
We can see that our optimized/tuned tree presents the same metrics as our initial tree. This happens because our initial tree was already well-fitted to the data’s variability, probably due to the low complexity in the relationships of these data. However, this exercise serves to demonstrate how a regression tree optimization process can be applied. In summary, in the example of this exercise, we could use our initial tree to try to predict tip amounts without compromising accuracy. But in databases where variables have more complex relationships, the procedure presented here could be of great value