Decision trees - exercise solutions

dr. Annelies Agten

2026-04-27

Multivariate Statistics - StatUa

Decision Trees - Exercise solutions

Questions:

Use createDataPartition to split the penguins data.

Why is a training/test split even more critical for Decision Trees than for LDA?

Fit a classification tree using the rpart package.

Visualize the tree.

Which variable is at the very top (the “Root Node”)?

Why did the algorithm choose this variable first?

Examine the Complexity Parameter (CP) table.

Identify the ‘elbow’ where adding more splits does not significantly reduce the cross-validation error.

Should you “prune” this tree back, or is the default size appropriate?

Use your tree to predict the test set. Compare the results to your previous LDA model.

Does the tree struggle with the same species as the LDA?

Look at the “Leaf Nodes”—are there any “pure” groups?

If the tree says “If Bill Length < 43.3mm, then Adelie,” but a biological textbook says the cutoff is 40mm, what do you do? How do you incorporate “theoretical knowledge” into a machine-learning model?

Decision trees are highly flexible models that can easily overfit the training data. Unlike LDA, which imposes a linear structure, trees can create very specific splits that capture noise rather than general patterns. A test set is therefore essential to evaluate how well the model generalizes.

data(penguins) 

df <- na.omit(penguins) 

df <- df %>% 
  select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) 

set.seed(22) 

trainIndex <- createDataPartition(df$species, p = 0.8, list = FALSE) 

train <- df[trainIndex, ] 

test <- df[-trainIndex, ]

The root node is the most important split in the tree. It is chosen because it maximizes the reduction in impurity (e.g., Gini index). In this example, the variable flipper length is selected because it best separates species at the first split.

tree_model <- rpart(species ~ ., data = train, method = "class") 

rpart.plot(tree_model)


tree_model
#> n= 268 
#> 
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#> 
#> 1) root 268 151 Adelie (0.436567164 0.205223881 0.358208955)  
#>   2) flipper_length_mm< 206.5 168  52 Adelie (0.690476190 0.303571429 0.005952381)  
#>     4) bill_length_mm< 43.35 118   5 Adelie (0.957627119 0.042372881 0.000000000) *
#>     5) bill_length_mm>=43.35 50   4 Chinstrap (0.060000000 0.920000000 0.020000000) *
#>   3) flipper_length_mm>=206.5 100   5 Gentoo (0.010000000 0.040000000 0.950000000)  
#>     6) bill_depth_mm>=17.05 7   3 Chinstrap (0.142857143 0.571428571 0.285714286) *
#>     7) bill_depth_mm< 17.05 93   0 Gentoo (0.000000000 0.000000000 1.000000000) *

This variable is chosen because it provides the best initial split in terms of impurity reduction (e.g., Gini index). In other words, it separates the species most effectively at the first step.

printcp(tree_model)
#> 
#> Classification tree:
#> rpart(formula = species ~ ., data = train, method = "class")
#> 
#> Variables actually used in tree construction:
#> [1] bill_depth_mm     bill_length_mm    flipper_length_mm
#> 
#> Root node error: 151/268 = 0.56343
#> 
#> n= 268 
#> 
#>         CP nsplit rel error  xerror     xstd
#> 1 0.622517      0  1.000000 1.00000 0.053770
#> 2 0.284768      1  0.377483 0.37748 0.044364
#> 3 0.013245      2  0.092715 0.11258 0.026425
#> 4 0.010000      3  0.079470 0.12583 0.027825

plotcp(tree_model)

The CP table shows the trade-off between model complexity and predictive performance. The “elbow” indicates the point where further splits do not meaningfully improve performance.

best_cp <- tree_model$cptable[which.min(tree_model$cptable[,"xerror"]), "CP"] 

pruned_tree <- prune(tree_model, cp = best_cp) 

rpart.plot(pruned_tree)

By pruning the tree, we are able to reduce overfitting, simplify interpretation, and improve generalization of the results. In the pruned tree, however, we see that there are no ‘pure’ groups, which means that there are some overlapping features.

pred_tree <- predict(pruned_tree, test, type = "class") 

confusionMatrix(pred_tree, test$species)
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Adelie Chinstrap Gentoo
#>   Adelie        27         0      0
#>   Chinstrap      1        12      0
#>   Gentoo         1         1     23
#> 
#> Overall Statistics
#>                                          
#>                Accuracy : 0.9538         
#>                  95% CI : (0.871, 0.9904)
#>     No Information Rate : 0.4462         
#>     P-Value [Acc > NIR] : <2e-16         
#>                                          
#>                   Kappa : 0.9277         
#>                                          
#>  Mcnemar's Test P-Value : 0.3916         
#> 
#> Statistics by Class:
#> 
#>                      Class: Adelie Class: Chinstrap Class: Gentoo
#> Sensitivity                 0.9310           0.9231        1.0000
#> Specificity                 1.0000           0.9808        0.9524
#> Pos Pred Value              1.0000           0.9231        0.9200
#> Neg Pred Value              0.9474           0.9808        1.0000
#> Prevalence                  0.4462           0.2000        0.3538
#> Detection Rate              0.4154           0.1846        0.3538
#> Detection Prevalence        0.4154           0.2000        0.3846
#> Balanced Accuracy           0.9655           0.9519        0.9762

When looking at the confusion matrix, we see that we achieve 95% accuracy on the test set. This is slighty worse that the LDA analysis. Decision trees generally perform well but slightly worse than LDA. This is because:

If the model finds a cutoff different from established theory: do NOT blindly trust the model, but integrate domain knowledge. You always have to consider model limitations!