From (classification) trees to (random) forest

29/06/2021

Classification trees

Classification Tree refers to a statistical method to find rules for assigning observation to known classes or categories.

Example: Prof. X and his three students

Prof. X noticed that three species of Iris differed in the size of their flowers and asked his three favorite students to measure individuals and find out how to discriminate the species based on measurements.

Rule A…

Student A measured Sepal Length and Petal width of 33 individuals, and concluded “this is easy”. With a simple rule he got 100% of the individuals in the right species.

Rule B…

Student B measured Sepal width and Petal length of only 15 individuals. He figured out that using only Petal length he could assign most of the individuals to the right species:

Rule C…

Student C measured Sepal Length and Width of 100 individuals but was not satified with any solution he could think of:

Rule C…

He said: “Maybe I should use a classification tree to solve this problem”. So he did, and found a solution that got at least 79% right:

Rule C…

Three different trees…

In fact each student used a different classification tree:

Compare the three results

They all came to Prof. X and presented their results.

He said: “Each one of you measured different individuals and different variables, and came to different classification rules. But what can we learn from this?”

If we take the rules (classification tree) of each student and test it across the three datasets:

##           Data A    Data B  Data C 
## Rule A  1.0000000 1.0000000    0.95
## Rule B  0.9090909 0.9333333    0.88
## Rule C  0.7575758 0.6666667    0.79

What have we learnt?

Some variables are more informative than others
The value of your rules (thresholds) depend on the data you sampled
If you pick the ‘right’ variable and the ‘right’ thresholds your classification tree will be simple and robust (everybody wants to be student A!)

But if three trees are not good enough…

Now imagine instead of 3 students, we have 30 students, each one has a different sample of individuals, and each one chooses randomly two variables to measure:

we will have 30 slightly different trees to classify the species!
we can test each tree with each (sub-)data set
we can compare which variables produce better trees (variable importance)

Random Forest

So Random Forest are an ensemble of decision trees (hence “forest”) in which each tree is grown independently using two randomization techniques:

Botstrap aggregation (Bagging) of observations and
Variable randomization

Botstrap aggregation

Bagging consists of three steps:

The original data is sampled with replacement, this sample goes IN THE BAG.
The model is fitted with the data IN THE BAG.
The model is used to predict the values of the samples OUT OF THE BAG (OOB).

In Random Forest the “model” is a single decision tree, and this procedure is repeated for each model. Each observation is placed in or out of the bag each time.

Variable randomization

The decision trees have rules to optimize which variable to select in each split. This means that decision trees will converge to similar solutions if they use similar data and variables.

In Random Forests only a random subset of variables are used in each split. This introduces variability in the trees, reducing their correlation, and allows the algorithm to explore different solutions.