Trees

Jake

11/08/2021

Decision Trees

  • Decision trees mimic the human thought process and produce a nice graphical representation, therefore are often considered the most interpretable off the shelf machine learning model.
    • Can easily handle qualitative predictors without the need of dummy variables.
    • However a single tree often is weak in terms of predictive accuracy, therefore we often tend to aggregate trees.
  • Decision trees perform well when there is a highly complex decision boundary and a complex relationship between features and the response.

CART Algo

  • CART Algo can be used to create regression and classification trees while ID3 only can create classification
  • CART can take into account continuous variables while ID3 cannot.
  • CART uses gini and ID3 uses entropy/info gain

Regression Trees

  • Tree partions the predictor space into leaves that take the average value score in the leaf.
    • This is done in a way to minimise RSS, noting that \(\hat{y}_{R_j}\)

\[ RSS = \sum^J_{j-1}\sum_{i\in R_j} (y_i-\hat{y}_{R_j})^2\]

  • The following is an example of a decision tree
Decision Tree

Decision Tree

Partioned Space

Partioned Space

  • It is computationally infeasible to consider every possible partition at each step therefore a top down, greedy approach is taken where only the current step is considered.
  • Binary, recurisve splits are done until a stopping criterion is reached.
    • Min observations in a leaf, number of nodes, number of branches etc.

Tree Pruning

  • A decision tree tends to overfit the data and become overly complex.
    • A smaller tree has less variance at the cost of some added bias.
  • The favoured strategy is to grow a large tree and prune back the branches to create a smaller subtree.
  • A common method is cost complexity pruning:
    • \(|T|\) is the number of terminal nodes
    • \(\alpha\) is a tuning parameter that manages the complexity penalty, and should be chosen through CV

\[ \sum^{|T|}_{m=1}\sum_{x_i\in R_m}(y_i-\hat{y}_{R_m})+\alpha|T|\] * An example of a pruning algorithm for regression trees is as follows:

Pruning Algorithm

Pruning Algorithm

Classification Trees

  • Similar to regression except the predicton is based on a majority vote within a leaf.
  • We consider different metrics to determine how good a leaf is as RSS does not work in the classification case.
    • Error rate can be considered however it is typically not sensitive enough for tree building.
  • Note:

\[ \hat{p}_{mk}=\frac{1}{N_m}\sum_{x_j\in R_m}I(y_i=k)\]

  • Gini index is a measure of variance among the \(K\) classes. Min at p=0,1 and max at p=0.5:
    • Note the Gini index is normally quoted as this index, and the lower the better.

\[ G = \sum^K_{k=1}\hat{p}_{mk}(1-\hat{p}_{mk})\]

  • Cross entropy is a similar score to Gini and has very similar results:
    • Slightly longer to compute due to the log
    • Tends to produce more balanced trees
    • Often quoted in information gain, which we want the largest value of

\[ D = -\sum^K_{k=1}\hat{p}_{mk}\log(\hat{p}_{mk})\]

Bagging

  • Bagging is a technique that can be used in any statistical learning technique, however is most common in trees as they have a high variance/low bias base that works well when averaged.

  • Decision trees often suffer from high variance. We can reduce this variance by creating multiple trees on bootstrapped samples and averaging the result.

    • We bootstrap samples to simulate getting new training samples in order to create different trees.
  • Another method is to use pasting, which uses a subsample of no replacement data

    • Bagging has slightly higher vias than pasting but creates less correlated trees, therefore often has better results.

\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum^B_{b=1}\hat{f}^{*b}(x)\] * For regression trees we average the actual values, for classification trees we take a majority vote from the resulting ensemble of trees.

  • The amount of trees, \(B\), is not a critical hyperparameter as we cannot overfit.

OOB Error

  • On average \(\frac{2}{3}\) of the original training set is used for each bootsrapped sample, therefore \(B/3\) tress, on average, will not contain an ith observation.
    • We predict each observation with the trees that do not include the observation. This is similar to LOOCV.

Variance Importance

  • To aid interpretation we calculate variance importance measures by finding the RSS/Gini reduction by splits for each predictor and standardise results.

Random Forests

  • At each split only \(m\) out of \(p\) predictors are even considered as split candidates

    • Strong predictors are therefore used in far fewer models, and therefore effects of all predictors can be measured.
    • This reduces the correlation between trees and therefore results in a lower variance for the ensemble when compared to normal bagging.
  • Like bagging we consider bootstrapped datasets therefore we can do similar OOB error estimates.

  • On average, \(\frac{(p-m)}{p}\) of splits don’t consider a given strong predictor.

  • Like bagging, random forest doesnt overfit with an increase of trees.

Extra Trees

  • Can add even more randomness to random forest through extra random trees
    • Makes trees more random by also using random thresholds for each feature rather than searching for the best possible threshold
  • Further increases bias for a decrease in variance.
  • Faster computationally as searching for the optimal threshold is computationally costly.

Boosting

  • Boosting, like bagging, can be applied to many statistical learning methods.

  • Boosting grows trees sequentially, fit on a modified version of the original dataset, therefore no bootstrapping is required.

    • Fit sequential trees on the current residuals.
    • This improves \(\hat{f}\) in areas it does not perform well
  • The algorithm is as follows:

Boosting Algorithm

Boosting Algorithm

  • Parameters to consider
    • Number of trees is important in this case as trees are built on residuals, therefore there is potential for overfitting.
    • Shrinkage parameter \(\lambda\) controls the rate the boosting learns from new trees
    • Number of splits, \(d\), determines how complex each tree is. \(d=1\) typically works well both empirically and for interpretability as it results in an additive type model.