Decision Trees
- Decision trees mimic the human thought process and produce a nice graphical representation, therefore are often considered the most interpretable off the shelf machine learning model.
- Can easily handle qualitative predictors without the need of dummy variables.
- However a single tree often is weak in terms of predictive accuracy, therefore we often tend to aggregate trees.
- Decision trees perform well when there is a highly complex decision boundary and a complex relationship between features and the response.
CART Algo
- CART Algo can be used to create regression and classification trees while ID3 only can create classification
- CART can take into account continuous variables while ID3 cannot.
- CART uses gini and ID3 uses entropy/info gain
Regression Trees
- Tree partions the predictor space into leaves that take the average value score in the leaf.
- This is done in a way to minimise RSS, noting that \(\hat{y}_{R_j}\)
\[ RSS = \sum^J_{j-1}\sum_{i\in R_j} (y_i-\hat{y}_{R_j})^2\]
- The following is an example of a decision tree
Decision Tree
Partioned Space
- It is computationally infeasible to consider every possible partition at each step therefore a top down, greedy approach is taken where only the current step is considered.
- Binary, recurisve splits are done until a stopping criterion is reached.
- Min observations in a leaf, number of nodes, number of branches etc.
Tree Pruning
- A decision tree tends to overfit the data and become overly complex.
- A smaller tree has less variance at the cost of some added bias.
- The favoured strategy is to grow a large tree and prune back the branches to create a smaller subtree.
- A common method is cost complexity pruning:
- \(|T|\) is the number of terminal nodes
- \(\alpha\) is a tuning parameter that manages the complexity penalty, and should be chosen through CV
\[ \sum^{|T|}_{m=1}\sum_{x_i\in R_m}(y_i-\hat{y}_{R_m})+\alpha|T|\] * An example of a pruning algorithm for regression trees is as follows:
Pruning Algorithm
Classification Trees
- Similar to regression except the predicton is based on a majority vote within a leaf.
- We consider different metrics to determine how good a leaf is as RSS does not work in the classification case.
- Error rate can be considered however it is typically not sensitive enough for tree building.
- Note:
\[ \hat{p}_{mk}=\frac{1}{N_m}\sum_{x_j\in R_m}I(y_i=k)\]
- Gini index is a measure of variance among the \(K\) classes. Min at p=0,1 and max at p=0.5:
- Note the Gini index is normally quoted as this index, and the lower the better.
\[ G = \sum^K_{k=1}\hat{p}_{mk}(1-\hat{p}_{mk})\]
- Cross entropy is a similar score to Gini and has very similar results:
- Slightly longer to compute due to the log
- Tends to produce more balanced trees
- Often quoted in information gain, which we want the largest value of
\[ D = -\sum^K_{k=1}\hat{p}_{mk}\log(\hat{p}_{mk})\]
Bagging
Bagging is a technique that can be used in any statistical learning technique, however is most common in trees as they have a high variance/low bias base that works well when averaged.
Decision trees often suffer from high variance. We can reduce this variance by creating multiple trees on bootstrapped samples and averaging the result.
- We bootstrap samples to simulate getting new training samples in order to create different trees.
Another method is to use pasting, which uses a subsample of no replacement data
- Bagging has slightly higher vias than pasting but creates less correlated trees, therefore often has better results.
\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum^B_{b=1}\hat{f}^{*b}(x)\] * For regression trees we average the actual values, for classification trees we take a majority vote from the resulting ensemble of trees.
- The amount of trees, \(B\), is not a critical hyperparameter as we cannot overfit.
OOB Error
- On average \(\frac{2}{3}\) of the original training set is used for each bootsrapped sample, therefore \(B/3\) tress, on average, will not contain an ith observation.
- We predict each observation with the trees that do not include the observation. This is similar to LOOCV.
Variance Importance
- To aid interpretation we calculate variance importance measures by finding the RSS/Gini reduction by splits for each predictor and standardise results.
Random Forests
At each split only \(m\) out of \(p\) predictors are even considered as split candidates
- Strong predictors are therefore used in far fewer models, and therefore effects of all predictors can be measured.
- This reduces the correlation between trees and therefore results in a lower variance for the ensemble when compared to normal bagging.
Like bagging we consider bootstrapped datasets therefore we can do similar OOB error estimates.
On average, \(\frac{(p-m)}{p}\) of splits don’t consider a given strong predictor.
Like bagging, random forest doesnt overfit with an increase of trees.
Extra Trees
- Can add even more randomness to random forest through extra random trees
- Makes trees more random by also using random thresholds for each feature rather than searching for the best possible threshold
- Further increases bias for a decrease in variance.
- Faster computationally as searching for the optimal threshold is computationally costly.
Boosting
Boosting, like bagging, can be applied to many statistical learning methods.
Boosting grows trees sequentially, fit on a modified version of the original dataset, therefore no bootstrapping is required.
- Fit sequential trees on the current residuals.
- This improves \(\hat{f}\) in areas it does not perform well
The algorithm is as follows:
Boosting Algorithm
- Parameters to consider
- Number of trees is important in this case as trees are built on residuals, therefore there is potential for overfitting.
- Shrinkage parameter \(\lambda\) controls the rate the boosting learns from new trees
- Number of splits, \(d\), determines how complex each tree is. \(d=1\) typically works well both empirically and for interpretability as it results in an additive type model.