Improving model performance
Ensemble Approach
- As an alternative to increasing the performance of a single model, it is possible to combine several models to form a powerful team.
- By intelligently using the talents of several diverse team members, it is possible to create a strong team of multiple weak learners.
Process Diagram
Process Diagram
Advantages
Better generalizability to future problems.
Improved performance on massive or miniscule datasets.
The ability to synthesize of data from distinct domains
A more nuanced understanding of difficult learning tasks.
Most popular ensemble methods
- Bagging(Bootstrap Aggregating)
Described by Leo Breiman in 1994
Generates a number of training datasets by bootstrap sampling the original training data.
The models’ predictions are combined using voting (for classification) or averaging (for numeric prediction).
often used with decision trees.(relatively unstable learners.)
- Boosting
it boosts the performance of weak learners to attain the performance of stronger learners.
The key difference is that the resampled datasets in boosting are constructed specifically to generate complementary learners, and the vote is weighted based on each model’s performance rather than giving each an equal vote.
Random Forest
- Random forests (Breiman, 2001) is a substantial modification of bagging. that builds a large collection of de-correlated trees, and then averages them.
- Combines the base principles of bagging with random feature selection to add additional diversity to the decision tree models.
- It can handle extremely large datasets.
Random Forest for Regression of Classification Algorithm
For \(b=1\) to \(B\):
Draw a bootstrap sample \(Z^{*}\) of size \(N\) from the training data.
Grow a random-forest tree \(T_b\) to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum nod size \(n_{min}\) is reached.
i.Select \(m\) variables at random from the \(p\) variables.
ii.Pick the best variable/split-point among the \(m\).
iii.Split the node into two daughter nodes.
Output the ensemble of trees \(\left\{T_b\right\}_1^B\).
To make a prediction at a new point \(x\):
\(Regression\): \(\hat{f}_{rf}^B=\frac{1}{B}\sum_{b=1}^{B}T_b(x)\).
\(Classification\): Let \(\hat{C}_b(x)\) be the class prediction of the \(b\) th random-forest tree.Then \(\hat{C}_{rf}^B(x)=majority \space vote \left\{ \hat{C}_b(x)\right\}_1^B\).
Difference to Standard Decision Tree
- Train each tree on bootstrap resample data
- Make new data by drawing with replacement N samples.
- For each split,consider only m randomly selected variables.
- Do not prune.
- Using average or majority voting to aggregate results.(Fit B trees.)
Random Forest Variance Decomposition
\(Var(\frac{1}{B}\sum_{i=1}^{B}T_i(c))=\frac{1}{B^2}\sum_{i=1}^{B}\sum_{j=1}^{B}Cov(T_i(x),T_j(x))\)
\(=\frac{1}{B^2}\sum_{i=1}^{B}\left (\sum_{j\neq i}^{B}Cov(T_i(x),T_j(x))+Var(T_i(x)) \right )\)
\(=\frac{1}{B^2}\sum_{i=1}^{B}\left ( (B-1) \sigma^2 \cdot \rho +\sigma ^2\right )\)
\(=\frac{B(B-1)\rho\sigma^2+B\sigma^2}{B^2}=\frac{(B-1)\rho\sigma^2}{B}+\frac{\sigma^2}{B}\)
\(=\rho\sigma^2-\frac{\rho\sigma^2}{B}+\frac{\sigma^2}{B}=\rho\sigma^2+\sigma^2\frac{1-\rho}{B}\)
- \(\rho \sigma^2\) decreases, if \(\rho\) decreases (m decreases.)
- \(\sigma^2\frac{1-\rho}{B}\) decreases,if number of trees B increases.
Decision Tree v.s. Random Forest
Trees
- fast
- easy to tune parameters
- tends to have a high variance
Random Forest
- +smaller prediction variance.
- +easy to tune parameters.
- +OOB error “for free”(no CV needed.)
- -slow
- -Black Box, hard to explain.