Improving model performance

Ensemble Approach

As an alternative to increasing the performance of a single model, it is possible to combine several models to form a powerful team.
By intelligently using the talents of several diverse team members, it is possible to create a strong team of multiple weak learners.

Process Diagram

Process Diagram

Advantages

Better generalizability to future problems.
Improved performance on massive or miniscule datasets.
The ability to synthesize of data from distinct domains
A more nuanced understanding of difficult learning tasks.

Most popular ensemble methods

Bagging(Bootstrap Aggregating)
- Described by Leo Breiman in 1994
- Generates a number of training datasets by bootstrap sampling the original training data.
- The models’ predictions are combined using voting (for classification) or averaging (for numeric prediction).
- often used with decision trees.(relatively unstable learners.)
Boosting
- it boosts the performance of weak learners to attain the performance of stronger learners.
- The key difference is that the resampled datasets in boosting are constructed specifically to generate complementary learners, and the vote is weighted based on each model’s performance rather than giving each an equal vote.

Random Forest

Random forests (Breiman, 2001) is a substantial modification of bagging. that builds a large collection of de-correlated trees, and then averages them.
Combines the base principles of bagging with random feature selection to add additional diversity to the decision tree models.
It can handle extremely large datasets.

Random Forest for Regression of Classification Algorithm

For \(b=1\) to \(B\):
1. Draw a bootstrap sample \(Z^{*}\) of size \(N\) from the training data.
2. Grow a random-forest tree \(T_b\) to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum nod size \(n_{min}\) is reached.
  
  i.Select \(m\) variables at random from the \(p\) variables.
  
  ii.Pick the best variable/split-point among the \(m\).
  
  iii.Split the node into two daughter nodes.
Output the ensemble of trees \(\left\{T_b\right\}_1^B\).

To make a prediction at a new point \(x\):

\(Regression\): \(\hat{f}_{rf}^B=\frac{1}{B}\sum_{b=1}^{B}T_b(x)\).

\(Classification\): Let \(\hat{C}_b(x)\) be the class prediction of the \(b\) th random-forest tree.Then \(\hat{C}_{rf}^B(x)=majority \space vote \left\{ \hat{C}_b(x)\right\}_1^B\).

Difference to Standard Decision Tree

Train each tree on bootstrap resample data
Make new data by drawing with replacement N samples.
For each split,consider only m randomly selected variables.
Do not prune.
Using average or majority voting to aggregate results.(Fit B trees.)

Random Forest Variance Decomposition

\(Var(\frac{1}{B}\sum_{i=1}^{B}T_i(c))=\frac{1}{B^2}\sum_{i=1}^{B}\sum_{j=1}^{B}Cov(T_i(x),T_j(x))\)

\(=\frac{1}{B^2}\sum_{i=1}^{B}\left (\sum_{j\neq i}^{B}Cov(T_i(x),T_j(x))+Var(T_i(x)) \right )\)

\(=\frac{1}{B^2}\sum_{i=1}^{B}\left ( (B-1) \sigma^2 \cdot \rho +\sigma ^2\right )\)

\(=\frac{B(B-1)\rho\sigma^2+B\sigma^2}{B^2}=\frac{(B-1)\rho\sigma^2}{B}+\frac{\sigma^2}{B}\)

\(=\rho\sigma^2-\frac{\rho\sigma^2}{B}+\frac{\sigma^2}{B}=\rho\sigma^2+\sigma^2\frac{1-\rho}{B}\)

\(\rho \sigma^2\) decreases, if \(\rho\) decreases (m decreases.)
\(\sigma^2\frac{1-\rho}{B}\) decreases,if number of trees B increases.

Decision Tree v.s. Random Forest

Trees

- fast
- easy to tune parameters
- tends to have a high variance

Random Forest

+smaller prediction variance.
+easy to tune parameters.
+OOB error “for free”(no CV needed.)
-slow
-Black Box, hard to explain.

Random Forest

Yan Ping Wu

2018-09-04

Improving model performance

Ensemble Approach

Advantages

Most popular ensemble methods

Random Forest

Random Forest for Regression of Classification Algorithm

Difference to Standard Decision Tree

Random Forest Variance Decomposition

Decision Tree v.s. Random Forest