Random Forest: A Study Guide with Mathematical and Coding
Representation
1. Introduction to Random Forest
Definition
Random Forest is an ensemble learning technique that extends
Bagging (Bootstrap Aggregation) by adding feature
randomness in addition to data randomness. It trains multiple decision
trees on different bootstrapped samples of the dataset and combines
their predictions for improved performance and stability.
Random Forest is widely used for both classification
and regression tasks due to its robustness, ability to
handle missing data, and resistance to overfitting.
2. Mathematical Representation of Random
Forest
Random Forest works by training multiple decision trees and
aggregating their outputs using:
- Bootstrap Sampling:
- Given a dataset \(D\) with \(N\) samples, multiple subsets \(D_b\) are created by randomly sampling
with replacement.
- Each subset is used to train an individual decision tree.
- Feature Randomness:
- Instead of considering all features at each split, only a
random subset of features is considered.
- Final Prediction:
- For Regression: The final prediction is the
average of all tree outputs: \[
F(x) = \frac{1}{B} \sum_{b=1}^{B} f_b(x)
\]
- For Classification: The final prediction is the
majority vote among all trees: \[
F(x) = \text{mode} \{ f_b(x) \}
\]
where: - \(B\) is the total number
of trees in the forest - \(f_b(x)\) is
the prediction from the \(b\)-th
tree
By introducing randomness at both data and
feature levels, Random Forest reduces variance while
maintaining accuracy.
3. Random Forest in Decision Trees
- Bootstrap Sample Creation: Each tree is trained on
a different randomly sampled subset.
- Feature Selection: At each split, a random
subset of features is considered instead of all features.
- Independent Tree Training: Each tree is trained
independently and makes predictions without interaction.
- Aggregation: The final output is obtained via
majority voting (classification) or averaging (regression).
This feature randomness helps improve generalization by
reducing correlation among trees, making Random Forest
more robust than a single decision tree.
4. Python Implementation of Random Forest
Training a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)
rf.fit(X_train, y_train)
# Make predictions
y_pred_rf = rf.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
# Print results
print(f"Accuracy of Random Forest (100 trees): {accuracy_rf:.4f}")
Feature Importance in Random Forest
import matplotlib.pyplot as plt
import numpy as np
# Get feature importances
feature_importances = rf.feature_importances_
features = iris.feature_names
# Plot feature importances
plt.figure(figsize=(10, 5))
plt.barh(features, feature_importances, color='skyblue')
plt.xlabel("Feature Importance")
plt.ylabel("Feature")
plt.title("Feature Importance in Random Forest")
plt.show()
5. Key Takeaways
- Random Forest extends Bagging by adding feature
randomness, reducing overfitting and improving generalization.
- Each tree is trained on a different bootstrap
sample, ensuring diversity in predictions.
- Only a random subset of features is considered at
each split, preventing dominant features from overshadowing others.
- Random Forest is resistant to overfitting,
especially when the number of trees is sufficiently large.
- Feature importance analysis helps identify the most relevant
predictors, making it a valuable tool for feature
selection.
By leveraging Random Forest, we can significantly improve predictive
accuracy while maintaining robustness in machine learning models.
LS0tDQp0aXRsZTogIlJhbmRvbSBGb3Jlc3QiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoqKlJhbmRvbSBGb3Jlc3Q6IEEgU3R1ZHkgR3VpZGUgd2l0aCBNYXRoZW1hdGljYWwgYW5kIENvZGluZyBSZXByZXNlbnRhdGlvbioqDQoNCiMjICoqMS4gSW50cm9kdWN0aW9uIHRvIFJhbmRvbSBGb3Jlc3QqKg0KDQojIyMgKipEZWZpbml0aW9uKioNClJhbmRvbSBGb3Jlc3QgaXMgYW4gZW5zZW1ibGUgbGVhcm5pbmcgdGVjaG5pcXVlIHRoYXQgZXh0ZW5kcyAqKkJhZ2dpbmcgKEJvb3RzdHJhcCBBZ2dyZWdhdGlvbikqKiBieSBhZGRpbmcgZmVhdHVyZSByYW5kb21uZXNzIGluIGFkZGl0aW9uIHRvIGRhdGEgcmFuZG9tbmVzcy4gSXQgdHJhaW5zIG11bHRpcGxlIGRlY2lzaW9uIHRyZWVzIG9uIGRpZmZlcmVudCBib290c3RyYXBwZWQgc2FtcGxlcyBvZiB0aGUgZGF0YXNldCBhbmQgY29tYmluZXMgdGhlaXIgcHJlZGljdGlvbnMgZm9yIGltcHJvdmVkIHBlcmZvcm1hbmNlIGFuZCBzdGFiaWxpdHkuDQoNClJhbmRvbSBGb3Jlc3QgaXMgd2lkZWx5IHVzZWQgZm9yIGJvdGggKipjbGFzc2lmaWNhdGlvbioqIGFuZCAqKnJlZ3Jlc3Npb24qKiB0YXNrcyBkdWUgdG8gaXRzIHJvYnVzdG5lc3MsIGFiaWxpdHkgdG8gaGFuZGxlIG1pc3NpbmcgZGF0YSwgYW5kIHJlc2lzdGFuY2UgdG8gb3ZlcmZpdHRpbmcuDQoNCi0tLQ0KDQojIyAqKjIuIE1hdGhlbWF0aWNhbCBSZXByZXNlbnRhdGlvbiBvZiBSYW5kb20gRm9yZXN0KioNCg0KUmFuZG9tIEZvcmVzdCB3b3JrcyBieSB0cmFpbmluZyBtdWx0aXBsZSBkZWNpc2lvbiB0cmVlcyBhbmQgYWdncmVnYXRpbmcgdGhlaXIgb3V0cHV0cyB1c2luZzoNCg0KLSAqKkJvb3RzdHJhcCBTYW1wbGluZzoqKg0KICAtIEdpdmVuIGEgZGF0YXNldCBcKCBEIFwpIHdpdGggXCggTiBcKSBzYW1wbGVzLCBtdWx0aXBsZSBzdWJzZXRzIFwoIERfYiBcKSBhcmUgY3JlYXRlZCBieSByYW5kb21seSBzYW1wbGluZyAqKndpdGggcmVwbGFjZW1lbnQqKi4NCiAgLSBFYWNoIHN1YnNldCBpcyB1c2VkIHRvIHRyYWluIGFuIGluZGl2aWR1YWwgZGVjaXNpb24gdHJlZS4NCg0KLSAqKkZlYXR1cmUgUmFuZG9tbmVzczoqKg0KICAtIEluc3RlYWQgb2YgY29uc2lkZXJpbmcgYWxsIGZlYXR1cmVzIGF0IGVhY2ggc3BsaXQsIG9ubHkgYSAqKnJhbmRvbSBzdWJzZXQgb2YgZmVhdHVyZXMqKiBpcyBjb25zaWRlcmVkLg0KDQotICoqRmluYWwgUHJlZGljdGlvbjoqKg0KICAtICoqRm9yIFJlZ3Jlc3Npb246KiogVGhlIGZpbmFsIHByZWRpY3Rpb24gaXMgdGhlICoqYXZlcmFnZSoqIG9mIGFsbCB0cmVlIG91dHB1dHM6DQogICAgXFsNCiAgICBGKHgpID0gXGZyYWN7MX17Qn0gXHN1bV97Yj0xfV57Qn0gZl9iKHgpDQogICAgXF0NCiAgLSAqKkZvciBDbGFzc2lmaWNhdGlvbjoqKiBUaGUgZmluYWwgcHJlZGljdGlvbiBpcyB0aGUgKiptYWpvcml0eSB2b3RlKiogYW1vbmcgYWxsIHRyZWVzOg0KICAgIFxbDQogICAgRih4KSA9IFx0ZXh0e21vZGV9IFx7IGZfYih4KSBcfQ0KICAgIFxdDQoNCndoZXJlOg0KLSBcKCBCIFwpIGlzIHRoZSB0b3RhbCBudW1iZXIgb2YgdHJlZXMgaW4gdGhlIGZvcmVzdA0KLSBcKCBmX2IoeCkgXCkgaXMgdGhlIHByZWRpY3Rpb24gZnJvbSB0aGUgXCggYiBcKS10aCB0cmVlDQoNCkJ5IGludHJvZHVjaW5nIHJhbmRvbW5lc3MgYXQgYm90aCAqKmRhdGEqKiBhbmQgKipmZWF0dXJlKiogbGV2ZWxzLCBSYW5kb20gRm9yZXN0IHJlZHVjZXMgdmFyaWFuY2Ugd2hpbGUgbWFpbnRhaW5pbmcgYWNjdXJhY3kuDQoNCi0tLQ0KDQojIyAqKjMuIFJhbmRvbSBGb3Jlc3QgaW4gRGVjaXNpb24gVHJlZXMqKg0KDQotICoqQm9vdHN0cmFwIFNhbXBsZSBDcmVhdGlvbjoqKiBFYWNoIHRyZWUgaXMgdHJhaW5lZCBvbiBhIGRpZmZlcmVudCByYW5kb21seSBzYW1wbGVkIHN1YnNldC4NCi0gKipGZWF0dXJlIFNlbGVjdGlvbjoqKiBBdCBlYWNoIHNwbGl0LCBhICoqcmFuZG9tIHN1YnNldCBvZiBmZWF0dXJlcyoqIGlzIGNvbnNpZGVyZWQgaW5zdGVhZCBvZiBhbGwgZmVhdHVyZXMuDQotICoqSW5kZXBlbmRlbnQgVHJlZSBUcmFpbmluZzoqKiBFYWNoIHRyZWUgaXMgdHJhaW5lZCBpbmRlcGVuZGVudGx5IGFuZCBtYWtlcyBwcmVkaWN0aW9ucyB3aXRob3V0IGludGVyYWN0aW9uLg0KLSAqKkFnZ3JlZ2F0aW9uOioqIFRoZSBmaW5hbCBvdXRwdXQgaXMgb2J0YWluZWQgdmlhIG1ham9yaXR5IHZvdGluZyAoY2xhc3NpZmljYXRpb24pIG9yIGF2ZXJhZ2luZyAocmVncmVzc2lvbikuDQoNClRoaXMgZmVhdHVyZSByYW5kb21uZXNzIGhlbHBzIGltcHJvdmUgZ2VuZXJhbGl6YXRpb24gYnkgKipyZWR1Y2luZyBjb3JyZWxhdGlvbiBhbW9uZyB0cmVlcyoqLCBtYWtpbmcgUmFuZG9tIEZvcmVzdCBtb3JlIHJvYnVzdCB0aGFuIGEgc2luZ2xlIGRlY2lzaW9uIHRyZWUuDQoNCi0tLQ0KDQojIyAqKjQuIFB5dGhvbiBJbXBsZW1lbnRhdGlvbiBvZiBSYW5kb20gRm9yZXN0KioNCg0KIyMjICoqVHJhaW5pbmcgYSBSYW5kb20gRm9yZXN0IENsYXNzaWZpZXIqKg0KYGBgcHl0aG9uDQpmcm9tIHNrbGVhcm4uZW5zZW1ibGUgaW1wb3J0IFJhbmRvbUZvcmVzdENsYXNzaWZpZXINCmZyb20gc2tsZWFybi5kYXRhc2V0cyBpbXBvcnQgbG9hZF9pcmlzDQpmcm9tIHNrbGVhcm4ubW9kZWxfc2VsZWN0aW9uIGltcG9ydCB0cmFpbl90ZXN0X3NwbGl0DQpmcm9tIHNrbGVhcm4ubWV0cmljcyBpbXBvcnQgYWNjdXJhY3lfc2NvcmUNCg0KIyBMb2FkIGRhdGFzZXQNCmlyaXMgPSBsb2FkX2lyaXMoKQ0KWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGlyaXMuZGF0YSwgaXJpcy50YXJnZXQsIHRlc3Rfc2l6ZT0wLjIsIHJhbmRvbV9zdGF0ZT00MikNCg0KIyBUcmFpbiBhIFJhbmRvbSBGb3Jlc3QgQ2xhc3NpZmllcg0KcmYgPSBSYW5kb21Gb3Jlc3RDbGFzc2lmaWVyKG5fZXN0aW1hdG9ycz0xMDAsIG1heF9mZWF0dXJlcz0nc3FydCcsIHJhbmRvbV9zdGF0ZT00MikNCnJmLmZpdChYX3RyYWluLCB5X3RyYWluKQ0KDQojIE1ha2UgcHJlZGljdGlvbnMNCnlfcHJlZF9yZiA9IHJmLnByZWRpY3QoWF90ZXN0KQ0KYWNjdXJhY3lfcmYgPSBhY2N1cmFjeV9zY29yZSh5X3Rlc3QsIHlfcHJlZF9yZikNCg0KIyBQcmludCByZXN1bHRzDQpwcmludChmIkFjY3VyYWN5IG9mIFJhbmRvbSBGb3Jlc3QgKDEwMCB0cmVlcyk6IHthY2N1cmFjeV9yZjouNGZ9IikNCmBgYA0KDQojIyMgKipGZWF0dXJlIEltcG9ydGFuY2UgaW4gUmFuZG9tIEZvcmVzdCoqDQpgYGBweXRob24NCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQNCmltcG9ydCBudW1weSBhcyBucA0KDQojIEdldCBmZWF0dXJlIGltcG9ydGFuY2VzDQpmZWF0dXJlX2ltcG9ydGFuY2VzID0gcmYuZmVhdHVyZV9pbXBvcnRhbmNlc18NCmZlYXR1cmVzID0gaXJpcy5mZWF0dXJlX25hbWVzDQoNCiMgUGxvdCBmZWF0dXJlIGltcG9ydGFuY2VzDQpwbHQuZmlndXJlKGZpZ3NpemU9KDEwLCA1KSkNCnBsdC5iYXJoKGZlYXR1cmVzLCBmZWF0dXJlX2ltcG9ydGFuY2VzLCBjb2xvcj0nc2t5Ymx1ZScpDQpwbHQueGxhYmVsKCJGZWF0dXJlIEltcG9ydGFuY2UiKQ0KcGx0LnlsYWJlbCgiRmVhdHVyZSIpDQpwbHQudGl0bGUoIkZlYXR1cmUgSW1wb3J0YW5jZSBpbiBSYW5kb20gRm9yZXN0IikNCnBsdC5zaG93KCkNCmBgYA0KDQotLS0NCg0KIyMgKio1LiBLZXkgVGFrZWF3YXlzKioNCjEuICoqUmFuZG9tIEZvcmVzdCBleHRlbmRzIEJhZ2dpbmcqKiBieSBhZGRpbmcgZmVhdHVyZSByYW5kb21uZXNzLCByZWR1Y2luZyBvdmVyZml0dGluZyBhbmQgaW1wcm92aW5nIGdlbmVyYWxpemF0aW9uLg0KMi4gKipFYWNoIHRyZWUgaXMgdHJhaW5lZCBvbiBhIGRpZmZlcmVudCBib290c3RyYXAgc2FtcGxlKiosIGVuc3VyaW5nIGRpdmVyc2l0eSBpbiBwcmVkaWN0aW9ucy4NCjMuICoqT25seSBhIHJhbmRvbSBzdWJzZXQgb2YgZmVhdHVyZXMgaXMgY29uc2lkZXJlZCoqIGF0IGVhY2ggc3BsaXQsIHByZXZlbnRpbmcgZG9taW5hbnQgZmVhdHVyZXMgZnJvbSBvdmVyc2hhZG93aW5nIG90aGVycy4NCjQuICoqUmFuZG9tIEZvcmVzdCBpcyByZXNpc3RhbnQgdG8gb3ZlcmZpdHRpbmcqKiwgZXNwZWNpYWxseSB3aGVuIHRoZSBudW1iZXIgb2YgdHJlZXMgaXMgc3VmZmljaWVudGx5IGxhcmdlLg0KNS4gKipGZWF0dXJlIGltcG9ydGFuY2UgYW5hbHlzaXMgaGVscHMgaWRlbnRpZnkgdGhlIG1vc3QgcmVsZXZhbnQgcHJlZGljdG9ycyoqLCBtYWtpbmcgaXQgYSB2YWx1YWJsZSB0b29sIGZvciBmZWF0dXJlIHNlbGVjdGlvbi4NCg0KQnkgbGV2ZXJhZ2luZyBSYW5kb20gRm9yZXN0LCB3ZSBjYW4gc2lnbmlmaWNhbnRseSBpbXByb3ZlIHByZWRpY3RpdmUgYWNjdXJhY3kgd2hpbGUgbWFpbnRhaW5pbmcgcm9idXN0bmVzcyBpbiBtYWNoaW5lIGxlYXJuaW5nIG1vZGVscy4NCg0KDQo=