Bagging: A Study Guide with Mathematical and Coding
Representation
1. Introduction to Bagging
Definition
Bagging (Bootstrap Aggregating) is an ensemble learning technique
that improves model stability and accuracy by training multiple models
on different random subsets of data and then aggregating their
predictions.
Bagging is widely used in algorithms like Random
Forests, where multiple decision trees are trained
independently and combined for improved performance.
2. Mathematical Representation of Bagging
Bagging reduces variance by training models on multiple bootstrapped
subsets and averaging their outputs.
Bootstrap Sampling
Given a dataset \(D\) with \(N\) samples, bagging creates multiple
subsets \(D_b\) by randomly sampling
with replacement: \[
D_b = \{ x_i | x_i \in D, \text{sampled with replacement} \}
\]
Each model is trained on a different subset \(D_b\) and their predictions \(f_b(x)\) are aggregated:
where: - \(B\) is the total number
of bootstrapped models - \(f_b(x)\) is
the prediction from the \(b\)-th
model
Bagging reduces overfitting and enhances robustness by minimizing
individual model biases.
3. Bagging in Decision Trees
Bagging is particularly effective in decision trees, where high
variance models benefit from aggregation.
- Bootstrap Sample Creation: Each tree is trained on
a different randomly sampled subset.
- Independent Model Training: Each tree makes its own
predictions without interaction.
- Aggregation: The final output is obtained via
majority voting (classification) or averaging (regression).
Bagging is the core technique behind Random Forests,
where additional randomness is introduced by selecting feature subsets
at each split.
4. Python Implementation of Bagging
Training a Bagging Classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load dataset
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=42)
# Train a single Decision Tree (without bagging)
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred_dt)
# Train a BaggingClassifier with 50 bootstrap samples
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=50, random_state=42)
bagging.fit(X_train, y_train)
y_pred_bagging = bagging.predict(X_test)
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)
# Print results
print(f"Accuracy of single Decision Tree: {accuracy_dt:.4f}")
print(f"Accuracy with Bagging (50 estimators): {accuracy_bagging:.4f}")
Visualizing the Bagging Process
import matplotlib.pyplot as plt
import numpy as np
# Example of bootstrap sampling
original_data = np.arange(1, 11)
bootstrap_sample = np.random.choice(original_data, size=10, replace=True)
plt.hist(bootstrap_sample, bins=np.arange(1, 12)-0.5, edgecolor='black', alpha=0.7)
plt.xticks(np.arange(1, 11))
plt.xlabel("Sample Values")
plt.ylabel("Frequency")
plt.title("Bootstrap Sampling Example")
plt.show()
5. Key Takeaways
- Bagging reduces variance by training multiple
models on bootstrapped subsets and aggregating predictions.
- Bootstrap sampling ensures diversity in training,
making the final model more robust.
- Bagging works best with high-variance models like
decision trees, enhancing their generalization ability.
- Random Forests extend bagging by adding feature
randomness, further improving performance and stability.
By leveraging bagging, we can significantly improve predictive
performance while minimizing overfitting in machine learning models.
LS0tDQp0aXRsZTogIkJhZ2dpbmciDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoqKkJhZ2dpbmc6IEEgU3R1ZHkgR3VpZGUgd2l0aCBNYXRoZW1hdGljYWwgYW5kIENvZGluZyBSZXByZXNlbnRhdGlvbioqDQoNCiMjICoqMS4gSW50cm9kdWN0aW9uIHRvIEJhZ2dpbmcqKg0KDQojIyMgKipEZWZpbml0aW9uKioNCkJhZ2dpbmcgKEJvb3RzdHJhcCBBZ2dyZWdhdGluZykgaXMgYW4gZW5zZW1ibGUgbGVhcm5pbmcgdGVjaG5pcXVlIHRoYXQgaW1wcm92ZXMgbW9kZWwgc3RhYmlsaXR5IGFuZCBhY2N1cmFjeSBieSB0cmFpbmluZyBtdWx0aXBsZSBtb2RlbHMgb24gZGlmZmVyZW50IHJhbmRvbSBzdWJzZXRzIG9mIGRhdGEgYW5kIHRoZW4gYWdncmVnYXRpbmcgdGhlaXIgcHJlZGljdGlvbnMuIA0KDQpCYWdnaW5nIGlzIHdpZGVseSB1c2VkIGluIGFsZ29yaXRobXMgbGlrZSAqKlJhbmRvbSBGb3Jlc3RzKiosIHdoZXJlIG11bHRpcGxlIGRlY2lzaW9uIHRyZWVzIGFyZSB0cmFpbmVkIGluZGVwZW5kZW50bHkgYW5kIGNvbWJpbmVkIGZvciBpbXByb3ZlZCBwZXJmb3JtYW5jZS4NCg0KLS0tDQoNCiMjICoqMi4gTWF0aGVtYXRpY2FsIFJlcHJlc2VudGF0aW9uIG9mIEJhZ2dpbmcqKg0KDQpCYWdnaW5nIHJlZHVjZXMgdmFyaWFuY2UgYnkgdHJhaW5pbmcgbW9kZWxzIG9uIG11bHRpcGxlIGJvb3RzdHJhcHBlZCBzdWJzZXRzIGFuZCBhdmVyYWdpbmcgdGhlaXIgb3V0cHV0cy4gDQoNCiMjIyAqKkJvb3RzdHJhcCBTYW1wbGluZyoqDQpHaXZlbiBhIGRhdGFzZXQgXCggRCBcKSB3aXRoIFwoIE4gXCkgc2FtcGxlcywgYmFnZ2luZyBjcmVhdGVzIG11bHRpcGxlIHN1YnNldHMgXCggRF9iIFwpIGJ5IHJhbmRvbWx5IHNhbXBsaW5nIHdpdGggcmVwbGFjZW1lbnQ6DQpcWw0KRF9iID0gXHsgeF9pIHwgeF9pIFxpbiBELCBcdGV4dHtzYW1wbGVkIHdpdGggcmVwbGFjZW1lbnR9IFx9DQpcXQ0KDQpFYWNoIG1vZGVsIGlzIHRyYWluZWQgb24gYSBkaWZmZXJlbnQgc3Vic2V0IFwoIERfYiBcKSBhbmQgdGhlaXIgcHJlZGljdGlvbnMgXCggZl9iKHgpIFwpIGFyZSBhZ2dyZWdhdGVkOg0KDQotICoqRm9yIFJlZ3Jlc3Npb24qKiAoQXZlcmFnaW5nIE91dHB1dHMpOg0KXFsNCkYoeCkgPSBcZnJhY3sxfXtCfSBcc3VtX3tiPTF9XntCfSBmX2IoeCkNClxdDQoNCi0gKipGb3IgQ2xhc3NpZmljYXRpb24qKiAoTWFqb3JpdHkgVm90aW5nKToNClxbDQpGKHgpID0gCWV4dHttb2RlfSBceyBmX2IoeCkgXH0NClxdDQoNCndoZXJlOg0KLSBcKCBCIFwpIGlzIHRoZSB0b3RhbCBudW1iZXIgb2YgYm9vdHN0cmFwcGVkIG1vZGVscw0KLSBcKCBmX2IoeCkgXCkgaXMgdGhlIHByZWRpY3Rpb24gZnJvbSB0aGUgXCggYiBcKS10aCBtb2RlbA0KDQpCYWdnaW5nIHJlZHVjZXMgb3ZlcmZpdHRpbmcgYW5kIGVuaGFuY2VzIHJvYnVzdG5lc3MgYnkgbWluaW1pemluZyBpbmRpdmlkdWFsIG1vZGVsIGJpYXNlcy4NCg0KLS0tDQoNCiMjICoqMy4gQmFnZ2luZyBpbiBEZWNpc2lvbiBUcmVlcyoqDQoNCkJhZ2dpbmcgaXMgcGFydGljdWxhcmx5IGVmZmVjdGl2ZSBpbiBkZWNpc2lvbiB0cmVlcywgd2hlcmUgaGlnaCB2YXJpYW5jZSBtb2RlbHMgYmVuZWZpdCBmcm9tIGFnZ3JlZ2F0aW9uLg0KDQotICoqQm9vdHN0cmFwIFNhbXBsZSBDcmVhdGlvbjoqKiBFYWNoIHRyZWUgaXMgdHJhaW5lZCBvbiBhIGRpZmZlcmVudCByYW5kb21seSBzYW1wbGVkIHN1YnNldC4NCi0gKipJbmRlcGVuZGVudCBNb2RlbCBUcmFpbmluZzoqKiBFYWNoIHRyZWUgbWFrZXMgaXRzIG93biBwcmVkaWN0aW9ucyB3aXRob3V0IGludGVyYWN0aW9uLg0KLSAqKkFnZ3JlZ2F0aW9uOioqIFRoZSBmaW5hbCBvdXRwdXQgaXMgb2J0YWluZWQgdmlhIG1ham9yaXR5IHZvdGluZyAoY2xhc3NpZmljYXRpb24pIG9yIGF2ZXJhZ2luZyAocmVncmVzc2lvbikuDQoNCkJhZ2dpbmcgaXMgdGhlIGNvcmUgdGVjaG5pcXVlIGJlaGluZCAqKlJhbmRvbSBGb3Jlc3RzKiosIHdoZXJlIGFkZGl0aW9uYWwgcmFuZG9tbmVzcyBpcyBpbnRyb2R1Y2VkIGJ5IHNlbGVjdGluZyBmZWF0dXJlIHN1YnNldHMgYXQgZWFjaCBzcGxpdC4NCg0KLS0tDQoNCiMjICoqNC4gUHl0aG9uIEltcGxlbWVudGF0aW9uIG9mIEJhZ2dpbmcqKg0KDQojIyMgKipUcmFpbmluZyBhIEJhZ2dpbmcgQ2xhc3NpZmllcioqDQpgYGBweXRob24NCmZyb20gc2tsZWFybi5lbnNlbWJsZSBpbXBvcnQgQmFnZ2luZ0NsYXNzaWZpZXINCmZyb20gc2tsZWFybi50cmVlIGltcG9ydCBEZWNpc2lvblRyZWVDbGFzc2lmaWVyDQpmcm9tIHNrbGVhcm4uZGF0YXNldHMgaW1wb3J0IGxvYWRfYnJlYXN0X2NhbmNlcg0KZnJvbSBza2xlYXJuLm1vZGVsX3NlbGVjdGlvbiBpbXBvcnQgdHJhaW5fdGVzdF9zcGxpdA0KZnJvbSBza2xlYXJuLm1ldHJpY3MgaW1wb3J0IGFjY3VyYWN5X3Njb3JlDQoNCiMgTG9hZCBkYXRhc2V0DQpjYW5jZXIgPSBsb2FkX2JyZWFzdF9jYW5jZXIoKQ0KWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGNhbmNlci5kYXRhLCBjYW5jZXIudGFyZ2V0LCB0ZXN0X3NpemU9MC4yLCByYW5kb21fc3RhdGU9NDIpDQoNCiMgVHJhaW4gYSBzaW5nbGUgRGVjaXNpb24gVHJlZSAod2l0aG91dCBiYWdnaW5nKQ0KZHQgPSBEZWNpc2lvblRyZWVDbGFzc2lmaWVyKHJhbmRvbV9zdGF0ZT00MikNCmR0LmZpdChYX3RyYWluLCB5X3RyYWluKQ0KeV9wcmVkX2R0ID0gZHQucHJlZGljdChYX3Rlc3QpDQphY2N1cmFjeV9kdCA9IGFjY3VyYWN5X3Njb3JlKHlfdGVzdCwgeV9wcmVkX2R0KQ0KDQojIFRyYWluIGEgQmFnZ2luZ0NsYXNzaWZpZXIgd2l0aCA1MCBib290c3RyYXAgc2FtcGxlcw0KYmFnZ2luZyA9IEJhZ2dpbmdDbGFzc2lmaWVyKGJhc2VfZXN0aW1hdG9yPURlY2lzaW9uVHJlZUNsYXNzaWZpZXIoKSwgbl9lc3RpbWF0b3JzPTUwLCByYW5kb21fc3RhdGU9NDIpDQpiYWdnaW5nLmZpdChYX3RyYWluLCB5X3RyYWluKQ0KeV9wcmVkX2JhZ2dpbmcgPSBiYWdnaW5nLnByZWRpY3QoWF90ZXN0KQ0KYWNjdXJhY3lfYmFnZ2luZyA9IGFjY3VyYWN5X3Njb3JlKHlfdGVzdCwgeV9wcmVkX2JhZ2dpbmcpDQoNCiMgUHJpbnQgcmVzdWx0cw0KcHJpbnQoZiJBY2N1cmFjeSBvZiBzaW5nbGUgRGVjaXNpb24gVHJlZToge2FjY3VyYWN5X2R0Oi40Zn0iKQ0KcHJpbnQoZiJBY2N1cmFjeSB3aXRoIEJhZ2dpbmcgKDUwIGVzdGltYXRvcnMpOiB7YWNjdXJhY3lfYmFnZ2luZzouNGZ9IikNCmBgYA0KDQojIyMgKipWaXN1YWxpemluZyB0aGUgQmFnZ2luZyBQcm9jZXNzKioNCmBgYHB5dGhvbg0KaW1wb3J0IG1hdHBsb3RsaWIucHlwbG90IGFzIHBsdA0KaW1wb3J0IG51bXB5IGFzIG5wDQoNCiMgRXhhbXBsZSBvZiBib290c3RyYXAgc2FtcGxpbmcNCm9yaWdpbmFsX2RhdGEgPSBucC5hcmFuZ2UoMSwgMTEpDQpib290c3RyYXBfc2FtcGxlID0gbnAucmFuZG9tLmNob2ljZShvcmlnaW5hbF9kYXRhLCBzaXplPTEwLCByZXBsYWNlPVRydWUpDQoNCnBsdC5oaXN0KGJvb3RzdHJhcF9zYW1wbGUsIGJpbnM9bnAuYXJhbmdlKDEsIDEyKS0wLjUsIGVkZ2Vjb2xvcj0nYmxhY2snLCBhbHBoYT0wLjcpDQpwbHQueHRpY2tzKG5wLmFyYW5nZSgxLCAxMSkpDQpwbHQueGxhYmVsKCJTYW1wbGUgVmFsdWVzIikNCnBsdC55bGFiZWwoIkZyZXF1ZW5jeSIpDQpwbHQudGl0bGUoIkJvb3RzdHJhcCBTYW1wbGluZyBFeGFtcGxlIikNCnBsdC5zaG93KCkNCmBgYA0KDQotLS0NCg0KIyMgKio1LiBLZXkgVGFrZWF3YXlzKioNCjEuICoqQmFnZ2luZyByZWR1Y2VzIHZhcmlhbmNlKiogYnkgdHJhaW5pbmcgbXVsdGlwbGUgbW9kZWxzIG9uIGJvb3RzdHJhcHBlZCBzdWJzZXRzIGFuZCBhZ2dyZWdhdGluZyBwcmVkaWN0aW9ucy4NCjIuICoqQm9vdHN0cmFwIHNhbXBsaW5nIGVuc3VyZXMgZGl2ZXJzaXR5KiogaW4gdHJhaW5pbmcsIG1ha2luZyB0aGUgZmluYWwgbW9kZWwgbW9yZSByb2J1c3QuDQozLiAqKkJhZ2dpbmcgd29ya3MgYmVzdCB3aXRoIGhpZ2gtdmFyaWFuY2UgbW9kZWxzKiogbGlrZSBkZWNpc2lvbiB0cmVlcywgZW5oYW5jaW5nIHRoZWlyIGdlbmVyYWxpemF0aW9uIGFiaWxpdHkuDQo0LiAqKlJhbmRvbSBGb3Jlc3RzIGV4dGVuZCBiYWdnaW5nKiogYnkgYWRkaW5nIGZlYXR1cmUgcmFuZG9tbmVzcywgZnVydGhlciBpbXByb3ZpbmcgcGVyZm9ybWFuY2UgYW5kIHN0YWJpbGl0eS4NCg0KQnkgbGV2ZXJhZ2luZyBiYWdnaW5nLCB3ZSBjYW4gc2lnbmlmaWNhbnRseSBpbXByb3ZlIHByZWRpY3RpdmUgcGVyZm9ybWFuY2Ugd2hpbGUgbWluaW1pemluZyBvdmVyZml0dGluZyBpbiBtYWNoaW5lIGxlYXJuaW5nIG1vZGVscy4NCg0K