Processing math: 100%

Partition Trees: A Study Guide with Mathematical and Coding Representation

1. Introduction to Partition Trees

Definition

Partition trees, also known as decision trees, are hierarchical models used for classification and regression. They recursively split data into subsets based on feature values to create structured decision paths.

Partition trees form the basis for advanced algorithms such as Random Forests and XGBoost, making them essential in modern data science.


2. Mathematical Representation of Partition Trees

A partition tree works by selecting the best feature to split data at each node. The selection is based on minimizing impurity, often measured using Entropy or Gini Impurity.

Entropy-Based Split

To determine the best split: IG(S,A)=H(S)vA|Sv||S|H(Sv) where: - IG(S,A) is the Information Gain for splitting attribute A - H(S) is the entropy of the dataset S - Sv represents subsets after splitting by A

Gini-Based Split

The Gini Impurity of a node is given by: G(X)=1ki=1p2i where: - G(X) is the impurity score - pi represents the probability of each class in the node

The best split minimizes the weighted Gini impurity across resulting partitions.


3. Partition Trees in Decision Trees

A decision tree model recursively partitions the feature space into subregions.

Example: Iris Dataset Split

A decision tree on the Iris dataset could split based on petal length and width: - If petal length < 2.45 → classify as Setosa - If petal length > 2.45 and petal width < 1.75 → classify as Versicolor - Otherwise, classify as Virginica


4. Python Implementation of Partition Trees

Training a Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3)
clf.fit(X_train, y_train)

print("Decision Tree Accuracy:", clf.score(X_test, y_test))

Visualizing the Tree

from sklearn import tree
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
tree.plot_tree(clf, feature_names=iris.feature_names, class_names=iris.target_names, filled=True)
plt.show()

5. Key Takeaways

  1. Partition Trees use recursive binary splits to classify data.
  2. Splitting criteria include Entropy and Gini Impurity, chosen based on the dataset and computation efficiency.
  3. Decision trees are prone to overfitting, requiring hyperparameter tuning (e.g., max depth, min samples per leaf).
  4. Partition Trees are the foundation of ensemble models such as Random Forests and Gradient Boosted Trees.
LS0tDQp0aXRsZTogIlBhcnRpdGlvbiBUcmVlcyINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCioqUGFydGl0aW9uIFRyZWVzOiBBIFN0dWR5IEd1aWRlIHdpdGggTWF0aGVtYXRpY2FsIGFuZCBDb2RpbmcgUmVwcmVzZW50YXRpb24qKg0KDQojIyAqKjEuIEludHJvZHVjdGlvbiB0byBQYXJ0aXRpb24gVHJlZXMqKg0KDQojIyMgKipEZWZpbml0aW9uKioNClBhcnRpdGlvbiB0cmVlcywgYWxzbyBrbm93biBhcyBkZWNpc2lvbiB0cmVlcywgYXJlIGhpZXJhcmNoaWNhbCBtb2RlbHMgdXNlZCBmb3IgY2xhc3NpZmljYXRpb24gYW5kIHJlZ3Jlc3Npb24uIFRoZXkgcmVjdXJzaXZlbHkgc3BsaXQgZGF0YSBpbnRvIHN1YnNldHMgYmFzZWQgb24gZmVhdHVyZSB2YWx1ZXMgdG8gY3JlYXRlIHN0cnVjdHVyZWQgZGVjaXNpb24gcGF0aHMuDQoNClBhcnRpdGlvbiB0cmVlcyBmb3JtIHRoZSBiYXNpcyBmb3IgYWR2YW5jZWQgYWxnb3JpdGhtcyBzdWNoIGFzICoqUmFuZG9tIEZvcmVzdHMqKiBhbmQgKipYR0Jvb3N0KiosIG1ha2luZyB0aGVtIGVzc2VudGlhbCBpbiBtb2Rlcm4gZGF0YSBzY2llbmNlLg0KDQotLS0NCg0KIyMgKioyLiBNYXRoZW1hdGljYWwgUmVwcmVzZW50YXRpb24gb2YgUGFydGl0aW9uIFRyZWVzKioNCg0KQSBwYXJ0aXRpb24gdHJlZSB3b3JrcyBieSBzZWxlY3RpbmcgdGhlIGJlc3QgZmVhdHVyZSB0byBzcGxpdCBkYXRhIGF0IGVhY2ggbm9kZS4gVGhlIHNlbGVjdGlvbiBpcyBiYXNlZCBvbiBtaW5pbWl6aW5nIGltcHVyaXR5LCBvZnRlbiBtZWFzdXJlZCB1c2luZyAqKkVudHJvcHkqKiBvciAqKkdpbmkgSW1wdXJpdHkqKi4NCg0KIyMjICoqRW50cm9weS1CYXNlZCBTcGxpdCoqDQpUbyBkZXRlcm1pbmUgdGhlIGJlc3Qgc3BsaXQ6DQpcWw0KSUcoUywgQSkgPSBIKFMpIC0gXHN1bV97diBcaW4gQX0gXGZyYWN7fFNfdnx9e3xTfH0gSChTX3YpDQpcXQ0Kd2hlcmU6DQotIFwoIElHKFMsIEEpIFwpIGlzIHRoZSAqKkluZm9ybWF0aW9uIEdhaW4qKiBmb3Igc3BsaXR0aW5nIGF0dHJpYnV0ZSBcKCBBIFwpDQotIFwoIEgoUykgXCkgaXMgdGhlIGVudHJvcHkgb2YgdGhlIGRhdGFzZXQgXCggUyBcKQ0KLSBcKCBTX3YgXCkgcmVwcmVzZW50cyBzdWJzZXRzIGFmdGVyIHNwbGl0dGluZyBieSBcKCBBIFwpDQoNCiMjIyAqKkdpbmktQmFzZWQgU3BsaXQqKg0KVGhlIEdpbmkgSW1wdXJpdHkgb2YgYSBub2RlIGlzIGdpdmVuIGJ5Og0KXFsNCkcoWCkgPSAxIC0gXHN1bV97aT0xfV57a30gcF9pXjINClxdDQp3aGVyZToNCi0gXCggRyhYKSBcKSBpcyB0aGUgaW1wdXJpdHkgc2NvcmUNCi0gXCggcF9pIFwpIHJlcHJlc2VudHMgdGhlIHByb2JhYmlsaXR5IG9mIGVhY2ggY2xhc3MgaW4gdGhlIG5vZGUNCg0KVGhlIGJlc3Qgc3BsaXQgbWluaW1pemVzIHRoZSB3ZWlnaHRlZCBHaW5pIGltcHVyaXR5IGFjcm9zcyByZXN1bHRpbmcgcGFydGl0aW9ucy4NCg0KLS0tDQoNCiMjICoqMy4gUGFydGl0aW9uIFRyZWVzIGluIERlY2lzaW9uIFRyZWVzKioNCg0KQSBkZWNpc2lvbiB0cmVlIG1vZGVsIHJlY3Vyc2l2ZWx5IHBhcnRpdGlvbnMgdGhlIGZlYXR1cmUgc3BhY2UgaW50byBzdWJyZWdpb25zLg0KDQotICoqU3RvcHBpbmcgQ3JpdGVyaWEqKjogVGhlIHRyZWUgc3RvcHMgZ3Jvd2luZyB3aGVuOg0KICAtIEEgcHJlZGVmaW5lZCBtYXhpbXVtIGRlcHRoIGlzIHJlYWNoZWQNCiAgLSBBIG1pbmltdW0gbnVtYmVyIG9mIHNhbXBsZXMgcGVyIG5vZGUgaXMgbWV0DQogIC0gVGhlIHNwbGl0IGRvZXMgbm90IHByb3ZpZGUgc2lnbmlmaWNhbnQgaW5mb3JtYXRpb24gZ2Fpbg0KDQojIyMgKipFeGFtcGxlOiBJcmlzIERhdGFzZXQgU3BsaXQqKg0KQSBkZWNpc2lvbiB0cmVlIG9uIHRoZSAqKklyaXMgZGF0YXNldCoqIGNvdWxkIHNwbGl0IGJhc2VkIG9uICoqcGV0YWwgbGVuZ3RoKiogYW5kICoqd2lkdGgqKjoNCi0gKipJZiBwZXRhbCBsZW5ndGggPCAyLjQ1Kiog4oaSIGNsYXNzaWZ5IGFzICpTZXRvc2EqDQotICoqSWYgcGV0YWwgbGVuZ3RoID4gMi40NSBhbmQgcGV0YWwgd2lkdGggPCAxLjc1Kiog4oaSIGNsYXNzaWZ5IGFzICpWZXJzaWNvbG9yKg0KLSAqKk90aGVyd2lzZSoqLCBjbGFzc2lmeSBhcyAqVmlyZ2luaWNhKg0KDQotLS0NCg0KIyMgKio0LiBQeXRob24gSW1wbGVtZW50YXRpb24gb2YgUGFydGl0aW9uIFRyZWVzKioNCg0KIyMjICoqVHJhaW5pbmcgYSBEZWNpc2lvbiBUcmVlIENsYXNzaWZpZXIqKg0KYGBgcHl0aG9uDQpmcm9tIHNrbGVhcm4udHJlZSBpbXBvcnQgRGVjaXNpb25UcmVlQ2xhc3NpZmllcg0KZnJvbSBza2xlYXJuLmRhdGFzZXRzIGltcG9ydCBsb2FkX2lyaXMNCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQNCg0KIyBMb2FkIGRhdGFzZXQNCmlyaXMgPSBsb2FkX2lyaXMoKQ0KWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGlyaXMuZGF0YSwgaXJpcy50YXJnZXQsIHRlc3Rfc2l6ZT0wLjIsIHJhbmRvbV9zdGF0ZT00MikNCg0KIyBUcmFpbiBEZWNpc2lvbiBUcmVlIENsYXNzaWZpZXINCmNsZiA9IERlY2lzaW9uVHJlZUNsYXNzaWZpZXIoY3JpdGVyaW9uPSdnaW5pJywgbWF4X2RlcHRoPTMpDQpjbGYuZml0KFhfdHJhaW4sIHlfdHJhaW4pDQoNCnByaW50KCJEZWNpc2lvbiBUcmVlIEFjY3VyYWN5OiIsIGNsZi5zY29yZShYX3Rlc3QsIHlfdGVzdCkpDQpgYGANCg0KIyMjICoqVmlzdWFsaXppbmcgdGhlIFRyZWUqKg0KYGBgcHl0aG9uDQpmcm9tIHNrbGVhcm4gaW1wb3J0IHRyZWUNCmltcG9ydCBtYXRwbG90bGliLnB5cGxvdCBhcyBwbHQNCg0KcGx0LmZpZ3VyZShmaWdzaXplPSgxMiwgNikpDQp0cmVlLnBsb3RfdHJlZShjbGYsIGZlYXR1cmVfbmFtZXM9aXJpcy5mZWF0dXJlX25hbWVzLCBjbGFzc19uYW1lcz1pcmlzLnRhcmdldF9uYW1lcywgZmlsbGVkPVRydWUpDQpwbHQuc2hvdygpDQpgYGANCg0KLS0tDQoNCiMjICoqNS4gS2V5IFRha2Vhd2F5cyoqDQoxLiAqKlBhcnRpdGlvbiBUcmVlcyB1c2UgcmVjdXJzaXZlIGJpbmFyeSBzcGxpdHMqKiB0byBjbGFzc2lmeSBkYXRhLg0KMi4gKipTcGxpdHRpbmcgY3JpdGVyaWEgaW5jbHVkZSBFbnRyb3B5IGFuZCBHaW5pIEltcHVyaXR5KiosIGNob3NlbiBiYXNlZCBvbiB0aGUgZGF0YXNldCBhbmQgY29tcHV0YXRpb24gZWZmaWNpZW5jeS4NCjMuICoqRGVjaXNpb24gdHJlZXMgYXJlIHByb25lIHRvIG92ZXJmaXR0aW5nKiosIHJlcXVpcmluZyBoeXBlcnBhcmFtZXRlciB0dW5pbmcgKGUuZy4sIG1heCBkZXB0aCwgbWluIHNhbXBsZXMgcGVyIGxlYWYpLg0KNC4gKipQYXJ0aXRpb24gVHJlZXMgYXJlIHRoZSBmb3VuZGF0aW9uIG9mIGVuc2VtYmxlIG1vZGVscyoqIHN1Y2ggYXMgUmFuZG9tIEZvcmVzdHMgYW5kIEdyYWRpZW50IEJvb3N0ZWQgVHJlZXMuDQoNCg==