Entropy and Gini: A Study Guide with Mathematical and Coding
Representation
1. Introduction to Entropy and Gini
Definition of Entropy
Entropy is a measure of disorder or uncertainty in a system. In the
context of information theory, entropy quantifies the unpredictability
of a random variable. The higher the entropy, the more disorderly the
system, while lower entropy signifies more predictability and order.
Entropy plays a crucial role in fields such as machine learning,
physics, and thermodynamics. In decision trees, entropy helps in
determining the best attribute for splitting data.
Definition of Gini Impurity
Gini impurity is another metric used in decision trees to measure how
often a randomly chosen element would be incorrectly classified. It
represents the probability of misclassification of a randomly chosen
sample.
Unlike entropy, which uses logarithmic calculations, Gini impurity
uses squared probabilities, making it computationally faster and often
preferred in large datasets.
2. Mathematical Representation of Entropy and Gini
Impurity
Entropy
For a discrete random variable \(X\)
with possible outcomes \(x_1, x_2, ...,
x_n\) and corresponding probabilities \(p_1, p_2, ..., p_n\), entropy is defined
as:
\[
H(X) = - \sum_{i=1}^{n} p_i \log_2 p_i
\]
where: - \(H(X)\) represents the
entropy of the variable \(X\) - \(p_i\) is the probability of occurrence of
outcome \(x_i\) - The base of the
logarithm is typically 2, measuring entropy in bits
Gini Impurity
For a classification problem with \(k\) classes and probabilities \(p_1, p_2, ..., p_k\), Gini impurity \(G(X)\) is calculated as:
\[
G(X) = 1 - \sum_{i=1}^{k} p_i^2
\]
where: - \(G(X)\) represents the
impurity of the variable \(X\) - \(p_i\) is the probability of an instance
belonging to class \(i\)
Example: Fair Coin Flip
If \(X\) represents the outcome of a
fair coin flip: - \(p(\text{heads}) =
0.5\) - \(p(\text{tails}) =
0.5\)
Then, \[
H(X) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 \, \text{bit}
\] \[
G(X) = 1 - (0.5^2 + 0.5^2) = 0.5
\]
A biased coin with \(p(\text{heads}) =
0.7\) and \(p(\text{tails}) =
0.3\) gives: \[
H(X) = - (0.7 \log_2 0.7 + 0.3 \log_2 0.3) \approx 0.88 \, \text{bits}
\] \[
G(X) = 1 - (0.7^2 + 0.3^2) = 0.42
\]
Lower Gini impurity values indicate purer subsets.
3. Python Implementation of Entropy and Gini
Impurity
Entropy Calculation in Python
import numpy as np
def entropy(probabilities):
return -np.sum(probabilities * np.log2(probabilities))
# Example: Fair coin flip
probs_fair = np.array([0.5, 0.5])
print("Entropy (Fair Coin):", entropy(probs_fair)) # Output: 1.0
# Example: Biased coin
probs_biased = np.array([0.7, 0.3])
print("Entropy (Biased Coin):", entropy(probs_biased)) # Output: ~0.88
Gini Impurity Calculation in Python
def gini_impurity(probabilities):
return 1 - np.sum(probabilities**2)
# Example: Fair coin flip
probs_fair = np.array([0.5, 0.5])
print("Gini Impurity (Fair Coin):", gini_impurity(probs_fair)) # Output: 0.5
# Example: Biased coin
probs_biased = np.array([0.7, 0.3])
print("Gini Impurity (Biased Coin):", gini_impurity(probs_biased)) # Output: ~0.42
4. Entropy and Gini in Decision Trees
Python Implementation in Decision Trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
# Train Decision Tree using entropy
clf_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=3)
clf_entropy.fit(X_train, y_train)
print("Decision Tree Accuracy (Entropy):", clf_entropy.score(X_test, y_test))
# Train Decision Tree using Gini Impurity
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3)
clf_gini.fit(X_train, y_train)
print("Decision Tree Accuracy (Gini):", clf_gini.score(X_test, y_test))
5. Key Takeaways
- Entropy and Gini Impurity both measure disorder in
a dataset and help guide decision tree splits.
- Entropy requires logarithms, making it
computationally expensive, while Gini uses squared
probabilities, making it faster.
- Information Gain is used to decide the best attribute for
splitting by reducing entropy or Gini impurity.
- Gini impurity tends to work slightly faster, making
it the default in many decision tree implementations.
LS0tDQp0aXRsZTogIkVudHJvcHkgYW5kIEdpbmkiDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoqKkVudHJvcHkgYW5kIEdpbmk6IEEgU3R1ZHkgR3VpZGUgd2l0aCBNYXRoZW1hdGljYWwgYW5kIENvZGluZyBSZXByZXNlbnRhdGlvbioqDQoNCiMjICoqMS4gSW50cm9kdWN0aW9uIHRvIEVudHJvcHkgYW5kIEdpbmkqKg0KDQojIyMgKipEZWZpbml0aW9uIG9mIEVudHJvcHkqKg0KRW50cm9weSBpcyBhIG1lYXN1cmUgb2YgZGlzb3JkZXIgb3IgdW5jZXJ0YWludHkgaW4gYSBzeXN0ZW0uIEluIHRoZSBjb250ZXh0IG9mIGluZm9ybWF0aW9uIHRoZW9yeSwgZW50cm9weSBxdWFudGlmaWVzIHRoZSB1bnByZWRpY3RhYmlsaXR5IG9mIGEgcmFuZG9tIHZhcmlhYmxlLiBUaGUgaGlnaGVyIHRoZSBlbnRyb3B5LCB0aGUgbW9yZSBkaXNvcmRlcmx5IHRoZSBzeXN0ZW0sIHdoaWxlIGxvd2VyIGVudHJvcHkgc2lnbmlmaWVzIG1vcmUgcHJlZGljdGFiaWxpdHkgYW5kIG9yZGVyLg0KDQpFbnRyb3B5IHBsYXlzIGEgY3J1Y2lhbCByb2xlIGluIGZpZWxkcyBzdWNoIGFzIG1hY2hpbmUgbGVhcm5pbmcsIHBoeXNpY3MsIGFuZCB0aGVybW9keW5hbWljcy4gSW4gZGVjaXNpb24gdHJlZXMsIGVudHJvcHkgaGVscHMgaW4gZGV0ZXJtaW5pbmcgdGhlIGJlc3QgYXR0cmlidXRlIGZvciBzcGxpdHRpbmcgZGF0YS4NCg0KIyMjICoqRGVmaW5pdGlvbiBvZiBHaW5pIEltcHVyaXR5KioNCkdpbmkgaW1wdXJpdHkgaXMgYW5vdGhlciBtZXRyaWMgdXNlZCBpbiBkZWNpc2lvbiB0cmVlcyB0byBtZWFzdXJlIGhvdyBvZnRlbiBhIHJhbmRvbWx5IGNob3NlbiBlbGVtZW50IHdvdWxkIGJlIGluY29ycmVjdGx5IGNsYXNzaWZpZWQuIEl0IHJlcHJlc2VudHMgdGhlIHByb2JhYmlsaXR5IG9mIG1pc2NsYXNzaWZpY2F0aW9uIG9mIGEgcmFuZG9tbHkgY2hvc2VuIHNhbXBsZS4NCg0KVW5saWtlIGVudHJvcHksIHdoaWNoIHVzZXMgbG9nYXJpdGhtaWMgY2FsY3VsYXRpb25zLCBHaW5pIGltcHVyaXR5IHVzZXMgc3F1YXJlZCBwcm9iYWJpbGl0aWVzLCBtYWtpbmcgaXQgY29tcHV0YXRpb25hbGx5IGZhc3RlciBhbmQgb2Z0ZW4gcHJlZmVycmVkIGluIGxhcmdlIGRhdGFzZXRzLg0KDQotLS0NCg0KIyMgKioyLiBNYXRoZW1hdGljYWwgUmVwcmVzZW50YXRpb24gb2YgRW50cm9weSBhbmQgR2luaSBJbXB1cml0eSoqDQoNCiMjIyAqKkVudHJvcHkqKg0KRm9yIGEgZGlzY3JldGUgcmFuZG9tIHZhcmlhYmxlIFwoIFggXCkgd2l0aCBwb3NzaWJsZSBvdXRjb21lcyBcKCB4XzEsIHhfMiwgLi4uLCB4X24gXCkgYW5kIGNvcnJlc3BvbmRpbmcgcHJvYmFiaWxpdGllcyBcKCBwXzEsIHBfMiwgLi4uLCBwX24gXCksIGVudHJvcHkgaXMgZGVmaW5lZCBhczoNCg0KXFsNCkgoWCkgPSAtIFxzdW1fe2k9MX1ee259IHBfaSBcbG9nXzIgcF9pDQpcXQ0KDQp3aGVyZToNCi0gXCggSChYKSBcKSByZXByZXNlbnRzIHRoZSBlbnRyb3B5IG9mIHRoZSB2YXJpYWJsZSBcKCBYIFwpDQotIFwoIHBfaSBcKSBpcyB0aGUgcHJvYmFiaWxpdHkgb2Ygb2NjdXJyZW5jZSBvZiBvdXRjb21lIFwoIHhfaSBcKQ0KLSBUaGUgYmFzZSBvZiB0aGUgbG9nYXJpdGhtIGlzIHR5cGljYWxseSAyLCBtZWFzdXJpbmcgZW50cm9weSBpbiBiaXRzDQoNCiMjIyAqKkdpbmkgSW1wdXJpdHkqKg0KRm9yIGEgY2xhc3NpZmljYXRpb24gcHJvYmxlbSB3aXRoIFwoIGsgXCkgY2xhc3NlcyBhbmQgcHJvYmFiaWxpdGllcyBcKCBwXzEsIHBfMiwgLi4uLCBwX2sgXCksIEdpbmkgaW1wdXJpdHkgXCggRyhYKSBcKSBpcyBjYWxjdWxhdGVkIGFzOg0KDQpcWw0KRyhYKSA9IDEgLSBcc3VtX3tpPTF9XntrfSBwX2leMg0KXF0NCg0Kd2hlcmU6DQotIFwoIEcoWCkgXCkgcmVwcmVzZW50cyB0aGUgaW1wdXJpdHkgb2YgdGhlIHZhcmlhYmxlIFwoIFggXCkNCi0gXCggcF9pIFwpIGlzIHRoZSBwcm9iYWJpbGl0eSBvZiBhbiBpbnN0YW5jZSBiZWxvbmdpbmcgdG8gY2xhc3MgXCggaSBcKQ0KDQojIyMgKipFeGFtcGxlOiBGYWlyIENvaW4gRmxpcCoqDQpJZiBcKCBYIFwpIHJlcHJlc2VudHMgdGhlIG91dGNvbWUgb2YgYSBmYWlyIGNvaW4gZmxpcDoNCi0gXCggcChcdGV4dHtoZWFkc30pID0gMC41IFwpDQotIFwoIHAoXHRleHR7dGFpbHN9KSA9IDAuNSBcKQ0KDQpUaGVuLA0KXFsNCkgoWCkgPSAtICgwLjUgXGxvZ18yIDAuNSArIDAuNSBcbG9nXzIgMC41KSA9IDEgXCwgXHRleHR7Yml0fQ0KXF0NClxbDQpHKFgpID0gMSAtICgwLjVeMiArIDAuNV4yKSA9IDAuNQ0KXF0NCg0KQSBiaWFzZWQgY29pbiB3aXRoIFwoIHAoXHRleHR7aGVhZHN9KSA9IDAuNyBcKSBhbmQgXCggcChcdGV4dHt0YWlsc30pID0gMC4zIFwpIGdpdmVzOg0KXFsNCkgoWCkgPSAtICgwLjcgXGxvZ18yIDAuNyArIDAuMyBcbG9nXzIgMC4zKSBcYXBwcm94IDAuODggXCwgXHRleHR7Yml0c30NClxdDQpcWw0KRyhYKSA9IDEgLSAoMC43XjIgKyAwLjNeMikgPSAwLjQyDQpcXQ0KDQpMb3dlciBHaW5pIGltcHVyaXR5IHZhbHVlcyBpbmRpY2F0ZSBwdXJlciBzdWJzZXRzLg0KDQotLS0NCg0KIyMgKiozLiBQeXRob24gSW1wbGVtZW50YXRpb24gb2YgRW50cm9weSBhbmQgR2luaSBJbXB1cml0eSoqDQoNCiMjIyAqKkVudHJvcHkgQ2FsY3VsYXRpb24gaW4gUHl0aG9uKioNCmBgYHB5dGhvbg0KaW1wb3J0IG51bXB5IGFzIG5wDQoNCmRlZiBlbnRyb3B5KHByb2JhYmlsaXRpZXMpOg0KICAgIHJldHVybiAtbnAuc3VtKHByb2JhYmlsaXRpZXMgKiBucC5sb2cyKHByb2JhYmlsaXRpZXMpKQ0KDQojIEV4YW1wbGU6IEZhaXIgY29pbiBmbGlwDQpwcm9ic19mYWlyID0gbnAuYXJyYXkoWzAuNSwgMC41XSkNCnByaW50KCJFbnRyb3B5IChGYWlyIENvaW4pOiIsIGVudHJvcHkocHJvYnNfZmFpcikpICAjIE91dHB1dDogMS4wDQoNCiMgRXhhbXBsZTogQmlhc2VkIGNvaW4NCnByb2JzX2JpYXNlZCA9IG5wLmFycmF5KFswLjcsIDAuM10pDQpwcmludCgiRW50cm9weSAoQmlhc2VkIENvaW4pOiIsIGVudHJvcHkocHJvYnNfYmlhc2VkKSkgICMgT3V0cHV0OiB+MC44OA0KYGBgDQoNCiMjIyAqKkdpbmkgSW1wdXJpdHkgQ2FsY3VsYXRpb24gaW4gUHl0aG9uKioNCmBgYHB5dGhvbg0KZGVmIGdpbmlfaW1wdXJpdHkocHJvYmFiaWxpdGllcyk6DQogICAgcmV0dXJuIDEgLSBucC5zdW0ocHJvYmFiaWxpdGllcyoqMikNCg0KIyBFeGFtcGxlOiBGYWlyIGNvaW4gZmxpcA0KcHJvYnNfZmFpciA9IG5wLmFycmF5KFswLjUsIDAuNV0pDQpwcmludCgiR2luaSBJbXB1cml0eSAoRmFpciBDb2luKToiLCBnaW5pX2ltcHVyaXR5KHByb2JzX2ZhaXIpKSAgIyBPdXRwdXQ6IDAuNQ0KDQojIEV4YW1wbGU6IEJpYXNlZCBjb2luDQpwcm9ic19iaWFzZWQgPSBucC5hcnJheShbMC43LCAwLjNdKQ0KcHJpbnQoIkdpbmkgSW1wdXJpdHkgKEJpYXNlZCBDb2luKToiLCBnaW5pX2ltcHVyaXR5KHByb2JzX2JpYXNlZCkpICAjIE91dHB1dDogfjAuNDINCmBgYA0KDQotLS0NCg0KIyMgKio0LiBFbnRyb3B5IGFuZCBHaW5pIGluIERlY2lzaW9uIFRyZWVzKioNCg0KIyMjICoqSW5mb3JtYXRpb24gR2FpbiBpbiBEZWNpc2lvbiBUcmVlcyoqDQpFbnRyb3B5IGFuZCBHaW5pIGltcHVyaXR5IGFyZSB1c2VkIHRvIG1lYXN1cmUgaG93IHdlbGwgYSBzcGxpdCBzZXBhcmF0ZXMgdGhlIGNsYXNzZXMuIFRoZSBnb2FsIGlzIHRvIG1pbmltaXplIGltcHVyaXR5IG9yIG1heGltaXplIGluZm9ybWF0aW9uIGdhaW4uDQoNClxbDQpJRyhTLCBBKSA9IEgoUykgLSBcc3VtX3t2IFxpbiBBfSBcZnJhY3t8U192fH17fFN8fSBIKFNfdikNClxdDQoNCndoZXJlOg0KLSBcKCBTIFwpIGlzIHRoZSBkYXRhc2V0DQotIFwoIEEgXCkgaXMgdGhlIGF0dHJpYnV0ZSB1c2VkIGZvciBzcGxpdHRpbmcNCi0gXCggU192IFwpIHJlcHJlc2VudHMgdGhlIHN1YnNldCBvZiBcKCBTIFwpIGFmdGVyIHRoZSBzcGxpdA0KDQojIyMgKipQeXRob24gSW1wbGVtZW50YXRpb24gaW4gRGVjaXNpb24gVHJlZXMqKg0KYGBgcHl0aG9uDQpmcm9tIHNrbGVhcm4udHJlZSBpbXBvcnQgRGVjaXNpb25UcmVlQ2xhc3NpZmllcg0KZnJvbSBza2xlYXJuLmRhdGFzZXRzIGltcG9ydCBsb2FkX2lyaXMNCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQNCg0KIyBMb2FkIGRhdGFzZXQNCmlyaXMgPSBsb2FkX2lyaXMoKQ0KWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGlyaXMuZGF0YSwgaXJpcy50YXJnZXQsIHRlc3Rfc2l6ZT0wLjIsIHJhbmRvbV9zdGF0ZT00MikNCg0KIyBUcmFpbiBEZWNpc2lvbiBUcmVlIHVzaW5nIGVudHJvcHkNCmNsZl9lbnRyb3B5ID0gRGVjaXNpb25UcmVlQ2xhc3NpZmllcihjcml0ZXJpb249J2VudHJvcHknLCBtYXhfZGVwdGg9MykNCmNsZl9lbnRyb3B5LmZpdChYX3RyYWluLCB5X3RyYWluKQ0KcHJpbnQoIkRlY2lzaW9uIFRyZWUgQWNjdXJhY3kgKEVudHJvcHkpOiIsIGNsZl9lbnRyb3B5LnNjb3JlKFhfdGVzdCwgeV90ZXN0KSkNCg0KIyBUcmFpbiBEZWNpc2lvbiBUcmVlIHVzaW5nIEdpbmkgSW1wdXJpdHkNCmNsZl9naW5pID0gRGVjaXNpb25UcmVlQ2xhc3NpZmllcihjcml0ZXJpb249J2dpbmknLCBtYXhfZGVwdGg9MykNCmNsZl9naW5pLmZpdChYX3RyYWluLCB5X3RyYWluKQ0KcHJpbnQoIkRlY2lzaW9uIFRyZWUgQWNjdXJhY3kgKEdpbmkpOiIsIGNsZl9naW5pLnNjb3JlKFhfdGVzdCwgeV90ZXN0KSkNCmBgYA0KDQotLS0NCg0KIyMgKio1LiBLZXkgVGFrZWF3YXlzKioNCjEuICoqRW50cm9weSBhbmQgR2luaSBJbXB1cml0eSoqIGJvdGggbWVhc3VyZSBkaXNvcmRlciBpbiBhIGRhdGFzZXQgYW5kIGhlbHAgZ3VpZGUgZGVjaXNpb24gdHJlZSBzcGxpdHMuDQoyLiAqKkVudHJvcHkgcmVxdWlyZXMgbG9nYXJpdGhtcyoqLCBtYWtpbmcgaXQgY29tcHV0YXRpb25hbGx5IGV4cGVuc2l2ZSwgd2hpbGUgKipHaW5pIHVzZXMgc3F1YXJlZCBwcm9iYWJpbGl0aWVzKiosIG1ha2luZyBpdCBmYXN0ZXIuDQozLiAqKkluZm9ybWF0aW9uIEdhaW4gaXMgdXNlZCB0byBkZWNpZGUgdGhlIGJlc3QgYXR0cmlidXRlIGZvciBzcGxpdHRpbmcqKiBieSByZWR1Y2luZyBlbnRyb3B5IG9yIEdpbmkgaW1wdXJpdHkuDQo0LiAqKkdpbmkgaW1wdXJpdHkgdGVuZHMgdG8gd29yayBzbGlnaHRseSBmYXN0ZXIqKiwgbWFraW5nIGl0IHRoZSBkZWZhdWx0IGluIG1hbnkgZGVjaXNpb24gdHJlZSBpbXBsZW1lbnRhdGlvbnMuDQoNCg==