Entropy: A Study Guide with Mathematical and Coding
Representation
1. Introduction to Entropy
Definition
Entropy is a measure of disorder or uncertainty in a system. In the
context of information theory, entropy quantifies the unpredictability
of a random variable. The higher the entropy, the more disorderly the
system, while lower entropy signifies more predictability and order.
Entropy plays a crucial role in fields such as machine learning,
physics, and thermodynamics. In decision trees, entropy helps in
determining the best attribute for splitting data.
2. Mathematical Representation of Entropy
For a discrete random variable X
with possible outcomes x1,x2,...,xn and corresponding probabilities p1,p2,...,pn, entropy is defined
as:
H(X)=−n∑i=1pilog2pi
where: - H(X) represents the
entropy of the variable X - pi is the probability of occurrence of
outcome xi - The base of the
logarithm is typically 2, measuring entropy in bits
Example: Fair Coin Flip
If X represents the outcome of a
fair coin flip: - p(heads)=0.5 - p(tails)=0.5
Then, H(X)=−(0.5log20.5+0.5log20.5) H(X)=−(0.5×−1+0.5×−1)=1bit
If the coin were biased, say p(heads)=0.7 and p(tails)=0.3, then entropy would
be: H(X)=−(0.7log20.7+0.3log20.3)≈0.88bits
This confirms that less randomness (a biased coin) results in lower
entropy.
3. Python Implementation of Entropy
We can implement entropy calculation using Python:
import numpy as np
def entropy(probabilities):
return -np.sum(probabilities * np.log2(probabilities))
probs_fair = np.array([0.5, 0.5])
print("Entropy (Fair Coin):", entropy(probs_fair))
probs_biased = np.array([0.7, 0.3])
print("Entropy (Biased Coin):", entropy(probs_biased))
4. Entropy in Decision Trees
Example: Decision Tree Split Calculation
Given a dataset with 10 samples: - 6 belong to Class A - 4 belong to
Class B
Initial entropy: H(S)=−(610log2610+410log2410)≈0.97
Now, assume we split based on a feature into two groups: -
Group 1: 4 samples (3 in Class A, 1 in Class B) →
Entropy = 0.81 - Group 2: 6 samples (3 in Class A, 3 in
Class B) → Entropy = 1.00
Weighted entropy after the split: H(S,A)=410×0.81+610×1.00=0.924
Information Gain: IG(S,A)=0.97−0.924=0.046
The attribute that yields the highest information gain is chosen for
splitting.
Python Implementation in Decision Trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)
clf = DecisionTreeClassifier(criterion='entropy', max_depth=3)
clf.fit(X_train, y_train)
print("Decision Tree Accuracy:", clf.score(X_test, y_test))
5. Key Takeaways
- Entropy quantifies disorder in a system and is
essential in information theory and machine learning.
- Decision trees use entropy to determine the best attribute
for splitting data, maximizing information gain.
- Lower entropy implies higher predictability,
whereas higher entropy means more randomness.
By understanding and applying entropy, we can optimize machine
learning models, especially in classification problems.
LS0tDQp0aXRsZTogIkVudHJvcHkgZnJvbSBhc3luY2ggYW5kIHRleHRib29rIg0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KKipFbnRyb3B5OiBBIFN0dWR5IEd1aWRlIHdpdGggTWF0aGVtYXRpY2FsIGFuZCBDb2RpbmcgUmVwcmVzZW50YXRpb24qKg0KDQojIyAqKjEuIEludHJvZHVjdGlvbiB0byBFbnRyb3B5KioNCg0KIyMjICoqRGVmaW5pdGlvbioqDQpFbnRyb3B5IGlzIGEgbWVhc3VyZSBvZiBkaXNvcmRlciBvciB1bmNlcnRhaW50eSBpbiBhIHN5c3RlbS4gSW4gdGhlIGNvbnRleHQgb2YgaW5mb3JtYXRpb24gdGhlb3J5LCBlbnRyb3B5IHF1YW50aWZpZXMgdGhlIHVucHJlZGljdGFiaWxpdHkgb2YgYSByYW5kb20gdmFyaWFibGUuIFRoZSBoaWdoZXIgdGhlIGVudHJvcHksIHRoZSBtb3JlIGRpc29yZGVybHkgdGhlIHN5c3RlbSwgd2hpbGUgbG93ZXIgZW50cm9weSBzaWduaWZpZXMgbW9yZSBwcmVkaWN0YWJpbGl0eSBhbmQgb3JkZXIuDQoNCkVudHJvcHkgcGxheXMgYSBjcnVjaWFsIHJvbGUgaW4gZmllbGRzIHN1Y2ggYXMgbWFjaGluZSBsZWFybmluZywgcGh5c2ljcywgYW5kIHRoZXJtb2R5bmFtaWNzLiBJbiBkZWNpc2lvbiB0cmVlcywgZW50cm9weSBoZWxwcyBpbiBkZXRlcm1pbmluZyB0aGUgYmVzdCBhdHRyaWJ1dGUgZm9yIHNwbGl0dGluZyBkYXRhLg0KDQotLS0NCg0KIyMgKioyLiBNYXRoZW1hdGljYWwgUmVwcmVzZW50YXRpb24gb2YgRW50cm9weSoqDQoNCkZvciBhIGRpc2NyZXRlIHJhbmRvbSB2YXJpYWJsZSBcKCBYIFwpIHdpdGggcG9zc2libGUgb3V0Y29tZXMgXCggeF8xLCB4XzIsIC4uLiwgeF9uIFwpIGFuZCBjb3JyZXNwb25kaW5nIHByb2JhYmlsaXRpZXMgXCggcF8xLCBwXzIsIC4uLiwgcF9uIFwpLCBlbnRyb3B5IGlzIGRlZmluZWQgYXM6DQoNClxbDQpIKFgpID0gLSBcc3VtX3tpPTF9XntufSBwX2kgXGxvZ18yIHBfaQ0KXF0NCg0Kd2hlcmU6DQotIFwoIEgoWCkgXCkgcmVwcmVzZW50cyB0aGUgZW50cm9weSBvZiB0aGUgdmFyaWFibGUgXCggWCBcKQ0KLSBcKCBwX2kgXCkgaXMgdGhlIHByb2JhYmlsaXR5IG9mIG9jY3VycmVuY2Ugb2Ygb3V0Y29tZSBcKCB4X2kgXCkNCi0gVGhlIGJhc2Ugb2YgdGhlIGxvZ2FyaXRobSBpcyB0eXBpY2FsbHkgMiwgbWVhc3VyaW5nIGVudHJvcHkgaW4gYml0cw0KDQojIyMgKipFeGFtcGxlOiBGYWlyIENvaW4gRmxpcCoqDQpJZiBcKCBYIFwpIHJlcHJlc2VudHMgdGhlIG91dGNvbWUgb2YgYSBmYWlyIGNvaW4gZmxpcDoNCi0gXCggcChcdGV4dHtoZWFkc30pID0gMC41IFwpDQotIFwoIHAoXHRleHR7dGFpbHN9KSA9IDAuNSBcKQ0KDQpUaGVuLA0KXFsNCkgoWCkgPSAtICgwLjUgXGxvZ18yIDAuNSArIDAuNSBcbG9nXzIgMC41KQ0KXF0NClxbDQpIKFgpID0gLSAoMC41IFx0aW1lcyAtMSArIDAuNSBcdGltZXMgLTEpID0gMSBcLCBcdGV4dHtiaXR9DQpcXQ0KDQpJZiB0aGUgY29pbiB3ZXJlIGJpYXNlZCwgc2F5IFwoIHAoXHRleHR7aGVhZHN9KSA9IDAuNyBcKSBhbmQgXCggcChcdGV4dHt0YWlsc30pID0gMC4zIFwpLCB0aGVuIGVudHJvcHkgd291bGQgYmU6DQpcWw0KSChYKSA9IC0gKDAuNyBcbG9nXzIgMC43ICsgMC4zIFxsb2dfMiAwLjMpIFxhcHByb3ggMC44OCBcLCBcdGV4dHtiaXRzfQ0KXF0NCg0KVGhpcyBjb25maXJtcyB0aGF0IGxlc3MgcmFuZG9tbmVzcyAoYSBiaWFzZWQgY29pbikgcmVzdWx0cyBpbiBsb3dlciBlbnRyb3B5Lg0KDQotLS0NCg0KIyMgKiozLiBQeXRob24gSW1wbGVtZW50YXRpb24gb2YgRW50cm9weSoqDQoNCldlIGNhbiBpbXBsZW1lbnQgZW50cm9weSBjYWxjdWxhdGlvbiB1c2luZyBQeXRob246DQoNCmBgYHB5dGhvbg0KaW1wb3J0IG51bXB5IGFzIG5wDQoNCmRlZiBlbnRyb3B5KHByb2JhYmlsaXRpZXMpOg0KICAgIHJldHVybiAtbnAuc3VtKHByb2JhYmlsaXRpZXMgKiBucC5sb2cyKHByb2JhYmlsaXRpZXMpKQ0KDQojIEV4YW1wbGU6IEZhaXIgY29pbiBmbGlwDQpwcm9ic19mYWlyID0gbnAuYXJyYXkoWzAuNSwgMC41XSkNCnByaW50KCJFbnRyb3B5IChGYWlyIENvaW4pOiIsIGVudHJvcHkocHJvYnNfZmFpcikpICAjIE91dHB1dDogMS4wDQoNCiMgRXhhbXBsZTogQmlhc2VkIGNvaW4NCnByb2JzX2JpYXNlZCA9IG5wLmFycmF5KFswLjcsIDAuM10pDQpwcmludCgiRW50cm9weSAoQmlhc2VkIENvaW4pOiIsIGVudHJvcHkocHJvYnNfYmlhc2VkKSkgICMgT3V0cHV0OiB+MC44OA0KYGBgDQoNCi0tLQ0KDQojIyAqKjQuIEVudHJvcHkgaW4gRGVjaXNpb24gVHJlZXMqKg0KDQojIyMgKipJbmZvcm1hdGlvbiBHYWluKioNCkVudHJvcHkgaXMgdXNlZCB0byBkZXRlcm1pbmUgdGhlIGVmZmVjdGl2ZW5lc3Mgb2YgYSBzcGxpdCBpbiBhIGRlY2lzaW9uIHRyZWUuICoqSW5mb3JtYXRpb24gZ2FpbiAoSUcpKiogaXMgdGhlIHJlZHVjdGlvbiBpbiBlbnRyb3B5IGFmdGVyIGEgZGF0YXNldCBpcyBzcGxpdCBiYXNlZCBvbiBhbiBhdHRyaWJ1dGUuDQoNClxbDQpJRyhTLCBBKSA9IEgoUykgLSBcc3VtX3t2IFxpbiBBfSBcZnJhY3t8U192fH17fFN8fSBIKFNfdikNClxdDQoNCndoZXJlOg0KLSBcKCBTIFwpIGlzIHRoZSBkYXRhc2V0DQotIFwoIEEgXCkgaXMgdGhlIGF0dHJpYnV0ZSB1c2VkIGZvciBzcGxpdHRpbmcNCi0gXCggU192IFwpIHJlcHJlc2VudHMgdGhlIHN1YnNldCBvZiBcKCBTIFwpIGFmdGVyIHRoZSBzcGxpdA0KDQojIyMgKipFeGFtcGxlOiBEZWNpc2lvbiBUcmVlIFNwbGl0IENhbGN1bGF0aW9uKioNCkdpdmVuIGEgZGF0YXNldCB3aXRoIDEwIHNhbXBsZXM6DQotIDYgYmVsb25nIHRvIENsYXNzIEENCi0gNCBiZWxvbmcgdG8gQ2xhc3MgQg0KDQpJbml0aWFsIGVudHJvcHk6DQpcWw0KSChTKSA9IC0gXGxlZnQoIFxmcmFjezZ9ezEwfSBcbG9nXzIgXGZyYWN7Nn17MTB9ICsgXGZyYWN7NH17MTB9IFxsb2dfMiBcZnJhY3s0fXsxMH0gXHJpZ2h0KSBcYXBwcm94IDAuOTcNClxdDQoNCk5vdywgYXNzdW1lIHdlIHNwbGl0IGJhc2VkIG9uIGEgZmVhdHVyZSBpbnRvIHR3byBncm91cHM6DQotICoqR3JvdXAgMToqKiA0IHNhbXBsZXMgKDMgaW4gQ2xhc3MgQSwgMSBpbiBDbGFzcyBCKSDihpIgRW50cm9weSA9IDAuODENCi0gKipHcm91cCAyOioqIDYgc2FtcGxlcyAoMyBpbiBDbGFzcyBBLCAzIGluIENsYXNzIEIpIOKGkiBFbnRyb3B5ID0gMS4wMA0KDQpXZWlnaHRlZCBlbnRyb3B5IGFmdGVyIHRoZSBzcGxpdDoNClxbDQpIKFMsIEEpID0gXGZyYWN7NH17MTB9IFx0aW1lcyAwLjgxICsgXGZyYWN7Nn17MTB9IFx0aW1lcyAxLjAwID0gMC45MjQNClxdDQoNCkluZm9ybWF0aW9uIEdhaW46DQpcWw0KSUcoUywgQSkgPSAwLjk3IC0gMC45MjQgPSAwLjA0Ng0KXF0NCg0KVGhlIGF0dHJpYnV0ZSB0aGF0IHlpZWxkcyB0aGUgaGlnaGVzdCBpbmZvcm1hdGlvbiBnYWluIGlzIGNob3NlbiBmb3Igc3BsaXR0aW5nLg0KDQojIyMgKipQeXRob24gSW1wbGVtZW50YXRpb24gaW4gRGVjaXNpb24gVHJlZXMqKg0KYGBgcHl0aG9uDQpmcm9tIHNrbGVhcm4udHJlZSBpbXBvcnQgRGVjaXNpb25UcmVlQ2xhc3NpZmllcg0KZnJvbSBza2xlYXJuLmRhdGFzZXRzIGltcG9ydCBsb2FkX2lyaXMNCmZyb20gc2tsZWFybi5tb2RlbF9zZWxlY3Rpb24gaW1wb3J0IHRyYWluX3Rlc3Rfc3BsaXQNCg0KIyBMb2FkIGRhdGFzZXQNCmlyaXMgPSBsb2FkX2lyaXMoKQ0KWF90cmFpbiwgWF90ZXN0LCB5X3RyYWluLCB5X3Rlc3QgPSB0cmFpbl90ZXN0X3NwbGl0KGlyaXMuZGF0YSwgaXJpcy50YXJnZXQsIHRlc3Rfc2l6ZT0wLjIsIHJhbmRvbV9zdGF0ZT00MikNCg0KIyBUcmFpbiBEZWNpc2lvbiBUcmVlIHVzaW5nIGVudHJvcHkNCmNsZiA9IERlY2lzaW9uVHJlZUNsYXNzaWZpZXIoY3JpdGVyaW9uPSdlbnRyb3B5JywgbWF4X2RlcHRoPTMpDQpjbGYuZml0KFhfdHJhaW4sIHlfdHJhaW4pDQoNCnByaW50KCJEZWNpc2lvbiBUcmVlIEFjY3VyYWN5OiIsIGNsZi5zY29yZShYX3Rlc3QsIHlfdGVzdCkpDQpgYGANCg0KLS0tDQoNCiMjICoqNS4gS2V5IFRha2Vhd2F5cyoqDQoxLiAqKkVudHJvcHkgcXVhbnRpZmllcyBkaXNvcmRlcioqIGluIGEgc3lzdGVtIGFuZCBpcyBlc3NlbnRpYWwgaW4gaW5mb3JtYXRpb24gdGhlb3J5IGFuZCBtYWNoaW5lIGxlYXJuaW5nLg0KMi4gKipEZWNpc2lvbiB0cmVlcyB1c2UgZW50cm9weSB0byBkZXRlcm1pbmUgdGhlIGJlc3QgYXR0cmlidXRlIGZvciBzcGxpdHRpbmcgZGF0YSoqLCBtYXhpbWl6aW5nIGluZm9ybWF0aW9uIGdhaW4uDQozLiAqKkxvd2VyIGVudHJvcHkgaW1wbGllcyBoaWdoZXIgcHJlZGljdGFiaWxpdHkqKiwgd2hlcmVhcyBoaWdoZXIgZW50cm9weSBtZWFucyBtb3JlIHJhbmRvbW5lc3MuDQoNCkJ5IHVuZGVyc3RhbmRpbmcgYW5kIGFwcGx5aW5nIGVudHJvcHksIHdlIGNhbiBvcHRpbWl6ZSBtYWNoaW5lIGxlYXJuaW5nIG1vZGVscywgZXNwZWNpYWxseSBpbiBjbGFzc2lmaWNhdGlvbiBwcm9ibGVtcy4NCg0K