Entropy is a fundamental concept in information
theory (Shannon, 1948) that measures
uncertainty or disorder. In ML, it’s
used in: - Decision trees (ID3, C4.5)
- Cross-entropy loss (classification)
- Clustering & probabilistic models
Entropy quantifies how unpredictable a random
variable is: - Low entropy → Predictable (e.g., biased
coin always landing heads).
- High entropy → Unpredictable (e.g., fair 50-50
coin).
For a discrete random variable \(X\) with probabilities \(P(x_i)\):
\[ H(X) = - \sum_{i=1}^{n} P(x_i) \log_2 P(x_i) \]
Interpretation: - \(H(X) =
0\) → No uncertainty.
- \(H(X)\) increases with
unpredictability.
entropy_calc <- function(probabilities) {
-sum(probabilities * log2(probabilities))
}
Decision trees use Information Gain (entropy reduction) to split data.
# Example dataset: 5 Cats, 5 Dogs
p_cat <- 0.5
p_dog <- 0.5
entropy_before <- entropy_calc(c(p_cat, p_dog))
Entropy before split: 1 bits.
After a perfect split (all Cats in one node, Dogs in
another), entropy becomes 0.
Cross-entropy measures the difference between: - True distribution
\(P\)
- Predicted distribution \(Q\)
\[ H(P, Q) = - \sum_{x} P(x) \log Q(x) \]
binary_ce <- function(y_true, y_pred) {
-mean(y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
}
# True label = 1, Predicted = 0.9
loss_correct <- binary_ce(1, 0.9)
loss_wrong <- binary_ce(1, 0.1)
Measures how distribution \(Q\) diverges from \(P\):
\[ D_{KL}(P \| Q) = \sum P(x) \log \frac{P(x)}{Q(x)} \]
Used in GANs, VAEs.
Key Observations: - Max entropy at \(P = 0.5\) (fair coin).
- Zero entropy at \(P = 0\) or \(1\) (deterministic).
| Concept | Formula | Use Case |
|---|---|---|
| Entropy | \(H(X) = - \sum P \log P\) | Decision trees |
| Cross-Entropy | \(H(P,Q) = - \sum P \log Q\) | Classification loss |
| KL Divergence | \(D_{KL}(P \| Q)\) | GANs, VAEs |
Takeaway:
✅ Entropy measures uncertainty.
✅ Cross-entropy is the standard loss for
classification.
✅ KL divergence compares two distributions.