Entropy in Machine Learning & Information Theory

Entropy is a fundamental concept in information theory (Shannon, 1948) that measures uncertainty or disorder. In ML, it’s used in: - Decision trees (ID3, C4.5)
- Cross-entropy loss (classification)
- Clustering & probabilistic models

1. What is Entropy?

Entropy quantifies how unpredictable a random variable is: - Low entropy → Predictable (e.g., biased coin always landing heads).
- High entropy → Unpredictable (e.g., fair 50-50 coin).

Mathematical Definition

For a discrete random variable \(X\) with probabilities \(P(x_i)\):

\[ H(X) = - \sum_{i=1}^{n} P(x_i) \log_2 P(x_i) \]

Interpretation: - \(H(X) = 0\) → No uncertainty.
- \(H(X)\) increases with unpredictability.

entropy_calc <- function(probabilities) {
  -sum(probabilities * log2(probabilities))
}

2. Entropy in Decision Trees

Decision trees use Information Gain (entropy reduction) to split data.

Example: Dataset with Cats & Dogs

# Example dataset: 5 Cats, 5 Dogs
p_cat <- 0.5
p_dog <- 0.5
entropy_before <- entropy_calc(c(p_cat, p_dog))

Entropy before split: 1 bits.

After a perfect split (all Cats in one node, Dogs in another), entropy becomes 0.

3. Cross-Entropy Loss

Cross-entropy measures the difference between: - True distribution \(P\)
- Predicted distribution \(Q\)

\[ H(P, Q) = - \sum_{x} P(x) \log Q(x) \]

Binary Classification Example

binary_ce <- function(y_true, y_pred) {
  -mean(y_true * log(y_pred) + (1 - y_true) * log(1 - y_pred))
}

# True label = 1, Predicted = 0.9
loss_correct <- binary_ce(1, 0.9)
loss_wrong <- binary_ce(1, 0.1)

Correct prediction (0.9): Loss = 0.1053605
Wrong prediction (0.1): Loss = 2.3025851

4. KL Divergence (Relative Entropy)

Measures how distribution \(Q\) diverges from \(P\):

\[ D_{KL}(P \| Q) = \sum P(x) \log \frac{P(x)}{Q(x)} \]

Used in GANs, VAEs.

5. Interactive Entropy Plot

Key Observations: - Max entropy at \(P = 0.5\) (fair coin).
- Zero entropy at \(P = 0\) or \(1\) (deterministic).

6. Summary

Concept	Formula	Use Case
Entropy	\(H(X) = - \sum P \log P\)	Decision trees
Cross-Entropy	\(H(P,Q) = - \sum P \log Q\)	Classification loss
KL Divergence	\(D_{KL}(P \\| Q)\)	GANs, VAEs

Takeaway:
✅ Entropy measures uncertainty.
✅ Cross-entropy is the standard loss for classification.
✅ KL divergence compares two distributions.

Understanding Entropy in Machine Learning

Enwu Liu

2025-06-25