\usepackage{bm}

Predictive Analytics: Entropy

Rasim Muzaffer Musal

Understanding Entropy

  • “Capturing the intangible concept of Information” Soofi, E. S. (1994) Journal of the American Statistical Association, 89, 1243 - 1254.

  • “Measuring Informativeness of Data by Entropy and Variance” Ebrahimi, N., Maasoumi, E., Soofi, E.S. (1999).

  • Measuring Informativeness of Data by Entropy and Variance. In: Slottje, D.J. (eds) Advances in Econometrics, Income Distribution and Scientific Methodology. Physica-Verlag HD.

Entropy in Information Theory

  • An important concept in Information Theory is entropy. First used in Physics and is now an important part of Machine Learning / Statistics / Data Science.

  • It can be measured on discrete or continous variables.

  • Shannon: Discussed quantity of information in message transmissions in telecommunications.

  • Kullback:

  • Lindley:

  • Jaynes: Maximum Entropy Principle

Entropy in Information Theory

  • Shannon 1948
    • Communication channel message input is X, output is Y.
    • How much information can be squeezed into channel capacity (cable, card etc…).
    • Applied the entropy from statistical mechanics to quantify information in a message.
    • To understand the following arguments you need to know marginal, joint and conditional probabilities.

Entropy in Information Theory

  • Shannon: Brought the idea of Entropy to communications theory. Probability distribution of one letter/word following another letter/word have consequences involving encoding of message transmission from origin X to reception Y.

“Suppose we have a set of possible events whose probabilities of occurrence are \(p_1, p_2, \ldots, p_n\). These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much”choice” is involved in the selection of the event or of how uncertain we are of the outcome?”

Entropy in Information Theory

  • If there is such a measure of \(H(p_{1},\dots,p_{n})\) “we” expect:
    • “H(p) should be continuous.”
    • “If all the \(p_i\) are equal, \(p_i\) = \(\frac{1}{n}\) , then H should be a monotonic increasing function of n. With equally likely events there is more choice, or uncertainty, when there are more possible events”
    • “If a choice be broken down into two successive choices, the original H should be the weighted sum of the individual values of H.” \(H(\frac{1}{2},\frac{1}{3},\frac{1}{6})=H(\frac{1}{2},\frac{1}{2})+\frac{1}{2}H(\frac{2}{3},\frac{1}{3})\)

Entropy in Information Theory

  • The following equation, where K is a constant, \[\begin{align} H(p)=K\sum_{i}p_{i}log(p_{i}), \end{align} \]

  • The entropy function H has the following properties:

    • H is only 0 if there is no uncertainty in the distribution of p.
    • H is at its maximum if the distribution of probability is uniform. Any change in making p more uniform overall increases uncertainty.

Entropy Types: Marginal and Joint.

  • Three entropy terms we need to consider
  • Marginal Entropy: The symbol \(p\),\(X\) , \(Y\) are labels
    • \(H(p)\), \(H(X)\), \(H(Y)\)
  • Joint Entropy: Measure of uncertainty on X and Y.
    • \(H(X,Y)\) \(=p(x_{i},y_{j}) \sum_{i,j} log[p(x_{i},y_{j})]\)

Marginal and Joint Entropy Formulations

  • “Suppose there are two events, x and y, \(\ldots\) Let \(p(x_{i}, y_{j})\) be the probability of the joint occurrence of i for the first and j for the second.

\[ \begin{align} H(x,y)= - \sum_{i,j} p(x_{i},y_{j}) \times log[p(x_{i},y_{j})] \\ H(x) = - \sum_{i,j} p(x_{i},y_{j}) \times \sum_{j} log(p(x_{i},y_{j})) \\ H(y) = - \sum_{i,j} p(x_{i},y_{j}) \times \sum_{i} log(p(x_{i},y_{j})) \end{align} \]

Entropy in Information Theory

  • Before we can discuss conditional entropy:
  • Conditional probability can also be written as

\[ \begin{align} p(y|x)=\frac{p(x,y)}{\sum_{j}p(x,y_{j})} \end{align} \] - Note the denominator is marginalizing out the values of \(y_{j}\) and is simply \(p(x)\).

Entropy in Information Theory

  • “We define the conditional entropy of y,” \(H(Y|X)\) “as the average of the entropy of y for each value of x, weighted according to the probability of getting that particular x.”

  • The word “average” is key here because we are not talking about a particular value of X, we are referring to all possible values of X. In other words, it is possible \(H(Y|X=x) > H(Y)\)

  • “This quantity measures how uncertain we are of y on the average when we know x”

Entropy in Information Theory

\[\begin{align} &H(Y|X)= E_{p(X)}\left[H(p(Y|X)) \right]=\sum_{j} p(X=x_{j})H(p(Y|x_{i}))\\ &-\sum_{i}P(x_{i})\sum_{j}P(y_{j}|x_{i})log \left[ p(y_{j}|x_{i})\right] =\\ &-\sum_{i,j} p(x_{i},y_{j})log[p(y_{j}|x_{i})]=-\sum_{i,j}p(x_{i},y_{j}) log \frac{p(x_{i},y_{j})}{p(x_{i})}=\\ &-\sum_{i,j}p(x_{i},y_{j})log [p(x_{i},y_{j})] -\sum_{i}p(x_{i})log \frac{1}{p(x_{i})}=\\ &H(X,Y)-H(X) \end{align}\]

Entropy in Information Theory

\[ \begin{align} H(x,y) \le H(x) + H(y) \end{align} \] - “The uncertainty of a joint event is less than or equal to the sum of the individual uncertainties.” Equality occurs when x and y are independent.

  • “The uncertainty of y is never increased by knowledge of x. It will be decreased unless x and y are independent events, in which case it is not changed.”

Entropy in Information Theory

“The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%. This means that when we write English half of what we write is determined by the structure of the language and half is chosen freely.”

Entropy in Information Theory

  • If the random variable is continuous rather than discrete the \(\sum\) operator changes to \(\int\) and a couple of properties changes:

  • H(X) is not scale invariant. \(H(cX) = |c|+H(X)\)

  • H(X) is translation invariant. \(H(c+X) = H(X)\)

  • $ - < H(X) < + $ if probability distribution p(x) is bound and \(\sigma^{2}_{x} < + \infty\)

  • $ - < H(X) $ and if $^{2}_{x} < $ then \(H(X) < + \infty\)

Entropy in Information Theory

  • H(X) is Entropy of X’s distribution, I(X;Y) is mutual information between X and Y.
  • We will use these in quantifying how much information we can gain about \(\theta\) from X or,
  • How much information X and Y have on each other.

Mutual Information

  • I(X;Y) Mutual Information \(=\) expected information from X on Y. \[ \begin{aligned} & I(X;Y) = I(Y;X) &\\ & \sum_{y} \sum_{x} P(X=x,Y=y)log\bigg[\frac{P(X=x,Y=y)}{P(X=x)P(Y=y)}\bigg] & \end{aligned} \]
  • More formally:“The expected information in a random draw from X, about an outcome from a random draw Y is given by mutual information.

Uses of Entropy

  • The expected information on Y from observing X never increases your uncertainty.

  • Once we obtain X, our information on Y could in fact decrease compared to the initial state of knowledge on Y.

  • Examples of Entropy applications:

    • Is this model more or less informative for Y?
    • Is it worth eliciting additional data for X to learn about Y?
    • Should we use \(X_{i}\) or \(X_{j}\) to learn about Y?

Entropy Calculations H(Y)

  • Assume Y=0 ; Y=1 and has equal probability of occurring P(Y=0)=0.5 ; P(Y=1) =0.5.
  • Calculate: \(H(Y)= - \sum_{i} P(y_i)\times log P(y_i)\)
-1*(0.5*log(0.5)+0.5*log(0.5))
[1] 0.6931472
  • What if P(Y=0)=1.0 and P(Y=1) = 0.0 We have to assume \(0*log(0) = 0\)
-1*(1.0*log(1)+0)
[1] 0

Entropy Calculations H(Y)

  • We are still assuming two outcomes
#P_Y creates a vector from 0 to 1 with increments of 0.01.
#P_Y has 101 elements.
P_Y=seq(0,1,.01)
H_Y=-1*(P_Y*log(P_Y)+(1-P_Y)*log(1-(P_Y)))
#Why do we have to manually calculate H_Y[1] and H_Y[101]?
#0*log(0) has to be defined as 0 or recalculated as we do.
H_Y[1]=-1*(0+(1-P_Y[1])*log(1-(P_Y[1])))
H_Y[101]=-1*(P_Y[101]*log(P_Y[101])+0)
which.max(H_Y)
[1] 51
P_Y[which.max(H_Y)]
[1] 0.5
HP=cbind(P_Y,H_Y)
HP=as.data.frame(HP)

Entropy Plot H(Y)

Kullback Leibler Divergence

  • For discrete distributions \[ \begin{aligned} D_{KL}(p(X) \| q(X)) = \sum_{k=1}^{k=K}p(x_{k}) log \frac{p(x_{k})}{q(x_{k})} \end{aligned} \]

  • Where \(p\) and \(q\) are distributions defined over the same data.

Kullback Leibler Divergence

X=1 X=2 X=3
p 0.4 0.25 0.35
q 0.1 0.5 0.4

\[ \begin{aligned} 0.4 \times log \left(\frac{0.4}{0.1} \right)+0.25 \times log \left(\frac{0.25}{0.5} \right)+0.35 \times log \left(\frac{0.35}{0.4} \right) \end{aligned} \]

 0.4 *log(0.4/0.1)+0.25 * log(0.25/0.5)+0.35* log(0.35/0.4)
[1] 0.334495

Conditional H(Y|X)

  • Recall \(H(Y|X)\) is \(H(X,Y)-H(X)\)
Y=1 Y=2 Y=3
X=1 0.1 0.25 0.20 0.55
X=2 0.15 0.18 0.12 0.45
0.25 0.43 0.32
# Finding H(Y,X)
H_Y_X=-1*(0.1*log(0.1)+0.25*log(0.25)+0.2*log(0.2)+
0.15*log(0.15)+0.18*log(0.18)+0.12*log(0.12))
# Finding H(X)
H_X=-1*(0.55*log(0.55)+0.45*log(0.45))
# Finding H(Y)
H_Y=-1*(0.25*log(0.25)+0.43*log(0.43)+0.32*log(0.32))

H(X), H(Y), H(X|Y), H(Y|X) Calc.

print(H_X)  
[1] 0.6881388
print(H_Y)
[1] 1.0741
H_Y_given_X =H_Y_X-H_X
print(H_Y_given_X)
[1] 1.058244
H_X_given_Y =H_Y_X-H_X
print(H_X_given_Y)
[1] 1.058244
H_X_given_Y =H_Y_X-H_Y
  • H(Y|X)=1.75 - 0.69 = 1.06
  • H(X|Y)=1.75 - .074 = 1.68

Mutual Information

Y=1 Y=2 Y=3
X=1 0.1 0.25 0.20 0.55
X=2 0.15 0.18 0.12 0.45
0.25 0.43 0.32

\[ \begin{aligned} & \sum_{y} \sum_{x} P(X=x,Y=y)log\bigg[\frac{P(X=x,Y=y)}{P(X=x)P(Y=y)}\bigg] & \end{aligned} \]

Mutual Information Computation

\[ 0.1 \times log \left[\frac{0.1}{0.55\times0.25} \right]+ 0.25 \times log \left[\frac{0.25}{0.55\times 0.43} \right]+\\ 0.2 \times log \left[\frac{0.2}{0.55\times 0.32} \right]+ 0.15 \times log \left[\frac{0.15}{0.45\times 0.25} \right]+\\ 0.18 \times log \left[\frac{0.18}{0.45\times 0.43} \right]+ 0.12 \times log \left[\frac{0.12}{0.45\times 0.32} \right] \]

MI=0.1 * log(0.1/(0.55*0.25))+0.25 * log(0.25/(0.55* 0.43))+
0.2 * log(0.2/(0.55* 0.32))+0.15 * log(0.15/(0.45* 0.25))+
0.18 * log (0.18/(0.45* 0.43))+0.12 *log(0.12/(0.45* 0.32))
print(MI)
[1] 0.01585548

Normalized Mutual Information (NMI)

\[\begin{align} & I(X;Y) = H(X) - H(X|Y) \le H(X)\\ & I(X;Y) = H(Y) - H(Y|X) \le H(Y)\\ & 0 \le I(X;Y) \le min(H(X),H(Y))\\ & NMI(X,Y) = \frac{I(X;Y)}{min(H(X),H(Y))}\le 1 \end{align} \]
NMI=MI/min(H_Y,H_X)
# Weak relationship
print(NMI)
[1] 0.02304111