[1] 0.6931472
“Capturing the intangible concept of Information” Soofi, E. S. (1994) Journal of the American Statistical Association, 89, 1243 - 1254.
“Measuring Informativeness of Data by Entropy and Variance” Ebrahimi, N., Maasoumi, E., Soofi, E.S. (1999).
Measuring Informativeness of Data by Entropy and Variance. In: Slottje, D.J. (eds) Advances in Econometrics, Income Distribution and Scientific Methodology. Physica-Verlag HD.
An important concept in Information Theory is entropy. First used in Physics and is now an important part of Machine Learning / Statistics / Data Science.
It can be measured on discrete or continous variables.
Shannon: Discussed quantity of information in message transmissions in telecommunications.
Kullback:
Lindley:
Jaynes: Maximum Entropy Principle
“Suppose we have a set of possible events whose probabilities of occurrence are \(p_1, p_2, \ldots, p_n\). These probabilities are known but that is all we know concerning which event will occur. Can we find a measure of how much”choice” is involved in the selection of the event or of how uncertain we are of the outcome?”
The following equation, where K is a constant, \[\begin{align} H(p)=K\sum_{i}p_{i}log(p_{i}), \end{align} \]
The entropy function H has the following properties:
\[ \begin{align} H(x,y)= - \sum_{i,j} p(x_{i},y_{j}) \times log[p(x_{i},y_{j})] \\ H(x) = - \sum_{i,j} p(x_{i},y_{j}) \times \sum_{j} log(p(x_{i},y_{j})) \\ H(y) = - \sum_{i,j} p(x_{i},y_{j}) \times \sum_{i} log(p(x_{i},y_{j})) \end{align} \]
\[ \begin{align} p(y|x)=\frac{p(x,y)}{\sum_{j}p(x,y_{j})} \end{align} \] - Note the denominator is marginalizing out the values of \(y_{j}\) and is simply \(p(x)\).
“We define the conditional entropy of y,” \(H(Y|X)\) “as the average of the entropy of y for each value of x, weighted according to the probability of getting that particular x.”
The word “average” is key here because we are not talking about a particular value of X, we are referring to all possible values of X. In other words, it is possible \(H(Y|X=x) > H(Y)\)
“This quantity measures how uncertain we are of y on the average when we know x”
\[\begin{align} &H(Y|X)= E_{p(X)}\left[H(p(Y|X)) \right]=\sum_{j} p(X=x_{j})H(p(Y|x_{i}))\\ &-\sum_{i}P(x_{i})\sum_{j}P(y_{j}|x_{i})log \left[ p(y_{j}|x_{i})\right] =\\ &-\sum_{i,j} p(x_{i},y_{j})log[p(y_{j}|x_{i})]=-\sum_{i,j}p(x_{i},y_{j}) log \frac{p(x_{i},y_{j})}{p(x_{i})}=\\ &-\sum_{i,j}p(x_{i},y_{j})log [p(x_{i},y_{j})] -\sum_{i}p(x_{i})log \frac{1}{p(x_{i})}=\\ &H(X,Y)-H(X) \end{align}\]
\[ \begin{align} H(x,y) \le H(x) + H(y) \end{align} \] - “The uncertainty of a joint event is less than or equal to the sum of the individual uncertainties.” Equality occurs when x and y are independent.
“The ratio of the entropy of a source to the maximum value it could have while still restricted to the same symbols will be called its relative entropy. This is the maximum compression possible when we encode into the same alphabet. One minus the relative entropy is the redundancy. The redundancy of ordinary English, not considering statistical structure over greater distances than about eight letters, is roughly 50%. This means that when we write English half of what we write is determined by the structure of the language and half is chosen freely.”
If the random variable is continuous rather than discrete the \(\sum\) operator changes to \(\int\) and a couple of properties changes:
H(X) is not scale invariant. \(H(cX) = |c|+H(X)\)
H(X) is translation invariant. \(H(c+X) = H(X)\)
$ - < H(X) < + $ if probability distribution p(x) is bound and \(\sigma^{2}_{x} < + \infty\)
$ - < H(X) $ and if $^{2}_{x} < $ then \(H(X) < + \infty\)
The expected information on Y from observing X never increases your uncertainty.
Once we obtain X, our information on Y could in fact decrease compared to the initial state of knowledge on Y.
Examples of Entropy applications:
#P_Y creates a vector from 0 to 1 with increments of 0.01.
#P_Y has 101 elements.
P_Y=seq(0,1,.01)
H_Y=-1*(P_Y*log(P_Y)+(1-P_Y)*log(1-(P_Y)))
#Why do we have to manually calculate H_Y[1] and H_Y[101]?
#0*log(0) has to be defined as 0 or recalculated as we do.
H_Y[1]=-1*(0+(1-P_Y[1])*log(1-(P_Y[1])))
H_Y[101]=-1*(P_Y[101]*log(P_Y[101])+0)
which.max(H_Y)
[1] 51
[1] 0.5
For discrete distributions \[ \begin{aligned} D_{KL}(p(X) \| q(X)) = \sum_{k=1}^{k=K}p(x_{k}) log \frac{p(x_{k})}{q(x_{k})} \end{aligned} \]
Where \(p\) and \(q\) are distributions defined over the same data.
X=1 | X=2 | X=3 | |
---|---|---|---|
p | 0.4 | 0.25 | 0.35 |
q | 0.1 | 0.5 | 0.4 |
\[ \begin{aligned} 0.4 \times log \left(\frac{0.4}{0.1} \right)+0.25 \times log \left(\frac{0.25}{0.5} \right)+0.35 \times log \left(\frac{0.35}{0.4} \right) \end{aligned} \]
Y=1 | Y=2 | Y=3 | ||
---|---|---|---|---|
X=1 | 0.1 | 0.25 | 0.20 | 0.55 |
X=2 | 0.15 | 0.18 | 0.12 | 0.45 |
0.25 | 0.43 | 0.32 |
[1] 0.6881388
[1] 1.0741
[1] 1.058244
[1] 1.058244
Y=1 | Y=2 | Y=3 | ||
---|---|---|---|---|
X=1 | 0.1 | 0.25 | 0.20 | 0.55 |
X=2 | 0.15 | 0.18 | 0.12 | 0.45 |
0.25 | 0.43 | 0.32 |
\[ \begin{aligned} & \sum_{y} \sum_{x} P(X=x,Y=y)log\bigg[\frac{P(X=x,Y=y)}{P(X=x)P(Y=y)}\bigg] & \end{aligned} \]
\[ 0.1 \times log \left[\frac{0.1}{0.55\times0.25} \right]+ 0.25 \times log \left[\frac{0.25}{0.55\times 0.43} \right]+\\ 0.2 \times log \left[\frac{0.2}{0.55\times 0.32} \right]+ 0.15 \times log \left[\frac{0.15}{0.45\times 0.25} \right]+\\ 0.18 \times log \left[\frac{0.18}{0.45\times 0.43} \right]+ 0.12 \times log \left[\frac{0.12}{0.45\times 0.32} \right] \]