Entropy

Setup

This section contains the setup and the various utility functions used throughout.

Libraries used:

library(data.table)
library(ggplot2)
library(mvtnorm)
library(extraDistr)
rstan::rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
options(mc.cores = 1)

Compiled code (any models used are shown later):

# mod1 <- rstan::stan_model(".//stan//logit_notrand.stan", verbose = FALSE)
# rstan::expose_stan_functions(mod1)

Information

In 1948 Claude Shannon published a paper¹ laying out a theory of information in the context of the statistical theory of communication. He was, in part, interested in how to transfer messages in the most efficient way? While Shannon’s formulation relates to telecommunications, the subject is also a re-interpretation of probability in a way that gives new insights. This is the perspective I try to take here to introduce some introductory concepts.

Fundamentally, information theory helps us describe the uncertainty inherent in the state of a random variable. This measure of uncertainty would be high for low probability events occur, but low for events that were virtually certain occur.

Entropy

How can we formalise the uncertainty we have regarding an event with probability \(p(A) = p_A\)? The measure proposed by Shannon was:

\[ h = \log_2(1/p) = -\log_2(p) \]

d <- data.table(
  p = seq(0, 1, len = 100)
)
ggplot(d, aes(x = p, y = -log2(p)))+
  geom_line() +
  scale_x_continuous("Probability of event", 
                     breaks = seq(0, 1, by = 0.1)) +
  scale_y_continuous("h") +
  theme_bw()

Note - \(\log_2\) is commonly used, but any base can be used.

The measure is referred to as entropy, or more specifically Shannon Entropy. It increases as probability decreases and thus likely events have low information content and vice versa.

For two events, \(A\) and \(B\), we need to consider their joint probability, \(p(A, B) = p(A)p(B|A) = p_{A,B}\). To compute the uncertainty we use:

\[ \begin{aligned} h(A,B) &= - \log_2(p_{A,B}) \\ &= - \log_2(p_A p_{B|A}) \\ &= -\log_2(p_A) + -\log_2(p_{B|A}) \\ &= h(A) + h(B|A) \end{aligned} \]

which is \(h(A,B) = h(A) + h(B)\) when \(A\) and \(B\) are independent. Thus \(h\) is additive.

We can extend this idea to probability distributions. If we have a random variable \(X\) with PMF \(p(X)\), then the probability of outcome \(X_i\) is \(p_i\) and it has uncertainty \(h(X_i) = \log_2(1/p_i)\). The expected value of \(h(X)\) is:

\[ H(X) = E(h(X)) = - \sum_i p_i \log_2(p_i) \]

where \(H\) is used to distinguish this from that \(h\) that we linked to single events and \(0 \log(0) = 0\) is assumed for convenience.

Note that this definition does not depend on the values of the random variable, only the probabilities.

One interpretation of \(H\) is the average number of y/n questions you would need to ask to find out which of the \(K\) possible outcomes from \(X\) occurred via a probability weighted binary search².

As an example, for a Bernoulli random variable, with probability of success equal to \(p\), the Shannon Entropy over the range of possible values of \(p\) is shown below. At the extremes of the range for \(p\), the Shannon Entropy is zero; there is no uncertainty regarding what the outcomes will be. The maximum value for Shannon Entropy occurs when \(p=0.5\).

d <- data.table(
  p = seq(0, 1, len = 100)
)
d[, h_1 := -ifelse(p == 0, 0, p * log2(p))]
d[, h_0 := -ifelse(p == 1, 0, (1-p) * log2(1-p))]
d[, H := h_1 + h_0]
ggplot(d, aes(x = p, y = H))+
  geom_line() +
  scale_x_continuous("Probability of event", 
                     breaks = seq(0, 1, by = 0.1)) +
  scale_y_continuous("H") +
  theme_bw()

Joint and conditional entropy

If we have multiple discrete random variables \(X\) and \(Y\) with \(n\) and \(m\) outcomes and joint PMF having values \(p_{ij}\), then the joint Shannon entropy is

\[ H(X, Y) = - \sum_{i = 1}^n \sum_{j=1}^m p_{ij} \log_2(p_{ij}) \]

For example, consider a random sample of adults from which we obtain an empirical view of the bivariate association between decadal age \(X_i\) and the number of times married \(Y_i\).

p <- rbind(
  c(90,10,0,0),
  c(50,45,5,0),
  c(42,48,10,2),
  c(30,55,10,4),
  c(25,57,12,5)
)
p <- p / sum(p)
rownames(p) <- c("10-19","20-29","30-39","40-49","50-59")
colnames(p) <- c("0", "1", "2", "3")
p

##           0     1     2     3
## 10-19 0.180 0.020 0.000 0.000
## 20-29 0.100 0.090 0.010 0.000
## 30-39 0.084 0.096 0.020 0.004
## 40-49 0.060 0.110 0.020 0.008
## 50-59 0.050 0.114 0.024 0.010

The joint uncertainty associated with this bivariate distribution is

lp <- ifelse(p == 0, 0, log2(p))
-sum(p * lp)

## [1] 3.570189

We might be interested in the uncertainty inherent in the number of times married \(Y_i\) when we already know the relevant age category \(X_i\). To do this we would need the conditional distribution of \(Y\).

\[ H_i = H(Y|X_i) = - \sum_{j=1}^m g_i(j) \log_2(g_i(j)) \]

where \(g_i(j) = \frac{p_{ij}}{\sum_{j=1}^m p_{ij}}\) where the denominator is the conditional probability of \(Y\) given a known value of \(X\). The denominator for \(g\) is averaging out \(Y\) to produce a probability for the given \(X_i\). For example, say that we are interested in the 30-39 age group, what is the conditional entropy?

First we compute the probability of \(X_i\) and then use it to renormalise the joint probabilities for the number of times married.

# Conditional probability of times married given age 30-39
p_a <- rowSums(p)
(p_m <- p["30-39",]/p_a["30-39"])

##          0          1          2          3 
## 0.41176471 0.47058824 0.09803922 0.01960784

And now we compute the entropy using this conditional probability. Once we know the age, our uncertainty regarding number of times married has decreased.

-sum(p_m * log2(p_m))

## [1] 1.478555

While the above relates to a specific value of \(X\) we can also average across the whole of X. The average conditional uncertainty for number of times married is the expectation of \(H_i\) over the distribution for \(X\).

\[ H(Y|X) = \sum_{i=1}^n H_i f_i \]

where \(f_i = \sum_{j=1}^m p_{ij}\).

# Conditional probability of times married given each age
(p_c <- sweep(p, 1, p_a, FUN="/"))

##               0         1          2          3
## 10-19 0.9000000 0.1000000 0.00000000 0.00000000
## 20-29 0.5000000 0.4500000 0.05000000 0.00000000
## 30-39 0.4117647 0.4705882 0.09803922 0.01960784
## 40-49 0.3030303 0.5555556 0.10101010 0.04040404
## 50-59 0.2525253 0.5757576 0.12121212 0.05050505

Compute the entropy for each row.

# H_i for each row
H_i <- apply(p_c, 1, function(z){
  lp <- ifelse(z == 0, 0, log2(z))
  - sum(z * lp)
})
# Conditional entropy
sum(H_i * p_a)

## [1] 1.248347

We can also obtain the marginal uncertainty for number of times married, i.e. in disregard for age.

\[ H(Y) = - \sum_{j = 1}^m g(j) \log_2 g(j) \]

where \(g(j) = \sum_{i = 1}^n p_{ij}\). The marginal probability is simply

(p_m <- colSums(p))

##     0     1     2     3 
## 0.474 0.430 0.074 0.022

thus the marginal uncertainty is

(H_y <- -sum(p_m * log2(p_m)))

## [1] 1.43319

Another way in which the joint entropy can be defined is relative to the conditional entropy. Specifically, we have the chain rule:

\[ H(X, Y) = H(X) + H(Y|X) \]

which can be derived by noting that \(p(x,y) = p(x)p(y|x)\) and substituting this into the equation for joint entropy \(H(X,Y) = - \sum_x \sum_y p(x,y) \log_2 p(x,y)\).

Relative entropy/Kullback-Leibler distance

The entropy of a random variable is a measure of uncertainty. The relative entropy (Kullback–Leibler distance) is a measure of distance between two distributions. For two PMFs \(p(x)\) and \(q(x)\) we have

\[ D(p||q) = \sum_{x \in \mathcal{X}} p(x) \log(\frac{p(x)}{q(x)}) \]

again, this is just an expectation, this time being for the log of p/q with any problematic values being set to zero by convention. Specifically, this is expected logarithm of the likelihood ratio. The KL is always nonnegative and if \(p = q\), then the KL is zero.

In a similar vein, the amount of information that one RV contains about another RV is the mutual information. It is defined as the relative entropy between the joint distribution and the product distribution \(p(x)p(y)\):

\[ I(X; Y) = \sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}} p(x, y) \log \frac{p(x,y)}{p(x)p(y)} \]

More usefully, the mutual information can be rewritten as

\[ \begin{aligned} I(X; Y) &= \sum_{x,y} p(x, y) \log \frac{p(x|y)}{p(x)} \\ &= -\sum_{x,y} p(x, y) \log(p(x)) + \sum_{x,y} p(x, y) \log(p(x|y)) \\ &= H(X) - H(X|Y) \\ &= H(X) + H(Y) - H(X|Y) \end{aligned} \]

with the last line following from the definition for \(H(X,Y) = H(Y) + H(X|Y)\). Thus, the mutual information \(I(X;Y)\) represents the reduction in the uncertainty of \(X\) due to knowledge of \(Y\).

Differential entropy

For a continuous random variable, the analogous measure of entropy is known as the differential entropy³. For a continuous random variable \(X\) with density \(f(x)\), the differential entropy is defined as

\[ h(X) = -\int_S f(x) \log(f(x)) dx \]

where \(S\) is the support set of \(X\) such that \(f(x)>0\). For example, a uniform random variable with support \([0,a]\) and \(p(x) = 1/a\) for all \(x\in[0,a]\) has a differential entropy

\[ \begin{aligned} h(x) &= -\int_0^a \frac{1}{a} \log(1/a) dx \\ &= - x \frac{\log(1/a)}{a} \Big\vert_{x=0}^{a} \\ &= - \frac{a \log(a)}{a} \\ &= -\log(1) + \log(a) = \log(a) \end{aligned} \]

Clearly, unlike Shannon entropy, the differential entropy can be negative, for example if \(a<1\) in the above. Shannon discusses some of the similarities and differences in the 1948 paper.

As an example of the calculations, say we have a beta random variable with \(a = 40\) and \(b = 30\). We can compute the differential entropy as

a <- 40; b <- 50
beta1 <- function(x){
  dbeta(x, shape1=a, shape2=b) * dbeta(x, shape1=a, shape2=b, log = T)
}
res <- integrate(beta1, 0, 1)
-res$value

## [1] -1.536131

which corresponds to the following analytic form

\[ \log(B(\alpha, \beta)) - (\alpha-1)\phi(\alpha) - (\beta-1)\phi(\beta) + (\alpha + \beta - 2)\phi(\alpha + \beta) \]

where \(B\) denotes the beta function and \(\phi\) is the digamma function \(\phi(x) = \Gamma^\prime(x)/ \Gamma(x)\), giving:

(h_x <- log(beta(a,b)) - (a-1)*digamma(a) - (b-1)*digamma(b) + (a+b-2)*digamma(a+b))

## [1] -1.536131

Another way that this integral can be computed is as follows. First recall that we want to compute differential entropy over positive support \(S\) of pdf \(f(x)\) for continuous random variable \(X\), i.e.

\[ \begin{aligned} h &= -\int_S f(x) \log(f(x)) dx \\ \end{aligned} \]

First, compute \(p_i\) for \(n\) segments, all with width \(\Delta\), over \(S\)

\[ \begin{aligned} p_i = \mathsf{Pr}(i\Delta < X < (i+1)\Delta) = \int_{i\Delta}^{(i+1)\Delta} f(x) dx \quad i \in \{0, 1, 2,\dots, (n-1) \} \end{aligned} \]

then compute density at midpoints of each segment

\[ \begin{aligned} f_i = f(x_i) \quad x_i = \Delta(i + 1/2) \end{aligned} \]

then use the following to approximate the integral

\[ \begin{aligned} - \sum p_i \log(f_i) \approx h \end{aligned} \]

which can be implemented as follows:

x <- seq(0, 1, length.out = 1e3)
x_mid <- (x[-1] + x[-length(x)]) / 2

p <- pbeta(x[-1], a, b) - pbeta(x[-length(x)], a, b)
lp <- dbeta(x_mid, a, b, log = TRUE)
-sum(p * lp)

## [1] -1.536115

Another perspective is to quantise the continuous random variable into a discrete random variable. This approach follows a similar development to the outline above, but heads off in a different direction. Again, consider a random variable \(X\) with density \(f(x)\) and suppose the range of \(X\) is divided into bins with width \(\Delta\). By the mean value theorem, there exists \(x_i\) such that

\[ f(x_i)\Delta = \int_{i\Delta}^{(i+1)\Delta}f(x)dx \]

Now consider a quantised version \(X^\Delta\) defined by

\[ X^\Delta = x_i \quad \mathsf{if} \quad i\Delta \le X < (i+1)\Delta \]

thus the probability that \(X^\Delta = x_i\) is

\[ p_i = f(x_i) \Delta \]

The entropy of the quantised version is

\[ \begin{aligned} H(X^\Delta) &= -\sum p_i \log(p_i) \\ &= -\sum f(x_i)\Delta \log(f(x_i)\Delta) \\ &= -\sum f(x_i)\Delta \log(f(x_i)) - \sum f(x_i)\Delta \log(\Delta) \\ &= -\sum f(x_i)\Delta \log(f(x_i)) - \log(\Delta) \end{aligned} \]

which implies that the if the pdf \(f(x)\) is Riemann integrable, then

\[ H(X^\Delta) + \log \Delta \rightarrow h(x) = h, \quad \mathsf{as\ } \Delta \rightarrow 0 \]

So, say we quantise the \([0,1]\) domain of our beta distribution domain into chunks with \(\Delta = 1/100\) and take the value of the \(x_i\) as the mid-points.

Note: using the mid-points, is not exactly correct, but hopefully it will suffice.

nn <- 100
p <- seq(0, 1, len = nn)
p_mid <- (p[-1] + p[-nn])/2
delta <- diff(p)[1]

from which we can calculate the PMF of the quantised variable

# Density -> PMF
f_mid <- dbeta(p_mid, shape1=a, shape2=b)
p_i <- f_mid * delta
sum(p_i)

## [1] 1

d <- data.table(p_i = p_i, p_mid = p_mid)
ggplot(d, aes(x = p_mid, y = p_i)) +
  geom_point() +
  geom_linerange(aes(ymin = 0, ymax = p_i)) +
  scale_x_continuous("x") +
  scale_y_continuous("Pr(X = x)") +
  theme_bw()

and the entropy of the quantised version is then

(H_delta <- -sum(p_i * log(p_i)))

## [1] 3.058989

# And the following should approximate the value for h_x
H_delta + log(delta)

## [1] -1.536131

h_x

## [1] -1.536131

Shannon 1948, A Mathematical Theory of Communication↩︎
A probability weighted binary search splits the search space in half at each step with reference to the probabilities rather than the usual \(n/2\).↩︎
Technically this isn’t quite correct. Interested readers should look into the limiting density of discrete points.↩︎

Entropy

Introduction

Setup

Information

Entropy

Joint and conditional entropy

Relative entropy/Kullback-Leibler distance

Differential entropy