This section contains the setup and the various utility functions used throughout.
Libraries used:
library(data.table)
library(ggplot2)
library(mvtnorm)
library(extraDistr)
rstan::rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
options(mc.cores = 1)
Compiled code (any models used are shown later):
# mod1 <- rstan::stan_model(".//stan//logit_notrand.stan", verbose = FALSE)
# rstan::expose_stan_functions(mod1)
In 1948 Claude Shannon published a paper1 laying out a theory of information in the context of the statistical theory of communication. He was, in part, interested in how to transfer messages in the most efficient way? While Shannon’s formulation relates to telecommunications, the subject is also a re-interpretation of probability in a way that gives new insights. This is the perspective I try to take here to introduce some introductory concepts.
Fundamentally, information theory helps us describe the uncertainty inherent in the state of a random variable. This measure of uncertainty would be high for low probability events occur, but low for events that were virtually certain occur.
How can we formalise the uncertainty we have regarding an event with probability \(p(A) = p_A\)? The measure proposed by Shannon was:
\[ h = \log_2(1/p) = -\log_2(p) \]
d <- data.table(
p = seq(0, 1, len = 100)
)
ggplot(d, aes(x = p, y = -log2(p)))+
geom_line() +
scale_x_continuous("Probability of event",
breaks = seq(0, 1, by = 0.1)) +
scale_y_continuous("h") +
theme_bw()
Note - \(\log_2\) is commonly used, but any base can be used.
The measure is referred to as entropy, or more specifically Shannon Entropy. It increases as probability decreases and thus likely events have low information content and vice versa.
For two events, \(A\) and \(B\), we need to consider their joint probability, \(p(A, B) = p(A)p(B|A) = p_{A,B}\). To compute the uncertainty we use:
\[ \begin{aligned} h(A,B) &= - \log_2(p_{A,B}) \\ &= - \log_2(p_A p_{B|A}) \\ &= -\log_2(p_A) + -\log_2(p_{B|A}) \\ &= h(A) + h(B|A) \end{aligned} \]
which is \(h(A,B) = h(A) + h(B)\) when \(A\) and \(B\) are independent. Thus \(h\) is additive.
We can extend this idea to probability distributions. If we have a random variable \(X\) with PMF \(p(X)\), then the probability of outcome \(X_i\) is \(p_i\) and it has uncertainty \(h(X_i) = \log_2(1/p_i)\). The expected value of \(h(X)\) is:
\[ H(X) = E(h(X)) = - \sum_i p_i \log_2(p_i) \]
where \(H\) is used to distinguish this from that \(h\) that we linked to single events and \(0 \log(0) = 0\) is assumed for convenience.
Note that this definition does not depend on the values of the random variable, only the probabilities.
One interpretation of \(H\) is the average number of y/n questions you would need to ask to find out which of the \(K\) possible outcomes from \(X\) occurred via a probability weighted binary search2.
As an example, for a Bernoulli random variable, with probability of success equal to \(p\), the Shannon Entropy over the range of possible values of \(p\) is shown below. At the extremes of the range for \(p\), the Shannon Entropy is zero; there is no uncertainty regarding what the outcomes will be. The maximum value for Shannon Entropy occurs when \(p=0.5\).
d <- data.table(
p = seq(0, 1, len = 100)
)
d[, h_1 := -ifelse(p == 0, 0, p * log2(p))]
d[, h_0 := -ifelse(p == 1, 0, (1-p) * log2(1-p))]
d[, H := h_1 + h_0]
ggplot(d, aes(x = p, y = H))+
geom_line() +
scale_x_continuous("Probability of event",
breaks = seq(0, 1, by = 0.1)) +
scale_y_continuous("H") +
theme_bw()
If we have multiple discrete random variables \(X\) and \(Y\) with \(n\) and \(m\) outcomes and joint PMF having values \(p_{ij}\), then the joint Shannon entropy is
\[ H(X, Y) = - \sum_{i = 1}^n \sum_{j=1}^m p_{ij} \log_2(p_{ij}) \]
For example, consider a random sample of adults from which we obtain an empirical view of the bivariate association between decadal age \(X_i\) and the number of times married \(Y_i\).
p <- rbind(
c(90,10,0,0),
c(50,45,5,0),
c(42,48,10,2),
c(30,55,10,4),
c(25,57,12,5)
)
p <- p / sum(p)
rownames(p) <- c("10-19","20-29","30-39","40-49","50-59")
colnames(p) <- c("0", "1", "2", "3")
p
## 0 1 2 3
## 10-19 0.180 0.020 0.000 0.000
## 20-29 0.100 0.090 0.010 0.000
## 30-39 0.084 0.096 0.020 0.004
## 40-49 0.060 0.110 0.020 0.008
## 50-59 0.050 0.114 0.024 0.010
The joint uncertainty associated with this bivariate distribution is
lp <- ifelse(p == 0, 0, log2(p))
-sum(p * lp)
## [1] 3.570189
We might be interested in the uncertainty inherent in the number of times married \(Y_i\) when we already know the relevant age category \(X_i\). To do this we would need the conditional distribution of \(Y\).
\[ H_i = H(Y|X_i) = - \sum_{j=1}^m g_i(j) \log_2(g_i(j)) \]
where \(g_i(j) = \frac{p_{ij}}{\sum_{j=1}^m p_{ij}}\) where the denominator is the conditional probability of \(Y\) given a known value of \(X\). The denominator for \(g\) is averaging out \(Y\) to produce a probability for the given \(X_i\). For example, say that we are interested in the 30-39 age group, what is the conditional entropy?
First we compute the probability of \(X_i\) and then use it to renormalise the joint probabilities for the number of times married.
# Conditional probability of times married given age 30-39
p_a <- rowSums(p)
(p_m <- p["30-39",]/p_a["30-39"])
## 0 1 2 3
## 0.41176471 0.47058824 0.09803922 0.01960784
And now we compute the entropy using this conditional probability. Once we know the age, our uncertainty regarding number of times married has decreased.
-sum(p_m * log2(p_m))
## [1] 1.478555
While the above relates to a specific value of \(X\) we can also average across the whole of X. The average conditional uncertainty for number of times married is the expectation of \(H_i\) over the distribution for \(X\).
\[ H(Y|X) = \sum_{i=1}^n H_i f_i \]
where \(f_i = \sum_{j=1}^m p_{ij}\).
# Conditional probability of times married given each age
(p_c <- sweep(p, 1, p_a, FUN="/"))
## 0 1 2 3
## 10-19 0.9000000 0.1000000 0.00000000 0.00000000
## 20-29 0.5000000 0.4500000 0.05000000 0.00000000
## 30-39 0.4117647 0.4705882 0.09803922 0.01960784
## 40-49 0.3030303 0.5555556 0.10101010 0.04040404
## 50-59 0.2525253 0.5757576 0.12121212 0.05050505
Compute the entropy for each row.
# H_i for each row
H_i <- apply(p_c, 1, function(z){
lp <- ifelse(z == 0, 0, log2(z))
- sum(z * lp)
})
# Conditional entropy
sum(H_i * p_a)
## [1] 1.248347
We can also obtain the marginal uncertainty for number of times married, i.e. in disregard for age.
\[ H(Y) = - \sum_{j = 1}^m g(j) \log_2 g(j) \]
where \(g(j) = \sum_{i = 1}^n p_{ij}\). The marginal probability is simply
(p_m <- colSums(p))
## 0 1 2 3
## 0.474 0.430 0.074 0.022
thus the marginal uncertainty is
(H_y <- -sum(p_m * log2(p_m)))
## [1] 1.43319
Another way in which the joint entropy can be defined is relative to the conditional entropy. Specifically, we have the chain rule:
\[ H(X, Y) = H(X) + H(Y|X) \]
which can be derived by noting that \(p(x,y) = p(x)p(y|x)\) and substituting this into the equation for joint entropy \(H(X,Y) = - \sum_x \sum_y p(x,y) \log_2 p(x,y)\).
The entropy of a random variable is a measure of uncertainty. The relative entropy (Kullback–Leibler distance) is a measure of distance between two distributions. For two PMFs \(p(x)\) and \(q(x)\) we have
\[ D(p||q) = \sum_{x \in \mathcal{X}} p(x) \log(\frac{p(x)}{q(x)}) \]
again, this is just an expectation, this time being for the log of p/q with any problematic values being set to zero by convention. Specifically, this is expected logarithm of the likelihood ratio. The KL is always nonnegative and if \(p = q\), then the KL is zero.
In a similar vein, the amount of information that one RV contains about another RV is the mutual information. It is defined as the relative entropy between the joint distribution and the product distribution \(p(x)p(y)\):
\[ I(X; Y) = \sum_{x\in \mathcal{X}}\sum_{y\in \mathcal{Y}} p(x, y) \log \frac{p(x,y)}{p(x)p(y)} \]
More usefully, the mutual information can be rewritten as
\[ \begin{aligned} I(X; Y) &= \sum_{x,y} p(x, y) \log \frac{p(x|y)}{p(x)} \\ &= -\sum_{x,y} p(x, y) \log(p(x)) + \sum_{x,y} p(x, y) \log(p(x|y)) \\ &= H(X) - H(X|Y) \\ &= H(X) + H(Y) - H(X|Y) \end{aligned} \]
with the last line following from the definition for \(H(X,Y) = H(Y) + H(X|Y)\). Thus, the mutual information \(I(X;Y)\) represents the reduction in the uncertainty of \(X\) due to knowledge of \(Y\).
For a continuous random variable, the analogous measure of entropy is known as the differential entropy3. For a continuous random variable \(X\) with density \(f(x)\), the differential entropy is defined as
\[ h(X) = -\int_S f(x) \log(f(x)) dx \]
where \(S\) is the support set of \(X\) such that \(f(x)>0\). For example, a uniform random variable with support \([0,a]\) and \(p(x) = 1/a\) for all \(x\in[0,a]\) has a differential entropy
\[ \begin{aligned} h(x) &= -\int_0^a \frac{1}{a} \log(1/a) dx \\ &= - x \frac{\log(1/a)}{a} \Big\vert_{x=0}^{a} \\ &= - \frac{a \log(a)}{a} \\ &= -\log(1) + \log(a) = \log(a) \end{aligned} \]
Clearly, unlike Shannon entropy, the differential entropy can be negative, for example if \(a<1\) in the above. Shannon discusses some of the similarities and differences in the 1948 paper.
As an example of the calculations, say we have a beta random variable with \(a = 40\) and \(b = 30\). We can compute the differential entropy as
a <- 40; b <- 50
beta1 <- function(x){
dbeta(x, shape1=a, shape2=b) * dbeta(x, shape1=a, shape2=b, log = T)
}
res <- integrate(beta1, 0, 1)
-res$value
## [1] -1.536131
which corresponds to the following analytic form
\[ \log(B(\alpha, \beta)) - (\alpha-1)\phi(\alpha) - (\beta-1)\phi(\beta) + (\alpha + \beta - 2)\phi(\alpha + \beta) \]
where \(B\) denotes the beta function and \(\phi\) is the digamma function \(\phi(x) = \Gamma^\prime(x)/ \Gamma(x)\), giving:
(h_x <- log(beta(a,b)) - (a-1)*digamma(a) - (b-1)*digamma(b) + (a+b-2)*digamma(a+b))
## [1] -1.536131
Another way that this integral can be computed is as follows. First recall that we want to compute differential entropy over positive support \(S\) of pdf \(f(x)\) for continuous random variable \(X\), i.e.
\[ \begin{aligned} h &= -\int_S f(x) \log(f(x)) dx \\ \end{aligned} \]
First, compute \(p_i\) for \(n\) segments, all with width \(\Delta\), over \(S\)
\[ \begin{aligned} p_i = \mathsf{Pr}(i\Delta < X < (i+1)\Delta) = \int_{i\Delta}^{(i+1)\Delta} f(x) dx \quad i \in \{0, 1, 2,\dots, (n-1) \} \end{aligned} \]
then compute density at midpoints of each segment
\[ \begin{aligned} f_i = f(x_i) \quad x_i = \Delta(i + 1/2) \end{aligned} \]
then use the following to approximate the integral
\[ \begin{aligned} - \sum p_i \log(f_i) \approx h \end{aligned} \]
which can be implemented as follows:
x <- seq(0, 1, length.out = 1e3)
x_mid <- (x[-1] + x[-length(x)]) / 2
p <- pbeta(x[-1], a, b) - pbeta(x[-length(x)], a, b)
lp <- dbeta(x_mid, a, b, log = TRUE)
-sum(p * lp)
## [1] -1.536115
Another perspective is to quantise the continuous random variable into a discrete random variable. This approach follows a similar development to the outline above, but heads off in a different direction. Again, consider a random variable \(X\) with density \(f(x)\) and suppose the range of \(X\) is divided into bins with width \(\Delta\). By the mean value theorem, there exists \(x_i\) such that
\[ f(x_i)\Delta = \int_{i\Delta}^{(i+1)\Delta}f(x)dx \]
Now consider a quantised version \(X^\Delta\) defined by
\[ X^\Delta = x_i \quad \mathsf{if} \quad i\Delta \le X < (i+1)\Delta \]
thus the probability that \(X^\Delta = x_i\) is
\[ p_i = f(x_i) \Delta \]
The entropy of the quantised version is
\[ \begin{aligned} H(X^\Delta) &= -\sum p_i \log(p_i) \\ &= -\sum f(x_i)\Delta \log(f(x_i)\Delta) \\ &= -\sum f(x_i)\Delta \log(f(x_i)) - \sum f(x_i)\Delta \log(\Delta) \\ &= -\sum f(x_i)\Delta \log(f(x_i)) - \log(\Delta) \end{aligned} \]
which implies that the if the pdf \(f(x)\) is Riemann integrable, then
\[ H(X^\Delta) + \log \Delta \rightarrow h(x) = h, \quad \mathsf{as\ } \Delta \rightarrow 0 \]
So, say we quantise the \([0,1]\) domain of our beta distribution domain into chunks with \(\Delta = 1/100\) and take the value of the \(x_i\) as the mid-points.
Note: using the mid-points, is not exactly correct, but hopefully it will suffice.
nn <- 100
p <- seq(0, 1, len = nn)
p_mid <- (p[-1] + p[-nn])/2
delta <- diff(p)[1]
from which we can calculate the PMF of the quantised variable
# Density -> PMF
f_mid <- dbeta(p_mid, shape1=a, shape2=b)
p_i <- f_mid * delta
sum(p_i)
## [1] 1
d <- data.table(p_i = p_i, p_mid = p_mid)
ggplot(d, aes(x = p_mid, y = p_i)) +
geom_point() +
geom_linerange(aes(ymin = 0, ymax = p_i)) +
scale_x_continuous("x") +
scale_y_continuous("Pr(X = x)") +
theme_bw()
and the entropy of the quantised version is then
(H_delta <- -sum(p_i * log(p_i)))
## [1] 3.058989
# And the following should approximate the value for h_x
H_delta + log(delta)
## [1] -1.536131
h_x
## [1] -1.536131
Shannon 1948, A Mathematical Theory of Communication↩︎
A probability weighted binary search splits the search space in half at each step with reference to the probabilities rather than the usual \(n/2\).↩︎
Technically this isn’t quite correct. Interested readers should look into the limiting density of discrete points.↩︎