Entropy

Joint entropy

Previously various foundational concepts for entropy were introduced. By the chain rule, we have:

\[ \begin{aligned} H(X_1, X_2, \dots, X_n) &= \sum_{i=1}^n H(X_i | X_{i-1}, \dots, X_1) \\ &\le \sum_{i=1}^n H(X_i) \end{aligned} \]

Equality holds only if \(X_i\) is independent of \(X_{i-1}, \dots, X_1\) for all \(i\).

Assume a K-arm parallel group trial where the responses are from independent normal distributions. Below I generate data using rmvnorm with a covariance matrix which has the off-diagonals set to zero, i.e. the samples in each arm are independent. These data are simply meant to be indicative of what you would obtain as samples from a posterior distribution after observing the data; in other words, it is what you would have to work with after fitting a Bayesian model via MCMC.

mu <- qlogis(c(0.15, 0.20, 0.27))
K <- length(mu)

S <- diag(K)

S2 <- S
S2[1,1] <- 0.2
S2[2,2] <- 0.5
S2[1,2] <- 0.8
S2[2,3] <- 0.7
S2[1,3] <- 0.5
S2[2,1] <- 0.8
S2[3,2] <- 0.7
S2[3,1] <- 0.5
# need it to be positive semi-def
S2 <- S2 %*% t(S2)

S1 <- S2
S1[upper.tri(S1)] <- 0
S1[lower.tri(S1)] <- 0

# 
m1 <- rmvnorm(1e3, mu, S1)
pairs(m1)

Because independence is satisfied, the entropy of the joint distribution can be computed as the sum of the entropies from the marginals.

While the random variables are continuous, I compute the Shannon entropy, by estimating a kernel density and then discretising the continous distribution over a grid. An alternative is to use a histogram as the approximation, albeit perhaps more choppy than the kernel.

entropy1 <- function(m){
  # For each column of random variates:
  H <- sum(unlist(lapply(1:ncol(m), function(ii){
    # Compute the density of the marginal
    z <- density(m[, ii], from = -5, to = 5)
    # Obtained the granularity of the grid
    delta <- diff(z$x)[1]
    # Probability
    y <- z$y * delta
    # Log probability
    ly <- log(z$y * delta)
    # Set problem values to zero
    ly[y == 0] <- 0
    # Calculate Shannon entropy
    -sum(y * ly)
  })))
  H
}
entropy1(m1)

## [1] 16.56226

Compare this to a direct calculation based on the joint distribution (also done by using a grid approximation). The result is similar to the above.

# humongous grid
z <- density(m1[, 1], from = -5, to = 5)
x <- z$x
grid <- expand.grid(
  x1 = x, x2 = x, x3 = x
)
p <- dmvnorm(grid, mu, S1)
p <- p / sum(p)
lp <- log(p)
lp[p == 0] <- 0
-sum(p * lp)

## [1] 16.44398

However, in other situations, the K arms could be dependent, as in dose ranging studies.

m2 <- rmvnorm(1e3, mu, S2)
pairs(m2)

In such a setting, calculating the entropy of the joint distribution as the sum of the entropies from the marginals will overstate its true value.

entropy1(m2)

## [1] 16.505

Instead, we should compute the joint entropy directly from the joint distribution. This will produce a lower value:

# humongous grid
z <- density(m2[, 1], from = -5, to = 5)
x <- z$x
grid <- expand.grid(
  x1 = x, x2 = x, x3 = x
)
p <- dmvnorm(grid, mu, S2)
p <- p / sum(p)
lp <- log(p)
lp[p == 0] <- 0
-sum(p * lp)

## [1] 14.45336

Shannon tried to extend the definition of entropy to the continuous setting via the differential entropy

\[ h(f) = -\int_{\mathcal{X}} f(x) \log(f(x)) dx \]

which essentially just replaced the summation with an integral. However, Jaynes went on to define the limiting density of discrete points, which is an adjustment to the differential entropy aimed at addressing some of its limitations. But we will ignore the LDDP for now and just work with differential entropy. For the multivariate normal distribution, the differential entropy is defined as

\[ \frac{k}{2} + \frac{k}{2} \log(2\pi) + \frac{1}{2}\log(|\Sigma|) \]

with the last term being the determinant of the covariance matrix. So for the two settings described above, we would obtain:

K/2 + K/2 * log(2*pi) + 1/2 * log(det(S1))

## [1] 4.658515

K/2 + K/2 * log(2*pi) + 1/2 * log(det(S2))

## [1] 2.662266

which we could approximate via numerical integration:

# Vectorise to speed this up.
ent_norm <- function(x, mu, S){
  # note the negative
  - dmvnorm(x, mu, S) * dmvnorm(x, mu, S, log=T)
}

lwr <- c(-5, -5, -5)
upr <- c(5, 5, 5)
Hmv1 <- adaptIntegrate(ent_norm, lwr, upr, mu = mu, S = S1)
Hmv2 <- adaptIntegrate(ent_norm, lwr, upr, mu = mu, S = S2)
Hmv1$integral

## [1] 4.632769

Hmv2$integral

## [1] 2.648844

Entropy

Independent vs Dependent Random Variables

Setup

Joint entropy