Entropy

Setup

This section contains the setup and the various utility functions used throughout.

Libraries used:

library(data.table)
library(ggplot2)
rstan::rstan_options(auto_write = TRUE)
Sys.setenv(LOCAL_CPPFLAGS = '-march=native')
options(mc.cores = 1)

Compiled code (any models used are shown later):

# mod1 <- rstan::stan_model(".//stan//logit_notrand.stan", verbose = FALSE)
# rstan::expose_stan_functions(mod1)

Introduction

John Maynard Keynes coined the term of a “Principle of Indifference” for Laplace’s more convoluted “principle of non-sufficient reason”. The rule gives us a strategy for assigning probabilities when we do not have any special knowledge of a situation. Specifically, in the absence of any relevant evidence, agents should distribute their degree of belief equally among all possible outcome. It is worth mentioning that this approach has some problems associated with it, but they not discussed here. The point here is to introduce the idea of maximising uncertainty (Shannon entropy) as an analogous principle.

MaxEnt

Edwin Jaynes was primarily responsible for developing the ideas of Maximum Entropy. Broadly speaking, it is an approach for estimating input probabilities given knowledge of the output event. It was originally motivated by trying to relate macroscopic properties of a physical system to a characterisation of its microstate. MaxEnt is a more general approach than Bayes Theorem to achieve this goal. The approach avoids making any unnecessary assumptions and allows us to produce a probability distribution consistent with any known constraints.

An example may give some insight. Consider a manufacturer that is known to produce three products - $X_1$, $X_2$ or $X_3$ for which they charge A$3, A$2 and A$1 respectively.
Their average sale is A$1.7, what are the probabilities of a customer buying each of these products?

As constraints, we have

\[ \begin{aligned} 1 &= \sum_{i=1}^3 p_i \\ 1.7 &= 3 p_1 + 2 p_2 + 1 p_3 \end{aligned} \]

Given that we have three unknowns but two equations, we are in a bit of a bind. Each of the possible values we could ascribe to the PMF over $X$ yield differing amounts of entropy. The Principle of MaxEnt is based on the idea that you should select the probability distribution which leaves you with the largest uncertainty to ensure you are not making any unwarranted assumptions. Based on the above we can elimitate terms to determine a range of possible values for the probability of each outcome:

\[ \begin{aligned} p_2 &= 0.7 - 2 p_1 \\ p_3 &= 0.3 + p_1 \end{aligned} \]

which implies

\[ \begin{aligned} 0 \le &p_1 \le 0.35 \\ 0 \le &p_2 \le 0.7 \\ 0.3 \le &p_3 \le 0.65 \end{aligned} \]

Substituting into the equation for Shannon entropy gives:

\[ \begin{aligned} H = - \left[ p_1 \log_2(p_1) + (0.7 - 2 p_1) \log_2 (0.7 - 2 p_1) + (0.3 + p_1) \log_2 (0.3 + p_1) \right] \end{aligned} \]

p <- seq(0, 0.35, len = 1000)

p1 <- p
p2 <- (0.7 - 2*p)
p3 <- (0.3 + p)

lp1 <- ifelse(p1 == 0, 0, log2(p1))
lp2 <- ifelse(p2 == 0, 0, log2(p2))
lp3 <- ifelse(p3 == 0, 0, log2(p3))

H = -(
  p1 * lp1 + p2 * lp2 + p3 * lp3
)

d <- data.table(p1 = p1, H = H)

ggplot(d, aes(x = p1, y = H)) +
  geom_line() +
  scale_x_continuous("p1") +
  theme_bw()

idx <- which.max(H)

p1_maxent <- p1[idx]
p2_maxent <- 0.7 - 2*p1[idx]
p3_maxent <- 0.3 + p1[idx]

# 3 * p1_maxent + 2 * p2_maxent + p3_maxent

a simple approximation suggests that the Shannon entropy is maximised at $p_1 =$ 0.1947948, which implies $p_2 =$ 0.3104104 and $p_3 =$ 0.4947948. These are not necessarily the ‘true’ values but they are unbiased estimates that minimise the assumptions associated with our allocation of probabilities.