Addendum to Levy section 4.5

BUGS-free adaptation of Levy pp.67ff

…

As a simple example, let us turn back to a case of inferring the ordering preference of an English binomial, such as {radio, television}. The words in this particular binomial differ in length (quantified as, for example, number of syllables), and numerous authors have suggested that a short-before-long metrical constraint is one determinant of ordering preferences for English binomials (Cooper and Ross, 1975; Pinker and Birdsong, 1979, inter alia). Our prior knowledge therefore inclines us to expect a preference for the ordering radio and television (abbreviated as r) more strongly than a preference for the ordering television and radio (t), but we may be relatively agnostic as to the particular strength of the ordering preference. A natural probabilistic model here would be the binomial distribution with success parameter \(\pi\), and a natural prior might be one which is uniform within each of the ranges \(0 \leq \pi \leq 0.5\) and \(0.5 < \pi < 1\), but twice as large in the latter range as in the former range. This would be the following prior: \[ p(\pi = x) = \begin{cases} 2/3\ \ 0 \leq \pi \leq .5\\ 4/3\ \ .5 < \pi < 1\\ 0 \text{ otherwise} \end{cases} \] which is a step function, illustrated here:

d = function(v) sapply(v, FUN=function(y) {
  if (y < 0) 0
  else if (y <= .5) 2/3
  else if (y < 1) 4/3
  else 0
})
curve(d(x), col='blue', xlim=c(-.2,1.2))

In such cases, there are typically no closed-form expressions for the posterior or predictive distributions given arbitrary observed data y. However, these distributions can very often be approximated using general-purpose sampling-based approaches. Under these approaches, samples (in principle independent of one another) can be drawn over quantities that are unknown in the model. These samples can then be used in combination with density estimation techniques to approximate any probability density of interest.

For example, suppose we obtain data \(y\) consisting of ten binomial tokens - five of r and five of t - and are interested in approximating the following distributions:

The posterior distribution over the success parameter \(\pi\);
The posterior predictive distribution over the observed ordering of an eleventh token;
The posterior predictive distribution over the number of \(r\) orderings seen in ten more tokens.

We can use a variety of techniques to sample from these distributions. The simplest is rejection sampling. This is generally not an efficient way to produce samples, but it gives a useful insight into what sampling techniques are accomplishing. There is a variety of other sampling techniques with cool-sounding names - Markov Chain Monte Carlo, Gibbs Sampling, Particle Filters, … - that are usually more efficient, though sometimes less precise.

The intuition behind rejection sampling is simple. As you’ve already seen a number of times in this class, if you can generate a large enough number of samples \(S\) from the joint distribution \(P(A = a, B = b)\) by the fraction of samples in which both \(A=a\) and \(B=b\). \[ P(A=a,B=b) \approx \frac{|\{s \in S\ |\ A(s) = a\ \&\ B(s) = b \}|}{|S|} \]

Similarly, we can approximate the conditional distribution \(P(A=a|B=b)\) by throwing away the samples where \(B \neq b\), and then considering the fraction of the remaining samples in which \(A=a\). \[ P(A=a|B=b) \approx \frac{|\{s \in S\ |\ A(s) = a\ \&\ B(s) = b \}|}{|\{s \in S \ |\ B(s) = b\}|} \] This works on the same principle as conditionalization generally: the distribution conditional on \(B = b\) is the one you get if you temporarily ignore all the possibilities in which \(B \neq b\), and then re-adjust probabilities so that the total probability mass of the \(B=b\) portion of logical space is 1.

Suppose, then, that we want to approximate the three quantities mentioned above, given the model and the first ten data points (with five r and five s tokens):

\(P(\pi | \mathbf{y}_{1-10}, I)\)
\(P(y_{11} | \mathbf{y}_{1-10}, I)\) (whether the 11th data point is “radio and television” or “television and radio”)
\(P(H_{11-20} | \mathbf{y}_{1-10}, I)\) (the number of times we will see “radio and television” in the 11th-20th data points)

where \(I\) describes our prior beliefs, including the form of the model and the step-function prior we’re using. Here is one way to sample from this distribution in R. I’ve arbitrarily chosen to associated radio and television with 1 (‘success’/‘heads’), and television and radio with 0 (‘failure’/‘tails’).

sampler = function(i) {
  if (runif(1,0,1) < 1/3) {   # with 1/3 probability, do this:
    parameter = runif(1,0,.5)   # get a random number in [0, .5]
  } else {                    # with 2/3 probability, do this:
    parameter = runif(1,.5,1)   # get a random number in (.5, 1)
  }
  n.data.points = 10
  # now simulate some coin flips
  simulated.heads = rbinom(n=1, size=n.data.points, prob=parameter)
  return(c(parameter, simulated.heads))
}
n.samples = 1000
my.samples = t(sapply(1:n.samples, sampler))
colnames(my.samples) = c('parameter', 'n.heads')
head(my.samples)

##      parameter n.heads
## [1,] 0.6327543       7
## [2,] 0.9541039      10
## [3,] 0.9723376      10
## [4,] 0.5308931       7
## [5,] 0.3435114       3
## [6,] 0.7488496       7

par(mfrow=c(1,2))
plot(table(my.samples[,'n.heads']), main='number of heads in samples')
plot(density(my.samples[,'parameter']), 
     xlab='parameter', ylab='unconditional density', main='')

To approximate the conditional distribution given our data, we can look at the subset of the samples in which the simulated number of ‘heads’ is the same as in our data:

observed.heads = 5
accurate.samples = subset(my.samples, my.samples[,'n.heads'] == observed.heads)
head(accurate.samples)

##      parameter n.heads
## [1,] 0.6060713       5
## [2,] 0.5867212       5
## [3,] 0.6902469       5
## [4,] 0.3790515       5
## [5,] 0.6589818       5
## [6,] 0.6864334       5

plot(density(accurate.samples[,'parameter']), 
     xlab='parameter', ylab='conditional density given data', main='')

Note the considerable difference between prior and posterior. Also, even though \(\pi = .5\) is the parameter that would be most likely to produce the observed data, because the prior is skewed toward values greater than \(.5\) we find that the posterior mean is a bit greater than .5.

mean(accurate.samples[,'parameter'])

## [1] 0.5416498

But this version is problematic because we can’t control the number of samples we end up with - it depends on the prior probability of the conditioning statement. In this case, we decided to take 1000 samples from the unconditional distribution, but we actually only got 101 samples from the conditional distribution:

dim(accurate.samples)

## [1] 101   2

If we wanted to be able to set in advance the number of samples we get from the conditional distribution, we could do it using a for- or while-loop. For example, the following code will keep sampling from the conditional distribution until it gets n.samples samples where the conditioning statement is satisfied.

n.samples = 1000
my.samples = rep(NA, n.samples)
n.accepted = 0
n.observations = 10
observed.heads = 5
while (n.accepted < n.samples) {
  if (runif(1,0,1) < 1/3) {   
    parameter = runif(1,0,.5)   
  } else {                    
    parameter = runif(1,.5,1)   
  }
  # now simulate some coin flips
  simulated.heads = rbinom(n=1, size=n.observations, prob=parameter)
  if (simulated.heads == observed.heads) { # where conditionalization happens! 
    n.accepted = n.accepted + 1
    my.samples[n.accepted] = parameter
  }
}
my.samples[1:20]

##  [1] 0.4068706 0.6488521 0.6887383 0.4396943 0.6328017 0.7660318 0.4286530
##  [8] 0.5677014 0.5352043 0.5115680 0.4732038 0.3927676 0.4483952 0.6241509
## [15] 0.6795873 0.5309849 0.7031046 0.5154484 0.6745366 0.6310864

plot(density(my.samples), 
     xlab='parameter', ylab='conditional density', main='')

The if-statement inside the while-loop is what determines the condition: in effect, we sample from the joint distribution and then, if the conditioning statement is not satisfied, we don’t record anything and just run the while-loop again.

The quality of the approximation that you get from any sampling technique depends on the number of samples; in rejection sampling, this is the only thing we have to worry about. (Other techniques are a bit more finnicky.) Basically what we have to worry about is this: if we take \(N\) samples, how many will remain after we throw out the ones where \(B \neq b\)? The reason that rejection sampling can be inefficient is that, if the prior probability \(P(B=b)\) is low, then we will throw out most of the samples we take, and we will have to take a large number of samples from the joint distribution to get a reasonable approximation to \(P(A=a|B=b)\). On average, if we want to end up with \(m\) samples and the prior probability of the conditioning statement is \(q\), we will have to take \(m/q\) samples from the unconditional distribution. This is OK if \(q\) is, say, \(.5\) - we just take \(2 \times m\) samples. But if \(q = .001\) - not an uncommon scenario in complicated models - we’ll throw away 99.9% of our samples, and so we have to take around \(1000\) times as many samples from the joint distribution as the number of samples we want (\(m\)). This gets very annoying. The reason we generally prefer other, fancier sampling techniques is that they allow us to avoid throwing away so much work.

Sticking with rejection sampling for the time being: given what we’ve observed, what should we expect the 11th flip to be like? What about the next ten?

n.samples = 1000
my.samples = matrix(NA, nrow=n.samples, ncol=3,
                    dimnames=list(1:n.samples, c('parameter', '11th', 'next.ten')))
n.accepted = 0
n.observations = 10
observed.heads = 5
while (n.accepted < n.samples) {
  if (runif(1,0,1) < 1/3) {   
    parameter = runif(1,0,.5)   
  } else {                    
    parameter = runif(1,.5,1)   
  }
  simulated.heads = rbinom(n=1, size=n.observations, prob=parameter)
  if (simulated.heads == observed.heads) { # where conditionalization happens! 
    n.accepted = n.accepted + 1
    next.flip = rbinom(n=1, size=1, prob=parameter)
    next.ten = rbinom(n=1, size=10, prob=parameter)
    this.sample = c(parameter, next.flip, next.ten)
    my.samples[n.accepted,] = this.sample 
      # replace whole row with sample characteristics
  }
}
my.samples[1:10,]

##    parameter 11th next.ten
## 1  0.5828332    0        7
## 2  0.5239234    1        6
## 3  0.2919434    0        2
## 4  0.3869844    1        2
## 5  0.5683187    0        5
## 6  0.5941923    1        7
## 7  0.6094865    0        6
## 8  0.6496907    1        7
## 9  0.5636408    0        4
## 10 0.6090278    1        5

Sanity check:

plot(density(my.samples[,'parameter']), 
     xlab='parameter', ylab='conditional density', main='')

Posterior predictive probability of getting ‘heads’ on 11th flip.

mean(my.samples[,'11th'])

## [1] 0.552

This should be very close to the mean parameter estimate we got above - why?

Posterior predictive distribution over number of ‘heads’ in ten more flips:

plot(table(my.samples[,'next.ten']), main='heads in 10 more flips', 
     xlab='number of heads', ylab='frequency')

For comparison, here’s what we get using the same technique with a uniform prior:

n.samples = 1000
my.samples = matrix(NA, nrow=n.samples, ncol=3,
                    dimnames=list(1:n.samples, c('parameter', '11th', 'next.ten')))
n.accepted = 0
n.observations = 10
observed.heads = 5
while (n.accepted < n.samples) {   
  parameter = runif(1,0,1) # uniform prior on the binomial parameter   
  simulated.heads = rbinom(n=1, size=n.observations, prob=parameter)
  if (simulated.heads == observed.heads) { # where conditionalization happens! 
    n.accepted = n.accepted + 1
    next.flip = rbinom(n=1, size=1, prob=parameter)
    next.ten = rbinom(n=1, size=10, prob=parameter)
    this.sample = c(parameter, next.flip, next.ten)
    my.samples[n.accepted,] = this.sample 
      # replace whole row with sample characteristics
  }
}
mean(my.samples[,'parameter'])

## [1] 0.5043839

mean(my.samples[,'11th'])

## [1] 0.489

plot(table(my.samples[,'next.ten']), main='heads in 10 more flips', 
     xlab='number of heads', ylab='frequency')

Note that 5 is a more likely outcome than 6 with this prior, but they were about equally likely with the step-function prior.

Addendum to Levy section 4.5

Dan Lassiter

BUGS-free adaptation of Levy pp.67ff