Andrew Ellis
05/09/2013
Imagine we are repeatedly tossing a coin (for some reason). We can think of the outcome of this coin toss as a Bernoulli process, with the probability that the coin will come up “heads'' (success) given by
\( p(success) \sim Bern(\theta) \)
where \( \theta \) is the rate parameter, which means that we are modelling each individual trial.
Alternatively, we can consider \( N \) trials, and model the probability of obtaining \( k \) successful trials out of \( N \), in which case we need a binomal distribution:
\( p(k) \sim Bin(\theta, N) \).
Let's simulate throwing this coin \( 20 \) times (a lot faster than actually tossing the coin). FOr this example, we will use a fair coin, which means that the rate parameter \( \theta \) should be \( 0.5 \):
N <- 20
theta <- 0.5
k = rbinom(n = 1, size = N, prob = theta)
k
[1] 8
We have obtained 8 successes out of 20 tosses.
Now suppose that we are told that a given coin has produced 8 successes out of 20 tosses. How should we go about inferring the rate parameter, given our observations? The intuitive answer is, of course, to calculate the relative frequency: \( k/N \), which gives us an estimate of 0.4.
We will see later that this corresponds to a special case of inference (also maximum likelihood estimate).
At this point you might be asking yourself: what's the big deal deal with coins? Isn't this all rather trivial? It might help to consider a coin toss as a simple metaphor for any process that generates binary responses.
A very important application would be a participant producing binary responses in a 2-alternative forced choice experiment. The k choices are out of N trials are distributed as \( Bin(\theta, N) \), and \( \theta \) is a parameter that we want to infer from the participants choices.
The process currently under consideration models the choice behaviour as simply being generated by a rate, without considering any experimental manipulations. This correpsonds to inferring the participant's bias.
By thinking about it this way, we will be able to build a model of the participants behaviour, and the deep connection to generalized linear models will become clear. In case, we are building a linear model (with just an intercept), and, using GLM terminology, we are assuming a logit link function.
This correpsonds to logistic regression.
Let's continue our example:
Using Bayes theorem, we can write an equation for the conditional probability of the rate \( \theta \), given our observed data, which we denote \( D \): \[ P(\theta | D) = \frac{P(D|\theta) * P(\theta)}{\int P(D | \theta) * P(\theta) \ \mathrm{d}\theta} \] We can solve this analytically, but we will do that later. For now, we are only considering computation approaches. Unfortunately, however, integrals look rather scary to non-mathematicians, and it would be nice if we could just close our eyes and make them go away.
The good news is that this is precisely what we can do, by using Markov Chain Monte Carlo sampling. We will explore what this means later. At the moment, it will suffice to consider it as magic that will allow us to ignore the denominator, and write:
\[ P(\theta | D) \propto P(D|\theta) * P(\theta) \]
What this means is that we will infer the rate \( \theta \) by considering the data, given by \( P(D|\theta) \), and the prior probability of \( \theta \), given by \( P(\theta) \).
The data distribution \( P(D|\theta) \) is referred to as the Likelihood function, in which case it is a function of \( \theta \) for the observed data, and is not a probability distribution.
Let's generate some more data, this time using a not so fair coin:
N <- 20
theta <- 0.75
k = rbinom(n = 1, size = N, prob = theta)
k
[1] 14
This time, have obtained 14 successes out of 20 tosses, but we know that the coin is unfair. Now let's try to infer the probability of the coin having a certain rate, given the observed data.
All we need to do for now is think about our prior distribution for the coin's rate; without any specific knowledge of the coin, we have no reason to assume that the coin has any particular bias. Normally, coins should be more or less fair, which means we have a prior that \( \theta = 0.5 \), but actually, \( \theta \) shouldn't be exaclty 0.5, because that would imply that the coin is manufactured with amazing precision. In other words, we expect even a fair coin to have range of possible rates.
Is seems sensible, for the moment, to assume that any value betewen \( 0 \) and \( 1 \) is possible, so we will use a uniform prior over the interval [0,1].
A uniform prior can be written as:
\( P(\theta) ~ Unif(0, 1) \)
but we can also write:
\( P(\theta) ~ Beta(1, 1) \)
where the two parameters of the Beta distribution, \( a \) and \( b \), can be understood as the prior successes and failures. Therefore, setting both \( a \) and \( b \) to \( 1 \) means that our prior says we have observed 2 coin tosses, with 1 success and 1 failure.
library(R2jags)
library(ggmcmc)
First, we wil specify our model in jags code:
model.jags <- "
model {
# prior distribution for theta
theta ~ dbeta(1, 1)
# likelihood function (observed data)
k ~ dbin(theta, n)
}
"
Next, we need to tell jags where to find the data, and jags will infer from this which variables in the model, are observed, and which are hidden:
data.jags <- list (n = N,
k = k)
inits.jags <- function(){
list(theta = runif(n = 1,
min = 0,
max = 1))
}
Now, we can tell jags which parameters we are interested in. Jags will then return the samples from the posterior distribution for those parameters.
In this case, we only have one parameter, \( \theta \):
parameters = c("theta")
samples = jags(data = data.jags,
inits = inits.jags,
parameters.to.save = parameters,
model.file = textConnection(model.jags),
n.chains = 1,
n.iter = 1000,
n.burnin = 100,
n.thin = 3,
DIC = T)
samples
Inference for Bugs model at "5", fit using jags,
1 chains, each with 1000 iterations (first 100 discarded), n.thin = 3
n.sims = 300 iterations saved
mu.vect sd.vect 2.5% 25% 50% 75% 97.5%
theta 0.687 0.099 0.482 0.619 0.701 0.755 0.855
deviance 4.270 1.445 3.305 3.366 3.742 4.576 7.435
DIC info (using the rule, pD = var(deviance)/2)
pD = 1.0 and DIC = 5.3
DIC is an estimate of expected predictive error (lower deviance is better).