Binomial with unknown probability and sample size: some of the
difficulties with setting prior distributions in multiparameter models
can be illustrated with the simple binomial distribution. Consider data
\(y_1,...,y_n\) modeled as independent
\(Bin(N, θ)\), with both \(N\) and \(θ\) unknown. Defining a convenient family
of prior distributions on \((N, θ)\) is
difficult,partly because of the discreteness of N.
Raftery (1988) considers a hierarchical approach based on assigning the
parameter \(N\) a Poisson distribution
with unknown mean \(µ\). To define a
prior distribution on \((θ, N)\),
Raftery defines \(λ = µθ\) and
specifies a prior distribution on \((λ,
θ)\). The prior distribution is specified in terms of \(λ\) rather than \(µ\) because ‘it would seem easier to
formulate prior information about \(λ\), the unconditional expectation of the
observations, than about \(µ\), the
mean of the unobserved quantity \(N\).’
a. A suggested noninformative prior distribution is \(p(λ, θ) ∝ \frac{1}{\lambda}\). What is a
motivation for this noninformative distribution? Is the distribution
improper? Transform to determine \(p(N,
θ)\).
Solution
A motivation for the prior distribution \(p(\lambda,\theta) {\propto}
\frac{1}{\lambda}\) can be described as follows: First, we
realize that the prior information about \(\lambda\) would be more precise than that
about \(\mu\) or \(\theta\), so we assume \(\lambda\) and \(\theta\) are independent a priori (rather
than \(\mu\) and \(\theta\)). As a result, we can write \(p(\lambda,\theta) = p(\lambda)p(\theta)\).
To express our lack of initial knowledge, we take \(\theta \sim U[0,1]\), as \(\theta\) is the probability of success in
each one of the \(N\) Bernoulli trials
conducted, so \(\theta \in [0,1]\). We
express our ignorance for \(\lambda\) a
little differently: since \(\lambda =
\mu\theta\), we can assume that \(\lambda > 0\) (as \(\theta = 0\) is a trivial case), and
therefore take \(log\lambda {\propto}
1\). If \(\phi = log\lambda\),
then \(p(\lambda) =
p(\phi)|\frac{d\phi}{d\lambda}| {\propto} 1\frac{1}{\lambda}\).
We deduce that \(p(\lambda,\theta) =
p(\lambda)p(\theta) {\propto} \frac{1}{\lambda}\). This prior is
improper, as \(\int_{0}^{1}\int_{0}^{+\infty}\frac{1}{\lambda}d\lambda
d\theta = ln\lambda|_0^{+\infty}\), which is undefined. Now,
let’s transform to determine \(p(N,\theta)\):
Since \(N|\lambda,\theta \sim
Pois(\mu)\) and \(\mu =
\frac{\lambda}{\theta}\), we have that \(p(N,\lambda,\theta) =
p(N|\lambda,\theta)p(\lambda,\theta) \propto
\frac{(\frac{\lambda}{\theta})^Ne^-\frac{\lambda}{\theta}}{N!}\frac{1}{\lambda}\),
so \(\begin{eqnarray*} p(N,\theta) &=&
\int_0^{+\infty}p(N,\lambda,\theta)d\lambda\\ &\propto&
\frac{Γ(N)}{N!}\int_0^{+\infty}\frac{1}{Γ(N)\theta^N}\lambda^{N-1}e^-\frac{\lambda}{\theta}d\lambda\\
&\propto& \frac{1}{N}\end{eqnarray*}\)
because \(f(\lambda) =
\frac{1}{Γ(N)\theta^N}\lambda^{N-1}e^-\frac{\lambda}{\theta}\) is
the p.d.f. of \(\Gamma(N,\theta)\).
b. The Bayesian method is illustrated on counts of waterbuck obtained by remote photography on five separate days in Kruger Park in South Africa. The counts were 53, 57, 66, 67, and 72. Perform the Bayesian analysis on these data and display a scatterplot of posterior simulations of (N, θ). What is the posterior probability that N > 100?
Solution
First, let’s find our joint posterior distribution, if \(y_i, 1 \leq i \leq 5\) denote the counts of
waterbuck on each of five seperate days respectively, and \(y = (y_1,y_2,y_3,y_4,y_5)\):
\(\begin{eqnarray*} p(N,\theta|y)
&\propto& p(N,\theta)p(y|N,\theta)\\ &\propto&
\frac{1}{N}
\prod\limits_{i=1}^5\binom{N}{y_i}\theta^{y_i}(1-\theta)^{N-y_i}\\
&\propto&
\frac{1}{N}\theta^{\sum\limits_{i=1}^5y_i}(1-\theta)^{5N-\sum\limits_{i=1}^5y_i}\prod\limits_{i=1}^5\binom{N}{y_i}\end{eqnarray*}\)
for \(\theta \in [0,1]\) and \(N \geq 72 = y_{max}\). Now, we can find the
unnormalized marginal distribution of \(N\) as follows: \(\begin{eqnarray*} p(N|y) &=&
\int_0^1p(N,\theta|y)d\theta\\ &\propto&
\frac{1}{N}\prod\limits_{i=1}^5\binom{N}{y_i}\int_0^1\theta^{\sum\limits_{i=1}^5y_i}(1-\theta)^{5N-\sum\limits_{i=1}^5y_i}d\theta\\
&\propto&
\frac{1}{N}\prod\limits_{i=1}^5\binom{N}{y_i}Β(\sum\limits_{i=1}^5y_i+1,5N-\sum\limits_{i=1}^5y_i+1)\\
&\propto&
\frac{1}{N}\prod\limits_{i=1}^5\binom{N}{y_i}\frac{\Gamma(5N-\sum\limits_{i=1}^5y_i+1)}{\Gamma(5N+2)}\\
&\propto&
\frac{(5N-\sum\limits_{i=1}^5y_i)!}{(5N+1)!}\frac{1}{N}\prod\limits_{i=1}^5\binom{N}{y_i}\\
&\propto& \frac{1}{\prod\limits_{k = 5N -
\sum\limits_{i=1}^5y_i+1}^{5N+1}k}\frac{1}{N}\prod\limits_{i=1}^5\binom{N}{y_i}
\end{eqnarray*}\)
for \(N \geq 72\). In order to simulate
from the joint posterior, we will also need the conditional posterior of
\(\theta\), which can be obtained
through the joint posterior with \(N\)
held constant: \(p(\theta|N,y) \propto
\theta^{\sum\limits_{i=1}^5y_i}(1-\theta)^{5N-\sum\limits_{i=1}^5y_i}\),
so we deduct that \(\theta|N,y \sim
Beta(\sum\limits_{i=1}^5y_i+1,5N-\sum\limits_{i=1}^5y_i+1)\) for
\(\theta \in [0,1]\).
In order to simulate a couple \((N,\theta)\) from the joint posterior
distribution, we will follow the steps below:
1. Simulate a \(N\) from \(N|y\)
2. Simulate a \(\theta\) from \(\theta|N,y\)
Since we have found the distribution of \(\theta|N,y\), there is only one problem
-the marginal distribution of \(N\)
does not seem to fit any well known distribution. However, we can
simulate from this distribution by using the inverse c.d.f. method. For
this method to work, we will need the normalized marginal function of
\(N\). We obtain this by dividing our
function with the normalizing constant, that is, the sum of the values
of the marginal function for every possible \(N\). Since the marginal probability decays
extremely fast as \(N\) increases, here
we will assume that \(N\) ranges from 1
to 1000. Additionaly, in order to avoid underflow and overflow
operations, we will compute the logarithm of the unnormalized
distribution and subtract off the maximum value before exponentiating
and normalizing.
y <- c(53,57,66,67,72)
N_marginal <- function(N) {
if (N<=72) {
return(-Inf)
}
den <- log(N)
num <- 0
for (i in (5*N-sum(y)+1):(5*N+1)) {
den <- den+log(i)
}
for (i in 1:5) {
num <- num+log(choose(N,y[i]))
}
return(num-den)
}
values <- mapply(N_marginal,seq(1,1000))
values <- values - max(values)
norm_const <- sum(exp(values))
normalized <- exp(values)/norm_const
Let’s take a look at our normalized marginal function (and verify that we can indeed take \(N=1,...,1000\)).
plot(normalized, col = "skyblue", type = "l",lwd = 2, xlab = "N", ylab = "p(N|y)")
Now, we are ready to simulate from the posterior. We will display a scatterplot of 1000 posterior simulations of \((N,\theta)\)
inv_method <- function(p) {
u <- runif(1)
if (u <= p[1]) {
return(1)
}
for (k in 2:length(p)) {
if (sum(p[1:(k-1)]) < u && u <= sum(p[1:k])) {
return(k)
}
}
}
set.seed(123)
iter <- 1000
X <- array(dim=c(iter, 2))
for (i in 1:iter) {
X[i,1] <- inv_method(normalized)
X[i,2] <- rbeta(1,sum(y)+1,5*X[i,1]-sum(y)+1)
}
plot(X, xlab = "N", ylab = "θ", col = "purple", main = "Scatterplot of 1000 posterior simulations of (N,θ)")
Now, in order to compute the probability of \(N>100\) we will use our normalized function. More specifically, we will sum the values of the normalized function for \(N>100\), that is, for \(N=101,...,1000\).
p <- sum(normalized[101:1000])
c. Why not simply use a Poisson with fixed \(µ\) as a prior distribution for \(N\)?
Solution
As no prior information is available, we would not know which \(\mu\) to pick beforehand, which creates the
risk of an unreasonable choice. For example, we did not know a priori
that after the value \(N = 1000\) the
value of the marginal posterior function of \(N\) is approximately zero, so if we were to
choose \(\mu = 1000\), that would be a
mispecification of the prior. This can be eventually counterbalanced by
the influence of the data, however in this problem only 5 data points
are available. Thus, the safest option is to pick an uninformative
prior.
Following our train of thought from the last question, let’s examine
our theory that using a Poisson with fixed \(\mu = 1000\) as a prior distribution for
\(N\) could negatively affect our
analysis, especially given that our data points are only 5.
Say that we have not yet obtained any information about the posterior
marginal distribution of \(N\), and we
want to answer question (b). We can do that with the help of a
Metropolis-Hastings algorithm as follows: since we know the posterior
conditional distribution of \(\theta\),
which is \(Beta(\sum\limits_{i=1}^5y_i+1,5N-\sum\limits_{i=1}^5y_i+1)\),
we can implement a hybrid MH with algorithm with a Gibbs step for \(\theta\). For \(N\), we can use its prior \(Pois(1000)\) as the proposal. As a result,
our acceptance probability will be \(ap(x,y) =
min\{\frac{f(y)q(x[1]|y[1])p(x[2])}{f(x)q(y[1]|x[1])p(x[2])},1\}\),
where \(x=(x[1],x[2])
=(N_{old},\theta_{old})\) is the old value, \(y=(y[1],y[2]) =(N_{new},\theta_{new})\) is
the simulated value, \(f\) is the
posterior function of \((N,\theta)\),
which we have previously obtained, \(q\) is the p.d.f. of \(Pois(1000)\) and \(p\) is the p.d.f. of \(Beta(\sum\limits_{i=1}^5y_i+1,5x[1]-\sum\limits_{i=1}^5y_i+1)\).
Our algorithm works as follows, in a random iteration \(t\):
For a given \(x_{t} = (N_{t},
\theta_{t})\),
1. We simulate \(y_t = (N,\theta)\),
where \(N \sim Pois(1000)\), \(\theta \sim
Beta(\sum\limits_{i=1}^5y_i+1,5N_t-\sum\limits_{i=1}^5y_i+1)\)
2. We choose \(\begin{equation*} x_{t+1} =
\left\{ \begin{array}{ccl} y_t & \text{with prob.} &
ap\left(x_t, y_t\right) \\ x_t & \text{with prob.} &
1-ap\left(x_t, y_t\right) \end{array}\right.\end{equation*}\)
y <- c(53,57,66,67,72)
posterior <- function(par) {
1/par[1]*prod(dbinom(y,par[1],par[2]))
}
library(coda)
## Warning: package 'coda' was built under R version 4.0.5
metrop <- function(init,iter=10000,burnin=5001) {
X <- array(dim=c(iter + 1, 2))
X[1,] <- init
for (i in 1:iter) {
Y <- c(rpois(1,1000), rbeta(1,sum(y)+1,5*X[i,1]-sum(y)+1))
ap <- (posterior(Y)*dpois(Y[1],1000))/(posterior(X[i,])*dpois(X[i,1],1000))
X[i+1,] <- X[i,] + (Y-X[i,])*(runif(1) < ap)
}
X <- X[-(1:burnin),]
X = data.frame(N=X[,1], theta=X[,2])
return(X)
}
set.seed(789)
chain <- metrop(c(500,0.5))
plot(chain)
plot(mcmc(chain))
par(mfrow=c(1,2))
acf(chain[,1])
acf(chain[,2])
We observe that our trace and density plots indicate good mixing,
while both the acf plots for \(\theta\)
and \(N\) look acceptable and drop to
zero quite quickly. However, the scatterplot of our simulations from the
posterior distribution looks completely different than the one we
obtained in (b), and, consequently, displays a contrasting (and
misleading) relationship between the parameters \(N\), \(\theta\). It is notable that, in this case,
the MH algorithm has no way of notifying us of our error as everything
looks in order. As a result, had we not performed the previous analysis
about \(N\), we would accept this as
our final solution unaware of its faults. That’s why often times it is a
good idea to test our MH in many ways to be sure of our results. Let’s
pretend we still don’t possess the previous knowledge about \(N\), but we remain relatively unbiased in
our choice of the proposal for \(N|y\),
and observe our results. We can choose \(q(x)\) to be the p.d.f. of the Poisson
distribution with mean \(\lambda =
N[i]\) (if we are in the \(i+1\)-th iteration),so we obtain \(N_{new} \sim Pois(N_{old})\), while
everything else remains the same. Our algorithm works as follows, in a
random iteration \(t\):
For a given \(x_{t} = (N_{t},
\theta_{t})\),
1. We simulate \(y_t = (N,\theta)\),
where \(N \sim Pois(N_t)\), \(\theta \sim
Beta(\sum\limits_{i=1}^5y_i+1,5N_t-\sum\limits_{i=1}^5y_i+1)\)
2. We choose \(\begin{equation*} x_{t+1} =
\left\{ \begin{array}{ccl} y_t & \text{with prob.} &
ap\left(x_t, y_t\right) \\ x_t & \text{with prob.} &
1-ap\left(x_t, y_t\right) \end{array}\right.\end{equation*}\)
metrop2 <- function(init,iter=10000,burnin=5001) {
X <- array(dim=c(iter + 1, 2))
X[1,] <- init
for (i in 1:iter) {
Y <- c(rpois(1,X[i,1]),rbeta(1,sum(y)+1,5*X[i,1]-sum(y)+1))
ap <- (posterior(Y)*dpois(Y[1],X[i,1]))/(posterior(X[i,])*dpois(X[i,1],Y[1]))
X[i+1,] <- X[i,] + (Y-X[i,])*(runif(1) < ap)
}
X <- X[-(1:burnin),]
X = data.frame(N=X[, 1], theta=X[, 2])
return(X)
}
chain2 <- metrop2(c(100,0.5))
plot(chain2)
plot(mcmc(chain2))
par(mfrow=c(1,2))
acf(chain2[,1])
acf(chain2[,2])
Now, we observe that, for the same amount of iterations and burn-in
period, our trace and density plots, are not in good shape. Similarly,
our acf plots show very slow geometric decay, a phenomenon that is
acceptable only in the case of first-order autoregressive models.
However, the scatterplot we obtained is much more similar to that in (b)
than before.
We conclude that initial assumptions about our model and parameters
should be made very carefully. On top of that, even if our analysis
seems robust and correct, it is always a good idea to verify, test and
experiment, in order to be sure that what we obtained is indeed correct,
as the tools we have available for testing the convergence of our MH
algorithm can only let us know with certainty when our algorithm has not
converged, and not the other way around. In order to be more confident
in our inference, we can run the chain several times with different
starting values, and check that the output from the various chains is
very similar. Additionally, it is always best to run long chains with
long burn-in periods. It is obvious that the MH algorithm is a powerful
tool, but also one that should be used carefully, and with continuous
assesment of our analysis and results.