Posteriors, Priors and Possibilities: Monte Carlo methods for Health Researchers

Author
Affiliation

Bongani Ncube

University Of the Witwatersrand (School of Public Health)

Published

May 11, 2025

Keywords

Posterior, Prior, Estimation, Bayesian Statistics, Biostatistics

Introduction

In my previous blog post I gave an introduction to bayes theorem and quite useful expressions. Here is the recap:

What is the Bayesian approach?

  • The approach is based upon the idea that the experimenter begins with some prior beliefs about the system.

  • And then updates these beliefs on the basis of observed data.

  • This updating procedure is based upon the Bayes’ Theorem:

\[\Pr(A \mid B) = \frac{\Pr(B \mid A) \; \Pr(A)}{\Pr(B)}\]

  • Schematically if \(A = \theta\) and \(B = \text{data}\), then

  • The Bayes’ theorem

\[\Pr(A \mid B) = \frac{\Pr(B \mid A) \; \Pr(A)}{\Pr(B)}\]

  • Translates into:

\[\Pr(\theta \mid \text{data}) = \frac{\Pr(\text{data} \mid \theta) \; \Pr(\theta)}{\Pr(\text{data})}\]

\[{\color{red}{\Pr(\theta \mid \text{data})}} = \frac{\color{blue}{\Pr(\text{data} \mid \theta)} \; \color{green}{\Pr(\theta)}}{\color{orange}{\Pr(\text{data})}}\]

  • \(\color{red}{Posterior ~distribution}\): Represents what you know after having seen the data. The basis for inference, a distribution, possibly multivariate if more than one parameter (\(\theta\)).

  • \(\color{blue}{Likelihood}\): We know that guy from before, same as in the MLE approach. It is a function of \(\theta\) that show how “likely” are various parameter values \(\theta\) to have produced the data Y that were observed.

Note

The Likelihood Principle of Birnbaum states that (given the data) all of the evidence about \(\theta\) is contained in the likelihood function.

  • \(\color{green}{Prior~ distribution}\): Represents what you know before seeing the data. The source of much discussion about the Bayesian approach.

  • \(\color{orange}{\Pr(\text{data}) = \int L(\text{data} \mid \theta) \;\Pr(\theta) d\theta }\): Possibly high-dimensional integral, difficult if not impossible to calculate. This is one of the reasons why we need simulation (MCMC) methods - more soon.

\[ \color{dodgerblue}{\boxed{\underbrace{P(\theta \mid y)}_{\text{Posterior}} = \frac{\overbrace{P(y \mid \theta)}^{\text{Likelihood}} \cdot \overbrace{P(\theta)}^{\text{Prior}}}{\underbrace{P(y)}_{\text{Normalizing Constant}}}}} \] - Since \(P(Y)\) carries no information about \(\theta\) for conciseness we may drop it and write:

\[ \color{dodgerblue}{\boxed{\underbrace{P(\theta \mid y)}_{\text{Posterior}} \propto \overbrace{P(y \mid \theta)}^{\text{Likelihood}} \cdot \overbrace{P(\theta)}^{\text{Prior}}}} \]

The Bayesian approach Examples (In stata)

  • we’ve collected data from 70 people in Canberra about the number of parking tickets they’ve gotten in the last year. We would want some concept of the rate the people get the tickets. We will do that based on the rumour that Canberra is particularly finickyy about parking. We will simulate data as though the true number of parking fines per year per person is 1.3
set seed 1800
set obs 70 
gen y = rpoisson(1.3)

sum y
## Number of observations (_N) was 0, now 70.
## 
## 
## 
##     Variable |        Obs        Mean    Std. dev.       Min        Max
## -------------+---------------------------------------------------------
##            y |         70    1.257143    1.099219          0          3

Defining the prior

  • suppose we are truly interested whether the rate of fines is over one per year per person hence to start , we need to specify a prior distribution
set seed 1800
set obs 70 
gen y = rpoisson(1.3)

bayesmh y, likelihood(poisson, noglmtransform) prior({y:_cons}, flat) rseed(7434)

db postest

bayesgraph histogram _all
graph export post.png , replace


bayesgraph kdensity _all
graph export post1.png

bayesgraph diagnostics _all
graph export post2.png

## Number of observations (_N) was 0, now 70.
## 
## 
##   
## Burn-in ...
## Simulation ...
## 
## Model summary
## ------------------------------------------------------------------------------
## Likelihood: 
##   y ~ poisson({y:_cons})
## 
## Prior: 
##   {y:_cons} ~ 1 (flat)
## ------------------------------------------------------------------------------
## 
## Bayesian Poisson regression                      MCMC iterations  =     12,500
## Random-walk Metropolis–Hastings sampling         Burn-in          =      2,500
##                                                  MCMC sample size =     10,000
##                                                  Number of obs    =         70
##                                                  Acceptance rate  =      .4271
## Log marginal-likelihood = -102.22367             Efficiency       =      .2315
##  
## ------------------------------------------------------------------------------
##              |                                                Equal-tailed
##            y |      Mean   Std. dev.     MCSE     Median  [95% cred. interval]
## -------------+----------------------------------------------------------------
##        _cons |  1.274535   .1358713   .002824   1.274437   1.020925   1.548038
## ------------------------------------------------------------------------------
## 
## 
## 
## file post.png saved as PNG format
## 
## 
## file post1.png already exists
## r(602);
## 
## r(602);

Playing with different priors

Suppose we talk to people in different cities and they all agree that the rate of fines should be about 1 every 3 years , with little chance of averaging more than 1 fine per year hence based on this infor a good prior would be Gamma(3,0.1)

In Stata

twoway function y=gammaden(3,0.1,0,x), range(0 1.5)

graph export post4.png , replace
## file post4.png saved as PNG format

In R

plot_gamma(3,0.1)
## Error in plot_gamma(3, 0.1): could not find function "plot_gamma"

Specify a new prior in stata

set seed 1800
set obs 70 
gen y = rpoisson(1.3)
bayesmh y, likelihood(poisson, noglmtransform) prior({y:_cons}, gamma(3,0.1)) rseed(9983)
## Number of observations (_N) was 0, now 70.
## 
## 
##   
## Burn-in ...
## Simulation ...
## 
## Model summary
## ------------------------------------------------------------------------------
## Likelihood: 
##   y ~ poisson({y:_cons})
## 
## Prior: 
##   {y:_cons} ~ gamma(3,0.1)
## ------------------------------------------------------------------------------
## 
## Bayesian Poisson regression                      MCMC iterations  =     12,500
## Random-walk Metropolis–Hastings sampling         Burn-in          =      2,500
##                                                  MCMC sample size =     10,000
##                                                  Number of obs    =         70
##                                                  Acceptance rate  =      .4313
## Log marginal-likelihood = -107.68681             Efficiency       =      .2345
##  
## ------------------------------------------------------------------------------
##              |                                                Equal-tailed
##            y |      Mean   Std. dev.     MCSE     Median  [95% cred. interval]
## -------------+----------------------------------------------------------------
##        _cons |  1.136429   .1181007   .002439   1.130957   .9216155   1.380885
## ------------------------------------------------------------------------------
Note
  • Stata churns through the MCMC computations to find the posterior distribution

The Bayesian approach Example (Beta-Binomial)

  • Let Y be the number of games (out of 6) that Bongani would win in the tournament against Vusi.

  • We model Y as binomial with parameters \(n = 6\) and success probability \(\theta ∈ [0,1]\).

  • Since the parameter \(\theta\) is restricted to be between 0 and 1, we should choose a prior distribution with support on \([0,1]\).

  • Let \(f(\theta)\) denote the prior probability density function (pdf) for \(\theta\).

  • Note \(f(\theta)\) has the usual properties of a pdf: It is non-negative everywhere, and it integrates to 1 over its support (which is \([0, 1]\) in this example).

  • Now, the one thing we know about \(\theta\) is that is a continuous random variable and that it lies between zero and one.

  • Thus, a suitable prior distribution might be the Beta defined on \([0,1]\).

  • What is the Beta distribution?

What is the Beta distribution?

\[ q(\theta \mid \alpha, \beta) = \frac{1}{\text{Beta}(\alpha, \beta)}{\theta^{\alpha-1}} {(1-\theta)^{\theta-1}} \]

with \(\text{Beta}(\alpha, \beta) = \displaystyle{\frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}}\) and \(\Gamma(n) = (n-1)!\)

  • In a real problem, we need to specify the values of our hyperparameters \(\alpha\) and \(\beta\) of our prior.
  • Ideally our choices of \(\alpha\) and \(\beta\) should reflect our prior beliefs about \(\theta\).
  • If we have no prior idea what \(\theta\) is, we could set \(\alpha = \beta = 1\), which corresponds to a \(Uniform(0,1)\) prior for \(\theta\): completely flat, so that all values of \(\theta\) are equally likely a priori.
  • The plot_beta(a,b) function from bayerules package can assist in plotting different beta distributions for different \(\alpha\) and \(\beta\)

Plots of different beta distributions:

b1.5 <- plot_beta(1,5)+ggtitle("Beta(1,5)")
b1.2 <- plot_beta(1,2)+ggtitle("Beta(1,2)")
b3.7 <- plot_beta(3,7)+ggtitle("Beta(3,7)")
b1.1 <- plot_beta(1,1)+ggtitle("Beta(1,1)")
b5.5 <- plot_beta(5,5)+ggtitle("Beta(5,5)")
b20.20 <- plot_beta(20,20)+ggtitle("Beta(20,20)")
b7.3 <- plot_beta(7,3)+ggtitle("Beta(7,3)")
b2.1 <- plot_beta(2,1)+ggtitle("Beta(2,1)")
b5.1 <- plot_beta(5,1)+ggtitle("Beta(5,1)")

multiplot(b1.5, b1.2, b3.7, b1.1,b5.5,b20.20,b7.3,b2.1,b5.1, cols=3)

Beta Beta Conjugate Prior

  • We assume a priori that (let \(\theta =p\))

\[Prior: ~p\sim Beta(\alpha,\beta)\] so that \(\Pr(p) = p^{\alpha-1} (1 - p)^{\beta-1}\)

also :

\[Likelihood~:X|p \sim Binomial(n,p)\]

  • Then we have:

\[L(P) ={\color{blue}{\binom{n}{k}p^k(1-p)^{n-k}}}\]

\[P(p) = {\color{green}{\frac{p^{\alpha-1} (1 - p)^{\beta-1}}{Beta(\alpha,\beta)}}}\]

\[ \begin{aligned} {\color{red}{Pr(p \mid k)}} & \propto {\color{blue}{\binom{n}{k}p^k(1-p)^{n-k}}} \; {\color{green}{p^{a-1} (1 - p)^{\beta-1}}}\\ & \propto {p^{(\alpha+k)-1}} {(1-p)^{(\beta+n-k)-1}} \end{aligned} \]

  • That is:

\[ p \mid k \sim Beta(\alpha+k,\beta+n-k)\]

  • Take a Beta prior with a Binomial likelihood, you get a Beta posterior (conjugacy)

  • The prior tells us information about the value of \(\theta\), based on our prior knowledge.

  • Candidate example: We believe a priori that the value of \(\theta\) is near 0.45.

  • The likelihood tells us information about the value of \(\theta\), based on information in our data.

  • Candidate example: We believe based on the data that the value of \(\theta\) is near 0.6.

  • The posterior distribution balances the information in the prior and the data.

  • The posterior uses the data information to update the prior information.

  • See the R plots to visually assess the position of the posterior relative to the prior and the likelihood.

# Comparing the prior and the posterior in the political candidate example:

plot_beta_binomial(alpha = 45, beta = 55, y = 30, n = 50)


# prior and the posterior summary statistics in the political candidate example:

summarize_beta_binomial(alpha = 45, beta = 55, y = 30, n = 50)
model alpha beta mean mode var sd
prior 45 55 0.45 0.4489796 0.0024505 0.0495025
posterior 75 75 0.50 0.5000000 0.0016556 0.0406894


# Credit Card Debt Example

## Researchers studied whether doctors completing residency take on high credit card debt.

# Let pi= population proportion of all medical residents with over $3000 in credit card debt

#  Prior belief:  20% do??

########### Analysis with beta(2,8) prior:

# Plot the Beta(2, 8) prior
plot_beta(2,8,mean=T,mode=T)


# Sample of 115 residents revealed 22 had credit card debt over $3000.

# Comparing the prior and the posterior in the credit card debt example:

plot_beta_binomial(alpha = 2, beta = 8, y = 22, n = 115)


# prior and the posterior summary statistics in the credit card debt example:

summarize_beta_binomial(alpha = 2, beta = 8, y = 22, n = 115)
model alpha beta mean mode var sd
prior 2 8 0.200 0.1250000 0.0145455 0.1206045
posterior 24 101 0.192 0.1869919 0.0012312 0.0350890


########### Analysis with beta(1,1) (i.e., Uniform) prior:


# Sample of 115 residents revealed 22 had credit card debt over $3000.

# Comparing the prior and the posterior in the credit card debt example:

plot_beta_binomial(alpha = 1, beta = 1, y = 22, n = 115)


# prior and the posterior summary statistics in the credit card debt example:

summarize_beta_binomial(alpha = 1, beta = 1, y = 22, n = 115)
model alpha beta mean mode var sd
prior 1 1 0.5000000 NaN 0.0833333 0.2886751
posterior 23 94 0.1965812 0.1913043 0.0013384 0.0365848
Note
  • with Bayesian estimation, as the sample size increases, the likelihood dominates the prior.
Elicitation

Often, one has a belief about the distribution of one’s data. You may think that your data come from a binomial distribution and in that case you typically know the \(n\), the number of trials but you usually do not know \(p\), the probability of success. Or you may think that your data come from a normal distribution. But you do not know the mean \(\mu\) or the standard deviation \(\sigma\) of the normal. Beside to knowing the distribution of one’s data, you may also have beliefs about the unknown \(p\) in the binomial or the unknown mean \(\mu\) in the normal.

Bayesians express their belief in terms of personal probabilities. These personal probabilities encapsulate everything a Bayesian knows or believes about the problem. But these beliefs must obey the laws of probability, and be consistent with everything else the Bayesian knows.

So a Bayesian will seek to express their belief about the value of \(p\) through a probability distribution, and a very flexible family of distributions for this purpose is the beta family. A member of the beta family is specified by two parameters, \(\alpha\) and \(\beta\); we denote this as \(p \sim \text{beta}(\alpha, \beta)\). The probability density function is

\[\begin{equation} f(p) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1} \end{equation}\] where \(0 \leq p \leq 1, \alpha>0, \beta>0\), and \(\Gamma\) is a factorial:

\[\Gamma(n) = (n-1)! = (n-1) \times (n-2) \times \cdots \times 1\] Here is a summary of the key ideas:

  1. Bayesians express their uncertainty through probability distributions.

  2. One can think about the situation and self-elicit a probability distribution that approximately reflects his/her personal probability.

  3. One’s personal probability should change according Bayes’ rule, as new data are observed.

  4. The beta family of distribution can describe a wide range of prior beliefs.

Conjugacy

Next, let’s introduce the concept of conjugacy in Bayesian statistics.

Suppose we have the prior beliefs about the data as below:

  • Binomial distribution \(\text{Bin}(n,p)\) with \(n\) known and \(p\) unknown

  • Prior belief about \(p\) is \(\text{beta}(\alpha,\beta)\)

Then we observe \(x\) success in \(n\) trials, and it turns out the Bayes’ rule implies that our new belief about the probability density of \(p\) is also the beta distribution, but with different parameters. In mathematical terms,

\[\begin{equation} p|x \sim \text{beta}(\alpha+x, \beta+n-x). \end{equation}\]

This is an example of conjugacy. Conjugacy occurs when the posterior distribution is in the same family of probability density functions as the prior belief, but with new parameter values, which have been updated to reflect what we have learned from the data.

Why are the beta binomial families conjugate? Here is a mathematical explanation.

Recall the discrete form of the Bayes’ rule:

\[P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum^n_{j=1}P(B|A_j)P(A_j)}\]

However, this formula does not apply to continuous random variables, such as the \(p\) which follows a beta distribution, because the denominator sums over all possible values (must be finitely many) of the random variable.

But the good news is that the \(p\) has a finite range – it can take any value only between 0 and 1. Hence we can perform integration, which is a generalization of the summation. The Bayes’ rule can also be written in continuous form as:

\[\pi^*(p|x) = \frac{P(x|p)\pi(p)}{\int^1_0 P(x|p)\pi(p) dp}.\]

This is analogus to the discrete form, since the integral in the denominator will also be equal to some constant, just like a summation. This constant ensures that the total area under the curve, i.e. the posterior density function, equals 1.

Note that in the numerator, the first term, \(P(x|p)\), is the data likelihood – the probability of observing the data given a specific value of \(p\). The second term, \(\pi(p)\), is the probability density function that reflects the prior belief about \(p\).

In the beta-binomial case, we have \(P(x|p)=\text{Bin}(n,p)\) and \(\pi(p)=\text{beta}(\alpha,\beta)\).

Plugging in these distributions, we get

$$ \[\begin{aligned} \pi^*(p|x) &= \frac{1}{\text{some number}} \times P(x|p)\pi(p) \\ &= \frac{1}{\text{some number}} [\left( \begin{array}{c} n \\ x \end{array} \right) p^x (1-p)^{n-x}] [\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1}] \\ &= \frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+x)\Gamma(\beta+n-x)} \times p^{\alpha+x-1} (1-p)^{\beta+n-x-1} \end{aligned}\]

$$

Let \(\alpha^* = \alpha + x\) and \(\beta^* = \beta+n-x\), and we get

\[\pi^*(p|x) = \text{beta}(\alpha^*,\beta^*) = \text{beta}(\alpha+x, \beta+n-x),\]

same as the posterior formula in Equation.

We can recognize the posterior distribution from the numerator \(p^{\alpha+x-1}\) and \((1-p)^{\beta+n-x-1}\). Everything else are just constants, and they must take the unique value, which is needed to ensure that the area under the curve between 0 and 1 equals 1. So they have to take the values of the beta, which has parameters \(\alpha+x\) and \(\beta+n-x\).

This is a cute trick. We can find the answer without doing the integral simply by looking at the form of the numerator.

Without conjugacy, one has to do the integral. Often, the integral is impossible to evaluate. That obstacle is the primary reason that most statistical theory in the 20th century was not Bayesian. The situation did not change until modern computing allowed researchers to compute integrals numerically.

Three Conjugate Families

In this section, the three conjugate families are beta-binomial, gamma-Poisson, and normal-normal pairs. Each of them has its own applications in everyday life.

The Gamma-Poisson Conjugate Families

The Poisson distribution has a single parameter \(\lambda\), and it is denoted as \(X \sim \text{Pois}(\lambda)\) with \(\lambda>0\). The probability mass function is

\[P(X=k) = \frac{\lambda^k}{k!} e^{-\lambda} \text{ for } k=0,1,\cdots,\]

where \(k! = k \times (k-1) \times \cdots \times 1\). This gives the probability of observing a random variable equal to \(k\).

Note that \(\lambda\) is both the mean and the variance of the Poisson random variable. It is obvious that \(\lambda\) must be greater than zero, because it represents the mean number of counts, and the variance should be greater than zero (except for constants, which have zero variance).

The gamma family is flexible, and Figure below illustrates a wide range of gamma shapes.

Gamma family

The probability density function for the gamma is indexed by shape \(k\) and scale \(\theta\), denoted as \(\text{Gamma}(k,\theta)\) with \(k,\theta > 0\). The mathematical form of the density is

\[\begin{equation} f(x) = \dfrac{1}{\Gamma(k)\theta^k} x^{k-1} e^{-x/\theta} \end{equation}\] where

\[\begin{equation} \Gamma(z) = \int^{\infty}_0 x^{z-1} e^{-x} dx. \end{equation}\] \(\Gamma(z)\), the gamma function, is simply a constant that ensures the area under curve between 0 and 1 sums to 1, just like in the beta probability distribution case of Equation @ref(eq:beta). A special case is that \(\Gamma(n) = (n-1)!\) when \(n\) is a positive integer.

However, some books parameterize the gamma distribution in a slightly different way with shape \(\alpha = k\) and rate (inverse scale) \(\beta=1/\theta\):

\[f(x) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x}\]

For this example, we use the \(k\)-\(\theta\) parameterization, but you should always check which parameterization is being used. For example, \(\mathsf{R}\) uses the \(\alpha\)-\(\beta\) parameterization by default.
In the the later material we find that using the rate parameterization is more convenient.

For the gamma Poisson conjugate family, suppose we observed data \(x_1, x_2, \cdots, x_n\) that follow a Poisson distribution.Then similar to the previous section, we would recognize the kernel of the gamma when using the gamma-Poisson family. The posterior \(\text{Gamma}(k^*, \theta^*)\) has parameters

\[k^* = k + \sum^n_{i=1} x_i \text{ and } \theta^* = \frac{\theta}{(n\theta+1)}.\]

The Normal-Normal Conjugate Families

There are other conjugate families, and one is the normal-normal pair. If your data come from a normal distribution with known variance \(\sigma^2\) but unknown mean \(\mu\), and if your prior on the mean \(\mu\), has a normal distribution with self-elicited mean \(\nu\) and self-elicited variance \(\tau^2\), then your posterior density for the mean, after seeing a sample of size \(n\) with sample mean \(\bar{x}\), is also normal. In mathematical notation, we have

\[\begin{aligned} x|\mu &\sim N(\mu,\sigma^2) \\ \mu &\sim N(\nu, \tau^2) \end{aligned}\]

As a practical matter, one often does not know \(\sigma^2\), the standard deviation of the normal from which the data. But there are cases in which it is reasonable to treat the \(\sigma^2\) as known.

For the normal-normal conjugate families, assume the prior on the unknown mean follows a normal distribution, i.e. \(\mu \sim N(\nu, \tau^2)\). We also assume that the data \(x_1,x_2,\cdots,x_n\) are independent and come from a normal with variance \(\sigma^2\).

Then the posterior distribution of \(\mu\) is also normal, with mean as a weighted average of the prior mean and the sample mean. We have

\[\mu|x_1,x_2,\cdots,x_n \sim N(\nu^*, \tau^{*2}),\]

where

\[\nu^* = \frac{\nu\sigma^2 + n\bar{x}\tau^2}{\sigma^2 + n\tau^2} \text{ and } \tau^* = \sqrt{\frac{\sigma^2\tau^2}{\sigma^2 + n\tau^2}}.\]

Example: Numerical Example

Suppose we have a sample of size \(n=10\) with sample \(\overline{X}=5\) , known variance \(\sigma^2=4\) , prior mean \(\mu_0=3\) and prior variance \(\tau^2=2\). The posterior mean and variance are:

Credible and Confidence Intervals

credible intervals, the Bayesian alternative to confidence intervals. Confidence intervals,are the frequentist way to express uncertainty about an estimate of a population mean, a population proportion or some other parameter.

A confidence interval has the form of an upper and lower bound.

\[L, U = \text{pe} \pm \text{se} \times \text{cv}\]

where \(L\), \(U\) are the lower bound and upper bound of the confidence interval respectively, \(\text{pe}\) represents “point estimates”, \(\text{se}\) is the standard error, and \(\text{cv}\) is the critical value.

Most importantly, the interpretation of a 95% confidence interval on the mean is that “95% of similarly constructed intervals will contain the true mean”, not “the probability that true mean lies between \(L\) and \(U\) is 0.95”.

The reason for this frequentist wording is that a frequentist may not express his uncertainty as a probability. The true mean is either within the interval or not, so the probability is zero or one. The problem is that the frequentist does not know which is the case.

On the other hand, Bayesians have no such qualms. It is fine for us to say that “the probability that the true mean is contained within a given interval is 0.95”. To distinguish our intervals from confidence intervals, we call them credible intervals.

Definitin 1

A confidence interval of X% for a parameter is an interval \((a, b)\) generated by a repeated sampling procedure has probability X% of containing the true value of the parameter, for all possible values of the parameter.

Definition 2

Say you performed a statistical analysis to compare the efficacy of a public policy between two groups and you obtain a difference between the mean of these groups.You can express this difference as a confidence interval. Often we choose 95% confidence.This means that \(\textbf{95 studies out of 100}\), that uses the \(\textbf{same sample size and target population}\), performing the \(\textbf{same statistical test}\),will expect to find a result of the mean difference between groups inside the confidence interval.

Generating Samples Statistical Distributions

  • In general we will want to simulate from more general distributions. For example suppose we wanted to generate representative heights of people within a population, we know that the normal/Gaussian distribution would be a good choice - how can we do this? We could come up with a whole new random number generator that spits out normally distributed samples instead, however this isn’t a good general approach to the problem - what if instead we required samples from a Beta distribution instead? In some instances the “specialist random number generator” approach can be useful where speed is of the utmost concern, however for at least 99% of Monte-Carlo methods we encounter this is not required.

Generalized Inverse Method

  • Instead what we tend to do is look for ways of “converting” the uniform variates (herein we will denote these as \((u_1, u_2, ... )\) with \(u_i \sim U[0,1]\) iid) into the distribution of our desire.
  • Suppose we want to simulate variable \(X\) with some distribution. We denote it’s cumulative function as: \[F(x) = \mathbb{P}( X \leq x) \]

We denote the generalized inverse of \(F\) as: \[ F^{-1}(u) = inf \{ x : F(x) \geq u \} \]

  • In the case of \(X\) being a continuous variable the generalized inverse function is identical to our normal notion of an “inverse function”.
  • We want to define the generalized inverse on the entire line \([0,1]\) not just the discrete points - we can just imagine drawing “straight lines” between the discrete points on the graph. We then have the probability integral transform lemma:

For \(U \sim U[0,1]\) and \(X \sim F\) then the random variable \(F^{-1}(U) \sim F\)

This means if we can define the cumulative function for a distribution and invert it we can simulate from the distribution. This is really powerful since it allows us to simulate from essentially any distribution. Say we have a bunch of data and we want to sample from its empirical distribution we can even do this by defining the cumulative function. It is simple fo prove the probability integral transform lemma:

For all \(u \in [0,1]\) and \(x \in F^{-1}([0,1])\) we have:

\(F(F^{-1}(u)) \geq u\) and \(F^{-1}(F(x)) \leq x\)

Therefore:

\(\{(u,x) : F^{-1}(u) \leq x \} = \{(u,x) : F(x) \geq u\}\)

And so:

\(\mathbb{P}(F^{-1}(U) \leq x) = \mathbb{P}(U \leq F(x)) = F(x)\) \[\square\]

Let’s see how we can use this method with a simple example. If we have \(X \sim Exp(\lambda)\) (an exponential distribution with parameter \(\lambda\)) we have that the cumulative function is: \[F(x) = 1 - e^{-\lambda x}\]

By inverting this function with some basic algebra we get: \[F^{-1}(u) = \frac{-1}{\lambda} ln(1-u) \]

We can note that \(U \sim (1-U)\) and so we have: \[ \frac{-1}{\lambda} ln(U) \sim F \]

Let’s look at this in action:

# Set parameters
N <- 10000    # Number of simulations
lam <- 3     # Rate parameter

# Set seed for reproducibility
set.seed(123)

# Sample uniform random variates
U <- runif(N)

# Apply probability integral transform
X <- -log(U)/lam

# Create analytic result
x_a <- seq(0, 2.5, length.out = 50)
fx_a <- lam * exp(-lam * x_a)

# Plot histogram of samples
hist(X, freq = FALSE, breaks = 50, col = "grey", 
     xlab = "x", ylab = "f(x)", 
     main = "Empirical Simulated Exponential Distribution Compared to Analytic Result",
     xlim = c(0, 2.5), ylim = c(0, 3))
lines(x_a, fx_a, col = "red", lwd = 2.5)
legend("topright", legend = c("Empirical", "Analytic"), 
       col = c("grey", "red"), lwd = c(NA, 2.5), 
       fill = c("grey", NA), border = c("black", NA))

Relationships Between Distributions

However in a lot of cases we can rely on relationships between distributions to help us (think back to stats class where you thought you’d never use this stuff!) For example if \(X_i \sim Exp(1)\) iid then we can define \(Y \sim Gamma(\alpha, \beta)\) via: \[ Y = \beta \sum_{i=1}^{\alpha} X_i\]

Along with many other relations. We can then avoid having to invert the CDF: \[F(x) = \frac {1}{\Gamma (\alpha )}\gamma (\alpha ,\beta x) \]

Which is not particuarly easy to do. There is a skill in knowing when/how to use these sorts of transformations. Lots of times the difference in execution time is negligble so it is better to use the standard inversion method on the CDF as it makes the code more transparent. In others timing matters and with the right transformation there can be serious time savings available.

Normal distribution:-Box Muller Algorithm for two normals from two uniforms

One transform method for generating standard Gaussian variates is the Box-Muller technique. It relies on a polar co-ordinate representation \((r, \theta)\). If \((X_1, X_2)\) are Gaussian(0,1) then we can write: \[r^2 = X_1^2 + X_2^2 \sim \chi_2^2 = Exp(\frac{1}{2})\]

We can then use \(U_1, U_2 \sim U[0,1]\) iid to represent \((X_1, X_2)\) as: \[\begin{align} X_1 &= \sqrt{-2lnU_1}cos(2\pi U_2) \\ X_2 &= \sqrt{-2lnU_1}sin(2\pi U_2) \end{align}\]

This algorithm isn’t the fastest performing algorithm one can find however, the \(sin(.)\) and \(cos(.)\) function calls are computationally quite expensive.

Relying on the relationships between distributions can be problematic. For one thing it might not be possible to construct a nice neat relationship - or if we can construct a relationship it may have strict conditions that means any conversion becomes inefficient.

Acceptance-Rejection Methods

  • There are many distributions for which the inverse transform method and even then general transformations will fail to be able to generate the required random variables
  • hence we turn to Another class of method we can use are the acceptance-rejection methods. Again this relies on taking (pseudo-random) uniform variates as a starting point.
  • The general concept is to “look” at each uniform pseudo-random number and decide whether we believe this could have been generated by the distribution we wish to sample from (and accept it) or not (and reject it).

The main theoretical backbone behind these methods is:

The Fundamental Theorem of Simulation:

Simulating variates with density \(f(x)\): \[ X \sim f(x) \]

Is equivalent to simulating: \[ (X, U) \sim U \{ (x,u) : 0 < u < f(x) \} \]

It is not immediately clear why this is useful at this stage, but it is quite prophetic and it underlies many simulation algorithms. To see why lets note: \[\begin{align} f(x) &= \int_0^{f(x)} du \\ \mathbb{P}(X \leq x) &= \int_{\infty}^{x} f(y) dy \end{align}\]

These are essentially just definitions, but they help us enormously in out simulation problem. For example let’s suppose we have some arbitrary density \(f(.)\) such that: \[\begin{align} \int_a^b f(x) dx &= 1 \\ sup_x \{ f(x) \} &\leq M \end{align}\]

Then we can simulate from \(f(.)\) by generating: \((Y, U)\) such that \(Y \sim U[a,b]\) and \(U|Y=y \sim U[0,M]\) we accept simulations \(u\) if \((0 < u < f(y))\) otherwise we reject the simulations. This works because: \[\begin{align} \mathbb{P}(X \leq x) &= \mathbb{P}(Y \leq x | U < f(Y)) \\ &= \frac{\int_a^x \int_0^{f(y)} du dy}{\int_a^b \int_0^{f(y)} du dy} \\ &= \int_a^x f(y) dy \end{align}\]

This is all a bit notation heavy so let’s look at a specific example now. We shall generate from a Beta distribution. This has density function: \[ f(x) = \frac{x^{\alpha-1}(1-x)^{\beta-1}} {B(\alpha,\beta)} \]

Where \(B\) is the Beta function defined: \[ B(\alpha ,\beta )={\frac {\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )}} \]

The Gamma function \(\Gamma\) being: \[ \Gamma (z)=\int _{0}^{\infty }x^{z-1}e^{-x}\,dx \]

Through differentiation we find the maximum value attained by this function is: \[ max_x (f(x)) = \frac{\left(\frac{\alpha-1}{\alpha+\beta-2}\right)^{\alpha-1}\left(1-\frac{\alpha-1}{\alpha+\beta-2}\right)^{\beta-1}} {B(\alpha,\beta)} = M \]

Assuming \(\alpha + \beta > 2\)

We can use this to sample from the Beta distribution using acceptance/rejection:

Summary

For the algorithm follows the following idea:

Choose \(g(x)\) which is close to the \(f(x)\) and easy to sample and has the same domain as the target \(f(x)\).

  1. Sample a random variable \(Y\) with density \(g(x)\)
  • Generate \(Y \sim g, U \sim U[0,1]\)
  1. Calculate the acceptance probability \(\alpha=\frac{f(y)}{Mg(Y)}\)
  2. Sample a Uniform random variable \(U \sim U(0,1)\) giving a sample value u
  • Accept \(X=Y ~if~ U \leq \frac{f(Y)}{Mg(Y)}\)
  • Return to 1

In general , Accept \(X_{i}\) if \(U_i \leq \frac{f(x_i)}{Mg(x_i)} ,\forall x_i\) otherwise reject hence \(\underline{X}\sim f(x)\)

Practical idea

# Set parameters
ap <- 3
bt <- 2

# Define beta PDF function
beta_pdf <- function(x, a = ap, b = bt) {
  (x^(a-1) * (1-x)^(b-1)) / beta(a, b)
}

# Calculate maximum value
x_max <- (ap - 1) / (ap + bt - 2)
M <- beta_pdf(x_max)

# Number of samples
N <- 1000

# Generate samples
set.seed(123)  # For reproducibility
Y <- runif(N)
U <- runif(N, 0, M)

# Acceptance/rejection masks
mask_acc <- U < beta_pdf(Y)
mask_rej <- U >= beta_pdf(Y)  # Changed to >= for consistency

# Accepted and rejected samples
Y_acc <- Y[mask_acc]
U_acc <- U[mask_acc]

Y_rej <- Y[mask_rej]
U_rej <- U[mask_rej]

# Create analytic PDF
X <- seq(0, 1, length.out = 100)
PDF <- beta_pdf(X)

# Create plots
par(mfrow = c(1, 2), mar = c(4, 4, 2, 1))

# Plot 1: Acceptance-Rejection Sampling
plot(Y_acc, U_acc, col = "pink", pch = 16, 
     xlab = "x", ylab = "f(x)", 
     main = "Acceptance-Rejection Sampling",
     xlim = c(0, 1), ylim = c(0, M + 0.05))
points(Y_rej, U_rej, col = "grey", pch = 16)
lines(X, PDF, col = "red", lwd = 2)
legend("topright", legend = c("Accepted", "Rejected", "PDF"), 
       col = c("pink", "grey", "red"), pch = c(16, 16, NA), lty = c(NA, NA, 1))

# Plot 2: Empirical Distribution
hist(Y_acc, freq = FALSE, col = "grey", 
     xlab = "x", ylab = "f(x)", 
     main = "Empirical Distribution",
     xlim = c(0, 1), ylim = c(0, M + 0.05))
lines(X, PDF, col = "red", lwd = 2)


# Reset plotting parameters
par(mfrow = c(1, 1))

Example 2

\[f(x)=\frac{3}{2}x^3+\frac{11}{8}x^2+\frac{1}{6}x+\frac{1}{12} ,\quad 0<x<1\] Since \(f(x)\) is defined on \((0,1)\) support, we can generate our samples from \(g(x) \sim U(0,1)\)

Central Limit Theorem for Monte Carlo Integration

We can see that our estimated integral is around \(5.2\), however at this stage we have no idea whether this is a “good” estimate or a “bad” estimate. In the (rather contrived) \(\pi\) estimate example we had some idea of what the value “should” be but in this case we do not have any real intuition. We can look at the graph and try and convince ourselves the estimate is reasonable but how do we know? Thankfully we can rely on some theory to help us gain a better understanding.

To better understand the properties of our estimator we rely on the central limit theorem (CLT):


Theorem: Central Limit Theorem (for Monte Carlo)

Suppose we wish to estimate the integral:

\[\mathbb{E}_f[h(X)] = \int h(x)f(x) dx\]

For some variable \(X\) with pdf \(f\). We can approximate this with a uniform sample via:

\[\overline{h}_N = \frac{1}{N} \sum_{i=1}^N h(u_i)\]

For \(u_i\) iid samples from the uniform distribution. We let:

\[\overline{v}_N = \frac{1}{N^2} \sum_{i=1}^N (h(u_i) - \overline{h}_N)^2\]

Then assuming \(\mathbb{E}_f[h(X)]\) and \(\operatorname{Var}_f[h(X)]\) both exist and are finite then:

\[ \frac{\overline{h}_N - \mathbb{E}_f[h(X)] }{\sqrt{\overline{v}_N}} \xrightarrow[N \to \infty]{\text{D}} \mathcal{N}(0,1) \]

That is the sample mean \(\overline{h}_N\) converges in distribution to a Gaussian distribution.

We can use this result in a number of ways, for example we can calculate a 95% confidence interval using our estimator above:

##=============
 #MonteCarloIntegrationexamples
 ##=============
 Integral<-function(n){
 X<-runif(n) #g(x)~Unif[0,1]
 Y<-exp(-X^2/2)/sqrt(2*pi)
 Int<-sum(Y)/n
 Error<-Int-(pnorm(1)-pnorm(0))
 list(Int,Error)}
 Integral(1000)
## [[1]]
## [1] 0.3416247
## 
## [[2]]
## [1] 0.0002799166
Integral4<-function(n){
 a<-pi
 x<-runif(n)
 f<-4/(1+x*x)
 Int<-sum(f)/n
 Error<-(Int-a)
 list(Int,Error)
 }
 Integral4(1000)
## [[1]]
## [1] 3.139655
## 
## [[2]]
## [1] -0.001937308

Convergence of Monte Carlo Estimates

Markov Chain Monte Carlo (MCMC)

MCMC is an ample class of computational tools to approximate integrals and generate samples from a posterior probability.

Why sample?

  • \(p(\theta \mid y)\) is a distribution but the shape of that distribution is not always directly attainable.
  • For a normal prior on the mean and a normal likelihood function the posterior distribution for the mean is also a normal distribution.
  • The normal is a conjugate prior for the normal likelihood. (The Beta distribution is conjugate for the binomial distribution).
  • For most models an analytic expression of the posterior distribution is not possible.
  • In these situations sampling is needed to approximate the posterior distribution. This is what is offered by software like JAGS, BUGS, Stan etc
Note

MCMC is used when it is not possible to sample \(\boldsymbol{\theta}\) directly from the posterior probability \(P(\boldsymbol{\theta} \mid \text{data})\). Instead, we collect samples in an iterative manner, where every step of the process we expect that the distribution which we are sampling from \(P^*(\boldsymbol{\theta}^{(*)} \mid \text{data})\) becomes more similar in every iteration to the posterior \(P(\boldsymbol{\theta} \mid \text{data})\).

All of this is to \(\textbf{eliminate the evaluation}\) (often impossible) of the \(\textbf{denominator}\) \(P(\text{data})\).

Markov Chains

  • We proceed by defining an \(\textbf{ergodic Markov chain}\) meaning that there is an \(\textbf{unique stationary distribution}\) in which the set of possible states is the sample size and the stationary distribution is the distribution to be \(\textit{approximated}\) (or \(\textit{sampled}\)).
  • Let \(X_0, X_1, \dots, X_n\) be a simulation of the chain. The Markov chain \(\textbf{converges to the stationary distribution from any initial state}\) \(X_0\) after a \(\textbf{sufficient large number of iterations}\) \(r\). The distribution of the state \(X_r\) will be similar to the stationary distribution, hence we can use it as a sample.

MCMC Algorithms – Metropolis-Hastings

These are the first MCMC algorithms. They use an \(\textbf{acceptance/rejection rule for the proposals}\). They are characterized by proposals originated from a random walk in the parameter space. The \(\textbf{Gibbs algorithm}\) can be seen as a \(\textbf{special case}\) of MH because all proposals are automatically accepted

Asymptotically, they have an acceptance rate of 23.4%, and the computational cost of every iteration is \(\mathcal{O}(d)\), where \(d\) is the number of dimension in the parameter space .

MCMC Algorithms – Hamiltonian Monte Carlo

The current most efficient MCMC algorithms. They try to \(\textbf{avoid the random walk behavior by introducing an auxiliary vector of momenta using Hamiltonian dynamics}\). The proposals are ``guided’’ to higher density regions of the sample space. This makes \(\textbf{HMC more efficient in orders of magnitude when compared to MH and Gibbs}\).

Asymptotically, they have an acceptance rate of 65.1%, and the computational cost of every iteration is \(\mathcal{O}(d^{\frac{1}{4}})\), where \(d\) is the number of dimension in the parameter space .

Metropolis Algorithm

Metropolis is a random walk through the parameter sample space, where the probability of the Markov chain changing its state is defined as: \[ P_{\text{change}} = \min\left({\frac{P (\boldsymbol{\theta}_{\text{proposed}})}{P (\boldsymbol{\theta}_{\text{current}})}},1\right). \] This means that the Markov chain will only change to a new state based in one of two conditions:

when the probability of the random walk proposed parameters \(P(\boldsymbol{\theta}_{\text{proposed}})\) is than the probability of the current state parameters \(P(\boldsymbol{\theta}_{\text{current}})\), we change with 100% probability.

when the probability of the random walk proposed parameters \(P(\boldsymbol{\theta}_{\text{proposed}})\) is than the probability of the current state parameters \(P(\boldsymbol{\theta}_{\text{current}})\), we change with probability equal to the proportion of this probability difference.

Implementing Metropolis-Hastings (MH) Algorithm

  • Initialize the value of \(\bf{X}\) at \(\bf{x}_i\).

  • Generate a new value \(\bf{x}_{i+1}\) from \(\bf{x}_i\) using a (symmetric) proposal distribution \(q(\bf{x}_{i+1}|\bf{x}_i)\).

  • Compute the probability of accepting the new value as:

\[p_{accept}=p(x_{i+1}|x_i)=\min\left[1,\frac{p(x_{i+1})\;q(x_i|x_{i+1})}{p(x_{i})\;q(x_{i+1}|x_{i})}\right]\]

  • Save the new value \(\bf{x}_{i+1}\) if \(p_{accept}>r\) where \(r\sim U(0,1)\). Otherwise, save the old value \(\bf{x}_i\).

  • Repeat steps \(2-4\) until a sufficiently large sample of values has been generated.

Limitations of the Metropolis Algorithms}

  • with the proposals randomly generated, it can take a large number of iterations for the Markov chain to enter higher posterior densities spaces.

  • even highly-efficient MH algorithms sometimes accept less than 25% of the proposals.

  • in lower-dimensional contexts, higher computational power can compensate the low efficiency up to a point. But in higher-dimensional (and higher-complexity) modeling situations, higher computational power alone are rarely sufficient to overcome the low efficiency.

Gibbs

The essence of the Gibbs algorithm is the sampling of parameters conditioned in other parameters: \[P(\theta_1 \mid \theta_2, \dots \theta_p)\]

Gibbs Sampling

More complicated than our Metropolis algorithm but better for large numbers of parameters, which easily occurs in hierarchical modeling. Imagine a two parameter situation \(\{\theta_1, \theta_2\}\) with independent prior distributions \(p(\theta_1, \theta_2) = p(\theta_1)p(\theta_2)\).

  1. Start at a random point : \(\theta^0\)
  2. for \(t= 1 \cdot T\) repeat the following steps
  3. Set \(\theta^{(1)}\)
  4. For \(j=1,2,3,...,d\) update \(\theta_j\) from \(\theta_j \sim f(\theta_j|\theta_{j-1,y})\)

Define an initial set \(\boldsymbol{\theta}^{(0)} \in \mathbb{R}^p\) that \(P\left(\boldsymbol{\theta}^{(0)} \mid \mathbf{y} \right) > 0\) \(t = 1, 2, \dots\) Assign:

\[ \boldsymbol{\theta}^{(t)} = \begin{cases} \theta^{(t)}_1 & \sim P \left(\theta_1 \mid \theta^{(0)}_2, \dots, \theta^{(0)}_p \right) \\ \theta^{(t)}_2 & \sim P \left(\theta_2 \mid \theta^{(t-1)}_1, \dots, \theta^{(0)}_p \right) \\ & \vdots \\ \theta^{(t)}_p & \sim P \left(\theta_p \mid \theta^{(t-1)}_1, \dots, \theta^{(t-1)}_{p-1} \right) \end{cases} \]

  • Informative priors aim to reflect information available to the analyst that is gained independently of the data being studied.

  • In the absence of any prior information on one or more model parameters we wish to ensure that this lack of knowledge is properly reflected in the prior.

Summary:

Proposition 1 (Metropolis–Hastings algorithm) We consider from a density function \(p(\boldsymbol{\theta}),\) known up to a normalizing factor not depending on \(\boldsymbol{\theta}.\) We use a (conditional) proposal density \(q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^*)\) which has non-zero probability over the support of \(p(\cdot),\) as transition kernel to generate proposals.

The Metropolis–Hastings build a Markov chain starting from an initial value \(\boldsymbol{\theta}_0:\)

  1. draw a proposal value \(\boldsymbol{\theta}_t^{\star} \sim q(\boldsymbol{\theta} \mid \boldsymbol{\theta}_{t-1}).\)
  2. Compute the acceptance ratio \[ R = \frac{p(\boldsymbol{\theta}_t^{\star})}{p(\boldsymbol{\theta}_{t-1})}\frac{q(\boldsymbol{\theta}_{t-1} \mid \boldsymbol{\theta}_t^{\star} )}{q(\boldsymbol{\theta}_t^{\star} \mid \boldsymbol{\theta}_{t-1})} \tag{1}\]
  3. With probability \(\min\{R, 1\},\) accept the proposal and set \(\boldsymbol{\theta}_t \gets \boldsymbol{\theta}_t^{\star},\) otherwise set the value to the previous state, \(\boldsymbol{\theta}_t \gets \boldsymbol{\theta}_{t-1}.\)

The following theoretical details provided for completeness only.

Definition 1 (Detailed balance) If our target is \(p(\cdot),\) then the Markov chain satisfies the detailed balance condition with respect to \(p(\cdot)\) if \[\begin{align*} K(\boldsymbol{\theta}^{\text{cur}}, \boldsymbol{\theta}^{\text{prop}})p(\boldsymbol{\theta}^{\text{cur}}) = K(\boldsymbol{\theta}^{\text{prop}}, \boldsymbol{\theta}^{\text{cur}})p(\boldsymbol{\theta}^{\text{prop}}). \end{align*}\] If a Markov chain satisfies the detailed balance with respect to \(p(\cdot),\) then the latter is necessarily the invariant density of the Markov chain and the latter is reversible.

Convergence diagnostics

There are several visual checks that are run to validate the output of models based on Markov chain Monte Carlo. Indeed, unlike with exact sampling algorithms, one needs to worry both about the convergence to the posterior and the quality of the mixing, which determines the effective sample size available for inference. Many visual diagnostics can be used to diagnose lack of convergence, or inefficiency: we review key notions that appear in the literature.

Definition 2 (Trace plots) A trace plot is a line plot of the Markov chain as a function of the number of iterations. It should be stable around some values if the posterior is unimodal and the chain has reached stationarity. The ideal shape is that of a ‘fat hairy catterpilar’.

It is useful to inspect visually the Markov chain, as it may indicate several problems. If the chain drifts around without stabilizing around the posterior mode, then we can suspect that it hasn’t reached it’s stationary distribution (likely due to poor starting values). In such cases, we need to disregard the dubious draws from the chain by discarding the so-called warm up or burn in period. While there are some guarantees of convergence in the long term, silly starting values may translate into tens of thousands of iterations lost wandering around in regions with low posterior mass. Preliminary optimization and plausible starting values help alleviate these problem.

Definition 3 (Trace rank plot) If we run several chains, with different starting values, we can monitor convergence by checking whether these chains converge to the same target. A trace rank plot compares the rank of the values of the different chain at a given iteration: with good mixing, the ranks should switch frequently and be distributed uniformly across integers.

Definition 4 (Burn in period) We term “burn in” the initial steps of the MCMC algorithm that are discarded because the chain has not reached it’s stationary distribution, due to poor starting values. , but visual inspection using a trace plot may show that it is necessary to remove additional observations.

Most software will remove the first \(N\) initial values (typically one thousand). Good starting values can reduce the need for a long burn in period. If visual inspection of the chains reveal that some of the chains for one or more parameters are not stationary until some iteration, we will discard all of these in addition. @Geweke:1992’s test measure whether the distribution of the resulting Markov chain is the same at the beginning and at the end through a test of equality of means.

Definition 5 (Warmup) Warmup period refers to the initial sampling phase (potentially overlapping with burn in period) during which proposals are tuned (for example, by changing the variance proposal to ensure good acceptance rate or for Hamiltonian Monte Carlo (HMC) to tune the size of the leapfrog. These initial steps should be disregarded.

Definition 6 (Thinning) MCMC algorithms are often run thinning the chain (i.e., keeping only a fraction of the samples drawn, typically every \(k\) iteration). This is wasteful as we can of course get more precise estimates by keeping all posterior draws, whether correlated or not. The only argument in favor of thinning is limited storage capacity: if we run very long chains in a model with hundreds of parameters, we may run out of memory.

Remark 1 (Multiple short chains or longer runs?). Many diagnostics rely on running multiple Markov chains for the same problem, with different starting values. In principle, it is more efficient to run a single Markov chain to draw samples for longer than multiple (shorter chains) for a fixed computational budget. Indeed, a single run means warmup needs to be achieve once. The benefits of running multiple chains comes from different considerations: we can monitor convergence to the stationary distribution (or to different modes for multimodal posteriors), and run diagnostics that rely on between-chain variance. Multiple chains can also be run in parallel and the results combined, a situation more in line with modern computer architecture.

Summary:
  • Metropolis–Hastings generalizes rejection sampling by building a Markov chain and providing a mechanism for sampling.
  • Small proposal variance leads to high acceptance rate, but small step sizes. Large variance proposals leads to many rejections, in which case the previous value is carried forward. Both extreme scenarios lead to large autocorrelation.
  • The proposal density can be anything, but must ideally account for the support and allow for exploration of the state.
  • Good initial starting values can be obtained by computing maximum a posteriori estimates.
  • Initializing multiple chains at different starting values can be used to check convergence to the stationary distribution.
  • Mixing will improve if strongly correlated parameters are sampled together.
  • The optimal acceptance rate depends on the dimension, but guidelines for random walk Metropolis are to have 0.44 for a single parameter model and 0.234 for multivariate targets; see @Neal:2011 for a heuristic derivation.
  • To obtain the target acceptance rate, users must tune the variance of the proposal kernel. This is typically achieved by running the chain for some period, computing the empirical acceptance rate and increasing (respectively decreasing) the variance if the acceptance rate is too high (too low).
  • Metropolis-adjusted Langevin algorithm (MALA) uses the gradient information to inform the proposal; it is akin to a Newton step.
  • The detailed balance requires a function \(g\) such that \(g(r) = rg(1/r).\) Taking \(g(r) = \min(1,r)\) as in Metropolis–Hasting rule leads to the lowest asymptotic variance [@Peskun:1973].

Examples

# Chapter 6 R examples:

library(bayesrules)
library(tidyverse)

## Grid Approximation:
## Poisson-Gamma example

# Plot of the gamma(3,1) prior pdf:

plot_gamma(shape=3,rate=1)


# Step 1: Define a grid of 501 lambda values
grid_data   <- data.frame(lambda_grid = seq(from = 0, to = 15, length = 501))

# Step 2: Evaluate the prior & likelihood at each lambda
grid_data <- grid_data %>% 
  mutate(prior = dgamma(lambda_grid, 3, 1),
         likelihood = dpois(2, lambda_grid) * dpois(8, lambda_grid))

# Step 3: Approximate the posterior
grid_data <- grid_data %>% 
  mutate(unnormalized = likelihood * prior,
         posterior = unnormalized / sum(unnormalized))

# Set the seed (just to match the book in this example, typically you wouldn't do this):
set.seed(84735)

# Step 4: sample from the discretized posterior
post_sample <- sample_n(grid_data, size = 10000, 
                        weight = posterior, replace = TRUE)
                        
# Histogram of the grid simulation with posterior pdf 
ggplot(post_sample, aes(x = lambda_grid)) + 
  geom_histogram(aes(y = ..density..), color = "white") + 
  stat_function(fun = dgamma, args = list(13, 3)) + 
  lims(x = c(0, 15))

  
# Approximate Posterior summary statistics:

mean(post_sample$lambda_grid)  # posterior mean
## [1] 4.334358
median(post_sample$lambda_grid)  # posterior median
## [1] 4.2
quantile(post_sample$lambda_grid, probs=c(0.025,0.975))  # 95% credible interval, quantile-based
##  2.5% 97.5% 
##  2.34  6.99
library(TeachingDemos)
emp.hpd(post_sample$lambda_grid, conf=0.95) # 95% credible interval, HPD
## [1] 1.29 6.48

####################################################
##### Example 1, Chapter 6a notes (Birth rate data)
####################################################

# Prior parameters:

alpha.pri <- 2; beta.pri <- 1

# Summary statistics from data:

sum.y1 <- 217; n1 <- 111
sum.y2 <- 66;  n2 <- 44

# Sampling from the posterior distribution:

theta1.post.draws <- rgamma(10000, shape = alpha.pri + sum.y1, rate = beta.pri + n1)
theta2.post.draws <- rgamma(10000, shape = alpha.pri + sum.y2, rate = beta.pri + n2)

# Approximating the posterior probability that theta1 > theta2:

mean(theta1.post.draws > theta2.post.draws)
## [1] 0.9722

# Examining the posterior distribution of the ratio theta1 / theta2:

my.ratio <- theta1.post.draws / theta2.post.draws

plot(density(my.ratio))  # plot of estimated posterior for theta1/theta2


quantile(my.ratio, probs=0.5)  # approximate posterior median
##      50% 
## 1.299798

quantile(my.ratio, probs=c(0.025,0.975) )  # approximate 95% quantile-based interval for theta1/theta2
##      2.5%     97.5% 
## 0.9929235 1.7171388

library(TeachingDemos)

emp.hpd(my.ratio, conf=0.95)  # approximate 95% HPD interval for theta1/theta2
## [1] 0.8304853 1.6412770

####################################################
##### Example 2, Chapter 6a notes (Flu shot data)
####################################################

flu.data <- c(1,1,1,1,1,0,1,1,1,1,0,1,1,0,1,1,0,0,1)
# data for the observed 19 individuals

my.y <- sum(flu.data)  # count of all "successes"

tot.draws <- 10000 # number of sampled values from each full conditional

# Initial values for quantities of interest:
X.20.init <- 1
theta.init <- 0.5

# Dummy Vectors that will hold values for samples of interest:
X.20.vec <- c(X.20.init, rep(NULL, times = tot.draws) )
theta.vec <- c(theta.init, rep(NULL, times = tot.draws) )

for (j in 2:(tot.draws+1) ) {
 theta.vec[j] <- rbeta(n=1, my.y + X.20.vec[j-1] + 1, 20 - my.y - X.20.vec[j-1] + 1)
 X.20.vec[j] <- rbinom(n=1, size=1, prob=theta.vec[j])
}

# Remove initial values:
theta.vec <- theta.vec[-1]
X.20.vec <- X.20.vec[-1]

# remove first 2000 sampled values as "burn-in":

theta.post <- theta.vec[-(1:2000)]
X.20.post <- X.20.vec[-(1:2000)]

## Posterior summary for theta:

plot(density(theta.post))  # plot of estimated posterior for theta


mean(theta.post)  # Posterior mean of the other 8000 sampled theta values
## [1] 0.7140283

median(theta.post)  # Posterior median of the other 8000 sampled theta values
## [1] 0.7210063

quantile(theta.post, probs=c(0.025,0.975) )  # approximate 95% quantile-based interval for theta
##      2.5%     97.5% 
## 0.5088270 0.8794538

emp.hpd(theta.post, conf=0.95)  # approximate 95% HPD interval for theta
## [1] 0.2874779 0.8602287

## NOTE:  The observed-data MLE for theta here is:
mean(flu.data)
## [1] 0.7368421
# which is biased in this case because it lacks the missing value.

## Posterior summary for the missing data value X.20:

mean(X.20.post)  # Posterior mean for X.20
## [1] 0.713125

var(X.20.post) # Posterior variance for X.20
## [1] 0.2046033


####################################################
##### Example 3, Chapter 6a notes (Coal mining data)
####################################################

# The data:
disasters <- c(4,5,4,1,0,4,3,4,0,6,3,3,4,0,2,6,3,3,5,
               4,5,3,1,4,4,1,5,5,3,4,2,5,2,2,3,4,2,1,
               3,2,2,1,1,1,1,3,0,0,1,0,1,1,0,0,3,1,0,
               3,2,2,0,1,1,1,0,1,0,1,0,0,0,2,1,0,0,0,
               1,1,0,2,3,3,1,1,2,1,1,1,1,2,4,2,0,0,0,
               1,4,0,0,0,1,0,0,0,0,0,1,0,0,1,0,1)

my.alpha <- 4; my.beta <- 1  # hyperparameters for prior on lambda
my.gamma <- 1; my.delta <- 2 # hyperparameters for prior on phi


### R function to repeatedly sample values of theta = (lambda,phi,k) using Gibbs sampler:

bcp <- function(theta.matrix,y,a,b,g,d) {
 n <- length(y)
 k.prob <- rep(0,times=n)
 for (i in 2:nrow(theta.matrix)) {
  # Sampling values of lambda from the appropriate gamma distribution:
  lambda <- rgamma(1,a+sum(y[1:theta.matrix[(i-1),3]]),b+theta.matrix[(i-1),3])

  # Sampling values of phi from the appropriate gamma distribution:
  phi <- rgamma(1,g+sum(y[(theta.matrix[(i-1),3]+1):n]),d+n-theta.matrix[(i-1),3])

 for (j in 1:n) {
  k.prob[j] <- exp(j*(phi-lambda))*(lambda/phi)^sum(y[1:j])
  }
 k.prob <- k.prob/sum(k.prob)
 k <- sample(1:n,size=1,prob=k.prob)
 theta.matrix[i,] <- c(lambda,phi,k)  # storing the set of parameters for this Gibbs iteration
 }
return(theta.matrix)
}

#### End of bcp function ###

tot.draws <- 2000 # number of sampled values from each full conditional

init.param.values <- c(4, 0.5, 55) # initial values for (lambda, phi, k)
# just used prior means as initial values...

init.theta.matrix <- matrix(0, nrow=tot.draws, ncol=3)

init.theta.matrix <- rbind(init.param.values,init.theta.matrix) 

my.Gibbs.samples <- bcp(init.theta.matrix, y=disasters, a=my.alpha, b=my.beta, g=my.gamma, d=my.delta)

# remove first 1000 sampled values as "burn-in":

my.post <- my.Gibbs.samples[-(1:1000),]

# Posterior summaries for lambda, phi, and k:

apply(my.post,2,median)  # Posterior medians for lambda, phi, and k:
## [1]  3.1314663  0.9177656 40.0000000

cbind(apply(my.post, 2, quantile, probs=0.025),   apply(my.post, 2, quantile, probs=0.975) )
##            [,1]      [,2]
## [1,]  2.6368119  3.764314
## [2,]  0.7082999  1.159749
## [3,] 36.0000000 45.000000
# approximate 0.025 and 0.975 quantiles for lambda, phi, and k
# to get approximate 95% quantile-based intervals

rbind( emp.hpd(my.post[,1], conf=0.95), 
emp.hpd(my.post[,2], conf=0.95), 
emp.hpd(my.post[,3], conf=0.95) ) # approximate 95% HPD intervals
##                          
## [1,]  2.3425913  3.661355
## [2,]  0.5431688  1.106828
## [3,] 32.0000000 43.000000



####################################################
##### Example 5, Chapter 6a notes (Sparrow data)
####################################################

library(mvtnorm)
# May need to install the mvtnorm package first?
# If so, type at the command line:  install.packages("mvtnorm", dependencies=T)
# while plugged in to the internet.

sparrow.data <- read.table("http://www.stat.sc.edu/~hitchcock/sparrowdata.txt", header=T)

y <- sparrow.data[,1]
x <- sparrow.data[,2]; xsq <- x^2

X <- cbind(rep(1,times=length(x)), x, xsq)

p <- dim(X)[2]  # number of columns of matrix X

beta.prior.mean <- rep(0, times=p)
beta.prior.sd <- rep(10,times=p)

proposal.cov.matrix <- var(log(y+1/2))*solve(t(X)%*%X)  
# should be reasonable in this case:   should be close to (sigma^2)*(X'X)^{-1}

S <- 10000
beta.current <- rep(0,times=p) # initial value for M-H algorithm
acs <- 0 # will be to track "acceptance rate"

beta.values <- matrix(0,nrow=S,ncol=p)  # will store sampled values of beta vector

for (s in 1:S) {
 beta.proposed <- t(rmvnorm(1, beta.current, proposal.cov.matrix))

 log.accept.ratio <- sum(dpois(y,exp(X %*% beta.proposed), log=T)) - sum(dpois(y,exp(X %*% beta.current), log=T)) +
        sum(dnorm(beta.proposed, beta.prior.mean, beta.prior.sd, log=T)) - 
        sum(dnorm(beta.current, beta.prior.mean, beta.prior.sd, log=T) )

  if (log.accept.ratio > log(runif(1)) ) {
    beta.current <- beta.proposed
    acs <- acs + 1
  }

beta.values[s,] <- beta.current
}

acs/S  # gives the acceptance rate
## [1] 0.4208

acf(beta.values[,1])  # plot autocorrelation values for beta_0

acf(beta.values[,2])  # plot autocorrelation values for beta_1

acf(beta.values[,3])  # plot autocorrelation values for beta_2


# Seems to be an issue with serial dependence.

# Thinning out the sampled values by taking every 10th row:

beta.values.thin <- beta.values[10*(1:(S/10) ),]

# Looks much better...

### Posterior summary:

apply(beta.values.thin,2,median)  # Posterior medians for each regression coefficent
## [1]  0.2364186  0.6963181 -0.1366423

cbind(apply(beta.values.thin, 2, quantile, probs=0.025),   apply(beta.values.thin, 2, quantile, probs=0.975) )
##             [,1]        [,2]
## [1,] -0.63664730  1.02144041
## [2,]  0.08254779  1.35967097
## [3,] -0.24932865 -0.03273479
# approximate 0.025 and 0.975 quantiles for each regression coefficent
# to get approximate 95% quantile-based intervals

rbind( emp.hpd(beta.values.thin[,1], conf=0.95), 
emp.hpd(beta.values.thin[,2], conf=0.95), 
emp.hpd(beta.values.thin[,3], conf=0.95) ) # approximate 95% HPD intervals
##            [,1]        [,2]
## [1,] -1.5263254  0.93914441
## [2,] -0.4581291  1.26930721
## [3,] -0.3520363 -0.05107206


# Plots of posterior median for expected offspring for ages 1,2,3,4,5,6:

my.X <- cbind( rep(1,times=6), (1:6), (1:6)^2 )

y.hats <- exp( my.X %*% as.matrix( apply(beta.values.thin,2,median), ncol=1 ) )

plot( (1:6), y.hats, type='b', xlab='age', ylab='expected offspring')


# Trace plots for the sampled beta_0 and beta_1 values:

plot(beta.values.thin[,1], type='l')


plot(beta.values.thin[,2], type='l')


# Geweke diagnostic for checking convergence of an MCMC chain:
# This gives a z-statistic that compares the mean values of
# (by default) the first 10% of the chain vs. the last 50% of the chain:

library(coda)
geweke.diag(beta.values.thin[,1])
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##   var1 
## 0.3151

# P-value associated with the Geweke diagnostic:
# (a small p-value indicates a LACK of convergence):

2*pnorm(abs(geweke.diag(beta.values.thin[,1])[[1]]),lower=F)
##      var1 
## 0.7526716
## Chapter 5 R Examples

library(bayesrules)
library(tidyverse)


## Fraud risk phone call example from textbook:

# Possible priors for lambda:

# Plot the Gamma(5, 1) prior
plot_gamma(shape = 5, rate = 1)


# Plot the Gamma(10, 2) prior
plot_gamma(shape = 10, rate = 2)


# Plot the Gamma(15, 3) prior
plot_gamma(shape = 15, rate = 3)


# Showing the posterior distribution, along with the prior and the likelihood:

plot_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)


# Summarizing the posterior (and the prior):

summarize_gamma_poisson(shape = 10, rate = 2, sum_y = 11, n = 4)
model shape rate mean mode var sd
prior 10 2 5.0 4.500000 2.5000000 1.5811388
posterior 21 6 3.5 3.333333 0.5833333 0.7637626

######################
## Credible Intervals
######################

### Fraud risk example:

# A 90% quantile-based credible interval for daily mean of fraud risk call:
round(qgamma(c(0.05,0.95), shape=21, rate=6),3)
## [1] 2.345 4.844

library(TeachingDemos)
# May need to install the TeachingDemos package first?
# If so, type at the command line:  install.packages("TeachingDemos", dependencies=F)
# while plugged in to the internet.

# Using hpd function for 90% HPD credible interval for fraud risk call example:

hpd(qgamma, shape=21, rate=6, conf=0.90)
## [1] 2.248895 4.720435

### Coin tossing Example, Chapter 5 notes:

y <- 2; n <- 10

# A 95% quantile-based credible interval for theta:

qbeta( c(.025, .975), y+1, n-y+1)
## [1] 0.06021773 0.51775585

# A 95% HPD credible interval for theta:

y <- 2; n <- 10

hpd(qbeta, shape1 = y+1, shape2= n-y+1, conf=0.95)
## [1] 0.04055517 0.48372366



#####################################
# Conjugate Priors with Normal Data #
#####################################

# Example 1, Chapter 3 notes (Midge data)

# Data (in mm):
y <- c(1.64, 1.70, 1.72, 1.74, 1.82, 1.82, 1.82, 1.90, 2.08)

ybar <- mean(y); n <- length(y)

my.seq <- seq(0.5,3,length=300)


#### Case when variance is known:

# Suppose we knew sigma^2 = 0.01:

# prior parameters for normal prior on mu:

my.delta <- 1.9; my.tau <- 0.3
# "95% sure" a priori that mu is between 1.3 and 2.5, say.
# What if we adjust this?

prior.mu <- dnorm(my.seq, mean=my.delta, sd=my.tau) 
# Defining the prior

posterior.mu <- dnorm(my.seq, mean = ( (my.delta/(my.tau^2) + sum(y)/0.01)/(1/(my.tau^2) + n/0.01) ), 
                sd = sqrt( 0.01*(my.tau^2) / (0.01 + n*(my.tau^2)) )  )
# Defining the posterior

plot(my.seq, prior.mu, ylim=c(-0.7,12), yaxt='n', 
     xlab="Values of mu", ylab="Probability distribution", type='l', lty=3)
# Plotting the prior
lines(my.seq, posterior.mu)
# Plotting the posterior

### Point estimates for mu:

posterior.mean <- ( (my.delta/(my.tau^2) + sum(y)/0.01)/(1/(my.tau^2) + n/0.01) )

posterior.median <- qnorm(0.50, mean = ( (my.delta/(my.tau^2) + sum(y)/0.01)/(1/(my.tau^2) + n/0.01) ), 
                sd = sqrt( 0.01*(my.tau^2) / (0.01 + n*(my.tau^2)) ) )

print(paste("posterior.mean=", round(posterior.mean,3), 
      "posterior.median=", round(posterior.median,3)))
## [1] "posterior.mean= 1.806 posterior.median= 1.806"

### Interval estimate for mu:

library(TeachingDemos) # loading package in order to use the hpd function

hpd.95 <- hpd(qnorm, mean = ( (my.delta/(my.tau^2) + sum(y)/0.01)/(1/(my.tau^2) + n/0.01) ), 
                sd = sqrt( 0.01*(my.tau^2) / (0.01 + n*(my.tau^2)) ), conf=0.95)
round(hpd.95, 3)
## [1] 1.741 1.871

segments(hpd.95[1], 0, hpd.95[2], 0, lwd=4)

# Showing the 95% HPD interval on the graph

# Using the built-in functions from the 'bayesrules' package:

library(bayesrules)
# Showing the posterior distribution, along with the prior and the likelihood:

plot_normal_normal(mean = 1.9, sd = 0.3, sigma = sqrt(0.01), y_bar = 1.804444, n = 9)


# Summarizing the posterior (and the prior):

summarize_normal_normal(mean = 1.9, sd = 0.3, sigma = sqrt(0.01), y_bar = 1.804444, n = 9)
model mean mode var sd
prior 1.900000 1.900000 0.0900000 0.3000000
posterior 1.805609 1.805609 0.0010976 0.0331295


#### Case when mean and variance are both unknown:

# prior parameters for inverse gamma prior on sigma^2:

my.alpha <- 18; my.beta <- .34  # Based on guess of E[sig^2] = .02, var[sig^2] = .005^2

# prior parameters for normal prior on mu:
my.delta <- 1.9; s0 <- 1  # low value of s0 indicates lack of prior knowledge

library(pscl)  # loading pscl package, to use inverse gamma distribution

# May need to install the pscl package first?
# If so, type at the command line:  install.packages("pscl", dependencies=F)
# while plugged in to the internet.



### Point estimates for sigma^2:

p.mean.sig.sq <- (my.beta + 0.5*(sum(y^2) - n*(ybar^2)) ) / (my.alpha + n/2 - 0.5 - 1)

p.median.sig.sq <- qigamma(0.50, my.alpha + n/2 - 0.5, my.beta + 0.5*( sum(y^2) - n*(ybar^2) ) )

print(paste("posterior.mean for sigma^2=", round(p.mean.sig.sq,4), 
      "posterior.median for sigma^2=", round(p.median.sig.sq,4) ))
## [1] "posterior.mean for sigma^2= 0.0194 posterior.median for sigma^2= 0.0188"

### Marginal Interval estimate for sigma^2:


hpd.95.sig.sq <- hpd(qigamma, alpha=my.alpha + n/2 - 0.5, beta=my.beta + 0.5*( sum(y^2) - n*(ybar^2) ) )


#### AN APPROACH FOR INFERENCE ABOUT mu:

# Randomly sample many values for the posterior of sigma^2:

sig.sq.values <- rigamma(n=1000000,alpha=my.alpha + n/2 - 0.5, beta=my.beta + 0.5*( sum(y^2) - n*(ybar^2) ) )

# Randomly sample many values from the posterior of mu, GIVEN the sampled values of sigma^2 above:

mu.values <- rnorm(n=1000000,mean=((sum(y)+my.delta*s0)/(n+s0)), sd=sqrt(sig.sq.values/(n+s0)) )

# Point estimates for sigma^2 and mu:

round(median(sig.sq.values),4)
## [1] 0.0188

round(median(mu.values),4)
## [1] 1.814

# 95% HPD interval estimates for sigma^2 and mu:

round(emp.hpd(sig.sq.values),4)
## [1] 0.0071 0.0274

round(emp.hpd(mu.values),4)
## [1] 1.5768 1.8863



# Concussion/Brain example:

# Picking out the sample of concussed football players:

data(football)
concussion_subjects <- football %>%
  filter(group == "fb_concuss")
  
# Sample mean hippocampal volume:

concussion_subjects %>%
  summarize(mean(volume))
mean(volume)
5.7346
  
# Checking our assumed data model by examining a density estimate of the sample data:
# Does a Normal distribution with sigma=0.5 seem reasonable?

ggplot(concussion_subjects, aes(x = volume)) + 
  geom_density()

  
# Showing the posterior distribution, along with the prior and the likelihood:
# Note the 'mean' argument below is the PRIOR mean (what we call delta in class).
# The 'sd' argument below is the PRIOR standard deviation (what we call tau in class).

plot_normal_normal(mean = 6.5, sd = 0.4, sigma = 0.5,
                   y_bar = 5.735, n = 25)


# Summarizing the posterior (and the prior):

summarize_normal_normal(mean = 6.5, sd = 0.4, sigma = 0.5,
                        y_bar = 5.735, n = 25)
model mean mode var sd
prior 6.50 6.50 0.1600000 0.4000000
posterior 5.78 5.78 0.0094118 0.0970143

hpd.95.mu <- hpd(qnorm, mean = 5.78, sd= 0.09701425, conf=0.95)
hpd.95.mu
## [1] 5.589856 5.970144