MCMC is the computation tool used to evaluate hierarchical models. Examples here range from gene frequencies and inbreeding depression to continental distribution and abundance. Interpretation focusses on the posterior distribution, moving from MCMC chains to parameter summaries and predictions.


**Inbred Soay sheep individuals are culled with age**. *Inbred individuals (high values of $F_{ROH}$) in are missing from higher age classes (a), because they have low survival (b, c) (From Stoffel et al., 2021).*

Inbred Soay sheep individuals are culled with age. Inbred individuals (high values of \(F_{ROH}\)) in are missing from higher age classes (a), because they have low survival (b, c) (From Stoffel et al., 2021).


Logistics

For today

  • Review MCMC examples from previous vignette.
  • Multiple chains
  • Multinomial distribution
  • Prior dependence in genotype frequencies

Objectives

  • articulate the concept of a Markov chain
  • determine when posterior simulation is needed
  • write a multinomial model
  • implement multiple MCMC chains
  • incorporate into algorithms the concepts of prior weight and prior dependence

For next time

  • This vignette

Code

source('clarkFunctions2024.r')


Prior dependence

It can be the case that the allowable range for one parameter depends on the value of a second parameter. This prior dependence can be incorporated into the proposals.

Based on counts of genotypes for two alleles at one locus, I want to estimate the frequency of each allele, which depends on inbreeding. An individual can be homozygous for either allele, or it can be heterozygous, having one copy of each. The greater the degree of inbreeding the lower the fraction of heterozygotes. Example of isolation effects on inbreeding and fitness include the Isle Royale wolves Robinson et al. 2019.

In this example, I use a Metropolis step to estimate gene frequency and the inbreeding coefficient, employing multiple chains. First, recall that I have tended to specify a prior distribution that is independent for multiple parameters, e.g., \([\theta_1, \theta_2]\) is given a prior \([\theta_1][\theta_2]\). This prior does not mean that I think that parameter estimates are independent. What it really means is that I might have no idea what the prior dependence should be or, if I did know, I would not know how to specify it. When my prior distribution is independent, it just means that their posterior relationship is coming from the data, i.e., the likelihood.

Allele frequencies and inbreeding

Genotypes shed light on inbreeding, because heterozygotes decline as inbreeding increases.

With random mating (\(f = 0\)), the expected frequency of genotypes for two alleles at one locus is

\[\begin{aligned} P_{aa} &= p^2 \\ P_{ab} + P_{ba} &= 2p(1 - p) \\ P_{bb} &= (1 - p)^2 \end{aligned}\] where \(p\) is the frequency of allele \(a\), and \(1 - p\) is the frequency of \(b\). In this case there are only two alleles at this locus.

With a tendency to in-breed or out-breed (\(f \neq 0\)), the frequencies become

\[\begin{aligned} P_{aa} &= p^2 + fp(1 - p) \\ P_{ab} &= 2p(1 - p)(1 - f) \\ P_{bb} &= (1 - p)^2 + fp(1 - p) \end{aligned}\]

With inbreeding (\(f > 0\)), both homozygotes (\(P_{aa}, P_{bb}\)) increase due to the second term that includes \(f\). At the same time, the factor \((1 - f)\) reduces the heterozygote fraction. With complete inbreeding (\(f = 1\)) heterozygotes disappear completely,

\[\begin{aligned} P_{aa} &= p^2 + p(1 - p) \\ P_{ab} &= 0 \\ P_{bb} &= (1 - p)^2 + p(1 - p) \end{aligned}\]

I have written a function that evaluates these values from a vector \((p, f)\) in clarkFunctions2024.r, pmake( pars ), where pars = c(p, f). pmake returns the frequencies \((P_{aa}, P_{ab}, P_{bb})\),

pmake <- function(pars){  
  
  if( is.list(pars) ){
    p = pars$p
    f = pars$f
  }else{
    p   <- pars[1]                   # frequency of a
    f   <- pars[2]                   # inbreeding coefficient 
  }
  paa <- p^2 + f*p*(1 - p)         # frequency of paa
  pab <- 2*p*(1 - p)*(1 - f)         # frequency of pab
  pbb <- (1 - p)^2 + f*p*(1 - p)   # frequency of pbb
  if( length( paa ) == 1) return( c(paa,pab,pbb) )
  cbind( paa, pab, pbb )
}

For p = 0.5, here is the difference between genotype frequencies for f = 0 and f = 0.2,

pmake( c(.5, 0) )
## [1] 0.25 0.50 0.25
pmake( c(.5, .2) )
## [1] 0.3 0.4 0.3


Note again how inbreeding increases the proportion of homozygotes. Likewise, a tendency to avoid inbreeding increases the frequency of heterozygotes,

pmake( c(.5,-.2) )
## [1] 0.2 0.6 0.2


Conditional prior distribution

In the case of inbreeding and allele frequency, I cannot avoid specifying prior dependence, because prior independence would give impossible values–the range of possible allele frequencies depends on inbreeding, and vice versa. This example demonstrates prior dependence in parameter values.

I know how to write a distribution for \([\theta_1, \theta_2]\) (jointly) only if I factor it into a conditional and marginal, \([\theta_1| \theta_2][\theta_2]\). For gene frequency and the inbreeding coefficient I use the prior distribution to specify a conditional relationship between two parameters, the inbreeding coefficient \(f\) and the frequency of allele \(a\), which is a parameter \(p\),

\[[f, p] = [f | p][p]\] I can impose this conditional relationship at the proposal stage of the Metropolis step. First, just a tiny review of genotypes for two alleles at one locus.

Prior dependence comes from the fact that some values of \(p\) cannot occur (at equilibrium) if \(f\) is low. This is the reason why allele frequencies shed light on inbreeding. The inbreeding coefficient can be negative and has a lower limit \(f_l\) that depends on \(p\),

\[ f_l(p) = \begin{cases} -(1 - p)/p, & p > (1 - p) \\ -p/(1 - p), & p \leq (1 - p) \\ \end{cases} \] \(f\) can be as low as \(-1\) for equally represented alleles \((p = 0.5)\) and increases to \(0\) as either allele dominates (\(p = 0\) or \(1\)). A function that determines this limit is minf(p), where the argument p is the parameter \(p\),

minf <- function(p){                        # minimum inbreeding coefficient
  lof <- rep( -1, length(p) )
  lop <- -(1 - p)/p
  hip <- -p/(1 - p)
  lof[ p > (1 - p) ] <- lop[ p > (1 - p) ]
  lof[ p < (1 - p) ] <- hip[ p < (1 - p) ]
  lof
}

Here is a plot of the minimum allowable values of \(f\),

pseq <- seq(.01,.99,length=100)
plot(pseq, minf(pseq), type='l', xlab='p',ylab='minimum f', ylim = c(-1,1))
*Minimum inbreeding coefficient at different equilibrium frequencies of allele a. Negative f means inbreeding avoidance.*

Minimum inbreeding coefficient at different equilibrium frequencies of allele a. Negative f means inbreeding avoidance.


Inbreeding coefficient

The inbreeding coefficient \(F\) identifies an excess of homozygotes over that expected based on allele frequency \(p\),

\[ F = 1 - \frac{H_{obs}}{2p(1-p)} \] where \(H_{obs}\) is the frequency of observed heterozygotes and the denominator is the expected frequency if \(f = 0\).

I begin by simulating a data set. Simulated genotypes for a pair of \((p, f)\) values and population size 50 can be generated from the multinomial as

m  <- 50                      # individuals in sample
p  <- 0.6                     # frequency of allele a
f  <- -.2                     # inbreeding coefficient
pr <- pmake( c(p, f) )        # values for (Paa,Pab,Pbb)
y  <- rmultinom( 10, m, pr )  # a random vector of size m
gnames <-  c( 'Paa', 'Pab', 'Pbb' )
rownames( y ) <- gnames
y
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## Paa   15   16   16   18    9   21   16   16   10    22
## Pab   30   28   33   29   35   23   29   27   35    24
## Pbb    5    6    1    3    6    6    5    7    5     4

The three rows represent the number of individuals having each of three genotypes. It is important not to select a value of \(f\) that is incompatible with \(p\), which can be checked using minf(p) = 0.6.

I used a value of \(f = -0.2\), which is above this lower limit. For \(n = 1\) populations, with \(m\) individuals per population, a simulated data set could be generated with

npop <- 20                      # no. populations
y    <- rmultinom(npop, m, pr)    # n random vectors of size m

To see that our f is approximated by the inbreeding coefficient \(F\), here it is for this simulated data set:

# inbreeding coefficient
fObs  <- sum( y[2,] )/sum(y)              # fraction of heterozygotes
pObs  <- sum(2*y[1,] + y[2,])/2/sum(y)    # obs p (copies of a)
fExp  <- 2*pObs*(1 - pObs)                # expected fraction if f = 0
Fstat <- 1 - fObs/fExp

Fstat can be taken as an estimate of \(\hat{f}\).

Metropolis for two parameters

The goal is to estimate the frequency of allele \(a\), i.e., \(p\), and inbreeding \(f\), based on simulated data for the three genotypes, while imposing prior dependence. A Metropolis step works well for estimating parameters where prior dependence is required.

I can impose prior dependence by proposing values this way:

Again, the simulated data y is a matrix, each column of which contains three values, one for each of the three genotypes. I now want to sample from the posterior for this model. The likelihood for population i is

\[\begin{aligned} \mathbf{y}_i &\sim multinom(n_i,\boldsymbol{\theta})_i \propto \prod_{k=1}^3 \theta_{ik}^{y_{ik}} \\ \boldsymbol{\theta}_i &= \left( P_{aa}, P_{ab}, P_{bb} \right)_i \end{aligned}\]

The three values in \(\boldsymbol{\theta}_i\) depend on the allele frequencies and inbreeding. Including all priors for these parameters and the likelihood taken over all n populations, the posterior is

\[[p,f|\mathbf{Y}] \propto \prod_{i=1}^m multinom(n_i,\boldsymbol{\theta}_i) beta(p|p_1, p_2)N(f|f_1, f_2)I(f_l(p) < f < 1)\]

Note the normal prior for \(f\) truncated at lower and upper values. The advantage of this prior is that I can propose from a symmetric distribution, while building in the prior dependence between \(f\) and \(p\). I will use Metropolis to draw samples for this Metropolis simulation.

Metropolis acceptance

Recall the acceptance criterion for Metropolis,

\[ a = \frac{[ (p, f)^*]}{[ (p, f)^{(g)}]} \] where \((p, f)^{(g)}\) is the current parameter values, and \((p, f)^*\) is the proposed new values. The distribution is the posterior, i.e., prior times likelihood,

\[ [ (p, f)^*] \propto \prod_{i=1}^m multinom(n_i,\boldsymbol{\theta}_i) beta(p^*|p_1, p_2)N(f^*|f_1, f_2)I(f_l(p) < f^* < 1) \] Again, \(\theta = (P_{aa}, P_{ab}, P_{bb})\) is the vector of genotype frequencies that comes from parameters \(p\) and \(f\). I have simply plugged into prior and likelihood the parameters that I proposed.


A Metropolis algorithm

Here are some prior and initial parameter values

g1 <- g2 <- 1    # beta parameters for p
priorF   <- 0    # mean and sd for f
priorFSD <- 1
fg       <- .1   # initial values of f and p
pg       <- .5


First propose values of \(p\) and \(f\) centered on current values pg and fg. Here is a way this can be done using truncated normals

propp <- .tnorm(1, 0, 1, pg, .002)      # propose p close to current value p
fl    <- minf(propp)                    # lower f limit for pa
propf <- .tnorm(1, fl, 1, fg, .01)      # propose f > fl close to current value fg


The log posterior is proportional to the sum of the log multinomial, the beta prior for pa, and the normal prior for f. This is done in the function pf.update(p,f),

pf.update <- function(y, p, f){               # log posterior
  multlik(y, c(p,f) ) + dbeta( p, g1, g2, log=T) + 
    dnorm(f, priorF, priorFSD, log=T)
}

multlik <- function(y, pars){                 # log multinom likelihood
  npop <- ncol(y)
  p    <- pmake(pars)
  pmat <- matrix(rep(p,npop),3,npop)
  sum(y*log(pmat))
}

update_pf <- function(y, pg, fg){     # metropolis update
  ac     <- 0
  propp  <- .tnorm(1,.02,.98,pg,.002) # propose pa
  fl     <- minf(propp)               # lower f limit for pa
  propf  <- .tnorm(1,fl,1,fg,.01)     # propose f > fl
  pnew   <- pf.update(y, propp, propf)
  pnow   <- pf.update(y, pg, fg)
  a      <- exp(pnew - pnow)          # acceptance criterion
  z      <- runif(1,0,1)
  if(z < a){
    pg <- propp
    fg <- propf
    ac <- 1
  }
  list(pg = pg, fg = fg, accept = ac)
}


The Metropolis algorithm is used to propose and accept values based on probabilities calculated using this function. Here it is for multiple chains (outer j loop) with a nested inner loop for a single chain:

ng     <- 5000
nchain <- 10
nt     <- 500
thin   <- round(seq(1, ng, length=nt))  

pgibbs <- matrix(0,nt,nchain)
fgibbs <- pgibbs
accept <- 0

for(j in 1:nchain){
    
  pg <- rbeta(1,1,1)   #draw initial values from prior
  lg <- minf(pg)
  fg <- .tnorm(1,lg,1,priorF,priorFSD)
  k  <- 0
    
  for(g in 1:ng){
    
    pf <- update_pf(y, pg, fg)
    pg <- pf$pg
    fg <- pf$fg
  
    if(g %in% thin){
      k <- k + 1
      pgibbs[k,j] <- pg
      fgibbs[k,j] <- fg
    }
  }
}


Each column of pgibbs and fgibbs is a different simulation starting from different initial values. Here is a plot showing all chains:

par( mfrow = c(2,1), mar = c(4,4,1,1), bty='n', las = 1 )
plot( fgibbs[,1], type='l', ylim = c(-1,1), ylab='f' )
for( j in 2:nchain )lines(fgibbs[,j] )
abline(h = f, lty=2 )

plot( pgibbs[,1],type = 'l',ylim = c(0,1), ylab = 'p' )
for( j in 2:nchain )lines(pgibbs[,j] )
abline( h = p, lty = 2)
*Multiple chains for f and p.*

Multiple chains for f and p.


The true values used to simulate the data are shown as dashed lines.

Here is a plot showing the joint distribution bounded by the prior dependence, with the convergence from my horrible initial values:

plot(pgibbs, fgibbs, xlim = c(0,1), ylim = c(-1,1), cex = .2, xlab = 'p', ylab = 'f')
lines(pseq, minf(pseq) )
*Progress of the MCMC through parameter space for multiple chains from different initial values*.

Progress of the MCMC through parameter space for multiple chains from different initial values.


Here are density plots when all chains analyzed together, omitting a burnin of 1000:

fp <- cbind( as.vector(fgibbs[-c(1:100),]), as.vector(pgibbs[-c(1:100),]) )
colnames(fp) <- c('f', 'p')
chain2density( fp, trueValues = c(f, p)  )
*Posterior distribution for (f, p).*

Posterior distribution for (f, p).

Here is a summary:

chain2tab( fp )
estimate SE CI_025 CI_975 sig95
f -0.1796 0.03026 -0.2395 -0.1192
p 0.6042 0.00975 0.5857 0.6239


Note that I passed to chain2density and chains2tab the matrix of chains transformed into vectors, omitting the burnin rows.

Exercise. Simulate populations of size n = 5, 10, 20 individuals with positive inbreeding. Use 10 chains to see if you can recover parameter estimates. Is recovery better, worse, or the same as with negative inbreeding.

Multiple loci

Inbreeding is not limited to a single locus–when inbreeding occurs it affects the genome. At each locus there can be multiple alleles. The parameter f would generally be evaluated over multiple loci. Here is an analysis that involves multiple loci (10), each with two alleles and a different frequency p. In this example there is out-breeding, a tendency to avoid mating with related individuals (f = -0.3).

*Chains for f and p at multiple loci, each with a different frequency for two alleles.*

Chains for f and p at multiple loci, each with a different frequency for two alleles.

Colors identify the different loci. True values used to simulate data are dashed lines.

*Above are posterior densities for (f, p). Vertical lines indicate true values. Below left is progress of the MCMC through parameter space for multiple chains from different initial values for multiple alleles. Each converges to the same f, but a different p. Shaded backgrounds at left bound parameter space at f = -0.3.*

Above are posterior densities for (f, p). Vertical lines indicate true values. Below left is progress of the MCMC through parameter space for multiple chains from different initial values for multiple alleles. Each converges to the same f, but a different p. Shaded backgrounds at left bound parameter space at f = -0.3.


The allowable frequencies are bounded away from zero and one when there is outbreeding (f < 0). This region is shaded at right.

The dashed line in this figure shows the true value of f used to simulate data.

Gibbs sampling and Metropolis as MCMC techniques

Both Gibbs sampling and Metropolis are MCMC techniques–both draw from the posterior distribution that is conditioned on the the current values of parameters. Gibbs sampling refers to the case where each MCMC step includes several pieces, one for each conditional distribution. In this example, I updated both parameters in a single step. It can be efficient to do this especially when proposals can incorporate information about the posterior distribution.

Sampling from those conditional distributions can be direct when there is a solution available for the conditional. It can be indirect, when there is no solution, in which case there are proposals that can be accepted or not. Metropolis is one of the most common indirect sampling methods. When available, direct sampling has the advantage of high efficiency.


Recap

Posterior simulation is done as a last resort, when a posterior distribution cannot be solved by conventional methods. Markov chain Monte Carlo (MCMC) algorithms can be developed for a broad range of inference problems. The most important technique is Gibbs sampling, which allows simulation based on conditional distributions.

The concept of conjugacy is important for determining when a prior \(\times\) likelihood can be sampled directly (efficient) versus indirectly, through methods such as Metropolis. Indirect sampling is less efficient, because it involves proposals, which can be hard to match with the posterior distribution.

Prior dependence describes the situation where the prior distribution on one parameter may depend on the currently imputed value of another, e.g., \(\beta_1 > \beta_2\). With Metropolis sampling, prior dependence may be achieved through proposals.