MCMC is the computation tool used to evaluate hierarchical models. Examples here range from gene frequencies and inbreeding depression to continental distribution and abundance. Interpretation focusses on the posterior distribution, moving from MCMC chains to parameter summaries and predictions.
Inbred Soay sheep individuals are culled with age. Inbred individuals (high values of \(F_{ROH}\)) in are missing from higher age classes (a), because they have low survival (b, c) (From Stoffel et al., 2021).
Genetic architecture and lifetime dynamics of inbreeding depression in a wild mammal
source('clarkFunctions2024.r')
It can be the case that the allowable range for one parameter depends on the value of a second parameter. This prior dependence can be incorporated into the proposals.
Based on counts of genotypes for two alleles at one locus, I want to estimate the frequency of each allele, which depends on inbreeding. An individual can be homozygous for either allele, or it can be heterozygous, having one copy of each. The greater the degree of inbreeding the lower the fraction of heterozygotes. Example of isolation effects on inbreeding and fitness include the Isle Royale wolves Robinson et al. 2019.
In this example, I use a Metropolis step to estimate gene frequency and the inbreeding coefficient, employing multiple chains. First, recall that I have tended to specify a prior distribution that is independent for multiple parameters, e.g., \([\theta_1, \theta_2]\) is given a prior \([\theta_1][\theta_2]\). This prior does not mean that I think that parameter estimates are independent. What it really means is that I might have no idea what the prior dependence should be or, if I did know, I would not know how to specify it. When my prior distribution is independent, it just means that their posterior relationship is coming from the data, i.e., the likelihood.
Genotypes shed light on inbreeding, because heterozygotes decline as inbreeding increases.
With random mating (\(f = 0\)), the expected frequency of genotypes for two alleles at one locus is
\[\begin{aligned} P_{aa} &= p^2 \\ P_{ab} + P_{ba} &= 2p(1 - p) \\ P_{bb} &= (1 - p)^2 \end{aligned}\] where \(p\) is the frequency of allele \(a\), and \(1 - p\) is the frequency of \(b\). In this case there are only two alleles at this locus.
With a tendency to in-breed or out-breed (\(f \neq 0\)), the frequencies become
\[\begin{aligned} P_{aa} &= p^2 + fp(1 - p) \\ P_{ab} &= 2p(1 - p)(1 - f) \\ P_{bb} &= (1 - p)^2 + fp(1 - p) \end{aligned}\]
With inbreeding (\(f > 0\)), both homozygotes (\(P_{aa}, P_{bb}\)) increase due to the second term that includes \(f\). At the same time, the factor \((1 - f)\) reduces the heterozygote fraction. With complete inbreeding (\(f = 1\)) heterozygotes disappear completely,
\[\begin{aligned} P_{aa} &= p^2 + p(1 - p) \\ P_{ab} &= 0 \\ P_{bb} &= (1 - p)^2 + p(1 - p) \end{aligned}\]
I have written a function that evaluates these values from a vector
\((p, f)\) in
clarkFunctions2024.r, pmake( pars ), where
pars = c(p, f). pmake returns the frequencies
\((P_{aa}, P_{ab}, P_{bb})\),
pmake <- function(pars){
if( is.list(pars) ){
p = pars$p
f = pars$f
}else{
p <- pars[1] # frequency of a
f <- pars[2] # inbreeding coefficient
}
paa <- p^2 + f*p*(1 - p) # frequency of paa
pab <- 2*p*(1 - p)*(1 - f) # frequency of pab
pbb <- (1 - p)^2 + f*p*(1 - p) # frequency of pbb
if( length( paa ) == 1) return( c(paa,pab,pbb) )
cbind( paa, pab, pbb )
}
For p = 0.5, here is the difference between genotype
frequencies for f = 0 and f = 0.2,
pmake( c(.5, 0) )
## [1] 0.25 0.50 0.25
pmake( c(.5, .2) )
## [1] 0.3 0.4 0.3
Note again how inbreeding increases the proportion of homozygotes. Likewise, a tendency to avoid inbreeding increases the frequency of heterozygotes,
pmake( c(.5,-.2) )
## [1] 0.2 0.6 0.2
In the case of inbreeding and allele frequency, I cannot avoid specifying prior dependence, because prior independence would give impossible values–the range of possible allele frequencies depends on inbreeding, and vice versa. This example demonstrates prior dependence in parameter values.
I know how to write a distribution for \([\theta_1, \theta_2]\) (jointly) only if I factor it into a conditional and marginal, \([\theta_1| \theta_2][\theta_2]\). For gene frequency and the inbreeding coefficient I use the prior distribution to specify a conditional relationship between two parameters, the inbreeding coefficient \(f\) and the frequency of allele \(a\), which is a parameter \(p\),
\[[f, p] = [f | p][p]\] I can impose this conditional relationship at the proposal stage of the Metropolis step. First, just a tiny review of genotypes for two alleles at one locus.
Prior dependence comes from the fact that some values of \(p\) cannot occur (at equilibrium) if \(f\) is low. This is the reason why allele frequencies shed light on inbreeding. The inbreeding coefficient can be negative and has a lower limit \(f_l\) that depends on \(p\),
\[
f_l(p) =
\begin{cases}
-(1 - p)/p, & p > (1 - p) \\
-p/(1 - p), & p \leq (1 - p) \\
\end{cases}
\] \(f\) can be as low as \(-1\) for equally represented alleles \((p = 0.5)\) and increases to \(0\) as either allele dominates (\(p = 0\) or \(1\)). A function that
determines this limit is minf(p), where the argument
p is the parameter \(p\),
minf <- function(p){ # minimum inbreeding coefficient
lof <- rep( -1, length(p) )
lop <- -(1 - p)/p
hip <- -p/(1 - p)
lof[ p > (1 - p) ] <- lop[ p > (1 - p) ]
lof[ p < (1 - p) ] <- hip[ p < (1 - p) ]
lof
}
Here is a plot of the minimum allowable values of \(f\),
pseq <- seq(.01,.99,length=100)
plot(pseq, minf(pseq), type='l', xlab='p',ylab='minimum f', ylim = c(-1,1))
Minimum inbreeding coefficient at different equilibrium frequencies of allele a. Negative f means inbreeding avoidance.
The inbreeding coefficient \(F\) identifies an excess of homozygotes over that expected based on allele frequency \(p\),
\[ F = 1 - \frac{H_{obs}}{2p(1-p)} \] where \(H_{obs}\) is the frequency of observed heterozygotes and the denominator is the expected frequency if \(f = 0\).
I begin by simulating a data set. Simulated genotypes for a pair of \((p, f)\) values and population size 50 can be generated from the multinomial as
m <- 50 # individuals in sample
p <- 0.6 # frequency of allele a
f <- -.2 # inbreeding coefficient
pr <- pmake( c(p, f) ) # values for (Paa,Pab,Pbb)
y <- rmultinom( 10, m, pr ) # a random vector of size m
gnames <- c( 'Paa', 'Pab', 'Pbb' )
rownames( y ) <- gnames
y
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## Paa 15 16 16 18 9 21 16 16 10 22
## Pab 30 28 33 29 35 23 29 27 35 24
## Pbb 5 6 1 3 6 6 5 7 5 4
The three rows represent the number of individuals having each of
three genotypes. It is important not to select a value of \(f\) that is incompatible with \(p\), which can be checked using
minf(p) = 0.6.
I used a value of \(f = -0.2\), which is above this lower limit. For \(n = 1\) populations, with \(m\) individuals per population, a simulated data set could be generated with
npop <- 20 # no. populations
y <- rmultinom(npop, m, pr) # n random vectors of size m
To see that our f is approximated by the inbreeding
coefficient \(F\), here it is for this
simulated data set:
# inbreeding coefficient
fObs <- sum( y[2,] )/sum(y) # fraction of heterozygotes
pObs <- sum(2*y[1,] + y[2,])/2/sum(y) # obs p (copies of a)
fExp <- 2*pObs*(1 - pObs) # expected fraction if f = 0
Fstat <- 1 - fObs/fExp
Fstat can be taken as an estimate of \(\hat{f}\).
The goal is to estimate the frequency of allele \(a\), i.e., \(p\), and inbreeding \(f\), based on simulated data for the three genotypes, while imposing prior dependence. A Metropolis step works well for estimating parameters where prior dependence is required.
I can impose prior dependence by proposing values this way:
Again, the simulated data y is a matrix, each column of
which contains three values, one for each of the three genotypes. I now
want to sample from the posterior for this model. The likelihood for
population i is
\[\begin{aligned} \mathbf{y}_i &\sim multinom(n_i,\boldsymbol{\theta})_i \propto \prod_{k=1}^3 \theta_{ik}^{y_{ik}} \\ \boldsymbol{\theta}_i &= \left( P_{aa}, P_{ab}, P_{bb} \right)_i \end{aligned}\]
The three values in \(\boldsymbol{\theta}_i\) depend on the allele frequencies and inbreeding. Including all priors for these parameters and the likelihood taken over all n populations, the posterior is
\[[p,f|\mathbf{Y}] \propto \prod_{i=1}^m multinom(n_i,\boldsymbol{\theta}_i) beta(p|p_1, p_2)N(f|f_1, f_2)I(f_l(p) < f < 1)\]
Note the normal prior for \(f\) truncated at lower and upper values. The advantage of this prior is that I can propose from a symmetric distribution, while building in the prior dependence between \(f\) and \(p\). I will use Metropolis to draw samples for this Metropolis simulation.
Recall the acceptance criterion for Metropolis,
\[ a = \frac{[ (p, f)^*]}{[ (p, f)^{(g)}]} \] where \((p, f)^{(g)}\) is the current parameter values, and \((p, f)^*\) is the proposed new values. The distribution is the posterior, i.e., prior times likelihood,
\[ [ (p, f)^*] \propto \prod_{i=1}^m multinom(n_i,\boldsymbol{\theta}_i) beta(p^*|p_1, p_2)N(f^*|f_1, f_2)I(f_l(p) < f^* < 1) \] Again, \(\theta = (P_{aa}, P_{ab}, P_{bb})\) is the vector of genotype frequencies that comes from parameters \(p\) and \(f\). I have simply plugged into prior and likelihood the parameters that I proposed.
Here are some prior and initial parameter values
g1 <- g2 <- 1 # beta parameters for p
priorF <- 0 # mean and sd for f
priorFSD <- 1
fg <- .1 # initial values of f and p
pg <- .5
First propose values of \(p\) and
\(f\) centered on current values
pg and fg. Here is a way this can be done
using truncated normals
propp <- .tnorm(1, 0, 1, pg, .002) # propose p close to current value p
fl <- minf(propp) # lower f limit for pa
propf <- .tnorm(1, fl, 1, fg, .01) # propose f > fl close to current value fg
The log posterior is proportional to the sum of the log multinomial,
the beta prior for pa, and the normal prior for
f. This is done in the
function pf.update(p,f),
pf.update <- function(y, p, f){ # log posterior
multlik(y, c(p,f) ) + dbeta( p, g1, g2, log=T) +
dnorm(f, priorF, priorFSD, log=T)
}
multlik <- function(y, pars){ # log multinom likelihood
npop <- ncol(y)
p <- pmake(pars)
pmat <- matrix(rep(p,npop),3,npop)
sum(y*log(pmat))
}
update_pf <- function(y, pg, fg){ # metropolis update
ac <- 0
propp <- .tnorm(1,.02,.98,pg,.002) # propose pa
fl <- minf(propp) # lower f limit for pa
propf <- .tnorm(1,fl,1,fg,.01) # propose f > fl
pnew <- pf.update(y, propp, propf)
pnow <- pf.update(y, pg, fg)
a <- exp(pnew - pnow) # acceptance criterion
z <- runif(1,0,1)
if(z < a){
pg <- propp
fg <- propf
ac <- 1
}
list(pg = pg, fg = fg, accept = ac)
}
The Metropolis algorithm is used to propose and accept values based
on probabilities calculated using this function. Here it is
for multiple chains (outer j loop) with a nested inner loop
for a single chain:
ng <- 5000
nchain <- 10
nt <- 500
thin <- round(seq(1, ng, length=nt))
pgibbs <- matrix(0,nt,nchain)
fgibbs <- pgibbs
accept <- 0
for(j in 1:nchain){
pg <- rbeta(1,1,1) #draw initial values from prior
lg <- minf(pg)
fg <- .tnorm(1,lg,1,priorF,priorFSD)
k <- 0
for(g in 1:ng){
pf <- update_pf(y, pg, fg)
pg <- pf$pg
fg <- pf$fg
if(g %in% thin){
k <- k + 1
pgibbs[k,j] <- pg
fgibbs[k,j] <- fg
}
}
}
Each column of pgibbs and fgibbs is a
different simulation starting from different initial values. Here is a
plot showing all chains:
par( mfrow = c(2,1), mar = c(4,4,1,1), bty='n', las = 1 )
plot( fgibbs[,1], type='l', ylim = c(-1,1), ylab='f' )
for( j in 2:nchain )lines(fgibbs[,j] )
abline(h = f, lty=2 )
plot( pgibbs[,1],type = 'l',ylim = c(0,1), ylab = 'p' )
for( j in 2:nchain )lines(pgibbs[,j] )
abline( h = p, lty = 2)
Multiple chains for f and p.
The true values used to simulate the data are shown as dashed lines.
Here is a plot showing the joint distribution bounded by the prior dependence, with the convergence from my horrible initial values:
plot(pgibbs, fgibbs, xlim = c(0,1), ylim = c(-1,1), cex = .2, xlab = 'p', ylab = 'f')
lines(pseq, minf(pseq) )
Progress of the MCMC through parameter space for multiple chains from different initial values.
Here are density plots when all chains analyzed together, omitting a burnin of 1000:
fp <- cbind( as.vector(fgibbs[-c(1:100),]), as.vector(pgibbs[-c(1:100),]) )
colnames(fp) <- c('f', 'p')
chain2density( fp, trueValues = c(f, p) )
Posterior distribution for (f, p).
Here is a summary:
chain2tab( fp )
| estimate | SE | CI_025 | CI_975 | sig95 | |
|---|---|---|---|---|---|
| f | -0.1796 | 0.03026 | -0.2395 | -0.1192 |
|
| p | 0.6042 | 0.00975 | 0.5857 | 0.6239 |
|
Note that I passed to chain2density and
chains2tab the matrix of chains transformed
into vectors, omitting the burnin rows.
Exercise. Simulate populations of size n = 5, 10, 20 individuals with positive inbreeding. Use 10 chains to see if you can recover parameter estimates. Is recovery better, worse, or the same as with negative inbreeding.
Inbreeding is not limited to a single locus–when inbreeding occurs it
affects the genome. At each locus there can be multiple alleles. The
parameter f would generally be evaluated over multiple
loci. Here is an analysis that involves multiple loci (10), each with
two alleles and a different frequency p. In this example
there is out-breeding, a tendency to avoid mating with related
individuals (f = -0.3).
Chains for f and p at multiple loci, each with a different frequency for two alleles.
Colors identify the different loci. True values used to simulate data are dashed lines.
Above are posterior densities for (f, p). Vertical lines indicate true values. Below left is progress of the MCMC through parameter space for multiple chains from different initial values for multiple alleles. Each converges to the same f, but a different p. Shaded backgrounds at left bound parameter space at f = -0.3.
The allowable frequencies are bounded away from zero and one when
there is outbreeding (f < 0). This region is shaded at
right.
The dashed line in this figure shows the true value of f
used to simulate data.
Both Gibbs sampling and Metropolis are MCMC techniques–both draw from the posterior distribution that is conditioned on the the current values of parameters. Gibbs sampling refers to the case where each MCMC step includes several pieces, one for each conditional distribution. In this example, I updated both parameters in a single step. It can be efficient to do this especially when proposals can incorporate information about the posterior distribution.
Sampling from those conditional distributions can be direct when there is a solution available for the conditional. It can be indirect, when there is no solution, in which case there are proposals that can be accepted or not. Metropolis is one of the most common indirect sampling methods. When available, direct sampling has the advantage of high efficiency.
Posterior simulation is done as a last resort, when a posterior distribution cannot be solved by conventional methods. Markov chain Monte Carlo (MCMC) algorithms can be developed for a broad range of inference problems. The most important technique is Gibbs sampling, which allows simulation based on conditional distributions.
The concept of conjugacy is important for determining when a prior \(\times\) likelihood can be sampled directly (efficient) versus indirectly, through methods such as Metropolis. Indirect sampling is less efficient, because it involves proposals, which can be hard to match with the posterior distribution.
Prior dependence describes the situation where the prior distribution on one parameter may depend on the currently imputed value of another, e.g., \(\beta_1 > \beta_2\). With Metropolis sampling, prior dependence may be achieved through proposals.