In this notebook, we’ll see how to simulate DNA evolution along a branch of arbitrary length, under the Jukes and Cantor (1969) model. Then we will compute the likelihood of a phylogenetic tree using such simulations.
Simulating along a branch
# Useful function for drawing a DNA state, for instance at the root
drawNewState <- function (){
return(sample(x=c("A","C","G","T"), size=1))
}
# Function to simulate along a branch using the Gillespie algorithm under the Jukes Cantor model (1969)
# In this case, we consider that there can be a mutation towards the same state.
# So the sum of the rates is 0.25*4=1.0
simulateAlongBranch <- function(startingState, branchLength) {
l <- 0.0
current <- startingState
while (l < branchLength) {
l <- l + rexp(rate=1.0, n=1) # 1.0 is the sum of the rates of all possible events
if (l < branchLength) {
current <- drawNewState()
}
}
return (current)
}
This way we can simulate along a branch. Let’s do it several times, for different branch lengths. First we implement a function to do that:
simulateSitesAndCollectFrequencies <- function (startingState, branchLength, numSites) {
simulatedStates = c()
for (i in 1:numSites) {
simulatedStates = c(simulatedStates, simulateAlongBranch(startingState, branchLength))
}
return(table(simulatedStates)/numSites)
}
Then we do the simulations. Branch length 0.1:
simulateSitesAndCollectFrequencies("A", 0.1, 1000)
simulatedStates
A C G T
0.926 0.024 0.029 0.021
Then we do the simulations. Branch length 1:
simulateSitesAndCollectFrequencies("A", 1, 1000)
simulatedStates
A C G T
0.549 0.145 0.148 0.158
Then we do the simulations. Branch length 10:
simulateSitesAndCollectFrequencies("A", 10, 1000)
simulatedStates
A C G T
0.254 0.217 0.261 0.268
Do you find those proportions expected or surprising?
Computing the likelihood of a tree using simulations
# Compute the probability of a tree
computeTreeProbabilityBySimulating <- function(bl_a1, bl_c, bl_a2, bl_y) {
Nsuccess = 0
Nreps = 10000
iter = 0
while (iter < Nreps) {
x <- drawNewState()
y <- simulateAlongBranch(x, bl_y)
A1 <- simulateAlongBranch(y, bl_a1)
C <- simulateAlongBranch(y, bl_c)
A2 <- simulateAlongBranch(x, bl_a2)
if (A1=="A" && C=="C" && A2=="A") {
Nsuccess <- Nsuccess + 1
}
iter <- iter + 1
}
return (Nsuccess/Nreps)
}
p = computeTreeProbabilityBySimulating(0.1, 0.1, 0.4, 0.3)
print(paste ("Likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: ", p, sep=""))
[1] "Likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: 0.0039"
print(paste("Log-likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: ", log(p), sep=""))
[1] "Log-likelihood of ((A:0.1, C:0.1):0.3, A:0.4); for pattern {A,C,A}: -5.54677872584654"
Comparison to likelihood computed with usual software
In common software, substitutions to the same state are not considered, and therefore branch lengths need to be rescaled. Indeed, branch lengths correspond to expected numbers of substitutions: in our case, we expect 4/3 times more substitutions than in common software. Therefore, to compare the likelihoods, we have to run common software on branch lengths that are 3/4 those of the tree, i.e.: ((A1:0.075, C:0.075):0.225, A2:0.3); When I do that, I get : loglk=-5.480185. So the Gillespie sampling approach is not too far from the Felsenstein algorithm exact approach.
LS0tCnRpdGxlOiAiU2ltdWxhdGlvbiBvZiBhIEROQSBzaXRlIGZvciBsaWtlbGlob29kIGNvbXB1dGF0aW9uIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpJbiB0aGlzIG5vdGVib29rLCB3ZSdsbCBzZWUgaG93IHRvIHNpbXVsYXRlIEROQSBldm9sdXRpb24gYWxvbmcgYSBicmFuY2ggb2YgYXJiaXRyYXJ5IGxlbmd0aCwgdW5kZXIgdGhlIEp1a2VzIGFuZCBDYW50b3IgKDE5NjkpIG1vZGVsLiBUaGVuIHdlIHdpbGwgY29tcHV0ZSB0aGUgbGlrZWxpaG9vZCBvZiBhIHBoeWxvZ2VuZXRpYyB0cmVlIHVzaW5nIHN1Y2ggc2ltdWxhdGlvbnMuCgoKIyMgU2ltdWxhdGluZyBhbG9uZyBhIGJyYW5jaAoKYGBge3J9CgoKIyBVc2VmdWwgZnVuY3Rpb24gZm9yIGRyYXdpbmcgYSBETkEgc3RhdGUsIGZvciBpbnN0YW5jZSBhdCB0aGUgcm9vdApkcmF3TmV3U3RhdGUgPC0gZnVuY3Rpb24gKCl7CiAgICByZXR1cm4oc2FtcGxlKHg9YygiQSIsIkMiLCJHIiwiVCIpLCBzaXplPTEpKQp9CgojIEZ1bmN0aW9uIHRvIHNpbXVsYXRlIGFsb25nIGEgYnJhbmNoIHVzaW5nIHRoZSBHaWxsZXNwaWUgYWxnb3JpdGhtIHVuZGVyIHRoZSBKdWtlcyBDYW50b3IgbW9kZWwgKDE5NjkpCiMgSW4gdGhpcyBjYXNlLCB3ZSBjb25zaWRlciB0aGF0IHRoZXJlIGNhbiBiZSBhIG11dGF0aW9uIHRvd2FyZHMgdGhlIHNhbWUgc3RhdGUuCiMgU28gdGhlIHN1bSBvZiB0aGUgcmF0ZXMgaXMgMC4yNSo0PTEuMApzaW11bGF0ZUFsb25nQnJhbmNoIDwtIGZ1bmN0aW9uKHN0YXJ0aW5nU3RhdGUsIGJyYW5jaExlbmd0aCkgewogICAgbCA8LSAwLjAKICAgIGN1cnJlbnQgPC0gc3RhcnRpbmdTdGF0ZQogICAgd2hpbGUgKGwgPCBicmFuY2hMZW5ndGgpIHsKICAgICAgICBsIDwtIGwgKyByZXhwKHJhdGU9MS4wLCBuPTEpICMgMS4wIGlzIHRoZSBzdW0gb2YgdGhlIHJhdGVzIG9mIGFsbCBwb3NzaWJsZSBldmVudHMKICAgICAgICBpZiAobCA8IGJyYW5jaExlbmd0aCkgewogICAgICAgICAgICBjdXJyZW50IDwtIGRyYXdOZXdTdGF0ZSgpCiAgICAgICAgfQogICAgfQogICAgcmV0dXJuIChjdXJyZW50KQp9CgpgYGAKCgpUaGlzIHdheSB3ZSBjYW4gc2ltdWxhdGUgYWxvbmcgYSBicmFuY2guIExldCdzIGRvIGl0IHNldmVyYWwgdGltZXMsIGZvciBkaWZmZXJlbnQgYnJhbmNoIGxlbmd0aHMuIApGaXJzdCB3ZSBpbXBsZW1lbnQgYSBmdW5jdGlvbiB0byBkbyB0aGF0OgoKYGBge3J9CnNpbXVsYXRlU2l0ZXNBbmRDb2xsZWN0RnJlcXVlbmNpZXMgPC0gZnVuY3Rpb24gKHN0YXJ0aW5nU3RhdGUsIGJyYW5jaExlbmd0aCwgbnVtU2l0ZXMpIHsKICBzaW11bGF0ZWRTdGF0ZXMgPSBjKCkKICBmb3IgKGkgaW4gMTpudW1TaXRlcykgewogICAgc2ltdWxhdGVkU3RhdGVzID0gYyhzaW11bGF0ZWRTdGF0ZXMsIHNpbXVsYXRlQWxvbmdCcmFuY2goc3RhcnRpbmdTdGF0ZSwgYnJhbmNoTGVuZ3RoKSkKICB9ICAKICByZXR1cm4odGFibGUoc2ltdWxhdGVkU3RhdGVzKS9udW1TaXRlcykKfQoKYGBgCgpUaGVuIHdlIGRvIHRoZSBzaW11bGF0aW9ucy4gQnJhbmNoIGxlbmd0aCAwLjE6CmBgYHtyfQpzaW11bGF0ZVNpdGVzQW5kQ29sbGVjdEZyZXF1ZW5jaWVzKCJBIiwgMC4xLCAxMDAwKQoKYGBgCgpUaGVuIHdlIGRvIHRoZSBzaW11bGF0aW9ucy4gQnJhbmNoIGxlbmd0aCAxOgpgYGB7cn0Kc2ltdWxhdGVTaXRlc0FuZENvbGxlY3RGcmVxdWVuY2llcygiQSIsIDEsIDEwMDApCgpgYGAKClRoZW4gd2UgZG8gdGhlIHNpbXVsYXRpb25zLiBCcmFuY2ggbGVuZ3RoIDEwOgpgYGB7cn0Kc2ltdWxhdGVTaXRlc0FuZENvbGxlY3RGcmVxdWVuY2llcygiQSIsIDEwLCAxMDAwKQoKYGBgCgpEbyB5b3UgZmluZCB0aG9zZSBwcm9wb3J0aW9ucyBleHBlY3RlZCBvciBzdXJwcmlzaW5nPwoKCiMjIENvbXB1dGluZyB0aGUgbGlrZWxpaG9vZCBvZiBhIHRyZWUgdXNpbmcgc2ltdWxhdGlvbnMKCmBgYHtyfQoKIyBDb21wdXRlIHRoZSBwcm9iYWJpbGl0eSBvZiBhIHRyZWUKY29tcHV0ZVRyZWVQcm9iYWJpbGl0eUJ5U2ltdWxhdGluZyA8LSBmdW5jdGlvbihibF9hMSwgYmxfYywgYmxfYTIsIGJsX3kpIHsKICAgIE5zdWNjZXNzID0gMAogICAgTnJlcHMgPSAxMDAwMAogICAgaXRlciA9IDAKICAgIHdoaWxlIChpdGVyIDwgTnJlcHMpIHsKICAgICAgICB4IDwtIGRyYXdOZXdTdGF0ZSgpCiAgICAgICAgeSA8LSBzaW11bGF0ZUFsb25nQnJhbmNoKHgsIGJsX3kpCiAgICAgICAgQTEgPC0gc2ltdWxhdGVBbG9uZ0JyYW5jaCh5LCBibF9hMSkKICAgICAgICBDIDwtIHNpbXVsYXRlQWxvbmdCcmFuY2goeSwgYmxfYykKICAgICAgICBBMiA8LSBzaW11bGF0ZUFsb25nQnJhbmNoKHgsIGJsX2EyKQogICAgICAgIGlmIChBMT09IkEiICYmIEM9PSJDIiAmJiBBMj09IkEiKSB7CiAgICAgICAgICAgIE5zdWNjZXNzIDwtIE5zdWNjZXNzICsgMQogICAgICAgIH0KICAgICAgICBpdGVyIDwtIGl0ZXIgKyAxCiAgICB9CiAgICByZXR1cm4gKE5zdWNjZXNzL05yZXBzKQp9CgoKcCA9IGNvbXB1dGVUcmVlUHJvYmFiaWxpdHlCeVNpbXVsYXRpbmcoMC4xLCAwLjEsIDAuNCwgMC4zKQpwcmludChwYXN0ZSAoIkxpa2VsaWhvb2Qgb2YgKChBOjAuMSwgQzowLjEpOjAuMywgQTowLjQpOyBmb3IgcGF0dGVybiB7QSxDLEF9OiAiLCBwLCBzZXA9IiIpKQpwcmludChwYXN0ZSgiTG9nLWxpa2VsaWhvb2Qgb2YgKChBOjAuMSwgQzowLjEpOjAuMywgQTowLjQpOyBmb3IgcGF0dGVybiB7QSxDLEF9OiAiLCBsb2cocCksIHNlcD0iIikpCgpgYGAKCgojIyBDb21wYXJpc29uIHRvIGxpa2VsaWhvb2QgY29tcHV0ZWQgd2l0aCB1c3VhbCBzb2Z0d2FyZQoKSW4gY29tbW9uIHNvZnR3YXJlLCBzdWJzdGl0dXRpb25zIHRvIHRoZSBzYW1lIHN0YXRlIGFyZSBub3QgY29uc2lkZXJlZCwgYW5kIHRoZXJlZm9yZSBicmFuY2ggbGVuZ3RocyBuZWVkIHRvIGJlIHJlc2NhbGVkLiBJbmRlZWQsIGJyYW5jaCBsZW5ndGhzIGNvcnJlc3BvbmQgdG8gZXhwZWN0ZWQgbnVtYmVycyBvZiBzdWJzdGl0dXRpb25zOiBpbiBvdXIgY2FzZSwgd2UgZXhwZWN0IDQvMyB0aW1lcyBtb3JlIHN1YnN0aXR1dGlvbnMgdGhhbiBpbiBjb21tb24gc29mdHdhcmUuIFRoZXJlZm9yZSwgdG8gY29tcGFyZSB0aGUgbGlrZWxpaG9vZHMsIHdlIGhhdmUgdG8gcnVuIGNvbW1vbiBzb2Z0d2FyZSBvbiBicmFuY2ggbGVuZ3RocyB0aGF0IGFyZSAzLzQgdGhvc2Ugb2YgdGhlIHRyZWUsIGkuZS46ICgoQTE6MC4wNzUsIEM6MC4wNzUpOjAuMjI1LCBBMjowLjMpOyBXaGVuIEkgZG8gdGhhdCwgSSBnZXQgOiBsb2dsaz0tNS40ODAxODUuIFNvIHRoZSBHaWxsZXNwaWUgc2FtcGxpbmcgYXBwcm9hY2ggaXMgbm90IHRvbyBmYXIgZnJvbSB0aGUgRmVsc2Vuc3RlaW4gYWxnb3JpdGhtIGV4YWN0IGFwcHJvYWNoLgoK