In this notebook, we’ll see how to simulate DNA evolution along a branch of arbitrary length, under the Jukes and Cantor (1969) model.
Simulating along a branch
# Useful function for drawing a DNA state, for instance at the root
drawNewState <- function (){
return(sample(x=c("A","C","G","T"), size=1))
}
# Function to simulate along a branch using the Gillespie algorithm under the Jukes Cantor model (1969)
# In this case, we consider that there can be a mutation towards the same state.
# So the sum of the rates is 0.25*4=1.0
simulateAlongBranch <- function(startingState, branchLength) {
l <- 0.0
current <- startingState
while (l < branchLength) {
l <- l + rexp(rate=1.0, n=1) # 1.0 is the sum of the rates of all possible events
if (l < branchLength) {
current <- drawNewState()
}
}
return (current)
}
This way we can simulate along a branch. Let’s do it several times, for different branch lengths. First we implement a function to do that:
simulateSitesAndCollectFrequencies <- function (startingState, branchLength, numSites) {
simulatedStates = c()
for (i in 1:numSites) {
simulatedStates = c(simulatedStates, simulateAlongBranch(startingState, branchLength))
}
return(table(simulatedStates)/numSites)
}
Then we do the simulations. Branch length 0.1:
simulateSitesAndCollectFrequencies("A", 0.1, 1000)
simulatedStates
A C G T
0.931 0.028 0.018 0.023
Then we do the simulations. Branch length 1:
simulateSitesAndCollectFrequencies("A", 1, 1000)
simulatedStates
A C G T
0.503 0.191 0.156 0.150
Then we do the simulations. Branch length 10:
simulateSitesAndCollectFrequencies("A", 10, 1000)
simulatedStates
A C G T
0.255 0.239 0.255 0.251
Do you find those proportions expected or surprising?
LS0tCnRpdGxlOiAiU2ltdWxhdGlvbiBvZiBhIEROQSBzaXRlIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpJbiB0aGlzIG5vdGVib29rLCB3ZSdsbCBzZWUgaG93IHRvIHNpbXVsYXRlIEROQSBldm9sdXRpb24gYWxvbmcgYSBicmFuY2ggb2YgYXJiaXRyYXJ5IGxlbmd0aCwgdW5kZXIgdGhlIEp1a2VzIGFuZCBDYW50b3IgKDE5NjkpIG1vZGVsLiAKCiMjIFNpbXVsYXRpbmcgYWxvbmcgYSBicmFuY2gKCgpgYGB7cn0KCgojIFVzZWZ1bCBmdW5jdGlvbiBmb3IgZHJhd2luZyBhIEROQSBzdGF0ZSwgZm9yIGluc3RhbmNlIGF0IHRoZSByb290CmRyYXdOZXdTdGF0ZSA8LSBmdW5jdGlvbiAoKXsKICAgIHJldHVybihzYW1wbGUoeD1jKCJBIiwiQyIsIkciLCJUIiksIHNpemU9MSkpCn0KCiMgRnVuY3Rpb24gdG8gc2ltdWxhdGUgYWxvbmcgYSBicmFuY2ggdXNpbmcgdGhlIEdpbGxlc3BpZSBhbGdvcml0aG0gdW5kZXIgdGhlIEp1a2VzIENhbnRvciBtb2RlbCAoMTk2OSkKIyBJbiB0aGlzIGNhc2UsIHdlIGNvbnNpZGVyIHRoYXQgdGhlcmUgY2FuIGJlIGEgbXV0YXRpb24gdG93YXJkcyB0aGUgc2FtZSBzdGF0ZS4KIyBTbyB0aGUgc3VtIG9mIHRoZSByYXRlcyBpcyAwLjI1KjQ9MS4wCnNpbXVsYXRlQWxvbmdCcmFuY2ggPC0gZnVuY3Rpb24oc3RhcnRpbmdTdGF0ZSwgYnJhbmNoTGVuZ3RoKSB7CiAgICBsIDwtIDAuMAogICAgY3VycmVudCA8LSBzdGFydGluZ1N0YXRlCiAgICB3aGlsZSAobCA8IGJyYW5jaExlbmd0aCkgewogICAgICAgIGwgPC0gbCArIHJleHAocmF0ZT0xLjAsIG49MSkgIyAxLjAgaXMgdGhlIHN1bSBvZiB0aGUgcmF0ZXMgb2YgYWxsIHBvc3NpYmxlIGV2ZW50cwogICAgICAgIGlmIChsIDwgYnJhbmNoTGVuZ3RoKSB7CiAgICAgICAgICAgIGN1cnJlbnQgPC0gZHJhd05ld1N0YXRlKCkKICAgICAgICB9CiAgICB9CiAgICByZXR1cm4gKGN1cnJlbnQpCn0KCmBgYAoKClRoaXMgd2F5IHdlIGNhbiBzaW11bGF0ZSBhbG9uZyBhIGJyYW5jaC4gTGV0J3MgZG8gaXQgc2V2ZXJhbCB0aW1lcywgZm9yIGRpZmZlcmVudCBicmFuY2ggbGVuZ3Rocy4gCkZpcnN0IHdlIGltcGxlbWVudCBhIGZ1bmN0aW9uIHRvIGRvIHRoYXQ6CgpgYGB7cn0Kc2ltdWxhdGVTaXRlc0FuZENvbGxlY3RGcmVxdWVuY2llcyA8LSBmdW5jdGlvbiAoc3RhcnRpbmdTdGF0ZSwgYnJhbmNoTGVuZ3RoLCBudW1TaXRlcykgewogIHNpbXVsYXRlZFN0YXRlcyA9IGMoKQogIGZvciAoaSBpbiAxOm51bVNpdGVzKSB7CiAgICBzaW11bGF0ZWRTdGF0ZXMgPSBjKHNpbXVsYXRlZFN0YXRlcywgc2ltdWxhdGVBbG9uZ0JyYW5jaChzdGFydGluZ1N0YXRlLCBicmFuY2hMZW5ndGgpKQogIH0gIAogIHJldHVybih0YWJsZShzaW11bGF0ZWRTdGF0ZXMpL251bVNpdGVzKQp9CgpgYGAKClRoZW4gd2UgZG8gdGhlIHNpbXVsYXRpb25zLiBCcmFuY2ggbGVuZ3RoIDAuMToKYGBge3J9CnNpbXVsYXRlU2l0ZXNBbmRDb2xsZWN0RnJlcXVlbmNpZXMoIkEiLCAwLjEsIDEwMDApCgpgYGAKClRoZW4gd2UgZG8gdGhlIHNpbXVsYXRpb25zLiBCcmFuY2ggbGVuZ3RoIDE6CmBgYHtyfQpzaW11bGF0ZVNpdGVzQW5kQ29sbGVjdEZyZXF1ZW5jaWVzKCJBIiwgMSwgMTAwMCkKCmBgYAoKVGhlbiB3ZSBkbyB0aGUgc2ltdWxhdGlvbnMuIEJyYW5jaCBsZW5ndGggMTA6CmBgYHtyfQpzaW11bGF0ZVNpdGVzQW5kQ29sbGVjdEZyZXF1ZW5jaWVzKCJBIiwgMTAsIDEwMDApCgpgYGAKCgpEbyB5b3UgZmluZCB0aG9zZSBwcm9wb3J0aW9ucyBleHBlY3RlZCBvciBzdXJwcmlzaW5nPwo=