The Weasel Program

In Chapter 3 of The Blind Watchmaker, Richard Dawkins references the infinite monkey theorem:

I don’t know who first pointed out that, given enough time, a monkey bashing away at random on a typewriter could produce all the works of Shakespeare. The operative phrase is, of course, given enough time. Let us limit the task facing our monkey somewhat. Suppose that he has to produce, not the complete works of Shakespeare but just the short sentence ‘Methinks it is like a weasel’, and we shall make it relatively easy by giving him a typewriter with a restricted keyboard, one with just the 26 (capital) letters, and a space bar. How long will he take to write this one little sentence?

The Simpsons: Last Exit to Springfield

Q1 If you have a million monkey at a million typewriters and a monkey could type one phrase 28 characters long every minute, how long would it take?

Dawkins goes on to say:

We again use our computer monkey, but with a crucial difference in its program. It again begins by choosing a random sequence of 28 letters, just as before … it duplicates it repeatedly, but with a certain chance of random error – ‘mutation’ – in the copying. The computer examines the mutant nonsense phrases, the ‘progeny’ of the original phrase, and chooses the one which, however slightly, most resembles the target phrase, METHINKS IT IS LIKE A WEASEL.

The exact time taken by the computer to reach the target doesn’t matter… Computers are a bit faster at this kind of thing than monkeys, but the difference really isn’t significant. What matters is the difference between the time taken by cumulative selection, and the time which the same computer, working flat out at the same rate, would take to reach the target phrase if it were forced to use the other procedure of single-step selection: about a million million million million million years. This is more than a million million million times as long as the universe has so far existed.

Spelling Population Genetics

For the sake of computational time, let’s start with a shorter phrase. Let’s say that we want our monkeys to type “Population Genetics” but we’ll settle for “POPULATIONGENETICS” from a keyboard with only 26 ALL CAPS keys.

Let’s start by creating a random sequences of letters, 18 characters long:

start.message <- sample(LETTERS, 18)

start.message

##  [1] "D" "M" "C" "R" "V" "X" "Q" "I" "F" "O" "B" "A" "L" "H" "J" "U" "W"
## [18] "K"

Then we can go letter by letter and see if each letter matches the letter we want. We can also count up the number of correct letters.

correct.message <- "POPULATIONGENETICS"
correct.message <- strsplit(correct.message, "")[[1]]

start.message==correct.message

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

sum(start.message==correct.message)

## [1] 1

Q2 What is the probability that this procedure would produce each possible number of correct letters? Try using R to calculate these probabilities. The distribution should look like the one below:

Mutating a Message

Let’s start with a mutation rate of 20% to speed things along. Each letter will have a 20% chance of mutating. If it mutates, it has an equal chance of becoming any letter. To do this in R, we’ll simulate 18 random numbers between 0 and 1. Each one of those numbers that’s lower than the mutation rate, signals that we should mutate that letter in the sequence.

mutation.rate <- 0.20
randos <- runif(18)
randos
randos < mutation.rate

Now, let’s mutate the message at the appropriate locations and compare our message to the start message.

message <- start.message
message[randos < mutation.rate] <- sample(LETTERS, sum(randos < mutation.rate))
message
start.message

(Note: If you happened not to see any mutations, you may want to run this code again to see what it does when there is a mutation.)

A Population of 100 Messages

Let’s create a population of 100 messages each 18 letters long:

start.message <- sample(LETTERS, 18)
for (i in 1:99){
  start.message <- rbind(start.message, sample(LETTERS, 18))
}

message <- start.message
message

For each of these 100 messages, let’s count the number of letters that are correct (that is the number that match the corresponding letter in “POPULATIONGENETICS”). We’ll store these numbers of correct letters in a vector.

correct.counts <- rep(NA, 100)
for(i in 1:100){
  correct.counts[i] <- sum(message[i, ]==correct.message)
}
correct.counts

Next, let’s order these messages from most correct letters to least and select the numbers of the two messages with the most correct letters.

order(correct.counts, decreasing=TRUE)[1:2]

Only these two messages will get to reproduce themselves. We’ll let each of these message create 50 exact copies of themselves to make a new population of 100 messages:

parents <- message[order(correct.counts, decreasing=TRUE)[1:2],]
message <- do.call("rbind", replicate(50, parents, simplify = FALSE))

Now, let’s allow random mutation to happen on each of the members of this second generation of messages:

for (i in 1:100){
  randos <- runif(18)
  message[i, randos < mutation.rate] <- sample(LETTERS, sum(randos < mutation.rate))
}

and, finally, we can count the number of correct letters once again:

correct.counts <- rep(NA, 100)
for(i in 1:100){
  correct.counts[i] <- sum(message[i, ]==correct.message)
}
correct.counts

Q3 How many correct letters to the most correct messages have? What about the least correct messages?

A Message Evolution Function

Let’s put everything we’ve done together in a function. We’ll allow ourselves to simulate any number of generations to change (num.gens), to change the mutation rate, to pick new correct messages and to change the size of the population (2N). It will vector with the average accuracy (proportion of correct letters) of our population in each generation.

simple.evolution <- function(correct.message = "POPULATIONGENETICS", mutation.rate=0.01, num.gens=100, N=50){

correct.message <- strsplit(correct.message, "")[[1]]
num.letters <- length(correct.message)
accuracy <- rep(NA, num.gens)
  
start.message <- sample(LETTERS, 18)
for (i in 1:(2*N-1)){
  start.message <- rbind(start.message, sample(LETTERS, num.letters))
}

message <- start.message

for (g in 1:num.gens){
  
for (i in 1:(2*N)){
  randos <- runif(num.letters)
  message[i, randos < mutation.rate] <- sample(LETTERS, sum(randos < mutation.rate))
  
}

correct.counts <- rep(NA, 2*N)
for(i in 1:(2*N)){
  correct.counts[i] <- sum(message[i, ]==correct.message)
}
accuracy[g] <- mean(correct.counts/num.letters)

parents <- message[order(correct.counts, decreasing=TRUE)[1:2],]
message <- do.call("rbind", replicate(N, parents, simplify = FALSE))
}
return(accuracy)
}

Now, let’s try out a few different mutation rates all for 200 generations.

results <- simple.evolution(mutation.rate=0.2, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

results <- simple.evolution(mutation.rate=0.1, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

results <- simple.evolution(mutation.rate=0.01, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

results <- simple.evolution(mutation.rate=0.001, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

Q4 What happens in these simulations? How does the mutation rate effect the evolution of the population of messages?

Try out a few simulations of your own.

Crossing Over

Let’s give our evolution function one more tool to work with – crossing over.

We can demonstrate this idea by allowing a random cross over between “POPULATIONGENETICS” and the first 18 letters of the alphabet:

crossOverPoint <- sample(2:17, 1)
crossOverPoint
parents <- rbind(correct.message, LETTERS[1:18])
children <- parents
children[1,] <- c(parents[1,1:crossOverPoint],
                  parents[2,(crossOverPoint+1):18])
children[2,] <- c(parents[2,1:crossOverPoint],
                  parents[1,(crossOverPoint+1):18])

children

(Note: We could add realism but assigning a probability of crossover between any two letters and allowing for multiple crossovers but this will do for now.)

A Message Evolution Function with Crossovers

Let’s alter our evolution function so that the two parent messages crossover at a random location when producing each child message.

evolution.crossover <- function(correct.message = "POPULATIONGENETICS", mutation.rate=0.01, num.gens=100, N=50){

correct.message <- strsplit(correct.message, "")[[1]]
num.letters <- length(correct.message)
accuracy <- rep(NA, num.gens)
  
start.message <- sample(LETTERS, 18)
for (i in 1:(2*N-1)){
  start.message <- rbind(start.message, sample(LETTERS, num.letters))
}

message <- start.message

for (g in 1:num.gens){
  
for (i in 1:(2*N)){
  randos <- runif(num.letters)
  message[i, randos < mutation.rate] <- sample(LETTERS, sum(randos < mutation.rate))
  
}

correct.counts <- rep(NA, 2*N)
for(i in 1:(2*N)){
  correct.counts[i] <- sum(message[i, ]==correct.message)
}
accuracy[g] <- mean(correct.counts/num.letters)

parents <- message[order(correct.counts, decreasing=TRUE)[1:2],]

  for (c in 1:N){
    crossOverPoint <- sample(2:(num.letters-1), 1)
    
    message[c,] <- c(parents[1,1:crossOverPoint],
                      parents[2,(crossOverPoint+1):num.letters])
    message[N+c,] <- c(parents[2,1:crossOverPoint],
                      parents[1,(crossOverPoint+1):num.letters])
  }
}
return(accuracy)
}

Now, let’s try several simulations with our new function:

results <- evolution.crossover(mutation.rate=0.2, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

results <- evolution.crossover(mutation.rate=0.1, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

results <- evolution.crossover(mutation.rate=0.01, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

results <- evolution.crossover(mutation.rate=0.001, num.gens=200)
plot(1:length(results), results, type="l", xlab="generation", ylab="Accuracy")

Q4 In our message evolution the effects crossovers might be muted since the two parents are never that different. Additionally, the selection pressures in our imagined scenario are perhaps unrealistically strong. What changes would you propose to this function in order add realism or simply to investigate other possibilities?

The Blind Phrase Maker

Population Genetics

3/19/2017