There’s More than One Right Answer

In a recent lab we saw how selection could lead us closer and closer to spelling “POPULATIONGENETICS”. In this lab we’ll add a little realism by noting that there’s more than one way to be fit. We’ll work with 3-letter sequences and any sequence that spells a word will have a chance to survive.

A Spell Checker

We’ll need to create a simple spell checker to find out if our 3-letter sequences are fit. To do this, we’ll load a file with all of the 3-letter words in the OSPD (Official Scrabble Player’s Dictionary).

load("/home/rstudioshared/shared_files/data/words.RData")

words

To check whether a string is a word we can simple check whether it appears in this words vector:

c("calico", "sqwrzib") %in% words

There are only 961 3-letter words in the OSPD. We can calculate the number of words of every length (2 through 8 letters) and the frequency of words among all possible sequences of that length.

table(nchar(words))

words.per.sequence <- table(nchar(words))/(26^(2:8))

words.per.sequence

Q1: What proportion of 3 letter sequences are words?

Q2: What proportion of 6 letter sequences are words?

Let’s make a vector of only the three letter words to save our spell checker some time:

words3 <- words[nchar(words)==3]


words3

Mutating a Sequence

Let’s start with a mutation rate of 50%. Each letter will have a 50% chance of mutating. If it mutates, it has an equal chance of becoming any letter. To do this in R, we’ll simulate 3 random numbers between 0 and 1. Each one of those numbers that’s lower than the mutation rate, signals that we should mutate that letter in the sequence.

mutation.rate <- 0.50
randos <- runif(3)
randos
randos < mutation.rate

Now, let’s mutate the message at the appropriate locations and compare our message to the start message.

start.message <- "bad"
start.message <- strsplit(start.message, "")[[1]]
message <- start.message
message[randos < mutation.rate] <- sample(letters, sum(randos < mutation.rate))
message <- paste(message, collapse = "")
start.message <- paste(start.message, collapse = "")
message; start.message

(Note: If you happened not to see any mutations, you may want to run this code again to see what it does when there is a mutation.)

A Population of 100 Sequences

Let’s create a population of 100 messages each 3 letters long:

start.message <- "bad"
start.message <- strsplit(start.message, "")[[1]]

message <- start.message
for (i in 1:99){
  message <- rbind(message, start.message)
}
for (i in 1:100){
  randos <- runif(3)
  message[i, randos < mutation.rate] <- sample(letters, sum(randos < mutation.rate))
}

message <- apply(message, 1, paste,collapse="")

message[message %in% words3]

table(message[message %in% words3])
start.message <- "bad"
start.message <- strsplit(start.message, "")[[1]]

message <- start.message
for (i in 1:99){
  message <- rbind(message, start.message)
}

mutation.rate <- 0.2

for (i in 1:10){
for (i in 1:100){
  randos <- runif(3)
  message[i, randos < mutation.rate] <- sample(letters, sum(randos < mutation.rate))
}

survivors <- message[apply(message, 1, paste, collapse="") %in% words3,]
message <- survivors[sample(1:nrow(survivors), 100, replace=TRUE),]
}

sequences <- apply(message, 1, paste,collapse="")
table(sequences)

Selective Advantage (Based on Scrabble Points)

Now, let’s make things more interesting by giving some words a selective advantage. Since we’re using a scrabble dictionary it seems only natural that each word’s fecundity be determined by the number of points it would score.

Let’s tell R the value of every letter in scrabble:

scrabble.values <- c(1,3,3,2,1,4,2,4,1,8,5,1,3,1,1,3,10,1,1,1,1,4,4,8,4,10)

Then we’ll add one row to our previous code and alter another. The new line calculates the scrabble score of every surviving (real) word. We also now sample words (produce the next generation of words) with the probabilities proportional to their scrabble scores.

start.message <- "bad"
start.message <- strsplit(start.message, "")[[1]]

message <- start.message
for (i in 1:99){
  message <- rbind(message, start.message)
}


mutation.rate <- 0.2

for (i in 1:10){
for (i in 1:100){
  randos <- runif(3)
  message[i, randos < mutation.rate] <- sample(letters, sum(randos < mutation.rate))
}

survivors <- message[apply(message, 1, paste, collapse="") %in% words3,]

# calculating survivor values
survivor.values <- rowSums(matrix(scrabble.values[apply(survivors, 2, match, letters)], ncol=3))

# sample with probabilities proporitional to scrabble scores
message <- survivors[sample(1:nrow(survivors), 100, replace=TRUE, prob=survivor.values),]
}

sequences <- apply(message, 1, paste,collapse="")
table(sequences)

Q3 How do the results of this evolution compare the evolution without scrabble scores?

Q4 Can you use this algorithm to find the highest scoring 3-letter scrabble word (with enough generations)? In what situations would this algorithm never find the highest scoring scrabble word and what might this tell us about evolution?

Challenge (due Thursday): Rewrite this program to use four word sequences and describe your results. Does your procedure find the highest scoring 4-letter scrabble word?