Week7 E115L: Genetic Drift

by Grace Yuh Chwen Lee, Spring 2024

Allele frequencies could change due to selection for or against an allele, as what we learned last week. But even in the absence of selection, allele frequencies could still change due to randomness, known as genetic drift. There are difference sources of genetic drift, including

An individual carrying a particular allele can have more or less offspring than the average
Mendelian transmission

Source of genetic drift

Let’s look at what we worked on last week (diploid case) for the first source of randomness.

	A11	A12	A22
rate of survival	100%	100%	80%
number of offspring	50	70	50
absolute fitness	?	?	?
relative fitness	?	?	?

For an individual with A11 genotype, how many offspring can that individual produce?

How about another individual with A11 genotype?

Do you think individuals with A11 genotype ALWAYS produce exactly that number of offspring?

How about for other genotypes?

Let’s look at another source of randomness - Mendelian transmission. For heterozygous A12 individuals…

What is the probability that the A1 allele is passed down? How about A2 allele?

What if an A12 individual only produces one offspring, which allele will be passed down?

We can use the coin tossing example in week 4 (but now, selecting A1 or A2 allele) to illustrate this point.

Try the following code multiple times and see if it gives you the same answer.

alleleS = c('A1', 'A2')
ntoss = 10 #the number of offspring produced
random_allele = sample(alleleS, size = ntoss, replace = TRUE) 
table(random_allele)

random_allele
A1 A2 
 5  5

Let’s write it into a function to make it easier to use.

per_A1_next_gen = function(n_offspring){
   alleleS = c('A1', 'A2')
    random_allele = sample(alleleS, size = n_offspring, replace = TRUE) 
  n_A1 = sum(random_allele == 'A1')
  return(n_A1/n_offspring)
}
    
#only one offspring
per_A1_next_gen(1)

[1] 1

#5 offspring
per_A1_next_gen(5)

[1] 0.8

#ten offspring
per_A1_next_gen(10)

[1] 0.7

#a thousand offspring
per_A1_next_gen(1000)

[1] 0.512

Because of these two sources of randomness, even in the absence of selection/fitness differences, allele frequencies change from generations to generations. In this class, we will write simulations to explore how various factors influence how strong genetic drift changes allele frequencies.

Binomial distribution

Before we write our own simualtion, we will need to first learn about random variables with binomial distribution. A binomial random variable is the number of success in an experiment of N trails.

Let’s take coin tossing as an example (again). The “number of heads with N toss” has a distribution of binomial distribution.

coin_head = function(ntoss){
  coin = c("head", "tail")
  random_coin = sample(coin, size = ntoss, replace = TRUE) 
  no_head = sum(random_coin == "head")
  return(no_head)
}

v_no_head = c()
ntoss = 100
for(i in 1:10000){
  no_head = coin_head(ntoss)
  v_no_head[i] = no_head
}
hist(v_no_head, xlab = "number of heads (100 toss)", main = "number of heads, fair coin")

Let’s consider a case of “unfair” coin, and compare the distributions.

coin_head = function(ntoss){
  coin = c("head", "tail")
  random_coin = sample(coin, size = ntoss, replace = TRUE, prob = c(0.9, 0.1)) 
  no_head = sum(random_coin == "head")
  return(no_head)
}

v_no_head2 = c()
ntoss = 100
for(i in 1:10000){
  no_head = coin_head(ntoss)
  v_no_head2[i] = no_head
}
hist(v_no_head2, xlab = "number of heads (100 toss)", main = "number of heads, unfair coin")

The two alleles, A1 and A2, could be like the “head” and “tail” of a unfair coin, with their probability being sampled following their frequencies.

To generate a random variable with a binomial distribution, we use the rbinom() function.

#the number of individuals with A1 allele, which has allele frequency of 0.9
#the number of individuals in the population is 100
#only do this sampling once
rbinom(n = 1, size = 100, prob = 0.9)

[1] 91

In class exercise - a simulator for genetic drift (this will be part of your lab report)

Goal: A FUNCTION to perform genetic drift with population size N, initial allele frequency p, and for ngen generations

Step 1: (First function) Write a FUNCTION p_next_drift_funfor allele frequency in the next generation using binomial sampling, given allele frequency p of current generation and fixed N population size.

Step 2: Write a FOR LOOP using the p_next_drift_fun function to see how allele frequencies change over the next ngen generations.

Save the results in v_p vector.

Step 3: (Second function) Convert what you wrote in Step 3 as ANOTHER FUNCTION drift_fun(), which takes three values p, N, and ngen. (Note, you can call another function you defined in a function!)

We will use this function in next class to explore various aspects of genetic drift, which will be part of your lab report!

Key characteristic of genetic drift is “randomness”

Let’s try executing your drift_fun() 10 times, save it in a matrix, and plot all the results out in the same figure.

p = 0.5
N = 100
ngen = 200

m_p = matrix(nrow = 10, ncol = ngen)
for(i in 1:10){
  v_p = drift_fun(p, N, ngen)
  m_p[i,] = v_p
}

plot(m_p[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation")
for(i in 2:10){
  points(m_p[i,], type = 'l')
}

With the same p, N, and ngen, do you think the trajectories are similar between the 10 runs?

Do you think the frequencies are more similar or more different as time goes by?

Impacts of genetic drift over generations

Let’s use some statistics to formally investigate whether the allele frequencies are indeed more different over time. We will learn how to use apply function in R.

?apply

v_mean = apply(m_p, 2, mean)
v_var  = apply(m_p, 2, var)

#mean A1 frequency over time
plot(v_mean, pch = 20, xlab = "generation", ylab = "mean A1 freq.", ylim = c(0, 1), main = "mean A1 freq")
lines(c(0, ngen), c(0.5, 0.5), col = "red")

#var A1 frequency over time
plot(v_var, pch = 20, xlab = "generation", ylab = "var A1 freq.", main = "variance A1 freq")

From the two plots, do you think mean A1 frequency changes much over time?

How about variance of A1 frequency and what does that mean?

Let’s run the simulation even longer to see if we could observe even more dramatic effects.

p = 0.5
N = 100
ngen = 2000

m_p = matrix(nrow = 10, ncol = ngen)
for(i in 1:10){
  v_p = drift_fun(p, N, ngen)
  m_p[i,] = v_p
}

v_mean = apply(m_p, 2, mean)
v_var  = apply(m_p, 2, var)

#mean A1 frequency over time
plot(v_mean, pch = 20, xlab = "generation", ylab = "mean A1 freq.", ylim = c(0, 1), main = "mean A1 freq")
lines(c(0, ngen), c(0.5, 0.5), col = "red")

#var A1 frequency over time
plot(v_var, pch = 20, xlab = "generation", ylab = "var A1 freq.", main = "mean A1 freq")

Why variance does not change anymore? Let’s plot out trajectories of individual run.

plot(m_p[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation")
for(i in 2:10){
  points(m_p[i,], type = 'l')
}

A1 frequencies either goes to fixation (i.e., p = 1, and A2 is lost) or lost (i.e., p = 0, and A1 is lost) after 500 generations.

Do you think genetic drift enhance or erode genetic diversity?

Impacts of initial allele frequency (p)

Let’s define another function, where it would return nrun number of series of drift simulations as a matrix.

multi_drift_fun = function(p, N, ngen, nrun){
  multi_m_p = matrix(nrow = nrun, ncol = ngen)
  for(i in 1:nrun){
    v_p = drift_fun(p, N, ngen)
    multi_m_p[i,] = v_p
  }
  return(multi_m_p)
}

N = 100
ngen = 200
nrun = 10 #number of simulations

#trying a series of initial allele frequencies
#p = 0.5 (the same as above), 0.1, 0.8
m_p_0.5 = multi_drift_fun(0.5, N, ngen, nrun)
m_p_0.1 = multi_drift_fun(0.1, N, ngen, nrun)
m_p_0.8 = multi_drift_fun(0.8, N, ngen, nrun)

#p = 0.5
plot(m_p_0.5[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation", main = "p0 = 0.5")
for(i in 2:10){
  points(m_p_0.5[i,], type = 'l')
}

#p = 0.1
plot(m_p_0.1[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation", main = "p0 = 0.1")
for(i in 2:10){
  points(m_p_0.1[i,], type = 'l')
}

#p = 0.8
plot(m_p_0.8[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation", main = "p0 = 0.8")
for(i in 2:10){
  points(m_p_0.8[i,], type = 'l')
}

Comparing these three plots with different starting allele frequencies of A1 (p), which one most likely leads to the lost of A1 allele?

Which one most likely leads to the fixation of A1 allele (i.e., lost of A2)?

Impacts of population size (N) - in class exercise (this will be part of your report)

Goal: Explore how the impacts of genetic drift vary with population size N

Step 1: Define a new function mean_var_drift_fun(), which takes in the value of p, N, ngen, nrun, and return mean and variance of allele frequency for each generation as a data frame.

Hint1: similar to what you did here, but just make it into a function!

v_mean = apply(m_p, 2, mean)
v_var  = apply(m_p, 2, var)

Hint2: You can define a new data.frame as mydata = data.frame(mean = v_mean, var = v_var), which comes in handy later.

Step 2: Call your mean_var_drift_fun() function and use three N (10, 1000, 100000).

Step 3: Plots out mean and variance over generations for the three N.

Step 4: Answer the following questions (in your lab report)

Is the impact of genetic drift stronger in small or large populations? Answer this question both with respect to your experimental results and simulations.

Hint: Be mindful of your Y-axis scale - they may be very different for your three N.