Week9 E115L: Genetic Drift and selection

by Grace Yuh Chwen Lee, Spring 2024

Review of Genetic drift

During Week 7, we worked together to define the following functions. Let’s spend some time to use them and see how they help us understand genetic drift.

p_next_drift_fun(p, N) simulates a possible allele frequency in the next generation in a population with size N and frequency of A1 allele p. The changes in allele frequency is due to the effect of genetic drift, and you should get a different answer each time you try it!
- An important thing to note: The construction for haploid and diploid should be different for this function. Which one of the following is for haploid and which one is for diploid?
```
p_next_drift_fun = function(p, N){
  n_next = rbinom(1, size = N, prob = p)
  p_next = n_next/N
  return(p_next)
}

p_next_drift_fun = function(p, N){
  n_next = rbinom(1, size = 2*N, prob = p)
  p_next = n_next/(2*N)
  return(p_next)
}
```
drift_fun(p, N, ngen) that does the binomial sampling ngen time. It returns a vector saving allele frequency of each generation.
- You should call p_next_drift_fun() in this function. Note that you can call population size, allele frequency of whatever name (no need to be N and p, but they have to be consistent between the two functions).
- In class exercise: Let’s use this function to explore the impact of population size (N). Run the following three simulations.
  
  p = 0.5, N = 10, ngen = 100
  
  p = 0.5, N = 100, ngen = 100
  
  p = 0.5, N = 1000, ngen = 100
  
  Make an X-Y plot, overlaying the results from all three simulations in one plot. Which simulation shows the strongest impact of genetic drift?
  
  Hint: mind the scale of Y-axis.
multi_dirft_fun(p, N, ngen, nrun) allows you to run simulations with the same set of parameters multiple times and save the allele frequncies in a matrix.
```
multi_drift_fun = function(p, N, ngen, nrun){
  multi_m_p = matrix(nrow = nrun, ncol = ngen) #this is an empty matrix
  for(i in 1:nrun){
    v_p = drift_fun(p, N, ngen)
    multi_m_p[i,] = v_p
  }
  return(multi_m_p)
}
```
- In class exercise: use this function to simulate 50 times
  
  with p = 0.5, N = 10, ngen = 100. Make a plot that includes all results.
  
  with p = 0.5, N = 1000, ngen = 100. Make a plot that includes all results.
  
  Based on your results, do you think the effect of genetic drift is stronger in small or big population?
mean_var_drift_fun(p, N, ngen, nrun) allows you to run simulations with the same set of parameters multiple times and return the mean and variance of allele frequencies over generations. This function helps us quantitatively investigate the impact of population size.
- In class exercise: use this function to simulate 50 times
  
  with p = 0.5, N = 10, ngen = 100. Make two plots, one for mean and one of variance over time.
  
  with p = 0.5, N = 1000, ngen = 100. Make two plots, one for mean and one of variance over time.
  
  Based on your results, do you think genetic drift can systemtically change allele frequencies? Do you think the effect of genetic drift is stronger in small or big population?

Two important forces changing allele frequencies

Directional selection
Genetic drift

In Week 5, we learned about how population genetic theories could predict the changes in frequency of a selected allele over time.

\(p_{t+1} = \frac{\omega_{1}}{\bar{\omega}}p_{t}\) (haploid)

\(p_{t+1} = \frac{\omega_{11}p_{t}^2+\omega_{12}p_{t}q_{t}}{\bar{\omega}}\) (diploid)

Together, we defined the p_next_fun(p, omega2) for haploid and p_next_fun_diploid(p, h, s) for diploid.

In class exercise: When using the same set of parameters running these two functions, did you get the same or different answers?

Try running p_next_fun for haploid with p = 0.5, and omega2 = 0.9 and repeat for 10 times. Do your answers change?

Hint:

p_next_fun = function(p, omega2){
  p_next = p/(p*1 + (1-p)*omega2) 
  return(p_next)
}

How about p_next_drift_fun? Did you get the same answers each time you try it?

The major difference between these defined functions is one is deterministic while the other is stochastic. These two terms are general to various simulations and you may encounter them in context other than population genetics.

In real population, both forces lead to changes in allele frequencies. But which one dominates depend on the strength of the forces (or “scale” in population genetics). A rule of thumb is to compare s vs 1/N. When s > 1/N, selection dominates and the changes in allele frequency is more or less deterministic. On the other hand, when s < 1/N, drift dominates and changes in allele frequency is stochastic.

Let’s try using simulations to explore these predictions. We will use functions we defined previously.

Here, I am simulating haploid case. These can be easily extended to diploids by using the corresponding diploid functions.

#redfine the drift function to make sure it is for haploid
p_next_drift_fun = function(p, N){
  n_next = rbinom(1, size = N, prob = p)
  p_next = n_next/N
  return(p_next)
}

#set up our parameters
N = 1000
p = 0.5
omega2 = 0.9
ngen = 1000

v_p = c()
for (i in 1:ngen){
  
  #here, we let selection happens first. In real population genetic simulations, the order sometimes matter. 
  #NOTE: did we use N here?
  p_next_after_selection = p_next_fun(p, omega2)
  
  #drift happens after
  #NOTE: did we use N here?
  p_next_after_drift = p_next_drift_fun(p_next_after_selection, N)
  
  v_p[i] = p_next_after_drift
  
  p = p_next_after_drift ######## important ######
}

plot(v_p, type = 'l', ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")

Let’s try defining a new function.

sel_drift_fun = function(p, omega2, N, ngen){
  v_p = c()
  
  for (i in 1:ngen){
    p_next_after_selection = p_next_fun(p, omega2)
    p_next_after_drift = p_next_drift_fun(p_next_after_selection, N)
    v_p[i] = p_next_after_drift
    p = p_next_after_drift 
  }
  return(v_p)
}

Let’s use this new function to explore the relative importance of selection and drift. In this case, s >> 1/N.

p = 0.5
ngen = 100

N = 100
s = 0.1
omega2 = 1-s ## rembmer this?

v_freq = sel_drift_fun(p, omega2, N, ngen)
plot(v_freq, type = 'l', col = "blue", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")

Let’s try weaker selection, where s ~ 1/N.

p = 0.5
ngen = 100

N = 100
s = 0.01
omega2 = 1-s ## rembmer this?

v_freq = sel_drift_fun(p, omega2, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")

Let’s try running the simulations multiple times (like what we did for drift alone) and explore the trajectories.

For the case where s >> 1/N, do you think selection or drift dominates?

p = 0.5
ngen = 100
nrun = 10

N = 100
s = 0.1
omega2 = 1-s 


v_freq = sel_drift_fun(p, omega2, N, ngen)
plot(v_freq, type = 'l', col = "blue", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.1")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, omega2, N, ngen)
  lines(v_freq, type = 'l', col = "blue")
}

For the case where s ~ 1/N, do you think selection or drift dominates?

p = 0.5
ngen = 100
nrun = 10

N = 100
s = 0.01
omega2 = 1-s 


v_freq = sel_drift_fun(p, omega2, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.01")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, omega2, N, ngen)
  lines(v_freq, type = 'l', col = "red")
}

It is the relative scale of s and N that matters. Dirft would dominate for both N = 100, s = 0.01 and N = 1000, s = 0.001.

par(mfrow = c(1, 2))  #having two figures in one row

p = 0.5
ngen = 1000
nrun = 10

N = 100
s = 0.01
omega2 = 1-s 

v_freq = sel_drift_fun(p, omega2, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.01")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, omega2, N, ngen)
  lines(v_freq, type = 'l', col = "red")
}



N = 1000
s = 0.001
omega2 = 1-s 

v_freq = sel_drift_fun(p, omega2, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 1000, s = 0.001")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, omega2, N, ngen)
  lines(v_freq, type = 'l', col = "red")
}

You may notice that how quickly the allele frequency changes differ between N = 100 and 1000, despite drift dominates both simulations. The time scale for drift depends on the inverse of population size: the expected time to fixation/lost of an allele due to drift is ~ 1/N. Accordingly, allele frequencies change with faster rate when N is smaller.