Week8 E115L: Diploid selection and genetic drift

by Grace Yuh Chwen Lee, Winter 2025

Diploid selection

Most the organisms you could actually “see” are diploids. So, it is equally important to understand how selection may influence the changes in allele frequencies. The concept is important in many scenarios, e.g. plant and animal breeding (e.g., how fast could farmers select corn with 110% of the yield?), insecticide resistance (e.g., how quickly will the mosquito population all become insecticide resistant), adaptation (e.g., plants being resistant to salt water!), and many other.

A1A1 A1A2 A2A2
genotype frequency \(p^2\) \(2pq\) \(q^2\)
absolute fitness \(W_{11}\) \(W_{12}\) \(W_{22}\)
relative fitness \(\omega_{11}=1\) \(\omega_{12} = 1-hs\) \(\omega_{22} = 1-s\)


One major difference between diploid and haploid selection model is dominance coefficient h. Let’s look at the number of offspring produced by three genotypes. Which allele is dominant?

A1A1 A1A2 A2A2
100 100 50
100 50 50
100 75 50


The h value for the above three cases are 1 (A1 is dominant), 0 (A2 is dominant), and 0.5 (neither allele is dominant over the other allele). Note that h can take values beyond 0 and 1 (stay tuned).

What are the two components of fitness?

What are the possible units of absolute fitness?

Hint: think about what you are counting for your plant experiment.

A11 A12 A22
number of individuals 25 950 9025
genotype frequency ? ? ?
survival component 100% 50% 80%
fertility component 80 70 80
absolute fitness ? ? ?
relative fitness ? ? ?

For the above table, what is the initial allele frequency of A1?

What is the selection coefficient (s)?

What is the dominance coefficient (h)?

Do you think A1 or A2 allele is “selected for”?

Will A1 and A2 increase in allele frequency?



With \(p\) being the frequency of A1 allele and \(q\) being the frequency of A2 allele (and remember \(p+q = 1\)), the frequency of A1 allele in the next generation is:

\(p_{t+1} = \frac{W_{11}p_{t}^2+W_{12}p_{t}q_{t}}{\bar{W}}\)


Again, the absolute fitness is irrelevant; what matters is the relative fitness. The above could be represented as

\(p_{t+1} = \frac{\omega_{11}p_{t}^2+\omega_{12}p_{t}q_{t}}{\bar{\omega}}\)


Changes in allele frequency in one generation is then

\(\Delta p_{t} = p_{t+1}-p_{t} = \frac{\omega_{11}p_{t}^2+\omega_{12}p_{t}q_{t}}{\bar{\omega}} - p_{t}\)


Let’s incorporate \(s\) and \(h\) (i.e., \(\omega_{12} = 1-hs\) and \(\omega_{22} = 1-s\)), the changes in allele frequency can be represented as

\(\Delta p_{t} = \frac{p_{t}hs + q_{t}s(1-h)}{\bar\omega}p_{t}q_{t}\)

When \(s > 0\) (i.e., A1A1 individuals are selected for), can you say whether \(\Delta p_{t}\) is always positive or negative? Compare to the case for haploid.

When \(\Delta p_{t}\) is positive, is the frequency of A1 allele increasing or decreasing? How about when the value is negative?

Let’s define a function to predict A1 allele frequency of the next generation with different s and h. Remember that \(p+q = 1\), so we only need to specify \(p\) in the function.

#define our own function for p_next
p_next_fun_diploid = function(p, s, h){
  mean_fitness = 1-2*p*(1-p)*h*s-(1-p)*(1-p)*s
  p_next = p + (p*h*s + (1-p)*s*(1-h))*p*(1-p)/mean_fitness
  return(p_next)
}
s = 0.6
h = 0.5
p = 0.05
ngen = 100
v_p = c()

for (i in 1:ngen){
  p_next = p_next_fun_diploid(p, s, h)
  v_p[i] = p_next
  p = p_next
}

plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))

How \(s\) changes the dynamics of A1 allele is similar to that in the haploid case. Let’s explore a bit about the impacts of \(h\).

s = 0.6
ngen = 50
m = matrix(nrow = 5, ncol = ngen)

v_h = c(0, 0.2, 0.5, 0.8, 1) # testing 5 different h values

for(i in 1:5){
  my_h = v_h[i]
  p = 0.05
  
  for (j in 1:ngen){
   p_next = p_next_fun_diploid(p, s, my_h)
   m[i, j] = p_next
   p = p_next
  }
}

my_color = c("red", "orange", "yellow", "olivedrab", "blue")
plot(1:ngen, m[1,], xlab = "generation", ylab = "A1 frequency", pch = 20, col = my_color[1], type = 'b', ylim = c(0, 1))
for (i in 2:5){ #plot functions can also be loop
  points(1:ngen, m[i,], col = my_color[i], pch = 20, type = 'b')
}

With which \(h\), A1 allele arises in frequency fastest?

Did you notice that the trajectories are also different? Can you guess why?



\(h\) does not need to be between 0 and 1. Let’s look at the following example.

A11 A12 A22
number of individuals 25 950 9025
genotype frequency ? ? ?
rate of survival 100% 100% 80%
number of offspring 50 70 50
absolute fitness ? ? ?
relative fitness ? ? ?

What is the value of \(s\) and \(h\)? Which genotype (A1A1, A1A2, or A2A2) has the highest fitness?



This is called “overdominance,” when heterozygotes have the highest fitness. Let’s see how this influences the dynamics of A1 allele.

p = 0.05
s = 0.2
h = -2
ngen = 100
v_p = c()

for (i in 1:ngen){
  p_next = p_next_fun_diploid(p, s, h)
  v_p[i] = p_next
  p = p_next
}

plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))

Do you think A1 allele will ever goes to fixation (i.e., all individuals in the population have that allele)?



Let’s look at another scenario.

A11 A12 A22
number of individuals 25 950 9025
genotype frequency ? ? ?
rate of survival 100% 50% 80%
number of offspring 80 70 80
absolute fitness ? ? ?
relative fitness ? ? ?

What is the value of \(s\) and \(h\)? Which genotype (A1A1, A1A2, or A2A2) has the lowest fitness?

When the heterozygotes have the lowest fitness, it is called “underdominance.” Underdominance also has some interesting behavior….

p = 0.05
s = 0.2
h = 2.8
ngen = 100
v_p = c()

for (i in 1:ngen){
  p_next = p_next_fun_diploid(p, s, h)
  v_p[i] = p_next
  p = p_next
}

plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))

When the starting frequency of A1 is 0.05, is A1 eventually fixed or lost in the population?

Let’s look at exactly the same setting for h and s, but the starting frequency is different.

p = 0.5
s = 0.2
h = 2.8
ngen = 100
v_p = c()

for (i in 1:ngen){
  p_next = p_next_fun_diploid(p, s, h)
  v_p[i] = p_next
  p = p_next
}

plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))

Is A1 fixed or lost now?

When h is between 0 and 1, do you think the dynamics of A1 depends on initial population frequencies? How does this differ from the haploid case?


Selection and Drift

Let’s take another look at the selection and drift functions.

In class exercise:

Try defining a selection_diaploid(p, s, h, ngen) function and run it for 10 times (i.e., predicting changes in allele frequency for the next 100 generations 10 times), and make a plot including the results from these 10 runs.

Are the trajectories for allele frequencies the same or different for your 10 runs?

#let's simulate the effect of selection for ten times
my_p = 0.1
my_s = 0.2
my_h = 0.5
my_ngen = 100

colors = c("red", "pink", "orange", "yellow", "darkgreen", "green", "blue", "skyblue", "violet", "gray50")


m_p = matrix(nrow = 10, ncol = my_ngen)
for(i in 1:10){
  v_p = selection_diploid(my_p, 1-my_s, my_h, my_ngen)
  m_p[i,] = v_p
}

plot(m_p[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation", col = colors[1])
for(i in 2:10){
  points(m_p[i,], type = 'l', col = colors[i])
}

In class exercise:

Let’s try the same thing with genetic drift functions (which takes p, N, ngen) you had, but remember to change to diploid case. Do you get similar results?

#let's simulate the effect of drift for ten times
my_p = 0.1
N = 100
my_ngen = 100
colors = c("red", "pink", "orange", "yellow", "darkgreen", "green", "blue", "skyblue", "violet", "gray50")


m_p = matrix(nrow = 10, ncol = my_ngen)
for(i in 1:10){
  v_p = drift_diploid(my_p, N, my_ngen)
  m_p[i,] = v_p
}

plot(m_p[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation", col = colors[1])
for(i in 2:10){
  points(m_p[i,], type = 'l', col = colors[i])
}

The major difference between the selection vs drift simulation is one is deterministic while the other is stochastic. These two terms are general to various simulations and you may encounter them in context other than population genetics.


Relative importance of directional selection and genetic drift

In real population, both forces lead to changes in allele frequencies. But which one dominates depend on the strength of the forces (or “scale” in population genetics). A rule of thumb is to compare s vs 1/N. When s > 1/N, selection dominates and the changes in allele frequency is more or less deterministic. On the other hand, when s < 1/N, drift dominates and changes in allele frequency is stochastic.

In class exercise:

Try using your p_next_drift_fun() and p_next_fun_diploid() to predict the allele frequency changes for 100 generations with both drift and selection.

#set up our parameters
N = 1000
p = 0.5
h = 0.5
s = 0.1
ngen = 1000

v_p = c()
for (i in 1:ngen){
  
  #here, we let selection happens first. In real population genetic simulations, the order sometimes matter. 
  #NOTE: did we use N here?
  p_next_after_selection = p_next_fun_diploid(p, s, h)
  
  #drift happens after
  #NOTE: did we use N here?
  p_next_after_selection_after_drift = p_next_drift_fun(p_next_after_selection, N)
  
  v_p[i] = p_next_after_selection_after_drift
  
  p = p_next_after_selection_after_drift ######## important ######
}

plot(v_p, type = 'l', ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")

Let’s try defining a new function.

sel_drift_fun = function(p, s, h, N, ngen){
  v_p = c()
  
  for (i in 1:ngen){
    p_next_after_selection = p_next_fun_diploid(p, s, h)
    p_next_after_selection_after_drift = p_next_drift_fun(p_next_after_selection, N)
    v_p[i] = p_next_after_selection_after_drift
    p = p_next_after_selection_after_drift 
  }
  return(v_p)
}

Let’s use this new function to explore the relative importance of selection and drift. In this case, s >> 1/N.

p = 0.5
ngen = 100

N = 100
s = 0.4
h = 0.5

v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "blue", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")

Let’s try weaker selection, where s ~ 1/N.

p = 0.5
ngen = 100

N = 100
s = 0.01
h = 0.5

v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")

Let’s try running the simulations multiple times (like what we did for drift alone) and explore the trajectories.

For the case where s >> 1/N, do you think selection or drift dominates?

p = 0.5
ngen = 100
nrun = 10

N = 100
s = 0.4
h = 0.5


v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "blue", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.1")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, s, h, N, ngen)
  lines(v_freq, type = 'l', col = "blue")
}

For the case where s ~ 1/N, do you think selection or drift dominates?

p = 0.5
ngen = 100
nrun = 10

N = 100
s = 0.01
h = 0.5


v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.01")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, s, h, N, ngen)
  lines(v_freq, type = 'l', col = "red")
}

It is the relative scale of s and N that matters. Dirft would dominate for both N = 100, s = 0.01 and N = 1000, s = 0.001.

par(mfrow = c(1, 2))  #having two figures in one row

p = 0.5
ngen = 1000
nrun = 10

N = 100
s = 0.01
h = 0.5

v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.01")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, s, h, N, ngen)
  lines(v_freq, type = 'l', col = "red")
}



N = 1000
s = 0.001
omega2 = 1-s 

v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 1000, s = 0.001")

for(i in 1:nrun-1){
  v_freq = sel_drift_fun(p, s, h, N, ngen)
  lines(v_freq, type = 'l', col = "red")
}

You may notice that how quickly the allele frequency changes differ between N = 100 and 1000, despite drift dominates both simulations. The time scale for drift depends on the inverse of population size: the expected time to fixation/lost of an allele due to drift is ~ 1/N. Accordingly, allele frequencies change with faster rate when N is smaller.