#define our own function for p_next
p_next_fun_diploid = function(p, s, h){
mean_fitness = 1-2*p*(1-p)*h*s-(1-p)*(1-p)*s
p_next = p + (p*h*s + (1-p)*s*(1-h))*p*(1-p)/mean_fitness
return(p_next)
}Week8 E115L: Diploid selection and genetic drift
by Grace Yuh Chwen Lee, Winter 2025
Diploid selection
Most the organisms you could actually “see” are diploids. So, it is equally important to understand how selection may influence the changes in allele frequencies. The concept is important in many scenarios, e.g. plant and animal breeding (e.g., how fast could farmers select corn with 110% of the yield?), insecticide resistance (e.g., how quickly will the mosquito population all become insecticide resistant), adaptation (e.g., plants being resistant to salt water!), and many other.
| A1A1 | A1A2 | A2A2 | |
|---|---|---|---|
| genotype frequency | \(p^2\) | \(2pq\) | \(q^2\) |
| absolute fitness | \(W_{11}\) | \(W_{12}\) | \(W_{22}\) |
| relative fitness | \(\omega_{11}=1\) | \(\omega_{12} = 1-hs\) | \(\omega_{22} = 1-s\) |
One major difference between diploid and haploid selection model is dominance coefficient h. Let’s look at the number of offspring produced by three genotypes. Which allele is dominant?
| A1A1 | A1A2 | A2A2 |
|---|---|---|
| 100 | 100 | 50 |
| 100 | 50 | 50 |
| 100 | 75 | 50 |
The h value for the above three cases are 1 (A1 is dominant), 0 (A2 is dominant), and 0.5 (neither allele is dominant over the other allele). Note that h can take values beyond 0 and 1 (stay tuned).
| A11 | A12 | A22 | |
|---|---|---|---|
| number of individuals | 25 | 950 | 9025 |
| genotype frequency | ? | ? | ? |
| survival component | 100% | 50% | 80% |
| fertility component | 80 | 70 | 80 |
| absolute fitness | ? | ? | ? |
| relative fitness | ? | ? | ? |
With \(p\) being the frequency of A1 allele and \(q\) being the frequency of A2 allele (and remember \(p+q = 1\)), the frequency of A1 allele in the next generation is:
\(p_{t+1} = \frac{W_{11}p_{t}^2+W_{12}p_{t}q_{t}}{\bar{W}}\)
Again, the absolute fitness is irrelevant; what matters is the relative fitness. The above could be represented as
\(p_{t+1} = \frac{\omega_{11}p_{t}^2+\omega_{12}p_{t}q_{t}}{\bar{\omega}}\)
Changes in allele frequency in one generation is then
\(\Delta p_{t} = p_{t+1}-p_{t} = \frac{\omega_{11}p_{t}^2+\omega_{12}p_{t}q_{t}}{\bar{\omega}} - p_{t}\)
Let’s incorporate \(s\) and \(h\) (i.e., \(\omega_{12} = 1-hs\) and \(\omega_{22} = 1-s\)), the changes in allele frequency can be represented as
\(\Delta p_{t} = \frac{p_{t}hs + q_{t}s(1-h)}{\bar\omega}p_{t}q_{t}\)
When \(s > 0\) (i.e., A1A1 individuals are selected for), can you say whether \(\Delta p_{t}\) is always positive or negative? Compare to the case for haploid.
When \(\Delta p_{t}\) is positive, is the frequency of A1 allele increasing or decreasing? How about when the value is negative?
Let’s define a function to predict A1 allele frequency of the next generation with different s and h. Remember that \(p+q = 1\), so we only need to specify \(p\) in the function.
s = 0.6
h = 0.5
p = 0.05
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))How \(s\) changes the dynamics of A1 allele is similar to that in the haploid case. Let’s explore a bit about the impacts of \(h\).
s = 0.6
ngen = 50
m = matrix(nrow = 5, ncol = ngen)
v_h = c(0, 0.2, 0.5, 0.8, 1) # testing 5 different h values
for(i in 1:5){
my_h = v_h[i]
p = 0.05
for (j in 1:ngen){
p_next = p_next_fun_diploid(p, s, my_h)
m[i, j] = p_next
p = p_next
}
}
my_color = c("red", "orange", "yellow", "olivedrab", "blue")
plot(1:ngen, m[1,], xlab = "generation", ylab = "A1 frequency", pch = 20, col = my_color[1], type = 'b', ylim = c(0, 1))
for (i in 2:5){ #plot functions can also be loop
points(1:ngen, m[i,], col = my_color[i], pch = 20, type = 'b')
}With which \(h\), A1 allele arises in frequency fastest?
Did you notice that the trajectories are also different? Can you guess why?
\(h\) does not need to be between 0 and 1. Let’s look at the following example.
| A11 | A12 | A22 | |
|---|---|---|---|
| number of individuals | 25 | 950 | 9025 |
| genotype frequency | ? | ? | ? |
| rate of survival | 100% | 100% | 80% |
| number of offspring | 50 | 70 | 50 |
| absolute fitness | ? | ? | ? |
| relative fitness | ? | ? | ? |
What is the value of \(s\) and \(h\)? Which genotype (A1A1, A1A2, or A2A2) has the highest fitness?
This is called “overdominance,” when heterozygotes have the highest fitness. Let’s see how this influences the dynamics of A1 allele.
p = 0.05
s = 0.2
h = -2
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))Do you think A1 allele will ever goes to fixation (i.e., all individuals in the population have that allele)?
Let’s look at another scenario.
| A11 | A12 | A22 | |
|---|---|---|---|
| number of individuals | 25 | 950 | 9025 |
| genotype frequency | ? | ? | ? |
| rate of survival | 100% | 50% | 80% |
| number of offspring | 80 | 70 | 80 |
| absolute fitness | ? | ? | ? |
| relative fitness | ? | ? | ? |
What is the value of \(s\) and \(h\)? Which genotype (A1A1, A1A2, or A2A2) has the lowest fitness?
When the heterozygotes have the lowest fitness, it is called “underdominance.” Underdominance also has some interesting behavior….
p = 0.05
s = 0.2
h = 2.8
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))When the starting frequency of A1 is 0.05, is A1 eventually fixed or lost in the population?
Let’s look at exactly the same setting for h and s, but the starting frequency is different.
p = 0.5
s = 0.2
h = 2.8
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))Is A1 fixed or lost now?
Selection and Drift
Let’s take another look at the selection and drift functions.
#let's simulate the effect of selection for ten times
my_p = 0.1
my_s = 0.2
my_h = 0.5
my_ngen = 100
colors = c("red", "pink", "orange", "yellow", "darkgreen", "green", "blue", "skyblue", "violet", "gray50")
m_p = matrix(nrow = 10, ncol = my_ngen)
for(i in 1:10){
v_p = selection_diploid(my_p, 1-my_s, my_h, my_ngen)
m_p[i,] = v_p
}
plot(m_p[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation", col = colors[1])
for(i in 2:10){
points(m_p[i,], type = 'l', col = colors[i])
}#let's simulate the effect of drift for ten times
my_p = 0.1
N = 100
my_ngen = 100
colors = c("red", "pink", "orange", "yellow", "darkgreen", "green", "blue", "skyblue", "violet", "gray50")
m_p = matrix(nrow = 10, ncol = my_ngen)
for(i in 1:10){
v_p = drift_diploid(my_p, N, my_ngen)
m_p[i,] = v_p
}
plot(m_p[1,], type = 'l', ylim = c(0, 1), ylab = "A1 frequency", xlab = "generation", col = colors[1])
for(i in 2:10){
points(m_p[i,], type = 'l', col = colors[i])
}The major difference between the selection vs drift simulation is one is deterministic while the other is stochastic. These two terms are general to various simulations and you may encounter them in context other than population genetics.
Relative importance of directional selection and genetic drift
In real population, both forces lead to changes in allele frequencies. But which one dominates depend on the strength of the forces (or “scale” in population genetics). A rule of thumb is to compare s vs 1/N. When s > 1/N, selection dominates and the changes in allele frequency is more or less deterministic. On the other hand, when s < 1/N, drift dominates and changes in allele frequency is stochastic.
#set up our parameters
N = 1000
p = 0.5
h = 0.5
s = 0.1
ngen = 1000
v_p = c()
for (i in 1:ngen){
#here, we let selection happens first. In real population genetic simulations, the order sometimes matter.
#NOTE: did we use N here?
p_next_after_selection = p_next_fun_diploid(p, s, h)
#drift happens after
#NOTE: did we use N here?
p_next_after_selection_after_drift = p_next_drift_fun(p_next_after_selection, N)
v_p[i] = p_next_after_selection_after_drift
p = p_next_after_selection_after_drift ######## important ######
}
plot(v_p, type = 'l', ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")Let’s try defining a new function.
sel_drift_fun = function(p, s, h, N, ngen){
v_p = c()
for (i in 1:ngen){
p_next_after_selection = p_next_fun_diploid(p, s, h)
p_next_after_selection_after_drift = p_next_drift_fun(p_next_after_selection, N)
v_p[i] = p_next_after_selection_after_drift
p = p_next_after_selection_after_drift
}
return(v_p)
}Let’s use this new function to explore the relative importance of selection and drift. In this case, s >> 1/N.
p = 0.5
ngen = 100
N = 100
s = 0.4
h = 0.5
v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "blue", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")Let’s try weaker selection, where s ~ 1/N.
p = 0.5
ngen = 100
N = 100
s = 0.01
h = 0.5
v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency")Let’s try running the simulations multiple times (like what we did for drift alone) and explore the trajectories.
For the case where s >> 1/N, do you think selection or drift dominates?
p = 0.5
ngen = 100
nrun = 10
N = 100
s = 0.4
h = 0.5
v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "blue", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.1")
for(i in 1:nrun-1){
v_freq = sel_drift_fun(p, s, h, N, ngen)
lines(v_freq, type = 'l', col = "blue")
}For the case where s ~ 1/N, do you think selection or drift dominates?
p = 0.5
ngen = 100
nrun = 10
N = 100
s = 0.01
h = 0.5
v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.01")
for(i in 1:nrun-1){
v_freq = sel_drift_fun(p, s, h, N, ngen)
lines(v_freq, type = 'l', col = "red")
}It is the relative scale of s and N that matters. Dirft would dominate for both N = 100, s = 0.01 and N = 1000, s = 0.001.
par(mfrow = c(1, 2)) #having two figures in one row
p = 0.5
ngen = 1000
nrun = 10
N = 100
s = 0.01
h = 0.5
v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 100, s = 0.01")
for(i in 1:nrun-1){
v_freq = sel_drift_fun(p, s, h, N, ngen)
lines(v_freq, type = 'l', col = "red")
}
N = 1000
s = 0.001
omega2 = 1-s
v_freq = sel_drift_fun(p, s, h, N, ngen)
plot(v_freq, type = 'l', col = "red", ylim = c(0, 1), xlab = "generation", ylab = "allele frequency", main = "N = 1000, s = 0.001")
for(i in 1:nrun-1){
v_freq = sel_drift_fun(p, s, h, N, ngen)
lines(v_freq, type = 'l', col = "red")
}You may notice that how quickly the allele frequency changes differ between N = 100 and 1000, despite drift dominates both simulations. The time scale for drift depends on the inverse of population size: the expected time to fixation/lost of an allele due to drift is ~ 1/N. Accordingly, allele frequencies change with faster rate when N is smaller.