for (i in 1:15){
print(i)
}[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
by Grace Yuh Chwen Lee, Spring 2024
Selection could be natural or artificial. Different dog breeds are wonderful examples showcasing the impacts of artificial selection. Treating bacteria with antibiotics could also be considered artificial selection (unless the same bacteria strain and the strain producing that antibiotics co-exist in the wild).
We are going to learn how to use simple mathematical expression to predict changes in allele frequencies in the presence of selection.
We will discuss population genetic theories for both (1) haploid and (2) diploid under one locus, two alleles. You will use both for your lab reports.
Is bacteria haploid or diploid? How about human?
Consider one locus with two alleles, A1 and A2.
| A1 | A2 | |
|---|---|---|
| number of individuals at time t | \(P_{t}\) | \(Q_{t}\) |
| allele frequencies at time t | \(p_{t} = P_{t}/(P_{t}+Q_{t})\) | \(q_{t} = Q_{t}/(P_{t}+Q_{t})\) |
| absolute fitness - offspring produced (per ind) | \(W1\) | \(W2\) |
| relative fitness (unitless) | \(\omega_{1} = W_{1}/W_{1} = 1\) | \(\omega_{2} = W2/W1\) |
Let’s look at one possible scenario together:
| A1 | A2 | |
|---|---|---|
| number of individuals at time t | 5 | 95 |
| allele frequencies at time t | ?? | ?? |
| absolute fitness - offspring produced (per ind) | 100 | 95 |
| relative fitness (unit-less) | 1 | ?? |
Several important concepts:
Because total allele frequencies should add up to 1, \(q_{t} = 1 - p_{t}\).
What are the TWO components of fitness?
We usually set the relative fitness of one of the alleles to 1 (in this case, A1). The fitness of A2 could be written as \(\omega_{2} = 1-s\), where s is the well known selection coefficient.
Assuming s > 0, do you think \(p_{t}\) or \(q_{t}\) will increase over time?
Let’s do some derivation to find the allele frequencies of A1 and A2 in the next generation.
At generation t, the mean absolute fitness of the population will be:
\(\bar{W} = W_{1}\frac{Pt}{P_{t}+Q_{t}}+W_{2}\frac{Qt}{P_{t}+Q_{t}} = W_{1}p_{t}+W_{2}q_{t}\)
The mean relative fitness is very similar:
\(\bar{\omega} = \omega_{1}\frac{Pt}{P_{t}+Q_{t}}+\omega_{2}\frac{Qt}{P_{t}+Q_{t}} = \omega_{1}p_{t}+\omega_{2}q_{t}\)
For the next generation, we will have the following number of A1 and A2:
A1: \(W_{1}P_{t}\)
A2: \(W_{2}Q_{t}\)
The allele frequencies of A1 in the next generation will be (i.e., the number of A1 offspring among all A1 + A2 offspring)
\(p_{t+1} = \frac{P_{t+1}}{P_{t+1}+Q_{t+1}} = \frac{W1P_{t}}{W1P_{t}+W2Q_{t}}\)
Remember that \(p_{t} = P_{t}/(P_{t}+Q_{t})\) and \(q_{t} = Q_{t}/(P_{t}+Q_{t})\).
The above could be expressed as
\(p_{t+1} = \frac{W1p_{t}}{W1p_{t}+W2q_{t}} = \frac{W_{1}}{\bar{W}}p_{t}\)
Because \(\omega_{1} = W_{1}/W_{1}\) and \(\omega_{2} = W_{2}/W_{1}\), with some rearrangement, the above becomes
\(p_{t+1} = \frac{\omega_{1}}{\bar{\omega}}p_{t}\)
Any important message we could learn from the above equation? Does change in allele frequencies depend on absolute or relative fitness?
| A1 | A2 | |
|---|---|---|
| number of individuals at time t | \(P_{t} = 2\) | \(Q_{t} = 98\) |
| allele frequencies at time t | \(p_{t} = ?\) | \(q_{t} = ?\) |
| absolute fitness - offspring produced (per ind) | \(W_{1} = 100\) | \(W_{2} = 50\) |
| relative fitness (unitless) | \(\omega_{1} = ?\) | \(\omega_{2} = ?\) |
Hints:
Fill in the ? in the table; then define them in R
Calculate mean relative fitness
Use the above equation. Then you should get the frequency of A1 in next generation.
Can you do this for \(t+2\) and \(t+3\) generations? How about \(t + 1000\) ?
When you need to repeat an action, for loop in R is very helpful. Let’s try a simple example.
for (i in 1:15){
print(i)
}[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
Here, i is a counter that keep track of the progress of the loop (you can use whatever variable names you want), and 1:15 specifies the range.
for (i in 100:115){
print(i)
}[1] 100
[1] 101
[1] 102
[1] 103
[1] 104
[1] 105
[1] 106
[1] 107
[1] 108
[1] 109
[1] 110
[1] 111
[1] 112
[1] 113
[1] 114
[1] 115
mysum = 0
for (i in 100:115){
mysum = mysum + i
}
mysum[1] 1720
#compare it the answer of R built-in function sum()
sum(100:115)[1] 1720
Let’s do it for predicting A1 frequency at generation \(t+20\).
p = 0.05 #inital frequency of A1
omega2 = 0.7 #relative fitness of A2
p_next = p/(p*1 + (1-p)*omega2)
p_next[1] 0.06993007
#do it for the next next generation
p_next_next = p_next/(p_next*1 + (1-p_next)*omega2)How do we put that into a loop? Which parts remain the same while which parts change?
Also,
p_nextis used in the next loop, we need some ways to “pass it on.”
Following is the changes in allele frequency of A1 for the next 20 generations (you can use this to check whether your for loop is right).
[1] 0.06993007
[1] 0.09699321
[1] 0.1330318
[1] 0.1797947
[1] 0.238474
[1] 0.3090874
[1] 0.3899043
[1] 0.4772558
[1] 0.5660211
[1] 0.6507438
[1] 0.7269069
[1] 0.7917754
[1] 0.844531
[1] 0.8858476
[1] 0.9172598
[1] 0.9406076
[1] 0.9576711
[1] 0.9699887
[1] 0.9788012
[1] 0.9850659
But how do we save the output? Let’s try saving it in a vector.
Lets look at the following two blocks of code. Why one is giving you changes in A1 frequency while the other does not?
p = 0.05
omega2 = 0.7
ngen = 20
v_p = c() #you can specify an empty vector
for (i in 1:ngen){
p_next = p/(p*1 + (1-p)*omega2)
v_p[i] = p_next
}
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")p = 0.05
omega2 = 0.7
ngen = 20
v_p = c() #you can specify an empty vector
for (i in 1:ngen){
p_next = p/(p*1 + (1-p)*omega2)
v_p[i] = p_next
p = p_next
}
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")
Let’s try doing the same thing, but with different omega2 values.
p = 0.05
omega2 = 0.9
ngen = 20
v_p2 = c() #you can specify an empty vector
for (i in 1:ngen){
p_next = p/(p*1 + (1-p)*omega2)
v_p2[i] = p_next
p = p_next
}
plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", main = "omega2 = 0.9", ylim = c(0, 1))p = 0.05
omega2 = 0.2
ngen = 20
v_p3 = c() #you can specify an empty vector
for (i in 1:ngen){
p_next = p/(p*1 + (1-p)*omega2)
v_p3[i] = p_next
p = p_next
}
plot(v_p3, pch = 20, xlab = "generation", ylab = "A1 frequency", main = "omega2 = 0.2", ylim = c(0, 1))With omega2 = 0.7, 0.9, and 0.2, what are the corresponding selection coefficient(s)?
Do you think selection is stronger or weaker when the omega2 value is smaller?
Is selection acting for or against individuals with A1 allele?
And when the omega2 value is smaller, is the changes in A1 allele frequencies faster or slower?
What’s the usage of knowing the changes in allele frequencies of A1? Let’s assume A1 is an allele conferring antibiotic resistance (or even COVID-vaccine resistance). Then, do you think being able to predict changes in A1 frequency over time is important?
It would be informative if we could compare the trajectories of A1 under different selection regimes (i.e., selection coefficients). Let’s learn how to do that.
By changing the options of plot(), we could make it looks line line plots. type = 'l' tells it is a line plot and lwd = 2 specifies how thick the line should be.
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'l', lwd = 2)Or you can use type = "b" if you want both dots and lines. Note that you need the " " around the l or b option.
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', lwd = 2)After you make a plot, you can add points or lines to it by points() or lines(). (You used similar concepts when doing abline in the L-D report).
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")
points(v_p2, pch = 10, col = "red")
points(v_p3, pch = 5, col = "blue")plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'l')
lines(v_p2, col = "red")
lines(v_p3, col = "blue")plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b')
points(v_p2, pch = 10, col = "red", type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')It is important to note that the plot is made first, and the points and lines are added later. So, if the range of the plot is limited, you may not see your points and lines (remember that issue with your L-D report?). See the following example:
plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', col = "red")
points(v_p, pch = 10, type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')There are many ways you could fix this. One way is to make the plot with the vector with the widest range of values first. But the better way is to simply specify the range of the plots.
y_min = min(c(v_p, v_p2, v_p3))
y_max = max(c(v_p, v_p2, v_p3))
plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', col = "red", ylim = c(y_min, y_max))
points(v_p, pch = 10, type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')Let’s look at these questions again….
With omega2 = 0.7, 0.9, and 0.2, what are the corresponding selection coefficient (s)?
Do you think selection is stronger or weaker when the omega2 value is smaller?
Is selection acting for or against individuals with A1 allele?
And when the omega2 value is smaller, is the changes in A1 allele frequencies faster or slower?
What’s the usage of knowing the changes in allele frequencies of A1? Let’s assume A1 is an allele conferring antibiotic resistance (or even COVID-vaccine resistance). Then, do you think being able to predict changes in A1 frequency over time is important?
In addition to R built in functions, you could also write your own custom R functions. This is easy if you need to do a certain thing repeatedly. For example, calculating allele frequency in the next generation.
Let’s look at what we just did. We need to calculate p_next and we keep typing the whole equation out.
p = 0.05
omega2 = 0.7
ngen = 20
for (i in 1:ngen){
p_next = p/(p*1 + (1-p)*omega2)
print(p_next)
p = p_next
}To streamline the process, we could make our own p_next()function using function(YOUR VARIABLE){YOUR CALCULATION}.
p_next_fun = function(p, omega2){
p_next = p/(p*1 + (1-p)*omega2)
return(p_next) #****
}Let’s try using both of functions.
#my own sum function
mysum(1:10)[1] 55
#R's built in function
sum(1:10)[1] 55
p_next_fun(0.1, 0.9)Similar to using built-in R functions, if you don’t specify enough variables, R also feels upset! Try this.
p_next_fun(0.1)With our own R functions, we could streamline our codes above.
#define a new function for the allele frequency in next generation
p_next_fun = function(p, omega2){
p_next = p/(p*1 + (1-p)*omega2)
return(p_next)
}
#try different oemga2 value
ngen = 20
m = matrix(nrow = 3, ncol = ngen)
#specifying an empty matrix with 3 rows and ngen columns
#one row will be for one omega2 value
v_omega2 = c(0.7, 0.9, 0.2)
for(i in 1:3){
omega2 = v_omega2[i]
p = 0.05 #think about why we need to specify p in the loop?
for(j in 1:ngen){ #you cannot use i here
p_next = p_next_fun(p, omega2)
m[i, j] = p_next
p = p_next
}
}
plot(1:ngen, m[1,], xlab = "generation", ylab = "allele frequency", pch = 20, ylim = c(0, 1))
points(1:ngen, m[2,], col = "red", pch = 20)
points(1:ngen, m[3,], col = "blue", pch = 20)Think about why we need to specify
pin the loop, instead of likengen, specifying at the beginning of the code?
Most the organisms you could actually “see” are diploids. So, it is equally important to understand how selection may influence the changes in allele frequencies. The concept is important in many scenarios, e.g. plant and animal breeding (e.g., how fast could farmers select corn with 110% of the yield?), insecticide resistance (e.g., how quickly will the mosquito population all become insecticide resistant), adaptation (e.g., plants being resistant to salt water!), and many other.
One major difference between diploid and haploid selection model is dominance coefficient h. Let’s look at the number of offspring produced by three genotypes. Which allele is dominant?
| A1A1 | A1A2 | A2A2 |
|---|---|---|
| 100 | 100 | 50 |
| 100 | 50 | 50 |
| 100 | 75 | 50 |
The h value for the above three cases are 1 (A1 is dominant), 0 (A2 is dominant), and 0.5 (neither allele is dominant over the other allele). Note that h can take values beyond 0 and 1 (stay tuned).
With \(p\) being the frequency of A1 allele and \(q\) being the frequency of A2 allele (and remember \(p+q = 1\)), we can have a similar table for diploids (remember your Hardy-Weinberg….)
| A1A1 | A1A2 | A2A2 | |
|---|---|---|---|
| genotype frequency | \(p^2\) | \(2pq\) | \(q^2\) |
| absolute fitness | \(W_{11}\) | \(W_{12}\) | \(W_{22}\) |
| relative fitness | \(\omega_{11}=1\) | \(\omega_{12} = 1-hs\) | \(\omega_{22} = 1-s\) |
With similar derivation as for the haploid case (but of course, a bit more complex), the frequency of A1 allele in the next generation is:
\(p_{t+1} = \frac{W_{11}p_{t}^2+W_{12}p_{t}q_{t}}{\bar{W}}\)
Again, the absolute fitness is irrelevant; what matters is the relative fitness. The above could be represented as
\(p_{t+1} = \frac{\omega_{11}p_{t}^2+\omega_{12}p_{t}q_{t}}{\bar{\omega}}\)
Changes in allele frequency in one generation is then
\(\Delta p_{t} = p_{t+1}-p_{t} = \frac{\omega_{11}p_{t}^2+\omega_{12}p_{t}q_{t}}{\bar{\omega}} - p_{t}\)
Let’s incorporate \(s\) and \(h\) (i.e., \(\omega_{12} = 1-hs\) and \(\omega_{22} = 1-s\)), the changes in allele frequency can be represented as
\(\Delta p_{t} = \frac{p_{t}hs + q_{t}s(1-h)}{\bar\omega}p_{t}q_{t}\)
Can you write out the expression for \(\bar\omega\) ?
When \(s > 0\) (i.e., A1A1 individuals are selected for), can you say whether \(\Delta p_{t}\) is always positive or negative? Compare to the case for haploid.
When \(\Delta p_{t}\) is positive, is the frequency of A1 allele increasing or decreasing? How about when the value is negative?
How do we go from observations to numbers that we could plug into equations? Consider the following example.
| A11 | A12 | A22 | |
|---|---|---|---|
| number of individuals | 25 | 950 | 9025 |
| genotype frequency | ? | ? | ? |
| rate of survival | 100% | 100% | 80% |
| number of offspring | 100 | 70 | 50 |
| absolute fitness | ? | ? | ? |
| relative fitness | ? | ? | ? |
Let’s predict the frequency of A1 with different \(s\) and \(h\) to explore how their impact.
s = 0.6
h = 0.5
p = 0.05
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))How \(s\) changes the dynamics of A1 allele is similar to that in the haploid case. Let’s explore a bit about the impacts of \(h\).
s = 0.6
ngen = 50
m = matrix(nrow = 5, ncol = ngen)
v_h = c(0, 0.2, 0.5, 0.8, 1) # testing 5 different h values
for(i in 1:5){
my_h = v_h[i]
p = 0.05
for (j in 1:ngen){
p_next = p_next_fun_diploid(p, s, my_h)
m[i, j] = p_next
p = p_next
}
}
my_color = c("red", "orange", "yellow", "olivedrab", "blue")
plot(1:ngen, m[1,], xlab = "generation", ylab = "A1 frequency", pch = 20, col = my_color[1], type = 'b', ylim = c(0, 1))
for (i in 2:5){ #plot functions can also be loop
points(1:ngen, m[i,], col = my_color[i], pch = 20, type = 'b')
}With which \(h\), A1 allele arises in frequency fastest?
Did you notice that the trajectories are also different? Can you guess why?
\(h\) does not need to be between 0 and 1. Let’s look at the following example.
| A11 | A12 | A22 | |
|---|---|---|---|
| number of individuals | 25 | 950 | 9025 |
| genotype frequency | ? | ? | ? |
| rate of survival | 100% | 100% | 80% |
| number of offspring | 50 | 70 | 50 |
| absolute fitness | ? | ? | ? |
| relative fitness | ? | ? | ? |
What is the value of \(s\) and \(h\)? Which genotype (A1A1, A1A2, or A2A2) has the highest fitness?
This is called “overdominance,” when heterozygotes have the highest fitness. Let’s see how this influences the dynamics of A1 allele.
p = 0.05
s = 0.2
h = -2
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))Do you think A1 allele will ever goes to fixation (i.e., all individuals in the population have that allele)?
Let’s look at another scenario.
| A11 | A12 | A22 | |
|---|---|---|---|
| number of individuals | 25 | 950 | 9025 |
| genotype frequency | ? | ? | ? |
| rate of survival | 100% | 50% | 80% |
| number of offspring | 80 | 70 | 80 |
| absolute fitness | ? | ? | ? |
| relative fitness | ? | ? | ? |
What is the value of \(s\) and \(h\)? Which genotype (A1A1, A1A2, or A2A2) has the lowest fitness?
When the heterozygotes have the lowest fitness, it is called “underdominance.” Underdominance also has some interesting behavior….
p = 0.05
s = 0.2
h = 2.8
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))When the starting frequency of A1 is 0.05, is A1 eventually fixed or lost in the population?
Let’s look at exactly the same setting for h and s, but the starting frequency is different.
p = 0.5
s = 0.2
h = 2.8
ngen = 100
v_p = c()
for (i in 1:ngen){
p_next = p_next_fun_diploid(p, s, h)
v_p[i] = p_next
p = p_next
}
plot(1:ngen, v_p, xlab = "generation", ylab = "frequency", pch = 20, ylim = c(0, 1))Is A1 fixed or lost now?