for (i in 1:15){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
by Grace Yuh Chwen Lee, Winter 2025
Selection could be natural or artificial. Different dog breeds are wonderful examples showcasing the impacts of artificial selection. Treating bacteria with antibiotics could also be considered artificial selection (unless the same bacteria strain and the strain producing that antibiotics co-exist in the wild).
We are going to learn how to use simple mathematical expression to predict changes in allele frequencies in the presence of selection.
We will discuss population genetic theories for both (1) haploid and (2) diploid under one locus, two alleles. You will use both for your lab reports.
Is bacteria haploid or diploid? How about human? How about the mustard plants?
Consider one locus with two alleles, A1 and A2.
A1 | A2 | |
---|---|---|
number of individuals at time t | \(P_{t}\) | \(Q_{t}\) |
allele frequencies at time t | \(p_{t} = P_{t}/(P_{t}+Q_{t})\) | \(q_{t} = Q_{t}/(P_{t}+Q_{t})\) |
absolute fitness - offspring produced (per ind) | \(W1\) | \(W2\) |
relative fitness (unitless) | \(\omega_{1} = W_{1}/W_{1} = 1\) | \(\omega_{2} = W2/W1\) |
Let’s look at one possible scenario together:
A1 | A2 | |
---|---|---|
number of individuals at time t | 5 | 95 |
allele frequencies at time t | ?? | ?? |
absolute fitness - offspring produced (per ind) | 100 | 95 |
relative fitness (unit-less) | 1 | ?? |
Several important concepts:
Because total allele frequencies should add up to 1, \(q_{t} = 1 - p_{t}\).
What are the TWO components of fitness?
We usually set the relative fitness of one of the alleles to 1 (in this case, A1). The fitness of A2 could be written as \(\omega_{2} = 1-s\), where s is the well known selection coefficient.
Assuming s > 0, do you think \(p_{t}\) or \(q_{t}\) will increase over time?
Let’s do some derivation to find the allele frequencies of A1 and A2 in the next generation.
At generation t, the mean absolute fitness of the population will be:
\(\bar{W} = W_{1}\frac{Pt}{P_{t}+Q_{t}}+W_{2}\frac{Qt}{P_{t}+Q_{t}} = W_{1}p_{t}+W_{2}q_{t}\)
The mean relative fitness is very similar:
\(\bar{\omega} = \omega_{1}\frac{Pt}{P_{t}+Q_{t}}+\omega_{2}\frac{Qt}{P_{t}+Q_{t}} = \omega_{1}p_{t}+\omega_{2}q_{t}\)
For the next generation, we will have the following number of A1 and A2:
A1: \(W_{1}P_{t}\)
A2: \(W_{2}Q_{t}\)
The allele frequencies of A1 in the next generation will be (i.e., the number of A1 offspring among all A1 + A2 offspring)
\(p_{t+1} = \frac{P_{t+1}}{P_{t+1}+Q_{t+1}} = \frac{W1P_{t}}{W1P_{t}+W2Q_{t}}\)
Remember that \(p_{t} = P_{t}/(P_{t}+Q_{t})\) and \(q_{t} = Q_{t}/(P_{t}+Q_{t})\).
The above could be expressed as
\(p_{t+1} = \frac{W1p_{t}}{W1p_{t}+W2q_{t}} = \frac{W_{1}}{\bar{W}}p_{t}\)
Because \(\omega_{1} = W_{1}/W_{1}\) and \(\omega_{2} = W_{2}/W_{1}\), with some rearrangement, the above becomes
\(p_{t+1} = \frac{\omega_{1}}{\bar{\omega}}p_{t}\)
Any important message we could learn from the above equation? Does change in allele frequencies depend on absolute or relative fitness?
A1 | A2 | |
---|---|---|
number of individuals at time t | \(P_{t} = 2\) | \(Q_{t} = 98\) |
allele frequencies at time t | \(p_{t} = ?\) | \(q_{t} = ?\) |
absolute fitness - offspring produced (per ind) | \(W_{1} = 100\) | \(W_{2} = 50\) |
relative fitness (unitless) | \(\omega_{1} = ?\) | \(\omega_{2} = ?\) |
Hints:
Fill in the ? in the table; then define them in R
Calculate mean relative fitness
Use the above equation. Then you should get the frequency of A1 in next generation.
Can you do this for \(t+2\) and \(t+3\) generations? How about \(t + 1000\) ?
When you need to repeat an action, for loop in R is very helpful. Let’s try a simple example.
for (i in 1:15){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15
Here, i
is a counter that keep track of the progress of the loop (you can use whatever variable names you want), and 1:15
specifies the range.
for (i in 100:115){
print(i)
}
[1] 100
[1] 101
[1] 102
[1] 103
[1] 104
[1] 105
[1] 106
[1] 107
[1] 108
[1] 109
[1] 110
[1] 111
[1] 112
[1] 113
[1] 114
[1] 115
= 0
mysum for (i in 100:115){
= mysum + i
mysum
} mysum
[1] 1720
#compare it the answer of R built-in function sum()
sum(100:115)
[1] 1720
Let’s do it for predicting A1 frequency at generation \(t+20\).
= 0.05 #inital frequency of A1
p = 0.7 #relative fitness of A2
omega2
= p/(p*1 + (1-p)*omega2)
p_next p_next
[1] 0.06993007
#do it for the next next generation
= p_next/(p_next*1 + (1-p_next)*omega2) p_next_next
How do we put that into a loop? Which parts remain the same while which parts change?
Also,
p_next
is used in the next loop, we need some ways to “pass it on.”
Following is the changes in allele frequency of A1 for the next 20 generations (you can use this to check whether your for loop is right).
[1] 0.06993007
[1] 0.09699321
[1] 0.1330318
[1] 0.1797947
[1] 0.238474
[1] 0.3090874
[1] 0.3899043
[1] 0.4772558
[1] 0.5660211
[1] 0.6507438
[1] 0.7269069
[1] 0.7917754
[1] 0.844531
[1] 0.8858476
[1] 0.9172598
[1] 0.9406076
[1] 0.9576711
[1] 0.9699887
[1] 0.9788012
[1] 0.9850659
But how do we save the output? Let’s try saving it in a vector.
Lets look at the following two blocks of code. Why one is giving you changes in A1 frequency while the other does not?
= 0.05
p = 0.7
omega2 = 20
ngen
= c() #you can specify an empty vector
v_p
for (i in 1:ngen){
= p/(p*1 + (1-p)*omega2)
p_next = p_next
v_p[i]
}plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")
= 0.05
p = 0.7
omega2 = 20
ngen
= c() #you can specify an empty vector
v_p
for (i in 1:ngen){
= p/(p*1 + (1-p)*omega2)
p_next = p_next
v_p[i] = p_next
p
}plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")
Let’s try doing the same thing, but with different omega2
values.
= 0.05
p = 0.9
omega2 = 20
ngen = c() #you can specify an empty vector
v_p2
for (i in 1:ngen){
= p/(p*1 + (1-p)*omega2)
p_next = p_next
v_p2[i] = p_next
p
}plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", main = "omega2 = 0.9", ylim = c(0, 1))
= 0.05
p = 0.2
omega2 = 20
ngen = c() #you can specify an empty vector
v_p3
for (i in 1:ngen){
= p/(p*1 + (1-p)*omega2)
p_next = p_next
v_p3[i] = p_next
p
}plot(v_p3, pch = 20, xlab = "generation", ylab = "A1 frequency", main = "omega2 = 0.2", ylim = c(0, 1))
With omega2 = 0.7, 0.9, and 0.2, what are the corresponding selection coefficient(s)?
Do you think selection is stronger or weaker when the omega2 value is smaller?
Is selection acting for or against individuals with A1 allele?
And when the omega2 value is smaller, is the changes in A1 allele frequencies faster or slower?
What’s the usage of knowing the changes in allele frequencies of A1? Let’s assume A1 is an allele conferring antibiotic resistance (or even COVID-vaccine resistance). Then, do you think being able to predict changes in A1 frequency over time is important?
It would be informative if we could compare the trajectories of A1 under different selection regimes (i.e., selection coefficients). Let’s learn how to do that.
By changing the options of plot()
, we could make it looks line line plots. type = 'l'
tells it is a line plot and lwd = 2
specifies how thick the line should be.
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'l', lwd = 2)
Or you can use type = "b"
if you want both dots and lines. Note that you need the " "
around the l or b option.
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', lwd = 2)
After you make a plot, you can add points or lines to it by points()
or lines()
. (You used similar concepts when doing abline
in the L-D report).
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")
points(v_p2, pch = 10, col = "red")
points(v_p3, pch = 5, col = "blue")
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'l')
lines(v_p2, col = "red")
lines(v_p3, col = "blue")
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b')
points(v_p2, pch = 10, col = "red", type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')
It is important to note that the plot
is made first, and the points
and lines
are added later. So, if the range of the plot
is limited, you may not see your points
and lines
(remember that issue with your L-D report?). See the following example:
plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', col = "red")
points(v_p, pch = 10, type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')
There are many ways you could fix this. One way is to make the plot with the vector with the widest range of values first. But the better way is to simply specify the range of the plots.
= min(c(v_p, v_p2, v_p3))
y_min = max(c(v_p, v_p2, v_p3))
y_max
plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', col = "red", ylim = c(y_min, y_max))
points(v_p, pch = 10, type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')
Let’s look at these questions again….
With omega2 = 0.7, 0.9, and 0.2, what are the corresponding selection coefficient (s)?
Do you think selection is stronger or weaker when the omega2 value is smaller?
Is selection acting for or against individuals with A1 allele?
And when the omega2 value is smaller, is the changes in A1 allele frequencies faster or slower?
What’s the usage of knowing the changes in allele frequencies of A1? Let’s assume A1 is an allele conferring antibiotic resistance (or even COVID-vaccine resistance). Then, do you think being able to predict changes in A1 frequency over time is important?
In addition to R built in functions, you could also write your own custom R functions. This is easy if you need to do a certain thing repeatedly. For example, calculating allele frequency in the next generation.
Let’s look at what we just did. We need to calculate p_next
and we keep typing the whole equation out.
= 0.05
p = 0.7
omega2 = 20
ngen
for (i in 1:ngen){
= p/(p*1 + (1-p)*omega2)
p_next print(p_next)
= p_next
p }
To streamline the process, we could make our own p_next()
function using function(YOUR VARIABLE){YOUR CALCULATION}
.
= function(p, omega2){
p_next_fun = p/(p*1 + (1-p)*omega2)
p_next return(p_next) #****
}
Let’s try using both of functions.
#my own sum function
mysum(1:10)
[1] 55
#R's built in function
sum(1:10)
[1] 55
p_next_fun(0.1, 0.9)
Similar to using built-in R functions, if you don’t specify enough variables, R also feels upset! Try this.
p_next_fun(0.1)
With our own R functions, we could streamline our codes above.
#define a new function for the allele frequency in next generation
= function(p, omega2){
p_next_fun = p/(p*1 + (1-p)*omega2)
p_next return(p_next)
}
#try different oemga2 value
= 20
ngen = matrix(nrow = 3, ncol = ngen)
m #specifying an empty matrix with 3 rows and ngen columns
#one row will be for one omega2 value
= c(0.7, 0.9, 0.2)
v_omega2
for(i in 1:3){
= v_omega2[i]
omega2
= 0.05 #think about why we need to specify p in the loop?
p for(j in 1:ngen){ #you cannot use i here
= p_next_fun(p, omega2)
p_next = p_next
m[i, j] = p_next ######
p
}
}
plot(1:ngen, m[1,], xlab = "generation", ylab = "allele frequency", pch = 20, ylim = c(0, 1))
points(1:ngen, m[2,], col = "red", pch = 20)
points(1:ngen, m[3,], col = "blue", pch = 20)
Think about why we need to specify
p
in the loop, instead of likengen
, specifying at the beginning of the code?
Most the organisms you could actually “see” are diploids. So, it is equally important to understand how selection may influence the changes in allele frequencies. The concept is important in many scenarios, e.g. plant and animal breeding (e.g., how fast could farmers select corn with 110% of the yield?), insecticide resistance (e.g., how quickly will the mosquito population all become insecticide resistant), adaptation (e.g., plants being resistant to salt water!), and many other.
One major difference between diploid and haploid selection model is dominance coefficient h
. Let’s look at the number of offspring produced by three genotypes. Which allele is dominant?
A1A1 | A1A2 | A2A2 |
---|---|---|
100 | 100 | 50 |
100 | 50 | 50 |
100 | 75 | 50 |
The h
value for the above three cases are 1 (A1 is dominant), 0 (A2 is dominant), and 0.5 (neither allele is dominant over the other allele). Note that h
can take values beyond 0 and 1 (stay tuned).
With \(p\) being the frequency of A1 allele and \(q\) being the frequency of A2 allele (and remember \(p+q = 1\)), we can have a similar table for diploids (remember your Hardy-Weinberg….)
A1A1 | A1A2 | A2A2 | |
---|---|---|---|
genotype frequency | \(p^2\) | \(2pq\) | \(q^2\) |
absolute fitness | \(W_{11}\) | \(W_{12}\) | \(W_{22}\) |
relative fitness | \(\omega_{11}=1\) | \(\omega_{12} = 1-hs\) | \(\omega_{22} = 1-s\) |
How do we go from observations to numbers that we could plug into equations? Consider the following example.
A11 | A12 | A22 | |
---|---|---|---|
number of individuals | 25 | 950 | 9025 |
genotype frequency | ? | ? | ? |
rate of survival | 100% | 100% | 80% |
number of offspring | 100 | 70 | 50 |
absolute fitness | ? | ? | ? |
relative fitness | ? | ? | ? |
How to predict allele frequencies changes under selection in diploid? Stay tuned! (Week 8)