Week 6 E115 L: selection

by Grace Yuh Chwen Lee, Winter 2025

Selection

Selection could be natural or artificial. Different dog breeds are wonderful examples showcasing the impacts of artificial selection. Treating bacteria with antibiotics could also be considered artificial selection (unless the same bacteria strain and the strain producing that antibiotics co-exist in the wild).

We are going to learn how to use simple mathematical expression to predict changes in allele frequencies in the presence of selection.

We will discuss population genetic theories for both (1) haploid and (2) diploid under one locus, two alleles. You will use both for your lab reports.

Is bacteria haploid or diploid? How about human? How about the mustard plants?

Haploid selection model

Consider one locus with two alleles, A1 and A2.

	A1	A2
number of individuals at time t	\(P_{t}\)	\(Q_{t}\)
allele frequencies at time t	\(p_{t} = P_{t}/(P_{t}+Q_{t})\)	\(q_{t} = Q_{t}/(P_{t}+Q_{t})\)
absolute fitness - offspring produced (per ind)	\(W1\)	\(W2\)
relative fitness (unitless)	\(\omega_{1} = W_{1}/W_{1} = 1\)	\(\omega_{2} = W2/W1\)

Let’s look at one possible scenario together:

	A1	A2
number of individuals at time t	5	95
allele frequencies at time t	??	??
absolute fitness - offspring produced (per ind)	100	95
relative fitness (unit-less)	1	??

Several important concepts:

Because total allele frequencies should add up to 1, \(q_{t} = 1 - p_{t}\).
What are the TWO components of fitness?
We usually set the relative fitness of one of the alleles to 1 (in this case, A1). The fitness of A2 could be written as \(\omega_{2} = 1-s\), where s is the well known selection coefficient.

Assuming s > 0, do you think \(p_{t}\) or \(q_{t}\) will increase over time?
Let’s do some derivation to find the allele frequencies of A1 and A2 in the next generation.

At generation t, the mean absolute fitness of the population will be:

\(\bar{W} = W_{1}\frac{Pt}{P_{t}+Q_{t}}+W_{2}\frac{Qt}{P_{t}+Q_{t}} = W_{1}p_{t}+W_{2}q_{t}\)

The mean relative fitness is very similar:

\(\bar{\omega} = \omega_{1}\frac{Pt}{P_{t}+Q_{t}}+\omega_{2}\frac{Qt}{P_{t}+Q_{t}} = \omega_{1}p_{t}+\omega_{2}q_{t}\)

For the next generation, we will have the following number of A1 and A2:

A1: \(W_{1}P_{t}\)

A2: \(W_{2}Q_{t}\)

The allele frequencies of A1 in the next generation will be (i.e., the number of A1 offspring among all A1 + A2 offspring)

\(p_{t+1} = \frac{P_{t+1}}{P_{t+1}+Q_{t+1}} = \frac{W1P_{t}}{W1P_{t}+W2Q_{t}}\)

Remember that \(p_{t} = P_{t}/(P_{t}+Q_{t})\) and \(q_{t} = Q_{t}/(P_{t}+Q_{t})\).

The above could be expressed as

\(p_{t+1} = \frac{W1p_{t}}{W1p_{t}+W2q_{t}} = \frac{W_{1}}{\bar{W}}p_{t}\)

Because \(\omega_{1} = W_{1}/W_{1}\) and \(\omega_{2} = W_{2}/W_{1}\), with some rearrangement, the above becomes

\(p_{t+1} = \frac{\omega_{1}}{\bar{\omega}}p_{t}\)

Any important message we could learn from the above equation? Does change in allele frequencies depend on absolute or relative fitness?

Let’s put the theories into action. For the following scenario, predict the frequency of A1 allele in generation \(t+1\).

Make sure you try it now! You will do something very similar in your report.

	A1	A2
number of individuals at time t	\(P_{t} = 2\)	\(Q_{t} = 98\)
allele frequencies at time t	\(p_{t} = ?\)	\(q_{t} = ?\)
absolute fitness - offspring produced (per ind)	\(W_{1} = 100\)	\(W_{2} = 50\)
relative fitness (unitless)	\(\omega_{1} = ?\)	\(\omega_{2} = ?\)

Hints:

Fill in the ? in the table; then define them in R
Calculate mean relative fitness
Use the above equation. Then you should get the frequency of A1 in next generation.

Can you do this for \(t+2\) and \(t+3\) generations? How about \(t + 1000\) ?

For loop

When you need to repeat an action, for loop in R is very helpful. Let’s try a simple example.

for (i in 1:15){
  print(i)
}

[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
[1] 12
[1] 13
[1] 14
[1] 15

Here, i is a counter that keep track of the progress of the loop (you can use whatever variable names you want), and 1:15 specifies the range.

for (i in 100:115){
  print(i)
}

[1] 100
[1] 101
[1] 102
[1] 103
[1] 104
[1] 105
[1] 106
[1] 107
[1] 108
[1] 109
[1] 110
[1] 111
[1] 112
[1] 113
[1] 114
[1] 115

mysum = 0
for (i in 100:115){
  mysum = mysum + i
}
mysum

[1] 1720

#compare it the answer of R built-in function sum()
sum(100:115)

[1] 1720

Let’s do it for predicting A1 frequency at generation \(t+20\).

Note that the number used below are DIFFERENT from the above exercise.

p = 0.05 #inital frequency of A1
omega2 = 0.7 #relative fitness of A2

p_next = p/(p*1 + (1-p)*omega2)
p_next

[1] 0.06993007

#do it for the next next generation
p_next_next = p_next/(p_next*1 + (1-p_next)*omega2)

How do we put that into a loop? Which parts remain the same while which parts change?

Also, p_next is used in the next loop, we need some ways to “pass it on.”

Please try! We will discuss how after you try it first!

Following is the changes in allele frequency of A1 for the next 20 generations (you can use this to check whether your for loop is right).

[1] 0.06993007
[1] 0.09699321
[1] 0.1330318
[1] 0.1797947
[1] 0.238474
[1] 0.3090874
[1] 0.3899043
[1] 0.4772558
[1] 0.5660211
[1] 0.6507438
[1] 0.7269069
[1] 0.7917754
[1] 0.844531
[1] 0.8858476
[1] 0.9172598
[1] 0.9406076
[1] 0.9576711
[1] 0.9699887
[1] 0.9788012
[1] 0.9850659

But how do we save the output? Let’s try saving it in a vector.

Lets look at the following two blocks of code. Why one is giving you changes in A1 frequency while the other does not?

p = 0.05
omega2 = 0.7
ngen = 20

v_p = c()  #you can specify an empty vector

for (i in 1:ngen){
  p_next = p/(p*1 + (1-p)*omega2)   
  v_p[i] = p_next
}
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")

p = 0.05
omega2 = 0.7
ngen = 20

v_p = c()  #you can specify an empty vector

for (i in 1:ngen){
  p_next = p/(p*1 + (1-p)*omega2)   
  v_p[i] = p_next
  p = p_next 
}
plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")

Let’s try doing the same thing, but with different omega2 values.

p = 0.05
omega2 = 0.9
ngen = 20
v_p2 = c()  #you can specify an empty vector

for (i in 1:ngen){
  p_next = p/(p*1 + (1-p)*omega2)   
  v_p2[i] = p_next
  p = p_next 
}
plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", main = "omega2 = 0.9", ylim = c(0, 1))

p = 0.05
omega2 = 0.2
ngen = 20
v_p3 = c()  #you can specify an empty vector

for (i in 1:ngen){
  p_next = p/(p*1 + (1-p)*omega2)   
  v_p3[i] = p_next
  p = p_next 
}
plot(v_p3, pch = 20, xlab = "generation", ylab = "A1 frequency", main = "omega2 = 0.2", ylim = c(0, 1))

With omega2 = 0.7, 0.9, and 0.2, what are the corresponding selection coefficient(s)?

Do you think selection is stronger or weaker when the omega2 value is smaller?

Is selection acting for or against individuals with A1 allele?

And when the omega2 value is smaller, is the changes in A1 allele frequencies faster or slower?

What’s the usage of knowing the changes in allele frequencies of A1? Let’s assume A1 is an allele conferring antibiotic resistance (or even COVID-vaccine resistance). Then, do you think being able to predict changes in A1 frequency over time is important?

It would be informative if we could compare the trajectories of A1 under different selection regimes (i.e., selection coefficients). Let’s learn how to do that.

Line plots

By changing the options of plot(), we could make it looks line line plots. type = 'l' tells it is a line plot and lwd = 2 specifies how thick the line should be.

plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'l', lwd = 2)

Or you can use type = "b" if you want both dots and lines. Note that you need the " " around the l or b option.

plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', lwd = 2)

Overlaying points/lines

After you make a plot, you can add points or lines to it by points() or lines(). (You used similar concepts when doing abline in the L-D report).

plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency")
points(v_p2, pch = 10, col = "red")
points(v_p3, pch = 5, col = "blue")

plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'l')
lines(v_p2, col = "red")
lines(v_p3, col = "blue")

plot(v_p, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b')
points(v_p2, pch = 10, col = "red", type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')

It is important to note that the plot is made first, and the points and lines are added later. So, if the range of the plot is limited, you may not see your points and lines (remember that issue with your L-D report?). See the following example:

plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', col = "red")
points(v_p, pch = 10, type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')

There are many ways you could fix this. One way is to make the plot with the vector with the widest range of values first. But the better way is to simply specify the range of the plots.

y_min = min(c(v_p, v_p2, v_p3))
y_max = max(c(v_p, v_p2, v_p3))

plot(v_p2, pch = 20, xlab = "generation", ylab = "A1 frequency", type = 'b', col = "red", ylim = c(y_min, y_max))
points(v_p, pch = 10, type = 'b')
lines(v_p3, pch = 5, col = "blue", type = 'b')

Let’s look at these questions again….

With omega2 = 0.7, 0.9, and 0.2, what are the corresponding selection coefficient (s)?

Do you think selection is stronger or weaker when the omega2 value is smaller?

Is selection acting for or against individuals with A1 allele?

And when the omega2 value is smaller, is the changes in A1 allele frequencies faster or slower?

What’s the usage of knowing the changes in allele frequencies of A1? Let’s assume A1 is an allele conferring antibiotic resistance (or even COVID-vaccine resistance). Then, do you think being able to predict changes in A1 frequency over time is important?

How to write YOUR OWN R functions

In addition to R built in functions, you could also write your own custom R functions. This is easy if you need to do a certain thing repeatedly. For example, calculating allele frequency in the next generation.

Let’s look at what we just did. We need to calculate p_next and we keep typing the whole equation out.

p = 0.05
omega2 = 0.7
ngen = 20

for (i in 1:ngen){
  p_next = p/(p*1 + (1-p)*omega2)   
  print(p_next)
  p = p_next
}

To streamline the process, we could make our own p_next()function using function(YOUR VARIABLE){YOUR CALCULATION}.

p_next_fun = function(p, omega2){
  p_next = p/(p*1 + (1-p)*omega2) 
  return(p_next) #****
}

Let’s take a pause and let you try writing your OWN function for (1) p_next (which we will use in the class) and (2) summing all elements in a vector (so you can compare the performance of your function and R’s built-in function). Try it!

Let’s try using both of functions.

#my own sum function
mysum(1:10)

[1] 55

#R's built in function
sum(1:10)

[1] 55

p_next_fun(0.1, 0.9)

Similar to using built-in R functions, if you don’t specify enough variables, R also feels upset! Try this.

p_next_fun(0.1)

With our own R functions, we could streamline our codes above.

#define a new function for the allele frequency in next generation
p_next_fun = function(p, omega2){
  p_next = p/(p*1 + (1-p)*omega2) 
  return(p_next)
}

#try different oemga2 value
ngen = 20
m = matrix(nrow = 3, ncol = ngen) 
#specifying an empty matrix with 3 rows and ngen columns
#one row will be for one omega2 value

v_omega2 = c(0.7, 0.9, 0.2) 

for(i in 1:3){
  omega2 = v_omega2[i]
  
  p = 0.05 #think about why we need to specify p in the loop?
  for(j in 1:ngen){  #you cannot use i here
    p_next = p_next_fun(p, omega2)
    m[i, j] = p_next
    p = p_next ######
  }
}

plot(1:ngen, m[1,], xlab = "generation", ylab = "allele frequency", pch = 20, ylim = c(0, 1))
points(1:ngen, m[2,], col = "red", pch = 20)
points(1:ngen, m[3,], col = "blue", pch = 20)

Think about why we need to specify p in the loop, instead of like ngen, specifying at the beginning of the code?

Diploid selection

Most the organisms you could actually “see” are diploids. So, it is equally important to understand how selection may influence the changes in allele frequencies. The concept is important in many scenarios, e.g. plant and animal breeding (e.g., how fast could farmers select corn with 110% of the yield?), insecticide resistance (e.g., how quickly will the mosquito population all become insecticide resistant), adaptation (e.g., plants being resistant to salt water!), and many other.

One major difference between diploid and haploid selection model is dominance coefficient h. Let’s look at the number of offspring produced by three genotypes. Which allele is dominant?

A1A1	A1A2	A2A2
100	100	50
100	50	50
100	75	50

The h value for the above three cases are 1 (A1 is dominant), 0 (A2 is dominant), and 0.5 (neither allele is dominant over the other allele). Note that h can take values beyond 0 and 1 (stay tuned).

With \(p\) being the frequency of A1 allele and \(q\) being the frequency of A2 allele (and remember \(p+q = 1\)), we can have a similar table for diploids (remember your Hardy-Weinberg….)

	A1A1	A1A2	A2A2
genotype frequency	\(p^2\)	\(2pq\)	\(q^2\)
absolute fitness	\(W_{11}\)	\(W_{12}\)	\(W_{22}\)
relative fitness	\(\omega_{11}=1\)	\(\omega_{12} = 1-hs\)	\(\omega_{22} = 1-s\)

How do we go from observations to numbers that we could plug into equations? Consider the following example.

	A11	A12	A22
number of individuals	25	950	9025
genotype frequency	?	?	?
rate of survival	100%	100%	80%
number of offspring	100	70	50
absolute fitness	?	?	?
relative fitness	?	?	?

What values are \(h\) and \(s\) ? Make sure you know how to do it as it will appear again in the plant report!

How to predict allele frequencies changes under selection in diploid? Stay tuned! (Week 8)