Population Genetics

M. Drew LaMar
March 30, 2022

Evolution

Definition: Evolution is the process of change in the gene pool.

Population Genetics and Evolution

Question: What are the major forces of evolution?

Answer: There are four major forces:

  • Natural selection
  • Drift
  • Gene flow
  • Mutation

Population Genetics and Evolution

Our goal is to develop models of individual forces and explore them to develop expectations of evolutionary behavior.

Let's follow the ODD protocol, even though it was designed for agent-based models.

Purpose: To understand the effects of natural selection in the absence of all other evolutionary forces.

Natural Selection - ODD

Entities, State Variables and Scales

Entities

Definition: A locus is a chromosomal or genomic location of a gene.

Definition: An allele is an alternate form of a gene at a given locus.

Definition: Gene frequency is the relative proportion of an allele.

Definition: The gene pool is the set of all alleles in a population at all loci.

Entities, State Variables and Scales

Assumptions and Simplifications

Discuss: What assumptions do we need to make to remove the forces of drift, gene flow, and mutation?

Answer:

  • Infinite population (Removes drift)
  • Closed system (Removes gene flow)
  • Fixed number of alleles (Removes mutation)

Entities, State Variables and Scales

Assumptions and Simplifications

Discuss: In terms of the entities (loci and alleles), what simplifications would be helpful at this stage?

Answer: First focus on one loci and two alleles.

Discuss: Before adding selection, do we have a model of evolution to start with?

Answer: Yes! We can start from Hardy-Weinberg Equilibrium.

Entities, State Variables and Scales

Assumptions and Simplifications

Model: Hardy-Weinberg Equilibrium states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences.

State variables

Suppose our population is diploid.

Let \( A_{1} \) and \( A_{2} \) denote the two alleles at our focus locus.

Let \( p \) and \( q \) denote the proportion of \( A_{1} \) and \( A_{2} \), respectively (Note: \( p + q = 1 \))

Entities, State Variables and Scales

Assumptions and Simplifications

State variables - Time

Discuss: What is our unit for time? Is it discrete or continuous?

Answer: Time is measured in generations, which is discrete.

Entities, State Variables and Scales

Under the assumption of random mating, what are the genotype frequencies for \( A_{1}A_{1} \), \( A_{1}A_{2} \), and \( A_{2}A_{2} \)?

\( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
\( p^2 \) \( 2pq \) \( q^2 \)

This follows since the number of \( A_{1} \) alleles in a genotype is a binomial random variable.

Entities, State Variables and Scales

For a variable to be a binomial random variable, ALL of the following conditions must be met:

  • There are a fixed number of trials \( N \) (N=2 chromosomes)
  • On each trial, there are two possible outcomes - “success” or “failure” (\( A_{1} \) or \( \ A_{2} \))
  • The probability of each outcome is the same on each trial (\( p \) and \( q=1-p \))
  • Trials are independent of one another (random mating)

The probability of having \( k \) successes in \( N \) trials is

\[ \textrm{Prob[k successes in N trials]} = \left(\begin{array}{c}N \\ k\end{array}\right)p^{k}q^{N-k} \]

Entities, State Variables and Scales

 
Prob[\( A_{1}A_{1} \)] Prob[k = 2] \( \left(\begin{array}{c}2 \\ 2\end{array}\right)p^{2}q^{2-2} = p^2 \)
Prob[\( A_{1}A_{2} \)] Prob[k = 1] \( \left(\begin{array}{c}2 \\ 1\end{array}\right)p^{1}q^{2-1} = 2pq \)
Prob[\( A_{2}A_{2} \)] Prob[k = 0] \( \left(\begin{array}{c}2 \\ 0\end{array}\right)p^{0}q^{2-0} = q^2 \)

Question: How do allele frequencies change over generations?

\[ p_{0} \rightarrow p_{1} \rightarrow p_{2} \rightarrow \cdots \]

\[ p_{t+1} = F(p_{t}), t \geq 0 \]

Note: Subscript corresponds to generation number.

Entities, State Variables and Scales

Hardy-Weinberg

\[ p_{0} \rightarrow p_{0} \rightarrow p_{0} \rightarrow \cdots \]

\[ p_{t+1} = p_{t} \]

How to construct \( F \)? To introduce selection, we need to model fitness, which for a diploid population occurs on the genotype and not on the allele. We will model relative fitness - i.e., we only care about the fitness of each genotype relative to the other genotypes.

Entities, State Variables and Scales

Selection

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( w_{11} \) \( w_{12} \) \( w_{22} \)
Frequency \( p^2 \) \( 2pq \) \( q^2 \)

Note: Rather than model the change of allele \( p \), we will model the change of allele \( q \), as we are interpreting this model in the case of mutation, where \( A_{2} \) is the new mutant allele. This just changes perspective, not the resulting biological interpretation.

Entities, State Variables and Scales

Selection

For model derivation purposes, it helps to imagine a finite population of individuals \( N_{t} \).

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( w_{11} \) \( w_{12} \) \( w_{22} \)
Frequency\( _{t} \) \( p_{t}^2 \) \( 2p_{t}q_{t} \) \( q_{t}^2 \)
Count\( _{t} \) \( p_{t}^2N_{t} \) \( 2p_{t}q_{t}N_{t} \) \( q_{t}^2N_{t} \)

Discuss: We have an implicit assumption here that we haven’t mentioned yet. Can you spot it?

Answer: Fitness is independent of time (parameters).

Entities, State Variables and Scales

Selection

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( w_{11} \) \( w_{12} \) \( w_{22} \)
Frequency\( _{t} \) \( p_{t}^2 \) \( 2p_{t}q_{t} \) \( q_{t}^2 \)
Count\( _{t} \) \( p_{t}^2N_{t} \) \( 2p_{t}q_{t}N_{t} \) \( q_{t}^2N_{t} \)

Discuss: Thinking of the weights \( w_{ij} \) as multiplicative growth factors, what would Count\( _{t+1} \) be?

Entities, State Variables and Scales

Selection

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( w_{11} \) \( w_{12} \) \( w_{22} \)
Frequency\( _{t} \) \( p_{t}^2 \) \( 2p_{t}q_{t} \) \( q_{t}^2 \)
Count\( _{t} \) \( p_{t}^2N_{t} \) \( 2p_{t}q_{t}N_{t} \) \( q_{t}^2N_{t} \)
Count\( _{t+1} \) \( p_{t}^2N_{t}w_{11} \) \( 2p_{t}q_{t}N_{t}w_{12} \) \( q_{t}^2N_{t}w_{22} \)

Discuss: How many \( A_{2} \) alleles are there in the \( t+1 \) generation? How many total alleles are there in the \( t+1 \) generation?

Entities, State Variables and Scales

Selection

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( w_{11} \) \( w_{12} \) \( w_{22} \)
Frequency\( _{t} \) \( p_{t}^2 \) \( 2p_{t}q_{t} \) \( q_{t}^2 \)
Count\( _{t} \) \( p_{t}^2N_{t} \) \( 2p_{t}q_{t}N_{t} \) \( q_{t}^2N_{t} \)
Count\( _{t+1} \) \( p_{t}^2N_{t}w_{11} \) \( 2p_{t}q_{t}N_{t}w_{12} \) \( q_{t}^2N_{t}w_{22} \)
Answer: \[ \begin{align} N_{2,t+1} & = 2p_{t}q_{t}N_{t}w_{12} + 2q_{t}^2N_{t}w_{22} \\ N_{t+1} & = 2\bigl(p_{t}^2N_{t}w_{11} + 2p_{t}q_{t}N_{t}w_{12} + q_{t}^2N_{t}w_{22}\bigr) \end{align} \]

Discuss: So what is the frequency of allele \( A_{2} \) in generation \( t+1 \)?

Entities, State Variables and Scales

\[ \begin{align} N_{2,t+1} & = 2p_{t}q_{t}N_{t}w_{12} + 2q_{t}^2N_{t}w_{22} \\ N_{t+1} & = 2\bigl(p_{t}^2N_{t}w_{11} + 2p_{t}q_{t}N_{t}w_{12} + q_{t}^2N_{t}w_{22}\bigr) \end{align} \]

Answer: \[ \begin{align} q_{t+1} & = \frac{N_{2,t+1}}{N_{t+1}} \\ & = \frac{2p_{t}q_{t}N_{t}w_{12} + 2q_{t}^2N_{t}w_{22}}{2\bigl(p_{t}^2N_{t}w_{11} + 2p_{t}q_{t}N_{t}w_{12} + q_{t}^2N_{t}w_{22}\bigr)} \\ & = \frac{p_{t}q_{t}w_{12} + q_{t}^2w_{22}}{p_{t}^2w_{11} + 2p_{t}q_{t}w_{12} + q_{t}^2w_{22}} \end{align} \]

Entities, State Variables and Scales

Replace \( p_{t} = 1-q_{t} \) to arrive at

\[ q_{t+1} = \frac{(1-q_{t})q_{t}w_{12} + q_{t}^2w_{22}}{(1-q_{t})^2w_{11} + 2(1-q_{t})q_{t}w_{12} + q_{t}^2w_{22}}. \]

Since we are concerned with relative fitness, we have

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+f(s,t) \) \( 1+g(s,t) \)

Model Design

We consider 5 common types of selection:

  • Codominance (genic selection):
Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+s \) \( 1+2s \)
  • Complete dominance (\( A_{2} \) is dominant):
Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+s \) \( 1+s \)

Model Design

  • Complete recessiveness (\( A_{2} \) is recessive):
Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1 \) \( 1+s \)
  • Overdominance (\( s > 0 \), \( s > t \), heterozygote highest fitness):
Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+s \) \( 1+t \)
  • Underdominance (\( s < 0 \), \( s < t \), heterozygote lowest fitness):
Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+s \) \( 1+t \)

Example: Codominance

Codominance (genic selection):

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+s \) \( 1+2s \)

\[ \begin{align} q_{t+1} & = \frac{(1-q_{t})q_{t}(1+s) + q_{t}^2(1+2s)}{(1-q_{t})^2 + 2(1-q_{t})q_{t}(1+s) + q_{t}^2(1+2s)} \\ & = F(q_{t},s) \end{align} \]

Example: Codominance

Simulate: \( s = 0.01 \), \( q_{0} = 0.04 \) plot of chunk unnamed-chunk-1

Example: Codominance

Simulate: \( s = 0.01 \), \( q_{0} = 0.04 \)

N <- 3000
s <- 0.01
q <- rep(0.04,N)
for (i in 1:(N-1)) {
  qi <- q[i] # For readability
  q[i+1] <- ((1-qi)*qi*(1+s) + qi*qi*(1+2*s))/((1-qi)*(1-qi) + 2*(1-qi)*qi*(1+s) + qi*qi*(1+2*s))
}
plot(1:N,q)

Directional selection

Codominance (left); Dominant (center); Recessive (right) plot of chunk unnamed-chunk-3

Recessive deleterious alleles

What happens when \( A_{2} \) is recessive and deleterious? Suppose \( s = -0.01 \).

plot of chunk unnamed-chunk-4

The proportion of \( A_{2} \) in the population after 3000 generations is 0.0184496!!! That's only a drop of 54%!

Balancing selection

Definition [Wikipedia]: Balancing selection refers to a number of selective processes by which multiple alleles (different versions of a gene) are actively maintained in the gene pool of a population at frequencies longer than expected from genetic drift alone.

Overdominance (\( s > 0 \), \( s > t \), heterozygote highest fitness):

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+s \) \( 1+t \)

Overdominance

Simulate: \( s = 0.04 \), \( t = 0.02 \)

plot of chunk unnamed-chunk-5

Underdominance

Underdominance (\( s < 0 \), \( s < t \), heterozygote lowest fitness):

Genotype \( A_{1}A_{1} \) \( A_{1}A_{2} \) \( A_{2}A_{2} \)
Fitness \( 1 \) \( 1+s \) \( 1+t \)

Simulate: \( s = -0.02 \), \( t = -0.01 \)

plot of chunk unnamed-chunk-6

Cobweb Plots

Symbolic Math

Question: How can we find fixed points algebraically and determine their stability?

Answer: We are going to use symbolic math! Symbolic math is a computer algebra system that will rigorously solve algebraic equations (and perform calculus) using a computer.

We will be using SymPy, a Python library for doing symbolic mathematics. In reality, we can use the package in R by using the rSymPy package.

Symbolic Math: Set up variables

library(rSymPy)

q <- Var("q")
w11 <- Var("w11")
w12 <- Var("w12")
w22 <- Var("w22")
s <- Var("s")
t <- Var("t")

Symbolic Math: Co-Dominance

\[ \begin{align} q_{t+1} & = \frac{(1-q_{t})q_{t}(1+s) + q_{t}^2(1+2s)}{(1-q_{t})^2 + 2(1-q_{t})q_{t}(1+s) + q_{t}^2(1+2s)} \\ & = F(q_{t},s) \end{align} \]

sympy("F = ((1-q)*q*w12 + w22*q**2)/(w11*(1-q)**2 + 2*(1-q)*q*w12 + w22*q**2)")

Substitute weights and simplify:

(CD <- sympy("CD = simplify(F.subs([(w11, 1), (w12, 1+s), (w22, 1+2*s)]))"))
[1] "(q + q*s + s*q**2)/(1 + 2*q*s)"

Symbolic Math: Co-Dominance

[1] "(q + q*s + s*q**2)/(1 + 2*q*s)"

\[ F(q_{t},s) = \frac{q_{t}(1 + s) + sq_{t}^2}{1+2sq_{t}} \]

Where are the fixed points? \( F(q_{t},s) = q_{t} \Rightarrow F(q_{t},s) - q_{t} = 0 \)

sympy("solve(CD - q, q)")  # Solve CD - q = 0 for q
[1] "[1, 0]"

Symbolic Math: Co-Dominance

Let's determine stability by finding the derivative \( \frac{dF}{dq_{t}}(q_{t},s) \)

sympy("dF = simplify(diff(CD, q))")
[1] "(1 + s + 2*q*s + 2*q**2*s**2)/(1 + 4*q*s + 4*q**2*s**2)"

\[ \frac{dF}{dq_{t}}(q_{t},s) = \frac{1 + s + 2sq + 2s^2q^2}{1 + 4sq + 4s^2q^2} \]

Symbolic Math: Co-Dominance

Now evaluate at fixed points:

sympy("simplify(dF.subs(q, 0))")  # Substitute q = 0 and simplify
[1] "1 + s"
sympy("simplify(dF.subs(q, 1))")  # Substitute q = 1 and simplify
[1] "(1 + s)/(1 + 2*s)"

Symbolic Math: Co-Dominance

Now evaluate at fixed points:

\[ \begin{align} \frac{dF}{dq_{t}}(0,s) & = 1+s \\ \frac{dF}{dq_{t}}(1,s) & = \frac{1+s}{1+2s} \end{align} \]

This, if \( s > 0 \) (advantageous), then 0 is unstable and 1 is stable, whereas if \( -1 < s < 0 \) (deleterious), then 0 is stable and 1 is unstable.

Exercise: Perform the same analysis yourself with the remaining 4 types of selection.