Nathan Stewart 811847789

1. Initiate the Project

Remember to re-run this code every time you re-open this R Notebook.

#Set working directory
set_wd <- function() {
  library(rstudioapi)
  current_path <- getActiveDocumentContext()$path 
  setwd(dirname(current_path ))
  print( getwd() )
}
set_wd()
[1] "C:/Users/natha/OneDrive/Documents/KSU stuff/KSU 2020/Evolution"
#Load libraries
library(ggplot2)

1.1. Some Advice Before Getting Started

In this exercise you will conduct a number of simple calculations. R recognized all basic mathematical function for addition/substraction (+, -), multiplication/division (*, /), and exponents/logarithms (^, log10). Also, R can use these functions using basic algebra, as long as you first define symbols numercially first. So, you can solve the the multiplication of 11 and 13 in two ways:

Specific solution:

11*13
[1] 143

General solution:

x <- 11
y <- 13
x*y
[1] 143

Note that the general solution seems more cumbersome at first. However, it allows you to write some complex code that processes data, and if you want to do the same thing with different data, you don’t have to rewrite the code, you just redefine the input variables.

2. A Null Model of Evolution: the Hardy-Weinberg Principle

The Hardy–Weinberg principle states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of evolutionary influences. Evolutionary influences include genetic drift, natural selection, mutation, and migration. In the simplest case of a single locus with two alleles denoted A and a with frequencies f(A) = p and f(a) = q, respectively, the expected genotype frequencies under random mating are f(AA) = p^2 for the AA homozygotes, f(aa) = q^2 for the aa homozygotes, and f(Aa) = 2pq for the heterozygotes (see figure below for the modified Punnett square). Note that p^2 + 2pq + q^2 always equals 1, just as p + q equals 1 at biallelic loci. In the absence of selection, mutation, genetic drift, or other evolutionary forces, allele frequencies p and q are constant between generations, so equilibrium is reached.

The power of the principle is that it allows for a simple test of whether evolutionary influences are acting on a specific locus. We can go out into natural populations, measure genotype frequencies and infer allele frequencies from that. In the absence of any evolutionary forces, the measured genotype frequencies should be equal to the idealized genotype frequencies predicted by the Hardy-Weinberg principle. If measured and predicted and idealized genotype frequencies do not match, the population is in Hardy-Weinberg disequilibrium and some evolutionary force must be acting on that particular locus, causing some genotypes to be over or underrepresented.

3. Calculating Idealized Genotype Frequencies

Before we apply the Hardy-Weinberg principle in practice, we should think more clearly about the theoretical predictions. For any given allele frequency of p at a biallelic locus, there is a clear mathematical prediction of what the genotype frequencies should be. Here, you will explore the relationship between p and the frequency of all genotypes.

3.1. Basic Calculations

Imagine a locus with two alleles, A (with frequency p) and a (with frequency q). In this section you will calculate the frequency of genotypes for values of p between 0 and 1.

#Make a table with the allele frequency of A (p) between 0 and 1 (in 0.01-step increments) using the seq command
data <- as.data.frame(seq(0, 1, by = 0.01))

#Add column name
colnames(data) <- "p"

#Calculate alternate allele frequency (q)
data$q <- 1-data$p

#Calculate the theoretical frequencies for genotypes AA, Aa, and aa
data$AA <- data$p^2
data$Aa <- 2*data$p*data$q
data$aa <- data$q^2

#Calculate the total allele frequency (sum of all genotype frequencies)
data$total <- (data$p^2)+(2*data$p*data$q)+(data$q^2)

#write results into a table
#Show table
data

3.2. Visualizing Genotype Frequencies

Just to practice our visualization skills, let’s use ggplot to visualize these data using a line graph (geom_line) with p on the x axis and the genotype frequencies on the y axis. You can plot the values for AA, Aa, aa, and the sum of all three into a single plot.

#Visualize the possible allele frequencies as a function of p
#Note that you can add additional elements (lines) by just adding additional geoms, with the aesthetics specified within the brackets
ggplot(data, aes(x=p, y=AA)) +
  geom_line() +
  geom_line(aes(x=p, y=Aa), color="blue") +
  geom_line(aes(x=p, y=aa), color="red") +
  geom_line(aes(x=p, y=total), color="purple") +
  xlab("p (allele frequency)") +
  ylab("genotype frequency") +
  theme_classic()

In your own words, can you explain the Hardy-Weinberg principle and why the sum of all genotypes always equals 1?

The Hardy-Weinberg principle states that variation in a population is constant unless it is disturbed by things such as mutations. Mutations disrupt the equilibrium by introducing new alleles. Other factors that can disrupt the equilibrium would be natural selection and genetic drift. The sum of all genotypes will always equal 1 because 1 is including the whole population, and this means there can’t be more or less.

4. Testing for HW Equilibrium

4.1. Variation in Eye Color

Fruit flies of the genus Drosophila are a workhorse for genetic and evolutionary studies. A frequently studied trait is the eye color polymorphism some populations exhibit. Wild-type flies have a bright red eye coloration. A mutation in a single gene controlling the expression of eye color, however, can turn the eyes white. The allele causing the white-eye phenotype (we call it e) is recessive, therefore the wild-type allele (we call it E) is dominant.

You collected 1000 flies in a natural population and bring your samples back to the laboratory for genotyping. You find that 720 flies are homozygous for the E allele, 120 flies are homozygous for the e allele, and 160 are heterozygous.

What are the relative frequencies (f) of each genotype in the population?

#Calculate observed genotype frequencies based on the results
f_EE = 720
f_Ee = 160
f_ee = 120

#Write results into a table
#Make a list of possible genotypes
genotype <- c("EE", "Ee", "ee")

#Make a list of observed genotype frequencies
f_observed <- c(f_EE, f_Ee, f_ee)

#Merge the two lists into a table
results1 <- data.frame(genotype, f_observed)
results1

Based on the measured genotype frequencies, you can calculate allele frequencies for E and e (hint: flies are diploid):

#Calculate allele frequency for E (p) and e (q) based on observed genotype frequencies
p = ((2*720)+160)/2000
q = ((2*120)+160)/2000

Based on the allele frequencies, we can apply the HW principle and calculate idealized genotype frequencies:

#Calculate idealized (i.e., theoretically predicted) genotype frequencies based on allele frequencies
f_EEi = p^2
f_Eei = 2*p*q
f_eei = q^2

#Make a list of idealized genotype frequencies
f_idealized <- c(f_EEi, f_Eei, f_eei)

#Add list to the results table
results2 <- cbind(results1, f_idealized)
results2

What do you observe? Is the population in HW equilibrium? If it is not, what could explain the difference between the observed and the predicted genotype frequencies?

The population is not in equilibrium, as the idealized has different ratios than the observed. This could be due to many factors, but any outside influence, genetic drift, or mutations can cause the equilibrium to be off. In this sampling, it could be possible that the sample flies collected do not represent the actual population.

4.2. Variation in Wing Morphology

Using the same flies, you also quantify genotype frequencies at a the bithorax locus, which has two alleles G and g, and causes an abnormal development of the fly’s halteres. You find that 10 flies are homozygous for the G allele, 810 flies are homozygous for the g allele and 180 are heterozygous.

f_GG = 10
f_Gg = 180
f_gg = 810

#Write results into a table
genotype <- c("GG", "Gg", "gg")
f_observed <- c(f_GG, f_Gg, f_gg)
results3 <- data.frame(genotype, f_observed)
results3

p = (2*10+180)/2000
q = (2*810+180)/2000

f_GGi = p^2
f_Ggi = 2*p*q
f_ggi = q^2

f_idealized <- c(f_GGi, f_Ggi, f_ggi)
results4 <- cbind(results3, f_idealized)
results4

Is the population in HW equilibrium? If it is not, what could explain the difference between the observed and the predicted genotype frequencies?

The population appears to be in equilibrium.

5. Resources

5.1. Data References

This exercise does not contain original data.

5.2. Resources You Consulted

Consulting additional resources to solve this assignment is absolutely allowed, but failure to disclose those resources is plagiarism. Please list any collaborators you worked with and resources you used below or state that you have not used any.

I used the screenshots, and other questions in the discussion board to troubleshoot my equations.

---
title: "The Hardy-Weinberg Principle"
output:
  html_notebook:
    fig_caption: yes
    toc: yes
    toc_depth: 3
    toc_float: yes
  pdf_document:
    toc: yes
    toc_depth: '3'
  html_document:
    keep_md: TRUE
---

## Nathan Stewart 811847789

# 1. Initiate the Project
Remember to re-run this code every time you re-open this R Notebook.
```{r}
#Set working directory
set_wd <- function() {
  library(rstudioapi)
  current_path <- getActiveDocumentContext()$path 
  setwd(dirname(current_path ))
  print( getwd() )
}
set_wd()

#Load libraries
library(ggplot2)
```

## 1.1. Some Advice Before Getting Started
In this exercise you will conduct a number of simple calculations. R recognized all basic mathematical function for addition/substraction (+, -), multiplication/division (*, /), and exponents/logarithms (^, log10). Also, R can use these functions using basic algebra, as long as you first define symbols numercially first. So, you can solve the the multiplication of 11 and 13 in two ways:

Specific solution:
```{r}
11*13
```

General solution:
```{r}
x <- 11
y <- 13
x*y
```

Note that the general solution seems more cumbersome at first. However, it allows you to write some complex code that processes data, and if you want to do the same thing with different data, you don't have to rewrite the code, you just redefine the input variables.

# 2. A Null Model of Evolution: the Hardy-Weinberg Principle
The [Hardy–Weinberg principle](https://www.nature.com/scitable/definition/hardy-weinberg-equilibrium-122/) states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of evolutionary influences. Evolutionary influences include genetic drift, natural selection, mutation, and migration. In the simplest case of a single locus with two alleles denoted A and a with frequencies f(A) = p and f(a) = q, respectively, the expected genotype frequencies under random mating are f(AA) = p^2 for the AA homozygotes, f(aa) = q^2 for the aa homozygotes, and f(Aa) = 2pq for the heterozygotes (see figure below for the modified Punnett square). Note that p^2 + 2pq + q^2 always equals 1, just as p + q equals 1 at biallelic loci. In the absence of selection, mutation, genetic drift, or other evolutionary forces, allele frequencies p and q are constant between generations, so equilibrium is reached.

![](HW_frequency.png)

The power of the principle is that it allows for a simple test of whether evolutionary influences are acting on a specific locus. We can go out into natural populations, measure genotype frequencies and infer allele frequencies from that. In the absence of any evolutionary forces, the measured genotype frequencies should be equal to the idealized genotype frequencies predicted by the Hardy-Weinberg principle. If measured and predicted and idealized genotype frequencies do not match, the population is in Hardy-Weinberg disequilibrium and some evolutionary force must be acting on that particular locus, causing some genotypes to be over or underrepresented.

![](HW_principle.jpg)

# 3. Calculating Idealized Genotype Frequencies
Before we apply the Hardy-Weinberg principle in practice, we should think more clearly about the theoretical predictions. For any given allele frequency of p at a biallelic locus, there is a clear mathematical prediction of what the genotype frequencies should be. Here, you will explore the relationship between p and the frequency of all genotypes. 

## 3.1. Basic Calculations
Imagine a locus with two alleles, A (with frequency p) and a (with frequency q). In this section you will calculate the frequency of genotypes for values of p between 0 and 1. 

```{r}
#Make a table with the allele frequency of A (p) between 0 and 1 (in 0.01-step increments) using the seq command
data <- as.data.frame(seq(0, 1, by = 0.01))

#Add column name
colnames(data) <- "p"

#Calculate alternate allele frequency (q)
data$q <- 1-data$p

#Calculate the theoretical frequencies for genotypes AA, Aa, and aa
data$AA <- data$p^2
data$Aa <- 2*data$p*data$q
data$aa <- data$q^2

#Calculate the total allele frequency (sum of all genotype frequencies)
data$total <- (data$p^2)+(2*data$p*data$q)+(data$q^2)

#write results into a table
#Show table
data
```

## 3.2. Visualizing Genotype Frequencies
Just to practice our visualization skills, let's use ggplot to visualize these data using a line graph (geom_line) with p on the x axis and the genotype frequencies on the y axis. You can plot the values for AA, Aa, aa, and the sum of all three into a single plot.

```{r}
#Visualize the possible allele frequencies as a function of p
#Note that you can add additional elements (lines) by just adding additional geoms, with the aesthetics specified within the brackets
ggplot(data, aes(x=p, y=AA)) +
  geom_line() +
  geom_line(aes(x=p, y=Aa), color="blue") +
  geom_line(aes(x=p, y=aa), color="red") +
  geom_line(aes(x=p, y=total), color="purple") +
  xlab("p (allele frequency)") +
  ylab("genotype frequency") +
  theme_classic()
```

In your own words, can you explain the Hardy-Weinberg principle and why the sum of all genotypes always equals 1?

*The Hardy-Weinberg principle states that variation in a population is constant unless it is disturbed by things such as mutations. Mutations disrupt the equilibrium by introducing new alleles. Other factors that can disrupt the equilibrium would be natural selection and genetic drift. The sum of all genotypes will always equal 1 because 1 is including the whole population, and this means there can't be more or less.*

# 4. Testing for HW Equilibrium
## 4.1. Variation in Eye Color
Fruit flies of the genus *Drosophila* are a workhorse for genetic and evolutionary studies. A frequently studied trait is the eye color polymorphism some populations exhibit. Wild-type flies have a bright red eye coloration. A mutation in a single gene controlling the expression of eye color, however, can turn the eyes white. The allele causing the white-eye phenotype (we call it e) is recessive, therefore the wild-type allele (we call it E) is dominant.

![](eye_color.jpg)

You collected 1000 flies in a natural population and bring your samples back to the laboratory for genotyping. You find that 720 flies are homozygous for the E allele, 120 flies are homozygous for the e allele, and 160 are heterozygous. 

What are the relative frequencies (f) of each genotype in the population?

```{r}
#Calculate observed genotype frequencies based on the results
f_EE = 720
f_Ee = 160
f_ee = 120

#Write results into a table
#Make a list of possible genotypes
genotype <- c("EE", "Ee", "ee")

#Make a list of observed genotype frequencies
f_observed <- c(f_EE, f_Ee, f_ee)

#Merge the two lists into a table
results1 <- data.frame(genotype, f_observed)
results1
```

Based on the measured genotype frequencies, you can calculate allele frequencies for E and e (hint: flies are diploid):

```{r}
#Calculate allele frequency for E (p) and e (q) based on observed genotype frequencies
p = ((2*720)+160)/2000
q = ((2*120)+160)/2000
```

Based on the allele frequencies, we can apply the HW principle and calculate idealized genotype frequencies:

```{r}
#Calculate idealized (i.e., theoretically predicted) genotype frequencies based on allele frequencies
f_EEi = p^2
f_Eei = 2*p*q
f_eei = q^2

#Make a list of idealized genotype frequencies
f_idealized <- c(f_EEi, f_Eei, f_eei)

#Add list to the results table
results2 <- cbind(results1, f_idealized)
results2
```

What do you observe? Is the population in HW equilibrium? If it is not, what could explain the difference between the observed and the predicted genotype frequencies?

*The population is not in equilibrium, as the idealized has different ratios than the observed. This could be due to many factors, but any outside influence, genetic drift, or mutations can cause the equilibrium to be off. In this sampling, it could be possible that the sample flies collected do not represent the actual population.*

## 4.2. Variation in Wing Morphology
Using the same flies, you also quantify genotype frequencies at a the bithorax locus, which has two alleles G and g, and causes an abnormal development of the fly's halteres. You find that 10 flies are homozygous for the G allele, 810 flies are homozygous for the g allele and 180 are heterozygous.

![](wing_morphology.png)

```{r}
f_GG = 10
f_Gg = 180
f_gg = 810

#Write results into a table
genotype <- c("GG", "Gg", "gg")
f_observed <- c(f_GG, f_Gg, f_gg)
results3 <- data.frame(genotype, f_observed)
results3

p = (2*10+180)/2000
q = (2*810+180)/2000

f_GGi = p^2
f_Ggi = 2*p*q
f_ggi = q^2

f_idealized <- c(f_GGi, f_Ggi, f_ggi)
results4 <- cbind(results3, f_idealized)
results4
```

Is the population in HW equilibrium? If it is not, what could explain the difference between the observed and the predicted genotype frequencies?

*The population appears to be in equilibrium.*

# 5. Resources
## 5.1. Data References
This exercise does not contain original data.

## 5.2. Resources You Consulted
Consulting additional resources to solve this assignment is absolutely allowed, but failure to disclose those resources is plagiarism. Please list any collaborators you worked with and resources you used below or state that you have not used any.

*I used the screenshots, and other questions in the discussion board to troubleshoot my equations.*
