Lab 1 Basics

QUESTION 1: In roughly 2–3 sentences (maximum), what excites you about Population Genetics and why did you take my course?

Populations are the basic unit of conservation and being able to interpret data or research seems to always be the first step. The class counts as an elective for the Wildlife, Fisheries,and Conservation major and to learn more about genetics.

#### Section 1 ####

1+1 #Notice how R produces a single object as an answer [1]. Also notice how it populates the answer inside that vector object; the answer is 2

## [1] 2

2+2

## [1] 4

#Creating objects to represent the genome of a human (_Homo sapiens_)
#notice your environment population
Diploid <-2         #we are indeed diploid organisms
Human_genome <-23   #we have 23 chromosomes

#How many chromosomes does a zygote have?
Zygote<- Diploid*Human_genome 
Zygote #this should give me the answer for 2 x 23

## [1] 46

#Considering Mendel's law of Segregation and Independent Assortment only,
#how many unique gametes can a human produce? The equation for this question is: diploidy raised to the power of the number of chromosome pairs, for humans: 2^23
unique1 <- Diploid^Human_genome
unique1

## [1] 8388608

?format() #this is a reminder to check out the help files

## starting httpd help server ... done

format(unique1, big.mark=",") #explain what this does

## [1] "8,388,608"

#How do humans compare to the Indian Muntjac? #how many unique gametes can an Indian Muntjac produce? (NOTE: This is the smallest genome of any mammal on earth)
IM_genome <-3
unique2 <- Diploid^IM_genome
unique2

## [1] 8

#How about the Red Vizcacha Rat? #how many unique gametes can a Red Vizcacha Rat produce? (NOTE: This is the largest genome of any mammal on earth)
RVR_genome <-51
unique3 <- Diploid^RVR_genome
unique3 #I'm not smart enough to understand this answer. This is an answer for computers, not humans. Lets fix it

## [1] 2.2518e+15

format(unique3, big.mark=",", scientific = F) #That's better, although still difficult to comprehend

## [1] "2,251,799,813,685,248"

QUESTION 2: What (if anything) scares you or excites you the most about learning R and RStudio within the framework of population genetics?

The warning about the labs being time intensive before the semester started. Getting stuck and not being able to find answers in the help function or other sources. Errors suck.

QUESTION 3: Can you think of any other benefits of commenting code? Any cons?

Commenting code can help you record estimates or inferences that you wouldn’t want included with the final publication. Even a comment just to remember to check a fact or figure later for accuracy.

QUESTION 4: Ultimately, you will comment your code in a way that makes sense to you. Or you won’t, and you’ll be disappointed later. Observe the commenting examples below. Why do you think I am using comments in different ways?

#### Section heading ####

#This general comment could document the following related lines
1+1   #this comment could describe the specific line

## [1] 2

2+2   #this comment could describe the specific line

## [1] 4

#This general comment could document the following related lines
A<-1  #this comment could describe the specific line
A+A   #this comment could describe the specific line

## [1] 2

B<-2  #this comment could describe the specific line
B+B   #this comment could describe the specific line

## [1] 4

#### Section 2 Plots ####
#Why am I doing this 4-hashtag commenting? Have you figured it out?
#### Makes the r script condense at that section title. Good space saver####

?c()
#combine. Can return a vector, create columns, and concatenate or combine multiple vectors.
genomes<-c(Human_genome, IM_genome, RVR_genome)
#genomes get combined variables of human genome, IM genome, and RVR genome.
uniques<-c(unique1, unique2, unique3)
#uniques get combined variables of unique 1, unique 2, and unique 3
uniques2<-c(log10(unique1), log10(unique2), log10(unique3))
#uniques2 gets combined log10 values of unique 1, 2, and 3.
?data.frame()
#creates a collection of variables like a matrix or list
df<-data.frame(genomes, uniques, uniques2)
#df now reresents that data frame with genomics, uniques and uniques2
df

##   genomes      uniques uniques2
## 1      23 8.388608e+06  6.92369
## 2       3 8.000000e+00  0.90309
## 3      51 2.251800e+15 15.35253

#If you don't understand what I just did, read the help files, ask a fellow student, then ask me

#
plot(df$genomes, df$uniques,
     #plot the data frame of genomes and the data frame of uniques
     xlab = "Haploid genome size", ylab = "Unique gamete combinations")
#customizes the x axis with the title of "Haploid genome size: and the y axis with unique gamete combinations
#
abline(lm(df$uniques~df$genomes), col="blue")

#Adds 1 or more straight lines through the plot of a linear regression model with a data set of uniques and genomes
#
plot(df$genomes, df$uniques2,
#plot the data frame of genomes and uniques2
     xlab = "Haploid genome size", ylab = "log10(Unique gamete combinations)")
abline(lm(df$uniques2~df$genomes), col="blue")

# abline(lm(df$uniques2~df$genomes), col="blue")c. Just figured out that Ctrl/Shift/C makes anything into a comment. Good for faulty code.

#### Section 3 Allele Frequencies ####

#Manually creating an allele frequency of allele "A" 20% and allele "a" 80%
allele <- c("A","A","a","a","a","a","a","a","a","a")
allele

##  [1] "A" "A" "a" "a" "a" "a" "a" "a" "a" "a"

#Manually creating an allele frequency of allele "A" 20% and allele "a" 80%
allele <- c("A","A","a","a","a","a","a","a","a","a")
#allele gets a combined ("A","A","a","a","a","a","a","a","a","a")
allele

##  [1] "A" "A" "a" "a" "a" "a" "a" "a" "a" "a"

allele2 <- c(rep("a",8), rep("A",2))
#allele 2 gets a combined repeat of variable "A" 2 times.
allele2

##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "A" "A"

#annotate this code to the best of your ability, line by line. Use ? help files when needed
popsize <-100
#population size gets 100
pop <- matrix(nrow=popsize, ncol=2)
#population gets matrix for the number of rows equal to population size, number of columns is 2.
head(pop)

##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA
## [3,]   NA   NA
## [4,]   NA   NA
## [5,]   NA   NA
## [6,]   NA   NA

#retrieve the head or first part of the value in parenthesis.
tail(pop)

##        [,1] [,2]
##  [95,]   NA   NA
##  [96,]   NA   NA
##  [97,]   NA   NA
##  [98,]   NA   NA
##  [99,]   NA   NA
## [100,]   NA   NA

#retrieve the tail or last part of the value in parenthesis.

colnames(pop) <-c("allele1", "allele2")
#column names of population get a combination of allele 1 and allele 2
head(pop)

##      allele1 allele2
## [1,]      NA      NA
## [2,]      NA      NA
## [3,]      NA      NA
## [4,]      NA      NA
## [5,]      NA      NA
## [6,]      NA      NA

#retrieve the first part of the population data set.

rownames(pop)<-paste("ind", 1:popsize,sep="")
#rownames for the population get pasted from individual with a population size of 1., and separate the columns into multiples
rownames(pop)

##   [1] "ind1"   "ind2"   "ind3"   "ind4"   "ind5"   "ind6"   "ind7"   "ind8"  
##   [9] "ind9"   "ind10"  "ind11"  "ind12"  "ind13"  "ind14"  "ind15"  "ind16" 
##  [17] "ind17"  "ind18"  "ind19"  "ind20"  "ind21"  "ind22"  "ind23"  "ind24" 
##  [25] "ind25"  "ind26"  "ind27"  "ind28"  "ind29"  "ind30"  "ind31"  "ind32" 
##  [33] "ind33"  "ind34"  "ind35"  "ind36"  "ind37"  "ind38"  "ind39"  "ind40" 
##  [41] "ind41"  "ind42"  "ind43"  "ind44"  "ind45"  "ind46"  "ind47"  "ind48" 
##  [49] "ind49"  "ind50"  "ind51"  "ind52"  "ind53"  "ind54"  "ind55"  "ind56" 
##  [57] "ind57"  "ind58"  "ind59"  "ind60"  "ind61"  "ind62"  "ind63"  "ind64" 
##  [65] "ind65"  "ind66"  "ind67"  "ind68"  "ind69"  "ind70"  "ind71"  "ind72" 
##  [73] "ind73"  "ind74"  "ind75"  "ind76"  "ind77"  "ind78"  "ind79"  "ind80" 
##  [81] "ind81"  "ind82"  "ind83"  "ind84"  "ind85"  "ind86"  "ind87"  "ind88" 
##  [89] "ind89"  "ind90"  "ind91"  "ind92"  "ind93"  "ind94"  "ind95"  "ind96" 
##  [97] "ind97"  "ind98"  "ind99"  "ind100"

#rownames of the population

head(pop)

##      allele1 allele2
## ind1      NA      NA
## ind2      NA      NA
## ind3      NA      NA
## ind4      NA      NA
## ind5      NA      NA
## ind6      NA      NA

#generated the first part of the population data.
tail(pop)

##        allele1 allele2
## ind95       NA      NA
## ind96       NA      NA
## ind97       NA      NA
## ind98       NA      NA
## ind99       NA      NA
## ind100      NA      NA

#generated the last part of the population data.

#Make sure you understand what sample() does
sample(allele,1)

## [1] "a"

#takes a sample from the allele data with a size of 1.

for(i in 1:popsize){
#for(variable i in a population size of 1.)
  pop[i,1] <- sample(allele,1)
#population i with a size of 1 gets a sample from allele one with a size of 1.
  pop[i,2] <- sample(allele,1)
#population i with a total size of 2 gets a sample from allele 1 with a size of 1.
}
head(pop)

##      allele1 allele2
## ind1 "a"     "A"    
## ind2 "a"     "a"    
## ind3 "A"     "a"    
## ind4 "a"     "a"    
## ind5 "a"     "a"    
## ind6 "a"     "a"

#generate the first part of the data set.
tail(pop)

##        allele1 allele2
## ind95  "a"     "a"    
## ind96  "a"     "a"    
## ind97  "a"     "a"    
## ind98  "a"     "a"    
## ind99  "a"     "a"    
## ind100 "a"     "a"

#generates the last part of the data set.

QUESTION 5: Which plot do you believe is better and why? What is abline? What is lm?

The adjusted plot with abline and the linear model function was better. The abline added a line to the plot while the linear model made the plot fit better for our data set.

QUESTION 6: I expect you to comment each line and figure out the purpose of each function using the ? function. The easiest way to figure this out is to copy and paste the code to your R.script, run it, comment it. Provide comments in the R code chunks. To figure out a function, try ?c() and ?cbind(). We’ll use those a lot. Provide all your code for the rest of the lab.

colnames(pop) <-c("allele1", "allele2")
#column name population gets a combination of allele 1 and allele 2.
head(pop)

##      allele1 allele2
## ind1 "a"     "A"    
## ind2 "a"     "a"    
## ind3 "A"     "a"    
## ind4 "a"     "a"    
## ind5 "a"     "a"    
## ind6 "a"     "a"

#generates the first part of the population data set.

rownames(pop)<-paste("ind", 1:popsize, sep="")
#rownames for the population get pasted from the individual with a population size of 1 and seperated into multiple columns.
rownames(pop)

##   [1] "ind1"   "ind2"   "ind3"   "ind4"   "ind5"   "ind6"   "ind7"   "ind8"  
##   [9] "ind9"   "ind10"  "ind11"  "ind12"  "ind13"  "ind14"  "ind15"  "ind16" 
##  [17] "ind17"  "ind18"  "ind19"  "ind20"  "ind21"  "ind22"  "ind23"  "ind24" 
##  [25] "ind25"  "ind26"  "ind27"  "ind28"  "ind29"  "ind30"  "ind31"  "ind32" 
##  [33] "ind33"  "ind34"  "ind35"  "ind36"  "ind37"  "ind38"  "ind39"  "ind40" 
##  [41] "ind41"  "ind42"  "ind43"  "ind44"  "ind45"  "ind46"  "ind47"  "ind48" 
##  [49] "ind49"  "ind50"  "ind51"  "ind52"  "ind53"  "ind54"  "ind55"  "ind56" 
##  [57] "ind57"  "ind58"  "ind59"  "ind60"  "ind61"  "ind62"  "ind63"  "ind64" 
##  [65] "ind65"  "ind66"  "ind67"  "ind68"  "ind69"  "ind70"  "ind71"  "ind72" 
##  [73] "ind73"  "ind74"  "ind75"  "ind76"  "ind77"  "ind78"  "ind79"  "ind80" 
##  [81] "ind81"  "ind82"  "ind83"  "ind84"  "ind85"  "ind86"  "ind87"  "ind88" 
##  [89] "ind89"  "ind90"  "ind91"  "ind92"  "ind93"  "ind94"  "ind95"  "ind96" 
##  [97] "ind97"  "ind98"  "ind99"  "ind100"

head(pop)

##      allele1 allele2
## ind1 "a"     "A"    
## ind2 "a"     "a"    
## ind3 "A"     "a"    
## ind4 "a"     "a"    
## ind5 "a"     "a"    
## ind6 "a"     "a"

#header of the population is generated
tail(pop)

##        allele1 allele2
## ind95  "a"     "a"    
## ind96  "a"     "a"    
## ind97  "a"     "a"    
## ind98  "a"     "a"    
## ind99  "a"     "a"    
## ind100 "a"     "a"

#tail of the population data is generated.

QUESTION 7: Run the following code 20 times and figure out what it does. How many times did you get “A”? how about “a”? Do you remember what the vector allele is? What do you expect to get? Did you get what you expected?

Instead of taking a sample we kept getting the same result rather than what is typically thought of as a sample since the size of the sample was reduced to 1.

#### Section 4 Initial Figure ####
#Make sure you understand what sample() does
?c(sample())
sample(allele,1)

## [1] "a"

#take a sample of allele with a size of 1.
for(i in 1:popsize){
#for variable 1 in a population size of 1
  pop[i,1] <- sample(allele,1)
#for the population of variable i with a size of 1 gets a sample of the allele data with a size of 1.
  pop[i,2] <- sample(allele,1)
#for the population of variable i with a size of 2 gets a sample of the allele data with a size of 1.
}
head(pop)

##      allele1 allele2
## ind1 "a"     "a"    
## ind2 "a"     "a"    
## ind3 "a"     "a"    
## ind4 "A"     "a"    
## ind5 "a"     "a"    
## ind6 "a"     "a"

#generates the first part of the population data set
tail(pop)

##        allele1 allele2
## ind95  "a"     "a"    
## ind96  "a"     "a"    
## ind97  "a"     "a"    
## ind98  "a"     "a"    
## ind99  "a"     "a"    
## ind100 "a"     "a"

#generates the last part of the population data set

QUESTION 8: Explain the code using comments. Make sure the figures are being produced. This a random sampling process, so it is unlikely you will get the same answer as your neighbor.

A_count<-0
#the count of A gets zero

for(i in 1:popsize){
#for the variable i in a population size of 1
  if(pop[i,1]=="A"){
#if population i with a size of 1 is equal to A
    A_count <- A_count+1
#Count of A gets count of A plus 1
  }
  if(pop[i,2]=="A"){
#if population i with a size of 2 is equal to A
    A_count <- A_count+1
#Count of A gets count of A plus 1
  }
}
A_count

## [1] 32

#we created the chart that explains how Godfrey Hardy and Wilhelm Weinberg arrived at the conclusion that all genotypes must add up to 1.
total_alleles<-popsize*2
#total alleles gets the population size doubled.
A_freq<-A_count/total_alleles
#frequency of A is calculated from the count of A over total alleles.
A_freq

## [1] 0.16

#A frequency
a_freq<-1-A_freq
#a frequency gets 1-A_frequency
a_freq

## [1] 0.84

#a frequency
#What did we just calculate?
#the frequencies of A and a within our data set. Multiplied pop by 2 because each individual has two alleles, created a ratio of all A alleles versus the total alleles to determine A frequency which was then used to determine a frequency as both will add up to 1.

Het_count <- 0
#heterozygous count gets 0
AA_count <- 0
#homozygous dominant count gets 0

for(i in 1:popsize){
#for variable i in a population size of 1
  if(pop[i,1]=="A"){
#if population i with a size of 1 is equal to A 
    if(pop[i,2]=="a"){
#if population i with a size of 2 is equal to a
      Het_count <- Het_count+1
#heterozygous count gets heterozygous count plus 1
    }else{
      #?cbind(else) ?c(else)
#creates a conditional statement like "if else"
      AA_count <- AA_count+1
#homozygous dominant count gets homozygous count plus 1
    }
  }
  if(pop[i,1]=="a"){
#if population i with a size of 1 is equal to a
    if(pop[i,2]=="A"){
#if population i wiht a size of 2 is equal to A
      Het_count <- Het_count+1
#heterozygous count gets heterozygous count plus 1
    }
  }
}
#Explain the for-loop
#through the if statements the i population is evaluated for if the allele is dominant or recessive. If it finds the proper alleles it will add 1 to the count of heterozygous or homozygous dominant.

Het_count

## [1] 24

#heterozygous count
AA_count

## [1] 4

#homozygous dominant count

Het_freq<-Het_count/popsize
#heterozygous frequency is calculated from the heterozygous count divided by population size.
Het_freq

## [1] 0.24

#Heterozygous frequency
Homo_freq<-1-Het_freq
#Homozygous frequency gets 1-heterozygous frequency
Homo_freq

## [1] 0.76

#Homozygous frequency
AA_freq<-AA_count/popsize
#Homozygous dominant frequency is calculated from homozgyous dominant count divided by population size.
AA_freq

## [1] 0.04

#Homozygous dominant frequency

aa_freq<-1-(Het_freq+AA_freq)
#Homozygous recessive frequency is calculated from 1- heterozygous frequency and homozygous dominant frequency.
aa_freq

## [1] 0.72

#Homozygous recessive frequency

print(c(paste("AA", AA_freq), "+", paste("Aa", Het_freq), "+", paste("aa", aa_freq), "=", sum(AA_freq+Het_freq+aa_freq)))

## [1] "AA 0.04" "+"       "Aa 0.24" "+"       "aa 0.72" "="       "1"

#print the combination of pasted AA,Aa, and aa value and frequency. Calculate the sum. 

#you are about to plot Hardy-Weinberg expectations. Does your simulation model meet the assumptions of Hardy-Weinberg? In other words, do the dots fall on the expected curves? Yes they follow a pattern of equaling 1 between all alleles.

#explain the plots
plot(A_freq, Het_freq, xlab="allele frequency of A", ylab="",
       ylim=c(0, 1), xlim=c(0, 1), col="blue")
#plotof the dominant frequency, heterozygous frequency, x axis equals "allele frequency of A" and y axis is blank. The plot shows a scenario (which is how we calculate allelic frequencies in genetics) to explain that all of the allelic frequencies will add up to 1. 
par(new=TRUE)
#new parameters that are true.
plot(A_freq, AA_freq, xlab="", ylab = "",
#plot the dominant, homozygous dominant frequencies with x and y axis blank.
     ylim = c(0,1), xlim = c(0,1), col="green")
#place a limit on the y axis combined from (0,1) and an x limit of (0,1). Color it green.
par(new=T)
#new paramater T
plot(A_freq, aa_freq, xlab = "",
#plot dominant frequency, homozygous recessive frequency and the x axis is blank.
     ylab = "genotype frequency", ylim = c(0,1), xlim = c(0,1),
#y axis equals the genotype frequency and the y limit is combined at (0,1) and the x limit is combined at (0,1) in the color red.The heterozygous value must be restricted to make the model work and conform to the equation of p + 2pq + q = 1.
     col="red")

#hint: 2pq
curve(2*x*(1-x), 0, 1, add=T, ylab=NULL,
#create a curve of 2 times 1 minus x. add points at 0, 1, add variable T, No y axis label
      ylim=c(0, 1),lwd = 2, col="darkblue")
#limit of y is collected between 0 and 1. Line width default at 2. Colored dark blue.
text(0.5, 0.7, "Aa", col = "blue")
#text lables of 0.5 and 0.7 for the heterozygous frequency in the color blue.
#hint: p^2
curve(x^2, 0, 1, add=TRUE, ylab=NULL, lwd=2,
#add a curve of x squared, 0 and 1, add true values, y axis is null, and line width depth of 2.
      ylim=c(0, 1), col="darkgreen")
#y limit is collected from 0 to 1 in the color dark green.
text(0.9, 0.7, "AA", col = "green")
#add the text 0.0,0.7,"AA", col="green".
#hint: q^2
curve((1-x)^2, 0, 1, add=TRUE, ylab=NULL, lwd=2,
#curve is 1-x squared, 0, and 1. Add true values and the y axis is null. Line width of 2.
      ylim=c(0, 1), col="darkred")
#y limit is collected from 0 to 1 and in dark red.
text(0.1, 0.7, "aa", col = "red")

#text added at 0.1, 0.7, homozygous recessive, and is colored red.

QUESTION 9: Provide a figure caption for the figure. Provide as much detail as possible. You must mention the x and y axes, you must mention each color. You must mention the relationship between x and y. Make it clear why we are only monitoring 1 allele.

Figure 1: The relationship between the allele frequency of A and genotypic frequency overall. The x values are the A frequency and the y values are the genotype frequency. The homozygous dominant green line has a exponential pattern. The heterozygous frequency has a bell shaped curve or negative quadratic form. The homozygous recessive has a negative exponential pattern. We only evaluate the A allele frequency because we know that all alleles within our data set have to add up to 1. Therefore, we can use the total number of A alleles to determine the genotype frequency of the other alleles. The relationship with homozygous recessive alleles is reciprocal to the homozygous dominant frequency. Meanwhile, the heterozygous curve is quadratic because the calculation uses 2 pq.

QUESTION 10: In addition to the learning objectives and question you have already answered, is there anything major that you take away from today’s lab?

Don’t get too frustrated with R and give up! Things get faster as you acclimate to the software and learn the general functions. Modeling out scientific concepts in R can make things a lot clearer in understanding the theory behind them.

Click the Knit button and save the file as LASTNAME_LAB01_submit. Make sure you completed the lab and are happy with the look of the knitted .Rmd file into the html. Submit the html file to kgustafson@astate.edu with the subject line “PopGen Lab 01”.

Lab 1 Basics

Hopkins_LAB01_submit

2024-09-08