QUESTION 1: In roughly 2–3 sentences (maximum), what excites you about Population Genetics and why did you take my course?
#### Section 1 ####
1+1 #Notice how R produces a single object as an answer [1]. Also notice how it populates the answer inside that vector object; the answer is 2
## [1] 2
2+2
## [1] 4
#Creating objects to represent the genome of a human (_Homo sapiens_)
#notice your environment population
Diploid <-2 #we are indeed diploid organisms
Human_genome <-23 #we have 23 chromosomes
#How many chromosomes does a zygote have?
Zygote<- Diploid*Human_genome
Zygote #this should give me the answer for 2 x 23
## [1] 46
#Considering Mendel's law of Segregation and Independent Assortment only,
#how many unique gametes can a human produce? The equation for this question is: diploidy raised to the power of the number of chromosome pairs, for humans: 2^23
unique1 <- Diploid^Human_genome
unique1
## [1] 8388608
?format() #this is a reminder to check out the help files
## starting httpd help server ... done
format(unique1, big.mark=",") #explain what this does
## [1] "8,388,608"
#How do humans compare to the Indian Muntjac? #how many unique gametes can an Indian Muntjac produce? (NOTE: This is the smallest genome of any mammal on earth)
IM_genome <-3
unique2 <- Diploid^IM_genome
unique2
## [1] 8
#How about the Red Vizcacha Rat? #how many unique gametes can a Red Vizcacha Rat produce? (NOTE: This is the largest genome of any mammal on earth)
RVR_genome <-51
unique3 <- Diploid^RVR_genome
unique3 #I'm not smart enough to understand this answer. This is an answer for computers, not humans. Lets fix it
## [1] 2.2518e+15
format(unique3, big.mark=",", scientific = F) #That's better, although still difficult to comprehend
## [1] "2,251,799,813,685,248"
QUESTION 2: What (if anything) scares you or excites you the most about learning R and RStudio within the framework of population genetics?
QUESTION 3: Can you think of any other benefits of commenting code? Any cons?
QUESTION 4: Ultimately, you will comment your code in a way that makes sense to you. Or you won’t, and you’ll be disappointed later. Observe the commenting examples below. Why do you think I am using comments in different ways?
#### Section heading ####
#This general comment could document the following related lines
1+1 #this comment could describe the specific line
## [1] 2
2+2 #this comment could describe the specific line
## [1] 4
#This general comment could document the following related lines
A<-1 #this comment could describe the specific line
A+A #this comment could describe the specific line
## [1] 2
B<-2 #this comment could describe the specific line
B+B #this comment could describe the specific line
## [1] 4
#### Section 2 Plots ####
#Why am I doing this 4-hashtag commenting? Have you figured it out?
#### Makes the r script condense at that section title. Good space saver####
?c()
#combine. Can return a vector, create columns, and concatenate or combine multiple vectors.
genomes<-c(Human_genome, IM_genome, RVR_genome)
#genomes get combined variables of human genome, IM genome, and RVR genome.
uniques<-c(unique1, unique2, unique3)
#uniques get combined variables of unique 1, unique 2, and unique 3
uniques2<-c(log10(unique1), log10(unique2), log10(unique3))
#uniques2 gets combined log10 values of unique 1, 2, and 3.
?data.frame()
#creates a collection of variables like a matrix or list
df<-data.frame(genomes, uniques, uniques2)
#df now reresents that data frame with genomics, uniques and uniques2
df
## genomes uniques uniques2
## 1 23 8.388608e+06 6.92369
## 2 3 8.000000e+00 0.90309
## 3 51 2.251800e+15 15.35253
#If you don't understand what I just did, read the help files, ask a fellow student, then ask me
#
plot(df$genomes, df$uniques,
#plot the data frame of genomes and the data frame of uniques
xlab = "Haploid genome size", ylab = "Unique gamete combinations")
#customizes the x axis with the title of "Haploid genome size: and the y axis with unique gamete combinations
#
abline(lm(df$uniques~df$genomes), col="blue")
#Adds 1 or more straight lines through the plot of a linear regression model with a data set of uniques and genomes
#
plot(df$genomes, df$uniques2,
#plot the data frame of genomes and uniques2
xlab = "Haploid genome size", ylab = "log10(Unique gamete combinations)")
abline(lm(df$uniques2~df$genomes), col="blue")
# abline(lm(df$uniques2~df$genomes), col="blue")c. Just figured out that Ctrl/Shift/C makes anything into a comment. Good for faulty code.
#### Section 3 Allele Frequencies ####
#Manually creating an allele frequency of allele "A" 20% and allele "a" 80%
allele <- c("A","A","a","a","a","a","a","a","a","a")
allele
## [1] "A" "A" "a" "a" "a" "a" "a" "a" "a" "a"
#Manually creating an allele frequency of allele "A" 20% and allele "a" 80%
allele <- c("A","A","a","a","a","a","a","a","a","a")
#allele gets a combined ("A","A","a","a","a","a","a","a","a","a")
allele
## [1] "A" "A" "a" "a" "a" "a" "a" "a" "a" "a"
allele2 <- c(rep("a",8), rep("A",2))
#allele 2 gets a combined repeat of variable "A" 2 times.
allele2
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "A" "A"
#annotate this code to the best of your ability, line by line. Use ? help files when needed
popsize <-100
#population size gets 100
pop <- matrix(nrow=popsize, ncol=2)
#population gets matrix for the number of rows equal to population size, number of columns is 2.
head(pop)
## [,1] [,2]
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
#retrieve the head or first part of the value in parenthesis.
tail(pop)
## [,1] [,2]
## [95,] NA NA
## [96,] NA NA
## [97,] NA NA
## [98,] NA NA
## [99,] NA NA
## [100,] NA NA
#retrieve the tail or last part of the value in parenthesis.
colnames(pop) <-c("allele1", "allele2")
#column names of population get a combination of allele 1 and allele 2
head(pop)
## allele1 allele2
## [1,] NA NA
## [2,] NA NA
## [3,] NA NA
## [4,] NA NA
## [5,] NA NA
## [6,] NA NA
#retrieve the first part of the population data set.
rownames(pop)<-paste("ind", 1:popsize,sep="")
#rownames for the population get pasted from individual with a population size of 1., and separate the columns into multiples
rownames(pop)
## [1] "ind1" "ind2" "ind3" "ind4" "ind5" "ind6" "ind7" "ind8"
## [9] "ind9" "ind10" "ind11" "ind12" "ind13" "ind14" "ind15" "ind16"
## [17] "ind17" "ind18" "ind19" "ind20" "ind21" "ind22" "ind23" "ind24"
## [25] "ind25" "ind26" "ind27" "ind28" "ind29" "ind30" "ind31" "ind32"
## [33] "ind33" "ind34" "ind35" "ind36" "ind37" "ind38" "ind39" "ind40"
## [41] "ind41" "ind42" "ind43" "ind44" "ind45" "ind46" "ind47" "ind48"
## [49] "ind49" "ind50" "ind51" "ind52" "ind53" "ind54" "ind55" "ind56"
## [57] "ind57" "ind58" "ind59" "ind60" "ind61" "ind62" "ind63" "ind64"
## [65] "ind65" "ind66" "ind67" "ind68" "ind69" "ind70" "ind71" "ind72"
## [73] "ind73" "ind74" "ind75" "ind76" "ind77" "ind78" "ind79" "ind80"
## [81] "ind81" "ind82" "ind83" "ind84" "ind85" "ind86" "ind87" "ind88"
## [89] "ind89" "ind90" "ind91" "ind92" "ind93" "ind94" "ind95" "ind96"
## [97] "ind97" "ind98" "ind99" "ind100"
#rownames of the population
head(pop)
## allele1 allele2
## ind1 NA NA
## ind2 NA NA
## ind3 NA NA
## ind4 NA NA
## ind5 NA NA
## ind6 NA NA
#generated the first part of the population data.
tail(pop)
## allele1 allele2
## ind95 NA NA
## ind96 NA NA
## ind97 NA NA
## ind98 NA NA
## ind99 NA NA
## ind100 NA NA
#generated the last part of the population data.
#Make sure you understand what sample() does
sample(allele,1)
## [1] "a"
#takes a sample from the allele data with a size of 1.
for(i in 1:popsize){
#for(variable i in a population size of 1.)
pop[i,1] <- sample(allele,1)
#population i with a size of 1 gets a sample from allele one with a size of 1.
pop[i,2] <- sample(allele,1)
#population i with a total size of 2 gets a sample from allele 1 with a size of 1.
}
head(pop)
## allele1 allele2
## ind1 "a" "A"
## ind2 "a" "a"
## ind3 "A" "a"
## ind4 "a" "a"
## ind5 "a" "a"
## ind6 "a" "a"
#generate the first part of the data set.
tail(pop)
## allele1 allele2
## ind95 "a" "a"
## ind96 "a" "a"
## ind97 "a" "a"
## ind98 "a" "a"
## ind99 "a" "a"
## ind100 "a" "a"
#generates the last part of the data set.
QUESTION 5: Which plot do you believe is better and why? What is
abline? What islm?
QUESTION 6: I expect you to comment each line and figure out the purpose of each function using the
?function. The easiest way to figure this out is to copy and paste the code to your R.script, run it, comment it. Provide comments in the R code chunks. To figure out a function, try?c()and?cbind(). We’ll use those a lot. Provide all your code for the rest of the lab.
colnames(pop) <-c("allele1", "allele2")
#column name population gets a combination of allele 1 and allele 2.
head(pop)
## allele1 allele2
## ind1 "a" "A"
## ind2 "a" "a"
## ind3 "A" "a"
## ind4 "a" "a"
## ind5 "a" "a"
## ind6 "a" "a"
#generates the first part of the population data set.
rownames(pop)<-paste("ind", 1:popsize, sep="")
#rownames for the population get pasted from the individual with a population size of 1 and seperated into multiple columns.
rownames(pop)
## [1] "ind1" "ind2" "ind3" "ind4" "ind5" "ind6" "ind7" "ind8"
## [9] "ind9" "ind10" "ind11" "ind12" "ind13" "ind14" "ind15" "ind16"
## [17] "ind17" "ind18" "ind19" "ind20" "ind21" "ind22" "ind23" "ind24"
## [25] "ind25" "ind26" "ind27" "ind28" "ind29" "ind30" "ind31" "ind32"
## [33] "ind33" "ind34" "ind35" "ind36" "ind37" "ind38" "ind39" "ind40"
## [41] "ind41" "ind42" "ind43" "ind44" "ind45" "ind46" "ind47" "ind48"
## [49] "ind49" "ind50" "ind51" "ind52" "ind53" "ind54" "ind55" "ind56"
## [57] "ind57" "ind58" "ind59" "ind60" "ind61" "ind62" "ind63" "ind64"
## [65] "ind65" "ind66" "ind67" "ind68" "ind69" "ind70" "ind71" "ind72"
## [73] "ind73" "ind74" "ind75" "ind76" "ind77" "ind78" "ind79" "ind80"
## [81] "ind81" "ind82" "ind83" "ind84" "ind85" "ind86" "ind87" "ind88"
## [89] "ind89" "ind90" "ind91" "ind92" "ind93" "ind94" "ind95" "ind96"
## [97] "ind97" "ind98" "ind99" "ind100"
head(pop)
## allele1 allele2
## ind1 "a" "A"
## ind2 "a" "a"
## ind3 "A" "a"
## ind4 "a" "a"
## ind5 "a" "a"
## ind6 "a" "a"
#header of the population is generated
tail(pop)
## allele1 allele2
## ind95 "a" "a"
## ind96 "a" "a"
## ind97 "a" "a"
## ind98 "a" "a"
## ind99 "a" "a"
## ind100 "a" "a"
#tail of the population data is generated.
QUESTION 7: Run the following code 20 times and figure out what it does. How many times did you get “A”? how about “a”? Do you remember what the vector
alleleis? What do you expect to get? Did you get what you expected?
- Instead of taking a sample we kept getting the same result rather than what is typically thought of as a sample since the size of the sample was reduced to 1.
#### Section 4 Initial Figure ####
#Make sure you understand what sample() does
?c(sample())
sample(allele,1)
## [1] "a"
#take a sample of allele with a size of 1.
for(i in 1:popsize){
#for variable 1 in a population size of 1
pop[i,1] <- sample(allele,1)
#for the population of variable i with a size of 1 gets a sample of the allele data with a size of 1.
pop[i,2] <- sample(allele,1)
#for the population of variable i with a size of 2 gets a sample of the allele data with a size of 1.
}
head(pop)
## allele1 allele2
## ind1 "a" "a"
## ind2 "a" "a"
## ind3 "a" "a"
## ind4 "A" "a"
## ind5 "a" "a"
## ind6 "a" "a"
#generates the first part of the population data set
tail(pop)
## allele1 allele2
## ind95 "a" "a"
## ind96 "a" "a"
## ind97 "a" "a"
## ind98 "a" "a"
## ind99 "a" "a"
## ind100 "a" "a"
#generates the last part of the population data set
QUESTION 8: Explain the code using comments. Make sure the figures are being produced. This a random sampling process, so it is unlikely you will get the same answer as your neighbor.
A_count<-0
#the count of A gets zero
for(i in 1:popsize){
#for the variable i in a population size of 1
if(pop[i,1]=="A"){
#if population i with a size of 1 is equal to A
A_count <- A_count+1
#Count of A gets count of A plus 1
}
if(pop[i,2]=="A"){
#if population i with a size of 2 is equal to A
A_count <- A_count+1
#Count of A gets count of A plus 1
}
}
A_count
## [1] 32
#we created the chart that explains how Godfrey Hardy and Wilhelm Weinberg arrived at the conclusion that all genotypes must add up to 1.
total_alleles<-popsize*2
#total alleles gets the population size doubled.
A_freq<-A_count/total_alleles
#frequency of A is calculated from the count of A over total alleles.
A_freq
## [1] 0.16
#A frequency
a_freq<-1-A_freq
#a frequency gets 1-A_frequency
a_freq
## [1] 0.84
#a frequency
#What did we just calculate?
#the frequencies of A and a within our data set. Multiplied pop by 2 because each individual has two alleles, created a ratio of all A alleles versus the total alleles to determine A frequency which was then used to determine a frequency as both will add up to 1.
Het_count <- 0
#heterozygous count gets 0
AA_count <- 0
#homozygous dominant count gets 0
for(i in 1:popsize){
#for variable i in a population size of 1
if(pop[i,1]=="A"){
#if population i with a size of 1 is equal to A
if(pop[i,2]=="a"){
#if population i with a size of 2 is equal to a
Het_count <- Het_count+1
#heterozygous count gets heterozygous count plus 1
}else{
#?cbind(else) ?c(else)
#creates a conditional statement like "if else"
AA_count <- AA_count+1
#homozygous dominant count gets homozygous count plus 1
}
}
if(pop[i,1]=="a"){
#if population i with a size of 1 is equal to a
if(pop[i,2]=="A"){
#if population i wiht a size of 2 is equal to A
Het_count <- Het_count+1
#heterozygous count gets heterozygous count plus 1
}
}
}
#Explain the for-loop
#through the if statements the i population is evaluated for if the allele is dominant or recessive. If it finds the proper alleles it will add 1 to the count of heterozygous or homozygous dominant.
Het_count
## [1] 24
#heterozygous count
AA_count
## [1] 4
#homozygous dominant count
Het_freq<-Het_count/popsize
#heterozygous frequency is calculated from the heterozygous count divided by population size.
Het_freq
## [1] 0.24
#Heterozygous frequency
Homo_freq<-1-Het_freq
#Homozygous frequency gets 1-heterozygous frequency
Homo_freq
## [1] 0.76
#Homozygous frequency
AA_freq<-AA_count/popsize
#Homozygous dominant frequency is calculated from homozgyous dominant count divided by population size.
AA_freq
## [1] 0.04
#Homozygous dominant frequency
aa_freq<-1-(Het_freq+AA_freq)
#Homozygous recessive frequency is calculated from 1- heterozygous frequency and homozygous dominant frequency.
aa_freq
## [1] 0.72
#Homozygous recessive frequency
print(c(paste("AA", AA_freq), "+", paste("Aa", Het_freq), "+", paste("aa", aa_freq), "=", sum(AA_freq+Het_freq+aa_freq)))
## [1] "AA 0.04" "+" "Aa 0.24" "+" "aa 0.72" "=" "1"
#print the combination of pasted AA,Aa, and aa value and frequency. Calculate the sum.
#you are about to plot Hardy-Weinberg expectations. Does your simulation model meet the assumptions of Hardy-Weinberg? In other words, do the dots fall on the expected curves? Yes they follow a pattern of equaling 1 between all alleles.
#explain the plots
plot(A_freq, Het_freq, xlab="allele frequency of A", ylab="",
ylim=c(0, 1), xlim=c(0, 1), col="blue")
#plotof the dominant frequency, heterozygous frequency, x axis equals "allele frequency of A" and y axis is blank. The plot shows a scenario (which is how we calculate allelic frequencies in genetics) to explain that all of the allelic frequencies will add up to 1.
par(new=TRUE)
#new parameters that are true.
plot(A_freq, AA_freq, xlab="", ylab = "",
#plot the dominant, homozygous dominant frequencies with x and y axis blank.
ylim = c(0,1), xlim = c(0,1), col="green")
#place a limit on the y axis combined from (0,1) and an x limit of (0,1). Color it green.
par(new=T)
#new paramater T
plot(A_freq, aa_freq, xlab = "",
#plot dominant frequency, homozygous recessive frequency and the x axis is blank.
ylab = "genotype frequency", ylim = c(0,1), xlim = c(0,1),
#y axis equals the genotype frequency and the y limit is combined at (0,1) and the x limit is combined at (0,1) in the color red.The heterozygous value must be restricted to make the model work and conform to the equation of p + 2pq + q = 1.
col="red")
#hint: 2pq
curve(2*x*(1-x), 0, 1, add=T, ylab=NULL,
#create a curve of 2 times 1 minus x. add points at 0, 1, add variable T, No y axis label
ylim=c(0, 1),lwd = 2, col="darkblue")
#limit of y is collected between 0 and 1. Line width default at 2. Colored dark blue.
text(0.5, 0.7, "Aa", col = "blue")
#text lables of 0.5 and 0.7 for the heterozygous frequency in the color blue.
#hint: p^2
curve(x^2, 0, 1, add=TRUE, ylab=NULL, lwd=2,
#add a curve of x squared, 0 and 1, add true values, y axis is null, and line width depth of 2.
ylim=c(0, 1), col="darkgreen")
#y limit is collected from 0 to 1 in the color dark green.
text(0.9, 0.7, "AA", col = "green")
#add the text 0.0,0.7,"AA", col="green".
#hint: q^2
curve((1-x)^2, 0, 1, add=TRUE, ylab=NULL, lwd=2,
#curve is 1-x squared, 0, and 1. Add true values and the y axis is null. Line width of 2.
ylim=c(0, 1), col="darkred")
#y limit is collected from 0 to 1 and in dark red.
text(0.1, 0.7, "aa", col = "red")
#text added at 0.1, 0.7, homozygous recessive, and is colored red.
QUESTION 9: Provide a figure caption for the figure. Provide as much detail as possible. You must mention the x and y axes, you must mention each color. You must mention the relationship between x and y. Make it clear why we are only monitoring 1 allele.
- Figure 1: The relationship between the allele frequency of A and genotypic frequency overall. The x values are the A frequency and the y values are the genotype frequency. The homozygous dominant green line has a exponential pattern. The heterozygous frequency has a bell shaped curve or negative quadratic form. The homozygous recessive has a negative exponential pattern. We only evaluate the A allele frequency because we know that all alleles within our data set have to add up to 1. Therefore, we can use the total number of A alleles to determine the genotype frequency of the other alleles. The relationship with homozygous recessive alleles is reciprocal to the homozygous dominant frequency. Meanwhile, the heterozygous curve is quadratic because the calculation uses 2 pq.
QUESTION 10: In addition to the learning objectives and question you have already answered, is there anything major that you take away from today’s lab?
- Don’t get too frustrated with R and give up! Things get faster as you acclimate to the software and learn the general functions. Modeling out scientific concepts in R can make things a lot clearer in understanding the theory behind them.
Click the Knit button and save the file as
LASTNAME_LAB01_submit. Make sure you completed the lab
and are happy with the look of the knitted .Rmd file into the
html. Submit the html file to kgustafson@astate.edu with the
subject line “PopGen Lab 01”.