R Notebook files (*.Rmd) allow you to combine text elements with snippets of code and the outputs generated from said code. There are three main parts to an R Notebook file, which are introduced in this section. In addition, each R Notebook automatically generates an *.html file that provides you with the formatted document including all components, which you will ultimately submit for your homework.
The header, which you can see at the beginning of this document, is
delineated with three dashes (---) at the beginning and the
end. It includes some code that is important for the formatting of
output files, so I would recommend not altering that section. In
general, there should be no reason for you to change the header for any
exercises in this course. However, if you would like to learn more about
the different header options, you can find a good tutorial here.
R code chunks are delineated with three ticks (''') at
the beginning and the end, and {r} after the first set of
ticks lets your computer know that you will be using the R programming
language. You can always add a code chunk by clicking “Insert > Code
Chunk > R” above or by clicking the “+C” icon, although we usually
already created all the chunks you will need in the template. Any text
within a code chunk, if written correctly, represents executable code,
which the computer can interpret as a command to perform certain tasks.
You can make your computer execute the code in a chunk by pressing the
small, green play arrow on the top right corner of each chunk, or you
can just highlight the code and press command+enter (control+enter on
PC). When you execute the code, the output will automatically appear
below a chunk. Sometimes you will find us using hash tags
(#) within code chunks. Hash tags “silence” the text that
follows on the same line, such that the computer jumps over that section
when executing the code. That is useful for code annotation, and you
will frequently see us using the hash tags to add further descriptions
or explanations within code chunks.
Pro tip: If you want to execute all code chunks in a document automatically, you can click “Run > Run All” in the RStudio menu.
The text in between code snippets is just that: text. We will use these sections to provide you with background information and discussion prompts, and you will use these sections to respond to questions and offer your interpretations of data. Sections where you need to write something are always highlighted in italics. You can use a variety of prompts to format your text if you are working with basic Markdown (see here for a cheat sheet). Most of you, however, will prefer the text editor that is implemented in R Studio to format text with the click of a button.
Pro tip: You can toggle back and forth between source code (with Markdown formatting) and the WYSIWYG editor (with text formatting through clicking) by using the Source/Visual buttons in the RStudio menu
As already mentioned, your R Notebook (including text, code chunks, and the outputs from your code) can be automatically knitted into an *.html file. You can click “Preview > Preview Notebook” or “Preview > Knit to HTML” to see the live html version as you are working on your R Notebook (just make sure to save to update), and you can find the shareable *.html file in the same folder as your *.Rmd file (same file name with .nb appended).
Note: Sometime R will prompt you to update some packages in the Console before you can knit the html file. If it is not working on the first try, make sure to check for prompts in the Console.
Having a well-organized file structure is critical to avoid issues with coding, because you will frequently read in data files, and you need to make sure that R knows where to look for those files. To facilitate this process, we will provide you with all the necessary files in a single zipped folder: *.zip. If you are working through this, you have already found the first file. We recommend that you unzip that folder and all of its contents in the location where you want it (e.g., your folder for this course) before starting.
Note that an easy way to set your working directory is to simply create a new project in R Studio using the icon with a blue box and a green plus sign (located at the upper left of your R Studio environment). When you are prompted, selected “Existing Directory” and choose the folder you downloaded from Canvas for this lab: “Introduction to R and Darwinian Natural Selection R Tutorial”. Another advantage of creating an R Studio project is that it will save not only your R Markdown file, but all associated files and preferences you have set up during your work session.
The folder containing the files for a particular exercise is called a
“Working Directory”, and opening an *.Rmd file automatically sets the
working directory to the directory of that R Notebook file. So after
unzipping, it is important not to move any files out of the folder we
provide you with, unless you want to manually tell R where to look for
readable files. If so, you can use the setwd() command to
point R toward the location of your files. Additionally, if you don’t
know what your working directory is, you can always find it using the
getwd() command, which you can type directly into your R
Console; then, you can simply copy that path and paste it into your
setwd() command.
When you install R, your computer can understand and execute a number
of commands. This is what is known as “Base R”. The power of R, however,
is that you can expand the number of commands your computer understands
by installing and loading additional R packages (also called libraries).
There are R packages specialized for pretty much any area of biology,
providing the capability to analyze data from the level of genes and
genomes to ecosystem level processes. We will frequently use a package
called ggplot2, which allows for plotting data. Depending
on the module, you will need to install additional libraries. To
download and install new R packages, go to “Tools > Install
Packages…” and type in the name of the package you want to install.
Alternatively, you can use the install.packages() function.
Fore example, execute the following code chunk to install
ggplot2:
#Install ggplot2
install.packages("ggplot2")
Installing package into ‘C:/Users/emmal/AppData/Local/R/win-library/4.4’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.4/ggplot2_3.5.1.zip'
Content type 'application/zip' length 5021717 bytes (4.8 MB)
downloaded 4.8 MB
package ‘ggplot2’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\emmal\AppData\Local\Temp\Rtmp0ILsFU\downloaded_packages
Note that you only need to install every package once (unless you
reinstall R). I recommend deleting the code chunk above after you run it
successfully, or you can silence it by a hash tag in the beginning of
install.packages("ggplot2"). Failure to do so can cause
problems during the export (knitting) of your R Notebook as an *.html
file.
To make use of installed packages, you also need to load the packages
every time you use R (i.e., every time you restart the
program). You can do this with the library() command, and
you will find a code snippet prompting you to load all needed libraries
at the beginning of each R Notebook (in a section that is typically
called dependencies). You can try it here by executing the code chunk
below to load ggplot2:
#Note that loading a library does not lead to an output
library(ggplot2)
One of the reasons we’re working through the coding basics here is of
course that you will work with actual data. To do that, you will need to
import data into R. With every exercise, we will provide you with one or
more data sets. These data sets will mostly come as *.csv files (which
stands for comma-separated values). They are essentially text files
containing data tables, and you can also open these files in Excel or
other programs. To import data, we will use the read.csv()
function. In the code chunk below, you can import a simple test data set
(“test_data.csv”) that includes the variables sex, length, and mass for
a population of an animal. Note that the fileEncoding
argument simply indicates that I generated the input files on a Mac,
which will prevent some import issues for those of you that use a
PC.
#The line of code simply prompts the computer to read the "test_data.csv" file and generate a data.frame called test.data
test.data <- read.csv("test_data.csv", fileEncoding = 'UTF-8-BOM')
If this worked correctly, you should now see this data set as
test.data in your global environment (top right panel). You
can double click it to view it. There should be three columns: sex,
length, and mass.
A key learning objective of this course is that you learn to visualize data in different ways to facilitate data interpretation in the context of different evolutionary hypotheses. In the following sections, I will explain step by step (that is code line by code line) how to make a simple graph with our sample data set. Let’s aim to make a scatter plot showing the relationship between length and mass in our species. The process is not much different than sketching a graph by hand and layering different parts of the graph on top of each other, just that you use words (code) to make the computer draw.
The first step of making any graph is to define the axes and
establish the coordinate grid that allows for the plotting of the data.
You can do this by calling the ggplot() function within
which you first need to specify the data source (in our case the data
frame we just created, called test.data) and then the so
called aesthetics—aes()—that contain information about what
variables define the x and y axes. In practice, this is accomplished
with the following line of code:
#This line of code calls for the ggplot function (a plotting function) and make a grid based on the test.data data frame, using length as the x axis and mass as the y axis
ggplot(test.data, aes(x=length, y=mass))
The second step is to draw the data into the established coordinate
system. To do so, you just need to tell the program what kind of graph
you want to draw. Different graph types in ggplot2 are
referred to as geoms (geometries), and a scatter plot is designated as
geom_point. You can just add that to your existing code
with a plus sign. For an overview of some of the graph types (geoms)
ggplot2 offers, check the appendix
of our textbook.
ggplot(test.data, aes(x=length, y=mass)) +
geom_point()
Whenever we look at the relationship between two variables, we may
want to add a trendline. You can add a trendline by adding the
geom_smooth() function to your existing code, and
method="lm" designates that your trendline should be
linear. The se argument designates whether or not you want
to draw an error estimate around your trendline.
#The code within the brackets of the geom_smooth command specified some additional options, namely that we want to draw a straight line (method="lm") and that we do not want to show the confidence interval (se=FALSE). Set the se=TRUE and see what happens.
ggplot(test.data, aes(x=length, y=mass)) +
geom_point() +
geom_smooth(method="lm", se=FALSE)
The variable names in the data set do not always provide the clearest
description of what a variable means. We can modify the x and y axis
labels using the xlab() and ylab() functions,
respectively. The actual titles need to be written within quotation
marks:
#Simply add the new label text in quotation marks
ggplot(test.data, aes(x=length, y=mass)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
xlab("Body length in cm") +
ylab("Body mass in kg")
I honestly hate the default theme of ggplot with its
gray background. But you can quickly alter the look of the graph by
switching to a number of other possible themes. I personally like the
theme_classic(), but you can customize the look of your
graph with themes listed here.
ggplot(test.data, aes(x=length, y=mass)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
xlab("Body length in cm") +
ylab("Body mass in kg") +
theme_classic()
Et voilà! You got yourself a perfectly good graph! As you exercise building graphs throughout the semester, make sure to check the “Practical Skills” sections of individual chapters refer to the appendix of the book as needed.
To get additional advice on how to work with different color schemes
in gglot(), including the use of colorblind-friendly
palettes, please check the corresponding
textbook section.
One of the most iconic study systems in evolutionary biology are Darwin’s finches on the Galapagos Islands. Rosemary and Peter Grant spent much of their lives devoted to the study of these bird, examining how their traits change in response to major ecological perturbations. To do so, they collected a massive, long-term data set on different traits of the medium ground finch (Geospiza fortis) population on Daphne Major Island. For this exercise, we will take a look at their beak size data from 1972-1994.
The beak size data can be found in file called “finches.csv”. The file includes three variables: year, the average relative beak size (rel.beak.size), and the standard error (st.err) that describes the variability of beak size in any given year.
finch <- read.csv("finches.csv", fileEncoding = 'UTF-8-BOM')
The following code chunk provides the base code to make a scatter plot as above. You will only have to specify the x and y variables and label the axes correctly.
ggplot(finch, aes(x=year, y=rel.beak.size)) +
geom_point() +
xlab("Year") +
ylab("Relative Beak Size") +
theme_classic()
There are two graphical elements that we can add to facilitate the interpretation of the data:
geom_line().geom_errorbar(). Make sure to specify the x
and y axes variables as aboveggplot(finch, aes(x=year, y=rel.beak.size)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin=rel.beak.size-st.err, ymax=rel.beak.size+st.err)) +
xlab("Year") +
ylab("Relative Beak Size") +
theme_classic()
Based on the graphs you just made, what do you observe? How do you interpret the data if I told you that 1977 was a massive drought year?
Variability was high before the 1970s, when food availability was broad. During a drought, food sources shift to larger, harder seeds that can withstand the drought. This shift added pressure to the survival of birds with smaller beaks, leading in a rise of survival and reproduction among larger beaked birds.
Do you think these data reflect evolutionary change through time? What is a potential alternative explanation? What additional information would you need to either accept or reject the hypothesis that these patterns reflect evolutionary change?
#This command loads required packages
library(ggplot2)
To wrap our head around some of the basic observations that led Darwin to infer natural selection, we will spend a little bit of time with the cave molly. The cave molly (Poecilia mexicana) is a small species of livebearing fish that occurs in a couple of small caves in Southern Mexico. One of the caves, the Cueva Luna Azufre, has a wetted area of only 39 square-meters. Even though the available habitat is really small, there has been an isolated population of cave mollies in this cave for several thousand years. Interestingly, mollies also occur in adjacent surface habitats. In the picture below, you can see the a male and a female of the surface (top two pictures) and the cave form (bottom two pictures) side by side.
The first set of observations that led Darwin to infer the process of natural selection related to the imbalance of organisms’ reproductive power and limitations of resource availability. Quantifying the effective reproductive output and resource availability in nature can be difficult. However, what we can do is to measure proxies for these traits and then use simple mathematical models to test whether our predictions and inferences are valid. Here, we use exponential and logistic population growth models to explore whether there is really a struggle for existence in cave mollies.
Even large animals with long generation times have an incredible reproductive potential. Cave mollies—as many other cave organisms—have a comparatively low fecundity, and females only give birth to one or two fully developed young at a time. Life history analyses based on female longevity and fecundity have revealed that the average female gives birth to about 3 offspring over her life; not exactly what you would call huge reproductive potential, right? But in reality, it is not the reproductive potential of individuals that counts, but the reproductive potential of populations. To illustrate this point, we want you to model population growth for a hypothetical population of cave mollies. Specifically, use the code below to simulate and graph the population growth of an initial cave molly population of 2 individuals (the initial colonizers of the cave).
How many generations would it take for the population to grow to a million? Under what circumstances might you see population growth like this? Do you think Darwin’s observation that “species have great potential fertility” holds true for cave mollies?
#Choose an initial population size
N0 = 2
#Choose the average number of offspring
b = 3
#Choose a range of generations you want to estimate population size for; default is generation 0 to 15
t = 0:15
#Calculate the population size for each generation
N = N0*b^t
#Merge the results of the simulation into a single table
final.results <- as.data.frame(cbind(t,N))
#You can view the results by just calling the data frame
print(final.results)
#Plot the results, make sure you properly label the axes
ggplot(final.results, aes(x=t, y=N)) +
geom_point() +
xlab("Generations") +
ylab("Population Size") +
theme_classic()
Around 12 generations
Exponential growth only occurs in very specific circumstances. In a cave that is only the fraction of the size of a football field, you would obviously never find a cave molly population of a million. The logistic model more accurately describes population growth in nature. Based on our past analyses, we estimate the population growth coefficient (𝛌) to be around 1.3 and the carrying capacity (K) of the cave around 360 individuals.
How long would it take for the population to reach the carrying capacity if there were two initial colonizers? What do you think determines K for the population of cave mollies in the Cueva Luna Azufre?
#Choose an initial population size
N0 = 2
#Choose population growth rate
lamda = 1.3
#Choose a range of generations you want to estimate population size for
t = 0:15
#Choose a carrying capacity
K = 360
#Calculate the population size for each generation
N = (N0*K)/(N0+(K-N0)*exp(-lamda*t))
#Merge the results of the simulation into a single table
final.results <- as.data.frame(cbind(t,N))
#Use the ggplot function to plot the results, make sure you properly label the axes
ggplot(final.results, aes(x=t, y=N)) +
geom_point() +
xlab("Generations") +
ylab("Population Size") +
theme_classic()
The population reaches carrying capacity after around 7 generations. Carrying capacity of cave mollies depends on limiting factors within the cave like food availability, space, predation, and dissolved oxygen.
Compare the two models (exponential and logistic) that were ran with the same initial parameters. What do the different outcomes mean for individual offspring that are born in any given generation? How might this discrepancy important in the context of evolution?
A smaller population size makes it more likely for evolutionary change to occur. If there was an infinite spawn of these fish, limiting factors and environmental factors would apply no evolutionary forces.
Another of Darwin’s key observations was just how variable individuals of the same species are. Let’s explore some of that variation in cave mollies. To do that, we first need to load some data into R. These data were collected as part of my dissertation and include the following variables: habitat (cave or surface), sex (male or female), standard length (in mm, from the snout to the caudal fin base), eye diameter (in mm), head length (in mm), head width (in mm), predorsal length (in mm, from the snout to the insertion of the dorsal fin), and gape width (in mm, from one corner of the mouth to the other).
#Use the read.csv function to import a dataset; take a look at the data structure once you imported the file!
morph.data <- read.csv("morphological_variation.csv", fileEncoding = 'UTF-8-BOM')
A simple way to compare variation within and between populations is
to plot a frequency histogram (which represents the raw counts) along
with a density plot (which represents the approximated statistical
distribution). You can generate a histogram with the
geom_histogram() function and designate any trait you may
want as the x axis. You can calculate the density with
aes(y=..density..) within geom_histogram() and
then plot it with geom_density(). Note that when you have
more than two groups (in our case we have samples from a cave and a
surface population), you can visualize them separately by designating a
different color for each group in the aesthetics
(fill=Habitat).
When you visualize body size variation in this manner what do you observe? Is there more variation within or between populations?
#Use the ggplot function to graph the histogram (see: http://www.sthda.com/english/wiki/ggplot2-histogram-plot-quick-start-guide-r-software-and-data-visualization)
ggplot(morph.data, aes(x=Standard.length, fill=Habitat)) +
geom_histogram(aes(y=..density..)) +
geom_density(alpha=0.5)+
xlab("standard length") +
ylab("density") +
theme_classic()
The surface population appears to have more variation in length than teh cave fish. The modes of length seem similar between habitats. Cave species are more abundant at smaller lengths than surface species.
Let’s also compare a second trait, predorsal length. With the
previous graph you hopefully saw how variable overall body size is
within populations. If we want to compare other traits, we have to
account for that. We want to know whether variation in predorsal length
is due to variation in size (small fish have small predorsal lengths) or
whether other patterns might be at play. To do so, we can calculate the
residual predorsal length as from a regression between predorsal and
standard length using the lm(y ~ x, data) and
residuals() functions:
#Calculating regression line
fit1 <- lm(Predorsal.length ~ Standard.length, data = morph.data)
#Extract residuals and create a new variable res.predorsal in the morph.data data frame
morph.data$res.predorsal <- residuals(fit1)
You can then use the new variable to plot the residual predorsal length, which is corrected for body size:
##Use the ggplot function to graph the histogram and color data based on habitat
ggplot(morph.data, aes(x=res.predorsal, fill=Habitat)) +
geom_histogram(aes(y=..density..)) +
geom_density(alpha=0.5)+
xlab("Pre-dorsal length (corrected for body size)") +
ylab("Density") +
theme_classic()
When you plot relative predorsal length, what do you observe? How does variation in predorsal length vary within and between populations, and how does it compare to variation in standard length?
Using the same approach as for predorsal variation, compare variation in relative eye diameter:
#Your code goes here
fit1 <- lm(Eye.diameter ~ Standard.length, data = morph.data)
#Extract residuals and create a new variable res.predorsal in the morph.data data frame
morph.data$res.eye<- residuals(fit1)
##Use the ggplot function to graph the histogram and color data based on habitat
ggplot(morph.data, aes(x=res.eye, fill=Habitat)) + geom_histogram(aes(y=..density..)) + geom_density(alpha=0.5)+ xlab("Relative Eye Diameter (corrected for body size)")
What do you observe? How does variation in eye diameter vary within and between populations, and how does it compare to variation in the other traits?
Unlike the other histograms, the variation in Eye diameter is more dependent on habitat. The Cave habitat is distributed to the left, with a generally smaller eye diameter.
An avid breeder of fancy pigeons, Darwin observed that specific traits are passed from parents to offspring, even though he had no clue how this might actually work (genetics was not really a thing yet). Even without an ability to conduct molecular genetic analyses, we can estimate heritability of traits by comparing the traits of offspring to the traits of the parents.
Let’s load some data that compares parent and offspring traits in cave mollies. To do this, we brought cave mollies into the lab and bred them under standardized conditions. Data represent the average trait values of the mother and father and of all offspring from a specific brood. The easiest way to compare parent and offspring traits is through a scatter plot, which we already used in Exercise 1. If a trait is heritable, we would expect to see a correlation between parent and offspring traits (e.g., parents with small eyes should have offspring with small eyes).
The following dataset includes measurements of parental and offspring standard length as well as eye size.
#Use the read.csv function to import a dataset; take a look at the data structure once you imported the file!
heritability <- read.csv("heritability.csv", fileEncoding = 'UTF-8-BOM')
First, let us explore whether there is evidence for heritability in standard length.
What do you observe? Is standard length a heritable trait?
ggplot(heritability, aes(x=parent.standard.length, y=offspring.standard.length)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Parent Size") +
ylab("Offspring Size") +
theme_classic()
Now let us explore whether there is any heritability in eye size. Remember, there is substantial variation in body size, and in such cases, we want to control for body size by calculating residual eye size first.
##Calculate residual eye sizes for the parents and the offspring
#Your code goes here:
fit1 <- lm(parent.eye.size ~ parent.standard.length, data = heritability)
#extract
heritability$res.parent.eye <-residuals(fit1)
fit1 <-lm(offspring.eye.size ~ offspring.standard.length, data = heritability)
heritability$res.offspring.eye <- residuals(fit1)
#Plot the results
ggplot(heritability, aes(x=res.parent.eye, y=res.offspring.eye)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Parent Relative Eye Diameter (Corrected for Body Size)") +
ylab("Offspring Relative Eye Diameter (Correcred for Body Size)") +
theme_classic()
What do you observe? Is standard length a heritable trait? #is this a typo?
These two values seem to have a strong positive correlation. I would observe that eye size is heritable.
I am not sure if I did this section correctly because it does not properly answer the questions above or below, so here is another graph showing the relationship of size accounting for eye size which I also do not know if that makes sense.
##Calculate residual eye sizes for the parents and the offspring
#Your code goes here:
fit1 <- lm(parent.standard.length ~ parent.eye.size, data = heritability)
#extract
heritability$res.parent.standard.length <-residuals(fit1)
fit1 <-lm(offspring.standard.length ~ offspring.eye.size, data = heritability)
heritability$res.offspring.standard.length <- residuals(fit1)
#Plot the results
ggplot(heritability, aes(x=res.parent.standard.length, y=res.offspring.standard.length)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Parent Relative Body Size (Corrected for Eye Size)") +
ylab("Offspring Relative Body Size (Correcred for Eye Size)") +
theme_classic()
This is interesting. I would not think as eye size as something that may explain body size. But this correlation does appear much more strong and positive accounting for eye size.
Imagine for a moment that smaller fish have a higher likelihood of survival in the cave. Would you expect evolution of body size upon cave colonization?
Imagine for a moment that fish with smaller eyes have a higher likelihood of survival in the cave. Would you expect evolution of eye size upon cave colonization? Justify your response.
This tutorial was adapted from an open-source textbook that teaches Evolution in R:
Data on beak size variation in Darwin’s finches came from the following publication:
The eye size data was published in the following paper. Other measurements are unpublished data by M. Tobler.
Consulting additional resources to solve this assignment is absolutely allowed, but failure to disclose those resources is plagiarism. Please list any collaborators you worked with and resources you used below or state that you have not used any.
I did not use any alternative resources.