Grace Strobbe
1. The General Structure of an R Notebook
R Notebook files (*.Rmd) allow you to combine text elements with
snippets of code and the outputs generated from said code. There are
three main parts to an R Notebook file, which are introduced in this
section. In addition, each R Notebook automatically generates an *.html
file that provides you with the formatted document including all
components, which you will ultimately submit for your homework.
1.2. Code Chunks
R code chunks are delineated with three ticks (''') at
the beginning and the end, and {r} after the first set of
ticks lets your computer know that you will be using the R programming
language. You can always add a code chunk by clicking “Insert > Code
Chunk > R” above or by clicking the “+C” icon, although we usually
already created all the chunks you will need in the template. Any text
within a code chunk, if written correctly, represents executable code,
which the computer can interpret as a command to perform certain tasks.
You can make your computer execute the code in a chunk by pressing the
small, green play arrow on the top right corner of each chunk, or you
can just highlight the code and press command+enter (control+enter on
PC). When you execute the code, the output will automatically appear
below a chunk. Sometimes you will find us using hash tags
(#) within code chunks. Hash tags “silence” the text that
follows on the same line, such that the computer jumps over that section
when executing the code. That is useful for code annotation, and you
will frequently see us using the hash tags to add further descriptions
or explanations within code chunks.
Pro tip: If you want to execute all code chunks in a document
automatically, you can click “Run > Run All” in the RStudio menu.
1.3. Text
The text in between code snippets is just that: text. We will use
these sections to provide you with background information and discussion
prompts, and you will use these sections to respond to questions and
offer your interpretations of data. Sections where you need to write
something are always highlighted in italics. You can use a
variety of prompts to format your text if you are working with basic
Markdown (see here
for a cheat sheet). Most of you, however, will prefer the text editor
that is implemented in R Studio to format text with the click of a
button.
Pro tip: You can toggle back and forth between source code (with
Markdown formatting) and the WYSIWYG editor (with text formatting
through clicking) by using the Source/Visual buttons in the RStudio
menu
1.4. HTML Preview and Output
As already mentioned, your R Notebook (including text, code chunks,
and the outputs from your code) can be automatically knitted into an
*.html file. You can click “Preview > Preview Notebook” or “Preview
> Knit to HTML” to see the live html version as you are working on
your R Notebook (just make sure to save to update), and you can find the
shareable *.html file in the same folder as your *.Rmd file (same file
name with .nb appended).
Note: Sometime R will prompt you to update some packages in the
Console before you can knit the html file. If it is not working on the
first try, make sure to check for prompts in the Console.
2. Getting Started
2.1. Setting Your Working Directory
Having a well-organized file structure is critical to avoid issues
with coding, because you will frequently read in data files, and you
need to make sure that R knows where to look for those files. To
facilitate this process, we will provide you with all the necessary
files in a zipped folder (if you are working through this, you have
already found the first file). We recommend that you move that *.zip
file to the location where you want it (e.g., your folder for this
course) before unzipping.
The folder containing the files for a particular exercise is called a
“Working Directory”, and opening an *.Rmd file automatically sets the
working directory to the directory of that R Notebook file. So after
unzipping, it is important not to move any files out of the folder we
provide you with, unless you want to manually tell R where to look for
readable files. If so, you can use the setwd() command to
point R toward the location of your files (see
textbook for details).
2.2. Loading Your Libraries
When you install R, your computer can understand and execute a number
of commands. This is what is known as “Base R”. The power of R, however,
is that you can expand the number of commands your computer understands
by installing and loading additional R packages (also called libraries).
There are R packages specialized for pretty much any area of biology,
providing the capability to analyze data from the level of genes and
genomes to ecosystem level processes. We will frequently use a package
called ggplot2, which allows for plotting data. Depending
on the module, you will need to install additional libraries. To
download and install new R packages, go to “Tools > Install
Packages…” and type in the name of the package you want to install.
Alternatively, you can use the install.packages() function.
Fore example, execute the following code chunk to install
ggplot2:
#Install ggplot2
install.packages("ggplot2")
Note that you only need to install every package once (unless you
reinstall R). I recommend deleting the code chunk above after you run it
successfully, or you can silence it by a hash tag in the beginning of
install.packages("ggplot2"). Failure to do so can cause
problems during the export (knitting) of your R Notebook as an *.html
file.
To make use of installed packages, you also need to load the packages
every time you use R (i.e., every time you restart the
program). You can do this with the library() command, and
you will find a code snippet prompting you to load all needed libraries
at the beginning of each R Notebook (in a section that is typically
called dependencies). You can try it here by executing the code chunk
below to load ggplot2:
#Note that loading a library does not lead to an output
library(ggplot2)
2.3. Importing Data
One of the reasons we’re working through the coding basics here is of
course that you will work with actual data. To do that, you will need to
import data into R. With every exercise, we will provide you with one or
more data sets. These data sets will mostly come as *.csv files (which
stands for comma-separated values). They are essentially text files
containing data tables, and you can also open these files in Excel or
other programs. To import data, we will use the read.csv()
function. In the code chunk below, you can import a simple test data set
(“test_data.csv”) that includes the variables sex, length, and mass for
a population of an animal. Note that the fileEncoding
argument simply indicates that I generated the input files on a Mac,
which will prevent some import issues for those of you that use a
PC.
#The line of code simply prompts the computer to read the "test_data.csv" file and generate a data.frame called test.data
test.data <- read.csv("test_data.csv", fileEncoding = 'UTF-8-BOM')
If this worked correctly, you should now see this data set as
test.data in your global environment (top right panel). You
can double click it to view it. There should be three columns: sex,
length, and mass.
3. Making Figures
A key learning objective of this course is that you learn to
visualize data in different ways to facilitate data interpretation in
the context of different evolutionary hypotheses. In the following
sections, I will explain step by step (that is code line by code line)
how to make a simple graph with our sample data set. Let’s aim to make a
scatter plot showing the relationship between length and mass in our
species. The process is not much different than sketching a graph by
hand and layering different parts of the graph on top of each other,
just that you use words (code) to make the computer draw.
3.1. Define the Axes and Coordinate System
The first step of making any graph is to define the axes and
establish the coordinate grid that allows for the plotting of the data.
You can do this by calling the ggplot() function within
which you first need to specify the data source (in our case the data
frame we just created, called test.data) and then the so
called aesthetics—aes()—that contain information about what
variables define the x and y axes. In practice, this is accomplished
with the following line of code:
#This line of code calls for the ggplot function (a plotting function) and make a grid based on the test.data data frame, using length as the x axis and mass as the y axis
ggplot(test.data, aes(x=length, y=mass))
3.2. Adding a Layer with Data Points
The second step is to draw the data into the established coordinate
system. To do so, you just need to tell the program what kind of graph
you want to draw. Different graph types in ggplot2 are
referred to as geoms (geometries), and a scatter plot is designated as
geom_point. You can just add that to your existing code
with a plus sign. For an overview of some of the graph types (geoms)
ggplot2 offers, check the appendix
of our textbook.
ggplot(test.data, aes(x=length, y=mass)) +
geom_point()
3.3. Adding a Trendline
Whenever we look at the relationship between two variables, we may
want to add a trendline. You can add a trendline by adding the
geom_smooth() function to your existing code, and
method="lm" designates that your trendline should be
linear. The se argument designates whether or not you want
to draw an error estimate around your trendline.
{r message=FALSE} #The code within the brackets of the geom_smooth command specified some additional options, namely that we want to draw a straight line (method="lm") and that we do not want to show the confidence interval (se=FALSE). Set the se=TRUE and see what happens. ggplot(test.data, aes(x=length, y=mass)) + geom_point() + geom_smooth(method="lm", se=FALSE)
3.4. Changing the Axes Labels
The variable names in the data set do not always provide the clearest
description of what a variable means. We can modify the x and y axis
labels using the xlab() and ylab() functions,
respectively. The actual titles need to be written within quotation
marks:
#Simply add the new label text in quotation marks
ggplot(test.data, aes(x=length, y=mass)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
xlab("Body length in cm") +
ylab("Body mass in kg")
3.5. Change the Theme
I honestly hate the default theme of ggplot with its
gray background. But you can quickly alter the look of the graph by
switching to a number of other possible themes. I personally like the
theme_classic(), but you can customize the look of your
graph with themes listed here.
ggplot(test.data, aes(x=length, y=mass)) +
geom_point() +
geom_smooth(method="lm", se=FALSE) +
xlab("Body length in cm") +
ylab("Body mass in kg") +
theme_classic()
Et voilà! You got yourself a perfectly good graph! As you exercise
building graphs throughout the semester, make sure to check the
“Practical Skills” sections of individual chapters refer to the appendix
of the book as needed.
To get additional advice on how to work with different color schemes
in gglot(), including the use of colorblind-friendly
palettes, please check the corresponding
textbook section.
4. Your First Data Set: Darwin’s Finches
One of the most iconic study systems in evolutionary biology are
Darwin’s finches on the Galapagos Islands. Rosemary and Peter Grant
spent much of their lives devoted to the study of these bird, examining
how their traits change in response to major ecological perturbations.
To do so, they collected a massive, long-term data set on different
traits of the medium ground finch (Geospiza fortis) population
on Daphne Major Island. For this exercise, we will take a look at their
beak size data from 1972-1994.

4.1. Import data
The beak size data can be found in file called “finches.csv”. The
file includes three variables: year, the average relative beak size
(rel.beak.size), and the standard error (st.err) that describes the
variability of beak size in any given year.
finch <- read.csv("finches.csv", fileEncoding = 'UTF-8-BOM')
4.2. Plotting the Data
The following code chunk provides the base code to make a scatter
plot as above. You will only have to specify the x and y variables and
label the axes correctly.
ggplot(finch, aes(x=year, y=rel.beak.size)) +
geom_point() +
xlab("year") +
ylab("beak size") +
theme_classic()
4.3. Adding Additional Graphical Elements
There are two graphical elements that we can add to facilitate the
interpretation of the data:
- Since this is a time series, it makes sense to connect the dots
representing the means from year to year. You can do this by simply
adding another geom:
geom_line().
- We want to know how much the average beak size changes relative to
the variability in the population. If variability is high, year to year
variation in may be negligible. But if variability is low, changes
across year may actually be substantial. You can do this by adding
another geom:
geom_errorbar(). Make sure to specify the x
and y axes variables as above
ggplot(finch, aes(x=year, y=rel.beak.size)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin=rel.beak.size-st.err, ymax=rel.beak.size+st.err)) +
xlab("year") +
ylab("beak size in cm") +
theme_classic()
4.4. Interpretation
4.4.1. General patterns
Based on the graphs you just made, what do you observe? How do you
interpret the data if I told you that 1977 was a massive drought
year?
Based on the graph the birds beak sizes vary heavily from year to
year which is most likley caused by environmental change. In the year
1977 there is a noticeable increase in the size of the beaks belonging
to the finches. This is most likely due to the need for long beaks for
digging or rooting for food during the drought period. In other words
the birds with longer beak sizes where the only ones surviving and
reproducing.
4.4.2. Evolution… or Not?
Do you think these data reflect evolutionary change through time?
What is a potential alternative explanation? What additional information
would you need to either accept or reject the hypothesis that these
patterns reflect evolutionary change?
The graph does in-fact reflect evolutionary change throughout a
period of time, as it shows major patterns of different traits presented
between the generations of the finches.Useful additional information
would be weather reports, and other respources on the types of
environments influencing the changes in beak size between the finches.
5. Resources
5.1. Data References
Data on beak size variation in Darwin’s finches came from the
following publication:
5.2 Resources You Consulted
Consulting additional resources to solve this assignment is
absolutely allowed, but failure to disclose those resources is
plagiarism. Please list any collaborators you worked with and resources
you used below or state that you have not used any.
No outside resources
