Grace Strobbe


1. The General Structure of an R Notebook

R Notebook files (*.Rmd) allow you to combine text elements with snippets of code and the outputs generated from said code. There are three main parts to an R Notebook file, which are introduced in this section. In addition, each R Notebook automatically generates an *.html file that provides you with the formatted document including all components, which you will ultimately submit for your homework.


1.1. The Header

The header, which you can see at the beginning of this document, is delineated with three dashes (---) at the beginning and the end. It includes some code that is important for the formatting of output files, so I would recommend not altering that section. In general, there should be no reason for you to change the header for any exercises in this course. However, if you would like to learn more about the different header options, you can find a good tutorial here.


1.2. Code Chunks

R code chunks are delineated with three ticks (''') at the beginning and the end, and {r} after the first set of ticks lets your computer know that you will be using the R programming language. You can always add a code chunk by clicking “Insert > Code Chunk > R” above or by clicking the “+C” icon, although we usually already created all the chunks you will need in the template. Any text within a code chunk, if written correctly, represents executable code, which the computer can interpret as a command to perform certain tasks. You can make your computer execute the code in a chunk by pressing the small, green play arrow on the top right corner of each chunk, or you can just highlight the code and press command+enter (control+enter on PC). When you execute the code, the output will automatically appear below a chunk. Sometimes you will find us using hash tags (#) within code chunks. Hash tags “silence” the text that follows on the same line, such that the computer jumps over that section when executing the code. That is useful for code annotation, and you will frequently see us using the hash tags to add further descriptions or explanations within code chunks.

Pro tip: If you want to execute all code chunks in a document automatically, you can click “Run > Run All” in the RStudio menu.


1.3. Text

The text in between code snippets is just that: text. We will use these sections to provide you with background information and discussion prompts, and you will use these sections to respond to questions and offer your interpretations of data. Sections where you need to write something are always highlighted in italics. You can use a variety of prompts to format your text if you are working with basic Markdown (see here for a cheat sheet). Most of you, however, will prefer the text editor that is implemented in R Studio to format text with the click of a button.

Pro tip: You can toggle back and forth between source code (with Markdown formatting) and the WYSIWYG editor (with text formatting through clicking) by using the Source/Visual buttons in the RStudio menu


1.4. HTML Preview and Output

As already mentioned, your R Notebook (including text, code chunks, and the outputs from your code) can be automatically knitted into an *.html file. You can click “Preview > Preview Notebook” or “Preview > Knit to HTML” to see the live html version as you are working on your R Notebook (just make sure to save to update), and you can find the shareable *.html file in the same folder as your *.Rmd file (same file name with .nb appended).

Note: Sometime R will prompt you to update some packages in the Console before you can knit the html file. If it is not working on the first try, make sure to check for prompts in the Console.


2. Getting Started


2.1. Setting Your Working Directory

Having a well-organized file structure is critical to avoid issues with coding, because you will frequently read in data files, and you need to make sure that R knows where to look for those files. To facilitate this process, we will provide you with all the necessary files in a zipped folder (if you are working through this, you have already found the first file). We recommend that you move that *.zip file to the location where you want it (e.g., your folder for this course) before unzipping.

The folder containing the files for a particular exercise is called a “Working Directory”, and opening an *.Rmd file automatically sets the working directory to the directory of that R Notebook file. So after unzipping, it is important not to move any files out of the folder we provide you with, unless you want to manually tell R where to look for readable files. If so, you can use the setwd() command to point R toward the location of your files (see textbook for details).


2.2. Loading Your Libraries

When you install R, your computer can understand and execute a number of commands. This is what is known as “Base R”. The power of R, however, is that you can expand the number of commands your computer understands by installing and loading additional R packages (also called libraries). There are R packages specialized for pretty much any area of biology, providing the capability to analyze data from the level of genes and genomes to ecosystem level processes. We will frequently use a package called ggplot2, which allows for plotting data. Depending on the module, you will need to install additional libraries. To download and install new R packages, go to “Tools > Install Packages…” and type in the name of the package you want to install. Alternatively, you can use the install.packages() function. Fore example, execute the following code chunk to install ggplot2:

#Install ggplot2
install.packages("ggplot2")

Note that you only need to install every package once (unless you reinstall R). I recommend deleting the code chunk above after you run it successfully, or you can silence it by a hash tag in the beginning of install.packages("ggplot2"). Failure to do so can cause problems during the export (knitting) of your R Notebook as an *.html file.

To make use of installed packages, you also need to load the packages every time you use R (i.e., every time you restart the program). You can do this with the library() command, and you will find a code snippet prompting you to load all needed libraries at the beginning of each R Notebook (in a section that is typically called dependencies). You can try it here by executing the code chunk below to load ggplot2:

#Note that loading a library does not lead to an output
library(ggplot2)

2.3. Importing Data

One of the reasons we’re working through the coding basics here is of course that you will work with actual data. To do that, you will need to import data into R. With every exercise, we will provide you with one or more data sets. These data sets will mostly come as *.csv files (which stands for comma-separated values). They are essentially text files containing data tables, and you can also open these files in Excel or other programs. To import data, we will use the read.csv() function. In the code chunk below, you can import a simple test data set (“test_data.csv”) that includes the variables sex, length, and mass for a population of an animal. Note that the fileEncoding argument simply indicates that I generated the input files on a Mac, which will prevent some import issues for those of you that use a PC.

#The line of code simply prompts the computer to read the "test_data.csv" file and generate a data.frame called test.data
test.data <- read.csv("test_data.csv", fileEncoding = 'UTF-8-BOM')

If this worked correctly, you should now see this data set as test.data in your global environment (top right panel). You can double click it to view it. There should be three columns: sex, length, and mass.


3. Making Figures

A key learning objective of this course is that you learn to visualize data in different ways to facilitate data interpretation in the context of different evolutionary hypotheses. In the following sections, I will explain step by step (that is code line by code line) how to make a simple graph with our sample data set. Let’s aim to make a scatter plot showing the relationship between length and mass in our species. The process is not much different than sketching a graph by hand and layering different parts of the graph on top of each other, just that you use words (code) to make the computer draw.


3.1. Define the Axes and Coordinate System

The first step of making any graph is to define the axes and establish the coordinate grid that allows for the plotting of the data. You can do this by calling the ggplot() function within which you first need to specify the data source (in our case the data frame we just created, called test.data) and then the so called aesthetics—aes()—that contain information about what variables define the x and y axes. In practice, this is accomplished with the following line of code:

#This line of code calls for the ggplot function (a plotting function) and make a grid based on the test.data data frame, using length as the x axis and mass as the y axis
ggplot(test.data, aes(x=length, y=mass))

3.2. Adding a Layer with Data Points

The second step is to draw the data into the established coordinate system. To do so, you just need to tell the program what kind of graph you want to draw. Different graph types in ggplot2 are referred to as geoms (geometries), and a scatter plot is designated as geom_point. You can just add that to your existing code with a plus sign. For an overview of some of the graph types (geoms) ggplot2 offers, check the appendix of our textbook.

ggplot(test.data, aes(x=length, y=mass)) +
  geom_point()

3.3. Adding a Trendline

Whenever we look at the relationship between two variables, we may want to add a trendline. You can add a trendline by adding the geom_smooth() function to your existing code, and method="lm" designates that your trendline should be linear. The se argument designates whether or not you want to draw an error estimate around your trendline.

{r message=FALSE} #The code within the brackets of the geom_smooth command specified some additional options, namely that we want to draw a straight line (method="lm") and that we do not want to show the confidence interval (se=FALSE). Set the se=TRUE and see what happens. ggplot(test.data, aes(x=length, y=mass)) + geom_point() + geom_smooth(method="lm", se=FALSE)


3.4. Changing the Axes Labels

The variable names in the data set do not always provide the clearest description of what a variable means. We can modify the x and y axis labels using the xlab() and ylab() functions, respectively. The actual titles need to be written within quotation marks:

#Simply add the new label text in quotation marks
ggplot(test.data, aes(x=length, y=mass)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("Body length in cm") +
  ylab("Body mass in kg")

3.5. Change the Theme

I honestly hate the default theme of ggplot with its gray background. But you can quickly alter the look of the graph by switching to a number of other possible themes. I personally like the theme_classic(), but you can customize the look of your graph with themes listed here.

ggplot(test.data, aes(x=length, y=mass)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("Body length in cm") +
  ylab("Body mass in kg") +
  theme_classic()

Et voilà! You got yourself a perfectly good graph! As you exercise building graphs throughout the semester, make sure to check the “Practical Skills” sections of individual chapters refer to the appendix of the book as needed.

To get additional advice on how to work with different color schemes in gglot(), including the use of colorblind-friendly palettes, please check the corresponding textbook section.


4. Your First Data Set: Darwin’s Finches

One of the most iconic study systems in evolutionary biology are Darwin’s finches on the Galapagos Islands. Rosemary and Peter Grant spent much of their lives devoted to the study of these bird, examining how their traits change in response to major ecological perturbations. To do so, they collected a massive, long-term data set on different traits of the medium ground finch (Geospiza fortis) population on Daphne Major Island. For this exercise, we will take a look at their beak size data from 1972-1994.


4.1. Import data

The beak size data can be found in file called “finches.csv”. The file includes three variables: year, the average relative beak size (rel.beak.size), and the standard error (st.err) that describes the variability of beak size in any given year.

finch <- read.csv("finches.csv", fileEncoding = 'UTF-8-BOM')

4.2. Plotting the Data

The following code chunk provides the base code to make a scatter plot as above. You will only have to specify the x and y variables and label the axes correctly.

ggplot(finch, aes(x=year, y=rel.beak.size)) +
  geom_point() +
  xlab("year") +
  ylab("beak size") +
  theme_classic()

4.3. Adding Additional Graphical Elements

There are two graphical elements that we can add to facilitate the interpretation of the data:

  1. Since this is a time series, it makes sense to connect the dots representing the means from year to year. You can do this by simply adding another geom: geom_line().
  2. We want to know how much the average beak size changes relative to the variability in the population. If variability is high, year to year variation in may be negligible. But if variability is low, changes across year may actually be substantial. You can do this by adding another geom: geom_errorbar(). Make sure to specify the x and y axes variables as above
ggplot(finch, aes(x=year, y=rel.beak.size)) +
  geom_point() +
  geom_line() +
  geom_errorbar(aes(ymin=rel.beak.size-st.err, ymax=rel.beak.size+st.err))  +
  xlab("year") +
  ylab("beak size in cm") +
  theme_classic()

4.4. Interpretation


4.4.1. General patterns

Based on the graphs you just made, what do you observe? How do you interpret the data if I told you that 1977 was a massive drought year?

Based on the graph the birds beak sizes vary heavily from year to year which is most likley caused by environmental change. In the year 1977 there is a noticeable increase in the size of the beaks belonging to the finches. This is most likely due to the need for long beaks for digging or rooting for food during the drought period. In other words the birds with longer beak sizes where the only ones surviving and reproducing.


4.4.2. Evolution… or Not?

Do you think these data reflect evolutionary change through time? What is a potential alternative explanation? What additional information would you need to either accept or reject the hypothesis that these patterns reflect evolutionary change?

The graph does in-fact reflect evolutionary change throughout a period of time, as it shows major patterns of different traits presented between the generations of the finches.Useful additional information would be weather reports, and other respources on the types of environments influencing the changes in beak size between the finches.


5. Resources


5.1. Data References

Data on beak size variation in Darwin’s finches came from the following publication:


5.2 Resources You Consulted

Consulting additional resources to solve this assignment is absolutely allowed, but failure to disclose those resources is plagiarism. Please list any collaborators you worked with and resources you used below or state that you have not used any.

No outside resources

