EEB319: Intro to R

What is R?

R is a computer programming language and environment for statistical computing and graphics. R is open-source, which means you can use it for free, and is maintained and updated by its users, who can write and upload “packages”, which are units of sharable code that extend R’s basic toolkit. As of June 2019, CRAN (Comprehensive R Archive Network), hosts over 14,000 packages, which means that if you’re working on a problem, another R user likely has a solution! For example, you can use a package called “lme4” to fit linear or generalized linear mixed-effects models, “popbio” to construct and analyse projection matrix models, “ggmap” to visualize spatial data on maps, and “wesanderson” to make your plots look like they were taken from the set of Moonrise Kingdom!

Resources:
R Project
Big Book of R

Downloading R and R Studio

DO THIS BEFORE LAB 1 STARTS, PLEASE!
Download and install R onto your computer here. After downloading/installing R, you’ll want to download/install RStudio, which is just R with a graphical user interface (GUI). Try opening R Studio to make sure it works. You should see four panes: the top-left pane is your editor, where you can write, run, and save scripts (Fig. 1). The bottom-left pane is your console/terminal, where you can run code without saving it in a script. The top-right pane is your environment, where you can find any objects you’ve defined. The bottom-right pane is where you can navigate between files in your working directory, view plots you’ve generated, or read package documentation. Don’t panic! We’ll try to go through all of this in detail during lab.

Figure 1: Here’s an example of my RStudio window, as I write this lab!

You might notice that my RStudio window looks different than yours; my panes are arranged differently, and my editor theme sports a cooler colour combination. You can change your pane layout and editor theme by going to the top-left corner of your screen, and clicking RStudio -> Preferences… -> Appearance/Pane Layout.

R Markdown

This document, along with most of the labs in this course, and your assignment submissions, will be written using R Markdown. R Markdown is a file format for making dynamic documents (documents that can be continually edited and updated) with R. R Markdown documents are written in markdown (an easy-to-write plain text format, similar to HTML and LaTex (pronounced “law-tech”)) and contain chunks of embedded R code (Fig. 2). More generally, writing in markdown allows you to create documents that smoothly integrate chunks of text, mathametical equations, hyperlinks, chunks of code, and so on. You can open and edit R Markdown files (.Rmd) in R Studio. To learn more about R Markdown, go here, or here.

Figure 2: An example of how we integrate chunks of text (dark blue) with chunks of code (light blue).

AN IMPORTANT NOTE: When we write Markdown documents, we can export .Rmd files to PDF, Word, or HTML via a process called “knitting”. You should be able to knit this file (“EEB319IntroR.Rmd”) to HTML by clicking “knit” at the top left of this page. Once R has finished knitting this file to HTML, a preview window will pop up. You can view the knitted document in HTML by clicking “Open in Browser”. Opening this document in a browser will also show you how the formatting of a .Rmd file maps onto the knitted version of the document.

ANOTHER IMPORTANT NOTE: As previously mentioned, Markdown documents are neat because they can integrate chunks of text with chunks of code. To learn how we integrate a chunk of code, check of the example we use to set a working directory, below. Take a look at the “eval” argument. If we set eval = TRUE, R runs that chunk of code when knitting a document. If we set eval = FALSE, R will include the chunk in the knitted document, but won’t run the code. When you submit assignments, you should make sure your code chunks are path independent (can be run on computers other than your own (you shouldn’t have to worry about this if you’ve set your working directory correctly, which we’ll go over in just a minute)), and set eval = TRUE, so that we can run your script quickly. You can check if your script runs smoothly from top to bottom by clicking the drop-down menu,“Run”, in the top-right of the editor window (top-left window in default layout), and finding “Run All”. In the “setwd()” example below, we’ve set eval = FALSE because we have different computers, and as such, the line we’ve written won’t run on your computer. In upcoming labs, you’ll notice that we’ve set eval = FALSE in some of our examples, and have included an empty code chunk, where we’re set eval = TRUE, for you to write your own code below.

Making Your Own Markdown Document

You can make your own Markdown document by clicking File > New File > R Markdown. After you’ve opened that blank Markdown document, you can start typing, using this document, and the following resources as a guide.

Don’t forget to fill in your name and student number at the top of the document, next to “author:”!

Let’s Get Started!

Setting Your Working Directory:

We use working directories to tell R where it should look for and store files on our computers. We can start by creating a folder called “R_Projects”, and a subfolder called “EEB319” on our desktops. If you right-click on that subfolder, you should be able to see its pathname (e.g. /Users/mjarviscross/Desktop/R_Projects/EEB319). As a side-note, your working directory can be thought of similarly to how you’d think about any folder. For example, let’s download the course files from Quercus into our “EEB319” folders, and create subfolders for different labs (/…EEB319/Lab1/…, etc.).

There are two ways to set your working directory:

setwd():

Use setwd() to set your working directory by including the pathname of your EEB319 folder:

setwd("/Users/mjarviscross/Desktop/R_Projects/EEB319")

This is the easier option, but not the best…

Start a Project:

If you set your working directory using setwd(), you’ll be able to save all your files in the same place, but won’t be able to save any of the objects or outputs you create. If you set your working directory by creating an R Project, you’ll be able to save all your files in the same place and save all your objects and outputs. This will save you a lot of time in the long-run, and make your work more reproducible.

Go to the top-right corner of your R Studio window and click “Project: (None)”. Then, click the first option in the drop-down menu, “New Project”. A new window will pop-up, and give you three options: “New Directory”, “Existing Directory”, and “Version Control”. Click “Existing Directory”, and use the browse button to navigate to your “EEB319” folder. Finally, click “Create Project” in the bottom-right of the pop-up window, and you’re ready to go! As a side-note, you should be able to see all the files and subfolders in your working directory in the “Files” tab of your “Files, Plots, Packages, Help, and Viewer” window of R Studio (bottom-right window in default layout).

NOTE: You can check if you’ve set your working directory correctly by typing “getwd()” in the console and clicking “return”.

An Example for Today’s Lab: Within-Host Disease Dynamics

To get more comfortable with R, let’s go through an example! Very broadly, population ecologists are interested in asking how (a)biotic factors affect populations, and how populations will change over time. For example, we often use population ecology to think about how a predator population interacts with its prey population. Similarly, over the past few decades, disease ecologists have co-opted consumer-resource frameworks to describe within-host disease dynamics, given that a host’s immune system can seek out and destroy infecting parasites as a predator would its prey (see: Anderson and May, 1979).

In their 1994 paper, “Within-host population dynamics and the evolution and maintenance of microparasite virulence”, Antia et al. presented a mathematical model describing the within-host population dynamics of an invading microparasite and its host’s immune response.

\[\begin{align} &\frac{dP}{dt} = rP - kPI \text{, if P < D and P ➝ 0 if P > D}&\\ &\frac{dI}{dt} = ρI\frac{P}{P + ϕ}&\\ \end{align}\]

Where:

Table 1
Parameters	Values
r, Intrinsic growth rate of parasite	0.1-2.0
k, Rate of destruction of parasite by host immune system	0.001
ρ, Maximum growth rate of the immune system	1
𝜙, Parasite density at which growth rate of host immune system is half its maximum	1000
D, Lethal within-host parasite density	10⁹

This is an ODE (ordinary differential equation) model, and is comprised of a series of differential equations. The equations in this model represent the parasite (P) and host immune cell (I) populations through time. The first term in the first equation denotes the intrinsic growth of the parasite population, while the second term denotes the destruction of the parasite by the host’s immune system. This model assumes that the rate of proliferation of the immune cell population is proportional to the size of the parasite population when the parasite population is small, and saturates when the parasite population grows sufficiently large. Try thinking about how the second equation denotes this relationship mathematically.

Question: Physiologically, why might the authors make this assumption?

For today’s lab, we’ll be simulating a version of this model forward in time. Antia et al.’s original model is written in continuous time, but we’ll be using a version of this model written in discrete time.

\[\begin{align} &P_{t+1} = (1 + r⋅ts - (1-e^{-k⋅I_t⋅ts}))P_t&\\ &I_{t+1} = (1 + ρ\frac{P}{P + ϕ}ts)I_t&\\ \end{align}\]

NOTE: We’ve added a parameter, ts, to the discretized version of the model. ts is our time step, or, step size, and denotes how often we evaluate our model. Today, we’ll be using a step size of 0.1.

Question: What differentiates continuous and discrete time models? Why might someone choose to use a continuous time model over a discrete time model, or vice versa? Finally, name and explain one advantage and one disadvantage to using a smaller step size.

Now, to simulate this model forward in time, open the Lab 1 Excel sheet; come back here when you’ve finished simulating your data!

NOTE: As referenced in Table 1, values for the parameter “r” can range between 0.1 and 2.0. Play around with a few different values to see how your time series changes!

…

Welcome back! The Excel workbook asked you to save your simulations as .csv files. Let’s bring those data into R!

Importing Data

Remember to make sure your .csv files are stored in your working directory (above). You can copy the pathname of a given file by right-clicking a file, and holding “option”/“shift” (Mac/PC) and clicking “copy”_________" as Pathname". You can import your data using read.csv():

Make sure to name your data sets so you can call on them later. We’ve included an example wherein we name our first data set “InfectData_r03” to specify (1) that our data chronicle an infection, and (2) that in this hypothetical simulation, we assigned “r” a value of 0.3. We use “<-” (“option” + “-”) as an assignment operator, which assigns a value to an object. Here, the value is our data set (.csv file), and the object is “InfecData_r03”.

# NOTE 1: We can put a "#" at the beginning of a line of code if we don't want R to run that line. These lines are called "comments" and can be used to leave yourself notes explaining what your code does, why you chose to do something a certain way, etc.! It's good practice to provide clear comments so anybody (e.g. your TA) can pick up your code and understand what you're trying to do.

# NOTE 2: To run a line or chunk of code in your editor, use "command + Enter" on Mac, and "Ctrl + Enter" on Windows. If you're working in your console (bottom-left window in default layout), you can just use "Enter".

InfecData_r03 <- read.csv("ModelSimulation_r03.csv") # NOTE: that you need to include the entire pathname for your .csv within the quotation marks.

Now, you try! Remember, you should be importing three .csv files!

# YOUR WORK HERE!

In R, we can store different variables and objects in different ways. It’s important to know the “class” of a variable or object because depending on its class, we’ll have to treat the variable/object differently. We can see what kind of object our data are stored in using class():

class(InfecData_r03)

## [1] "data.frame"

Now, you try!

# YOUR WORK HERE!

Our data are stored in a data.frame! A data frame is the most common way of storing data, and is the data structure we most often use when analysing or visualizing data. Technically, data frames are just lists of equal-length vectors, making them easy to manipulate and play around in.

Exploring Data Using Functions

In R, functions are commands you use to tell R to perform some task. The good news is, you’ve already used a couple functions (“read.csv()”, “class()”). To use functions, you type out the function’s name and open brackets, “()”. Inside your brackets, you’ll include your arguments, which tell R where and how you’re using a function.

Let’s take a look at our data! You can view your data sets in their entirety by using the function, View():

View(InfecData_r03)

Now, you try!

# YOUR WORK HERE!

Do your data sets look as they should? You should have 501 rows and 3 columns (time, parasite abundance, and host immune cell abundace) per simulation. Next, let’s use some more functions to explore our data!

# If you're not sure what a function does or how to make it work, you can ask R by using "?":
?head # The help tab in your bottom-right window will tell you what "head()" does and how to use it!

# We can see the first/last few lines of our data set using head() and tail():
head(InfecData_r03)

##   time parasites immune_cells
## 1  0.0  1.000000     1.000000
## 2  0.1  1.029900     1.000100
## 3  0.2  1.060694     1.000203
## 4  0.3  1.092409     1.000309
## 5  0.4  1.125072     1.000418
## 6  0.5  1.158711     1.000530

tail(InfecData_r03)

##     time   parasites immune_cells
## 496 49.5 2.33597e-14     2522.598
## 497 49.6 1.88523e-14     2522.598
## 498 49.7 1.52146e-14     2522.598
## 499 49.8 1.22788e-14     2522.598
## 500 49.9 9.90954e-15     2522.598
## 501 50.0 7.99742e-15     2522.598

# And we can specify how many lines we'd like to see by adding an argument for "n":
head(InfecData_r03, n = 3)

##   time parasites immune_cells
## 1  0.0  1.000000     1.000000
## 2  0.1  1.029900     1.000100
## 3  0.2  1.060694     1.000203

# We can summarize our data using summary():
summary(InfecData_r03)

##       time        parasites         immune_cells     
##  Min.   : 0.0   Min.   :   0.000   Min.   :   1.000  
##  1st Qu.:12.5   1st Qu.:   0.003   1st Qu.:   1.135  
##  Median :25.0   Median :  12.599   Median :  22.428  
##  Mean   :25.0   Mean   : 428.546   Mean   : 979.919  
##  3rd Qu.:37.5   3rd Qu.: 311.399   3rd Qu.:2522.593  
##  Max.   :50.0   Max.   :3264.206   Max.   :2522.598

# We can apply these same functions to a subset of our data set using "$" to specify the column we're interested in:
max(InfecData_r03$parasites) # Shows us the maximum abundance of parasites present in the host at a given time

## [1] 3264.206

which(InfecData_r03$parasites == max(InfecData_r03$parasites)) # Reads: Where in the "Parasites" column of the the data set (InfecData_r03$Parasites) is the data point equal to (==) the maximum abundance of parasites (max(InfecData_r03$Parasites)); shows us when the parasite population peaks.

## [1] 290

Now, you try!

# YOUR WORK HERE!

The functions used above are often termed “base functions” because they come pre-installed for our use. To access more expansive and specific sets of functions, we can install what’s called a “package” A package is a set of pre-written functions designed to help us accomplish a task, e.g. data cleaning, solving mathematical systems, etc.

# To use a function from a package, we first have to install and load the package:
# install.packages("cowsay") # Installing
library(cowsay) # Loading
?cowsay # Exploring the package
?say # Exploring the function
say("This is one of R's many user-generated useless-but-fun packages!") # Using the package's function

## 
##  -------------- 
## This is one of R's many user-generated useless-but-fun packages! 
##  --------------
##     \
##       \
##         \
##             |\___/|
##           ==) ^Y^ (==
##             \  ^  /
##              )=*=(
##             /     \
##             |     |
##            /| | | |\
##            \| | |_|/\
##       jgs  //_// ___/
##                \_)
##

say("Maddie's writing this in October, so I'm a ghost!", by = "ghost") # Adding an argument to change how we're using the function

## 
##  ----- 
## Maddie's writing this in October, so I'm a ghost! 
##  ------ 
##     \   
##      \
##      .-.
##     (o o)
##     | O \
##      \   \
##       `~~~' [nosig]
##

Plotting Data

We can plot our data using the function, plot(). We need to specify the data we want to use for the x-axis first, followed by the data we want to use for the y-axis. Here, we want “time” on the x-axis, and the abundances of parasites and immunce cells on the y-axis. In the example below, the first line plots the population abundances of the parasite through time, and the second line adds the population abundances of the immune cells through time to the plot. The third line adds a legend!

We can use a few additional arguments to make aesthetic changes to our plot:

type changes the line type
col changes the colours of our data
lwd changes the line width
xlab and ylab change the names of our axes

plot(InfecData_r03$time, InfecData_r03$parasites, type = "l", col = "goldenrod1", lwd = 4, xlab="Time", ylab="Abundance", main = "Within-Host Population Dynamics, r = 0.3", axes = F)
axis(side = 1, at = seq(0,50,10)); axis(side = 2, at = seq(0,max(InfecData_r03$parasites), 500))
lines(InfecData_r03$time, InfecData_r03$immune_cells, col = "cornflowerblue", lwd = 4)
legend("topleft", legend = c("Parasites", "Immune Cells"), col = c("goldenrod1", "cornflowerblue"), lty = 1, lwd  =4, bty = "n")

Now, try to plot all three of your simulations like the example above, with some aesthetic changes! To start, try changing: (1) the title of your plot, (2) the colour of your data, and (3) the spacing of your axis values. Finally, if you’re feeling comfortable, try figuring out how to add a vertical line to your plot, denoting the time at which the parasite population reaches its maximum value. You can use STHDA and Towards Data Science as references!

# YOUR WORK HERE!

Question: How might we interpret these data? Do these data seem to represent a chronic infection? An infection that can be cleared? An infection that ends with the death of the host? Explain.

Another Question: How does the value of “r” change your time series? How might you interpret these dynamical changes physiologically?

A P.S. on plotting: If you like data visualization, check out a package called ggplot2! ggplot2 is part of what’s called the “tidyverse”. The tidyverse is a collection of R packages designed for data science (data manipulation, exploration, and visualization), that share underlying design philosophy, grammar, and data structures. Lots of people have lots of opinions about the merits of using base R (what we’re using here, at the moment) vs. the tidyverse, but which is better usually comes down to personal preference. ggplot2 can be a great tool if you’re looking to visualize your data in unique ways, or if you’re looking to make your plots interactive (package: “plotly”). Check out some examples here. If you’re interested, you can read more about ggplot2 here, and more about the tidyverse more generally, here.

After Getting Started…

For Loops

For loops are used to repeat a chunk of code some number of times. For loops are easy to write once you get the hang of them, and can be used to save you a lot of time! The general syntax for a for loop is as follows:

for (i in sequence){
  perform some operation 
}

Here, “i” is what we call the “iterator”, and “perform some operation” is the “iterable”.

Example:

vec <- c() # Makes an empty vector to store the outputs of our for loop in
for (i in 1:10){
  vec[i] <- i + 1 # The value at position i in our vector is i + 1
}
# Means that the value at position 1 in our vector is 1 + 1, the value at position 2 in our vectos is 2 + 1, etc.

print(vec) # To show our resulting vector

##  [1]  2  3  4  5  6  7  8  9 10 11

if… else Statements

if… else statements are conditional statements that are used to evaluate an input, and perform different tasks based on whether the input passes some test or not (a kind of boolean (TRUE/FALSE) test). The general syntax for an if… else statement is as follows:

if (test_expression){
  statement
}

if (test_expression){
  statement
} else{
  statement2
}

Example:

x <- 7 # Assign x some value
if (x > 2){
  print("It's a bones day")
}

## [1] "It's a bones day"

And another one…

if (x == 5){
  print("It's a bones day!")
} else {
  print("It's a no bones day!")
}

## [1] "It's a no bones day!"

Combining For Loops and if… else Statements

For loops and if… else statements are often used together by nesting the if… else statement within the for loop.

vec2 <- c(seq(1:10)) # Makes a vector with values 1-10
vec3 <- c() # Another empty vector to store the outputs of our for loop in 
for (i in vec2){
  if (vec2[i] <= 5){
    vec3[i] <- "It's a bones day!"
  } else {
    vec3[i] <- "It's a no bones day!"
  }
}
print(vec3)

##  [1] "It's a bones day!"    "It's a bones day!"    "It's a bones day!"   
##  [4] "It's a bones day!"    "It's a bones day!"    "It's a no bones day!"
##  [7] "It's a no bones day!" "It's a no bones day!" "It's a no bones day!"
## [10] "It's a no bones day!"

You can also nest for loops inside of other for loops and so on. BUT! A few layers of for loops can really slow down your computation. When that happens, we’re better off writing a function.

Writing Functions

In R, functions are self-contained modules of code that accomplish a specific task. To use a function, you give its required arguments, the function completes its task, and gives you back an output. The general syntax for a function is as follows:

function_name <- function(argument){
  statement
}

Example:

Eqn_Func <- function(x){
  y <- 2*x + 4
  return(y)
}

# The function needs us to give a value of x. If we say x = 2, then the function calculates y according to the statment we wrote, y = 2x + 4
Eqn_Func(2)

## [1] 8

We can also put a function inside a for loop (and so on, and so on…).

Let’s think about our Within-Host Disease Dynamics example from ealier… We simulated our model forward in time using Excel, but we can do the same thing in R by writing function, and implementing that function within a for loop. The output of this code chunk should resemble the plot you generated earlier!

NOTE: There’s a lot going on in this example! It’s okay if this is a bit confusing!

# This function takes initial parasite and immune cell population abundances, and, according to the model we've been using, determines the population abundances at the next time step
Sim_Func <- function(Para_Last, Immune_Last){
  # Parameter values
  ts = 0.1
  r = 0.3; k = 0.001; p = 1; o = 1000
  
  # A discrete model of within-host infectious disease dynamics
  Para_Next = (1 + r*ts - (1-exp(-k*Immune_Last*ts)))*Para_Last
  Immune_Next = (1 + p*(Para_Last/(Para_Last + o))*ts)*Immune_Last
  
  Para_Last <- Para_Next
  Immune_Last <- Immune_Next
  
  return(c(Para_Last, Immune_Last))
}

# To project this model into the future, we need to implement the function within a for loop that can feed the function the return values from the previous iteration of the loop. 
Abundances <- data.frame() # Empty data frame to hold our outputs
Para_Last <- 1; Immune_Last <- 1 # Initial population abundances
for (i in 1:500){ # Loop will run the function 500 times, but with new "initial" abundances every time to simulate the populations forwards in time
  Output = Sim_Func(Para_Last, Immune_Last)
  Para_Last = Output[1]
  Immune_Last = Output[2]
  Addition <- c(Para_Last, Immune_Last)
  Abundances <- rbind(Abundances, Addition)
  colnames(Abundances) <- c("Parasites", "Immune_Cells")
}

# Let's see if that worked:
head(Abundances)

##   Parasites Immune_Cells
## 1  1.029900     1.000100
## 2  1.060694     1.000203
## 3  1.092409     1.000309
## 4  1.125072     1.000418
## 5  1.158711     1.000530
## 6  1.193357     1.000646

# Let's plot our output!
Inits <- data.frame("1", "1")
colnames(Inits) <- c("Parasites", "Immune_Cells")
Abundances <- data.frame(rbind(Inits, Abundances))
Time <- seq(0,50,0.1)
df <- cbind(Time, Abundances)
plot(df$Time, df$Parasites, type = "l", col = "goldenrod1", lwd = 4, xlab="Time", ylab="Abundance")
lines(df$Time, df$Immune_Cells, col = "cornflowerblue", lwd = 4)
legend("topleft", legend=c("Parasites", "Immune Cells"),
       col=c("goldenrod1", "cornflowerblue"), lty = 1, lwd = 4)

Learning Resources:

Introduction to R by DataCamp
- A free interactive course with modules to teach you the basics, vectors, matricies, factors, data frames, and lists
- 4 hour time committment
Tips and Tricks in R Studio and R Markdown
- A really great article from Towards Data Science (generally a fantastic resource!) with some tips on how to write R scripts more efficiently
Stack Overflow
- A Q&A website for asking and answering coding questions
- If you’re having trouble figuring out how to do something, or can’t figure out why your code isn’t running, check Stack Overflow! It’s likely that someone’s already asked and answered your question!

Lab 1 Assignment and Marking:

Assignment:
1. Plot with new parameters??? And some esthetic changes
2. Answer all interpretation questions in EEB319IntroR.Rmd (highlighted in red)

Submit your assignment as an R Markdown file on Quercus.

Marking scheme:
* Code with clear and concise commenting: 40%
* Does your code run? 10%
* Plotting: 25%
* Interpretation questions: 25%

EEB319: Intro to R

M. Jarvis-Cross, Last Updated: 1 February 2022

What is R?

Downloading R and R Studio

R Markdown

Making Your Own Markdown Document

Let’s Get Started!

Setting Your Working Directory:

setwd():

Start a Project:

An Example for Today’s Lab: Within-Host Disease Dynamics

Importing Data

Exploring Data Using Functions

Plotting Data

After Getting Started…

For Loops

if… else Statements

Combining For Loops and if… else Statements

Writing Functions

Learning Resources:

Lab 1 Assignment and Marking: