A Brief Introduction to R

Disclaimer

This is a document intended to highlight basic skills needed to get started with R.

I am not a trained statistician, nor am I a computer scientist. In fact, I have never taking a formal class in any coding language and, as far as I am concerned, computers are still magic black-boxes. So please forgive me if I make mistakes or if I forget to cover something. However, I hope that you will find some of what I discuss useful for your own work. These lessons are an almagamation of tricks I’ve learned taking a variety of statistics classes during my graduate schooling, reading a variety of really great books that cover R’s capabilities in much greater detail, and hours spent poring over Stack Exchange forums looking for answers to my error codes.

If you are interested in further reading, I strongly recommend the following texts:

R in Action!: Data Analysis and Graphics with R by Robert I. Kabacoff - This book covers the basics in depth. If you are new to R, this book is a must have.

Environmental and Ecological Statistics with R by Song S. Qian - Song was my statistics professor at the University of Toledo for my M.S. program. In addition to basic statistics, this book will cover many advanced topics including GLMs, GAMs, and multi-level models.

Statistical Rethinking: A Bayesian Course with Examples in R and Stan by Richard McElreath - Probably the best introduction to Bayesian statistics I can possibly recommend.

Using R

Why do ecologists use R? R is a powerful language specifically designed for doing statistics. One of the major benefits of using R over other statistical softwares, like SPSS, SAS, or Excel, is that R is FREE! Yes, FREE! R also has a sizeable community of users who develop and publish packages. Packages are collections of functions developed by users for specific applications. If you need to use a particular method that requires advanced coding skills, there is a good chance someone else has already written a function to do what you want to do and published it as part of an R package. Finally, R has some fantastic data visualization capabilities. You will be able to visualize your data and make publication-quality graphics using some great R packages like ggplot2 and cowplot. Finally, using command line programs, like R, over simpler point-and-click interfaces you may already be familiar with (e.g. Excel) has the major advantage of being completely reproducible. Don’t remember what you did to get that result? Fortunately R keeps a record of every command you wrote down, and you (or anyone else) can easily reproduce your work. Point-and-click programs may be more intuitive and their learning curves aren’t as steep, but you gain much more in terms of reproducibility and transparency in your analyses when you use R, or other programming languages, for your professional work. There really is no limit to what R can do and if you plan to stay in ecology, it’s definitely a skill to focus on so you will be as marketable as possible when it’s time to hunt for jobs!

How to get started

Before we can use R, we need to download it. R is freely available through the Comprehensive R Archive Network (CRAN) here. If you don’t have R installed on your computer yet, go to this link and download the appropriate version for your computer following the instructions on the CRAN website.

I also strongly recommend downloading RStudio. Although technically optional, RStudio will make your life much, much easier. As powerful as R is, it is also extremely picky and it can be frustrating when your code doesn’t work as planned. I find working directly in the R Console to be very frustrating, largely because it does not have an integrated text editor. So when you run a command and it doesn’t work, you must find out why it didn’t work and then rewrite the whole thing into the command line to run it again. You can just use a regular text editor, like notepad, but that requires you to have two programs running instead of just one. RStudio is an integrated developer environment, which offers a nice user interface for working in R. Some of its features include the R Console (where the code is executed), a syntax editor for writing and saving your code, tools for plotting, debugging assistance, ways to see your code history, ways to see what objects are saved in your current session and what they are called, where your current working directory is, and much more. RStudio makes it easy to document your code while you edit and run it. RStudio is also free and it can be downloaded here.

Housekeeping

Before we get to any statistics, let’s cover some useful features in R. You can leave yourself a comment in R where you write your code. All you have do is use the hashtag symbol (pound sign if you were born before 1990). This is helpful for leaving yourself, and others, notes about what your code means.

#I am adding two numbers together
2+2

## [1] 4

Any line of code that has the # sign in front of it will not run in the R command line. Therefore, # is a special symbol that should only be used to make comments. It is also helpful to use it as a way to save certian code, but maybe you just don’t want to run it right now. For example, maybe you are trying to decide how best to complete a task and you aren’t ready to delete a certain line of code yet, but you want to try another way first, you can use the # symbol to make sure that line of code does not run while you figure out the best way.

#I can answer my question in several different ways
#2+2
2+1+1

## [1] 4

Comments can also be used to provide helpful information about the current version of your script and what it is for.

#Title: A Brief Introduction to R
#Author: Me
#Date: Today

Working directories

An important concept for working with R, and any coding language, is your working directory. When you set your working directory, you tell R where to read files from and where to save all of your work. Whenever you start a new project in R, it is good practice to set the working directory so that you do not become confused as to where your files are being saved.

#Setting a working directory
setwd("C:/Users/jakek/OneDrive/Documents/PhD_work/R_coding_corner/R_skills")

#Checking your current working directory
getwd()

## [1] "C:/Users/jakek/OneDrive/Documents/PhD_work/R_coding_corner/R_skills"

In RStudio, you can also set the working directory by going to “Session” -> “Set working directory” and choosing where to set the working directory in your file explorer.

Packages

R has many packages available to it. But it cannot use those packages unless they are installed and turned on in your library. To install a package, you need to get it from CRAN. You can easily install packages using the install.packages() command.

options(repos = structure("http://cran.us.r-project.org")) #This line just tells R where you would like to install packages from. This is optional but it helps make sure you're getting packages from your preferred CRAN mirror. 
#I've set mine to the US CRAN mirror site.
install.packages("MASS")

## Installing package into 'C:/Users/jakek/OneDrive/Documents/R/win-library/3.6'
## (as 'lib' is unspecified)

## 
##   There is a binary version available but the source version is later:
##      binary source needs_compilation
## MASS 7.3-52 7.3-53              TRUE
## 
##   Binaries will be installed
## package 'MASS' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\jakek\AppData\Local\Temp\RtmpAz8AJQ\downloaded_packages

Now I have the MASS package installed, which contains a collection of statistical functions and example datasets. However, I still must call the MASS package from my library before I can use the functions in it.

library(MASS)

Sometimes packages depend on other packages, meaning you may install one package but you might also need to install its dependencies in order to be able to use all of its capabilities. Usually, you can easily do this by adding dependencies = T to the install.packages() command.

install.packages("ggplot2", dependencies = T)

You can load multiple packages at once with a few lines of code if you know which ones you like to use. I find this helpful because there are several packages I typically always want whenever I start a new R session.

x <- c("ggplot2", "cowplot", "dplyr", "MASS", "car", "reshape2")
lapply(x, require, character.only = T)

What I’ve done here is I’ve saved a list of package names that I’ve already installed in my version of R as a vector called x. I then ran the function lapply(x, require, character.only = T) on x, which tells R to take every element in vector x and execute the require() command on each element. The require() command is similar to library() in that it just attaches the packages you want to use. Finally, character.only = T just tells R that you want it to read the elements in your vector as a character string (yes, you do).

When you encounter a new command and you aren’t sure what it does, you can call up a help command using the ? symbol. For example, try entering ?require() in your command line.

Objects

Before going any further, let’s talk about object-oriented programming.

R is what we call an object-oriented programming language. This just means that R works by storing information into objects that you name and assign yourself. You then can use those objects to perform any number of complex operations. An object can be many different things including data, functions, or even graphs.

Several of the object types for storing data you will use in R include vectors, matrices, arrays, and data frames. I will not cover arrays here but just know that an array is basically a matrix with more than two dimensions.

Scalars

A scalar is the simplest kind of object. Scalars hold a single value and are typically used to name and store constants.

A <- 15
A

## [1] 15

Vectors

A vector is the second simplest kind of object, which consists of a one-dimensional array. A vector can hold numeric data, character data, and logical (TRUE/FALSE) data.

#A vector containing a set of numbers
a <- c(1, 2, 3)
a

## [1] 1 2 3

#You might choose to give your objects more descriptive names
states <- c("Michigan", "Ohio", "Washington", "Alaska", "Oregon", "Rhode Island")
states

## [1] "Michigan"     "Ohio"         "Washington"   "Alaska"       "Oregon"      
## [6] "Rhode Island"

You can also easily create sequences in R and store them in vectors in several different ways

b <- c(4:7)
b

## [1] 4 5 6 7

c <- seq(4, 7, by = 1) 
c

## [1] 4 5 6 7

#'by = 1' tells R what interval to create your sequence by. 
#You can change it to a different interval, which results in a new vector.
d <- seq(4,7, by = 0.25)
d

##  [1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 6.25 6.50 6.75 7.00

Note that R uses the <- symbol sort of analogously to the = symbol you may have encountered if you have ever used other programming languages, like Python. You can also use the = symbol to name objects in R. But sometimes = is used in special cases, like within certain functions. So I tend to stick with <- for naming objects just to avoid confusion. Note that you can also reverse the arrow -> and name objects like this:

c(1, 2, 3) -> a
a

## [1] 1 2 3

But nobody does this and your coding friends will make fun of you if you do this. For readability, it is standard to put your object name on the left, followed by the <-, and then the list of information you are storing into that object.

Also note that whenever you are making a list of anything in R, you must use the c symbol in front of your list enclosed in parantheses. This tells R that you are grouping each of the elements inside the parantheses together as a list.

Matrices

A matrix is a two-dimensional data object in which each value contains an address corresponding to its row number and column number. Matrices can take on all the same data types as scalars and vectors (numeric, character, logical). However, you cannot mix data types in the same matrix. For example, a matrix cannot contain both numbers and characters.

x <- rnorm(10) #rnorm() randomly samples n times from a normal distribution. Try ?rnorm() for more details.

y <- matrix(x, nrow = 5, ncol = 2)
y

##             [,1]       [,2]
## [1,] -0.05613275  0.9971035
## [2,] -0.33003768 -0.8877445
## [3,] -0.39471793 -1.2689348
## [4,] -0.58241646 -1.0149301
## [5,]  1.56400214 -1.6255710

The matrix() command takes three basic inputs. First is a vector, in my case called x, which contains the elements for the matrix. Next, nrow specifies the number of rows the matrix will contain. Finally, ncol specifies the number of columns. You can also name your columns and rows with the dimnames command.

rnames <- c("R1", "R2", "R3", "R4", "R5")
cnames <- c("C1", "C2")
y <- matrix(x, nrow = 5, ncol = 2, dimnames = list(rnames, cnames))
y

##             C1         C2
## R1 -0.05613275  0.9971035
## R2 -0.33003768 -0.8877445
## R3 -0.39471793 -1.2689348
## R4 -0.58241646 -1.0149301
## R5  1.56400214 -1.6255710

#Equivalently, you can use colnames() and rownames() separately
y <- matrix(x, nrow = 5, ncol = 2)
colnames(y) <- cnames
rownames(y) <- rnames
y

##             C1         C2
## R1 -0.05613275  0.9971035
## R2 -0.33003768 -0.8877445
## R3 -0.39471793 -1.2689348
## R4 -0.58241646 -1.0149301
## R5  1.56400214 -1.6255710

Data frames

Data frames are probably the most common type of object in R. Like a matrix, data frames contain columns and rows, but you can have more than one data type within a data frame. You can either build a data frame in R directly or you can load one in from a .txt or .csv file, for example. Let’s just start by building a simple data frame using some basic information about the five Great Lakes.

#First thing to do is to create a series of vectors containing the information for my data frame
Lake <- c("Ontario", "Erie", "Huron", "Michigan", "Superior")
SurfArea <- c(7340, 9910, 23000, 22300, 31700) #Sq. miles
AvgDepth <- c(283, 210, 195, 279, 483) #Feet
Vol <- c(393, 116, 850, 1180, 2900) #Cu miles
Visited <- c(TRUE, TRUE, TRUE, TRUE, FALSE) #A TRUE/FALSE vector indicating whether or not I have visited the lake

#Now I can combine these vectors into a data frame. Note that for this to work, the vectors must all be of the same length.

GreatLakes <- data.frame(Lake, SurfArea, AvgDepth, Vol, Visited)
GreatLakes

Once you have a data frame built, you can inspect the structure of the data frame with the str() command. This is important for verifying that the data in each column is in the correct format.

str(GreatLakes)

## 'data.frame':    5 obs. of  5 variables:
##  $ Lake    : Factor w/ 5 levels "Erie","Huron",..: 4 1 2 3 5
##  $ SurfArea: num  7340 9910 23000 22300 31700
##  $ AvgDepth: num  283 210 195 279 483
##  $ Vol     : num  393 116 850 1180 2900
##  $ Visited : logi  TRUE TRUE TRUE TRUE FALSE

Ok let’s look at the output from the str() command. This tells us that GreatLakes is a data frame containing 5 observations on 5 variables. Lake is a factor with 5 levels. SurfArea, AvgDepth, and Vol are numeric data. Finally Visited is a logical column containing TRUE/FALSE information indicating whether or not I have visited that lake. Looks like everything is exactly how I want it. If needed, I can change the class of a variable in the data frame. For example, maybe I want R to read the Lake column as a character instead of a factor.

GreatLakes$Lake <- as.character(GreatLakes$Lake)
str(GreatLakes)

## 'data.frame':    5 obs. of  5 variables:
##  $ Lake    : chr  "Ontario" "Erie" "Huron" "Michigan" ...
##  $ SurfArea: num  7340 9910 23000 22300 31700
##  $ AvgDepth: num  283 210 195 279 483
##  $ Vol     : num  393 116 850 1180 2900
##  $ Visited : logi  TRUE TRUE TRUE TRUE FALSE

Now the Lake variable is a string of characters instead of factors. Ensuring you have the correct internal data structure will be important for running statistical analyses later on, especially for more complex analyses where you might want to include data of multiple different classes, like a mixed effects model where you would regress some continuous variables (numeric) across different groups (factors).

The $ is an important concept in R. It is essentially a very basic subsetting operation. It tells R to extract elements by name from a named list or data frame. T

#The command 'GreatLakes$Lake' tells R to return only the 'Lake' column from the 'GreatLakes' data frame. 

GreatLakes$Lake

## [1] "Ontario"  "Erie"     "Huron"    "Michigan" "Superior"

By using the $ operator, you can pull important elements from your dataframe to perform specific actions on them. For example, I could crosstabulate the GreatLakes data by Lake and Vistied to easily see the Visited status of each lake. This seems silly to do on such a small dataframe, but when working with very large dataframes this can be very helpful.

table(GreatLakes$Lake, GreatLakes$Visited)

##           
##            FALSE TRUE
##   Erie         0    1
##   Huron        0    1
##   Michigan     0    1
##   Ontario      0    1
##   Superior     1    0

Now let’s move on to how to read data from a .txt. or .csv file since this is most likely how you will use R for your own data analysis, rather than building the dataframes in R yourself.

Reading in data

In your emails, I sent you a copy of the Fisher/Anderson iris data. The iris dataset might be the most famous (or infamous) dataset known to coding and statistics students the world over. It was published by Ronald Fisher in a 1936 paper as an example of linear discriminant analysis, although the data was actually collected by Edgar Anderson. The iris dataset contains a collection of morphological measurements on three species of iris and is extremely useful for teaching a wide variety statistical concepts.

However, this will be the last time we use the iris dataset for the purposes of these lessons. For one thing, there are plenty of fresh and interesting datasets freely available for us to use and we don’t need to be beholden to Fisher and Anderson. Second, there is a push in science, and society more broadly, to recognize past injustices directed at historically victimized communities and to embrace more diverse representation moving forward. Fisher was a prominent eugenicist, an outspoken racist, and a defender of big tobacco. Therefore, I find it best to recognize him for his enormous contributions to science and statistics, but to promptly move past him. So from here forward, I will try to use datasets in my examples that recognize a broader collection of researchers and their contributions to science in their respective fields.

Ok, enough history. Let’s read in the iris data. To do this, you will need to find the file path in your file explorer where you saved your data. In my case it is in C:\Users\jakek\OneDrive\Documents\PhD_work\R_coding_corner\R_skills. If you are on Windows, your file explorer likely uses \ in its file paths. For some reason, R uses / to read file paths. So in order to get the file path to play nicely in R, you will have to replace all of the \ with / when you write the file path in R. We can do this within the read.table() command.

read.table("C:/Users/jakek/OneDrive/Documents/PhD_work/R_coding_corner/R_skills/iris.txt", header = T, sep = "")

Ok so let’s break down what that command does. Notice that at the end of the file path, I wrote in the name of the data file as it is saved on my computer. I also had to append the file type .txt to the end of the name. If you forget to append the file type, R will not find your file. Next, I added header = T to let R know that the first row in the data are my column headers. Finally, I added sep = "" to tell R that my data are separated by white spaces. Had my data been comma separated I would have just wrote sep = ",". Alternatively, if my data were saved in a .csv format, I would simply use read.csv() and change the name of the dataset to iris.csv when reading it in.

Just reading in the data does us no good though, we want to be able to do stuff with it. So I am going to store the dataset into an object called iris.dat and inspect the structure with str() to make sure everything looks good.

iris.dat <- read.table("C:/Users/jakek/OneDrive/Documents/PhD_work/R_coding_corner/R_skills/iris.txt", header = T, sep = "")
str(iris.dat)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Looks good to me. Here are a couple of other useful tools for inspecting your data just to make sure everything read in properly:

head(iris.dat) #Returns the first few rows of the data rather than printing the whole thing. Especially useful for large datasets. 
tail(iris.dat) #Returns the last few rows
View(iris.dat) #Opens the whole dataframe in a new window
colnames(iris.dat) #Returns the column names
dim(iris.dat) #Returns the dimensions of your dataframe first by rows, then columns.
is.na(iris.dat) #Checks for NAs in your data. We will discuss how to handle NAs later on.
summary(iris.dat) #Generates simple summary statistics on the columns

That’s all I have for now. Thanks for reading my post! Next week I will introduce your to the tidyverse.

Stay tuned!