The wonderful world of R: Introduction, distributions, basic data simulation

Welcome to R! R is one of several simple programming language used globally for data analysis across many fields of study. It can seem daunting at first, but the upfront work it totally worth it. It makes it easy to share work with collaborators and perform tasks with large data that would take forever in excel or google sheets. This intro is in no way comprehensive and it meant to give some basic code that will be handy for this course. Feel free to reference this throughout the class.

You will notice in R that there are often MANY ways to accomplish the same goal. In the spirit of this, some of the code below uses the base R language (which I am not as good with) and the tidyverse (with which I am much better). They will usually accomplish the same thing. That said, it can be rewarding to minimize packages, that way you really understand what is going on.

For those interested, I learned most of my R knowledge from the book “R for Data Science” by Hadley Wickham. The book can be found here: https://r4ds.had.co.nz/index.html. Full disclosure: it is a tidyverse book.

OK! Let’s get started!

Installing packages and calling libraries

Libraries are collections of functions that you can call. When you fire up R some are already loaded. It is possible to do a ton in R without ever deviating from these base libraries. That said, often we want to run functions that are not in R and we need to install packages. To do this we use the function install.packages(). Here is an example of using install packages to install the package “tidyverse”, an especially useful package in data science. The tidyverse are a set of packages that are use for data processing, plotting, etc. Most of these functions can also be performed using base functions, but I like tidyverse.

install.packages("tidyverse") #sorry, tidyverse is kinda a lot

You only have to install a package once (although you may need to update them occasionally). Once package has been installed you can call it using the function library(). Require() can also be used for this.

library("tidyverse")

## ── Attaching packages ────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

#require("tidyverse")

Setting your working directory

This is a critical step and EVERY script you write should include this. The working directory is where the R will look for files you want to use and where it will save files. After my libraries, the next line in all of my scripts is setwd(). Here I am setting my working directory as the folder “test_folder” with my “Documents” folder. Set your working directory to wherever you saved the “test_data.csv” file.

Tip: you can use press tab in the process of typing your working directory and it will auto-fill with options. You can also set your working directory by using the “session” dropdown menu.

setwd("~/Documents/test_folder")

Importing data

Data to be read into R should be a csv. If you are saving data from excel with the intention of using it in R, it should be saved as a csv. We can import data using the read.csv() function.

Like before you can us tab complete to more easily find files. It tries to matches file names from your working directory

dat <- read.csv("test_data.csv")

We need a name for the data we imported so I called it “dat”. You can save objects using <- or =.

Some R basics

R saves objects in different formats. The item “dat” is a data frame, which is essentially a list of vectors. A vector stores data of the same type.

Col1 is an integer vector.

dat$Col1

##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
## [26]  6  7  8  9 10

Col2 is a character vector.

dat$Col2

##  [1] Species A Species A Species A Species A Species A Species A Species A
##  [8] Species A Species A Species A Species B Species B Species B Species B
## [15] Species B Species B Species B Species B Species B Species B Species C
## [22] Species C Species C Species C Species C Species C Species C Species C
## [29] Species C Species C
## Levels: Species A Species B Species C

You can access elements of a vector using brackets. This will access the second element of the vector:

dat$Col2[2]

## [1] Species A
## Levels: Species A Species B Species C

These brackets also apply to the entire data.frame. Since it is two dimensional we need to add a command. dat[1,2] accesses the value in the first row and the second column.

dat[1,2]

## [1] Species A
## Levels: Species A Species B Species C

If you want to access an entire row or column just put in a comma.

dat[1,] #first row

##   Col1      Col2     Col3
## 1    1 Species A 49.90378

dat[,1] #first column

##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
## [26]  6  7  8  9 10

Examining data

Now that we have imported the data, let’s take a look at it! There are many useful ways to look at data, too many to describe here, so I have chosen a few I use all the time. These are:

colnames()
length()
nrow()
ncol()
unique()
head()
tail()
as.character(), as.factor(), as.numeric()

Before going any further its worth pointing out that functions require arguments. These will be different for every functions depending on what it does. If you ever need help with a function put a question mark in front of it. Tab is also useful.

?colnames

colnames()

Returns the name of all of the columns in the data.

colnames(dat)

## [1] "Col1" "Col2" "Col3"

This function can also be used rename columns. We can to this be creating a vector of names. To make a vector of values we need to enclose them in c()

colnames(dat) <- c("plotID", "species", "height")
colnames(dat)

## [1] "plotID"  "species" "height"

length()

Returns the number of elements in a vector. A vector is a basic element that contains values of the same type. The code below creates a new vector named “spec” that contains the values in the species column. length returns the number of values in that vector.

spec <- dat$species
length(spec)

## [1] 30

nrow() and ncol()

Return the number of rows and the number of columns.

nrow(dat)

## [1] 30

ncol(dat)

## [1] 3

unique()

Return all of the unique values in a vector (or column). I often pair it with length() to determine the number of levels in a categorical variable. Let’s examine the species category.

unique(dat$species)

## [1] Species A Species B Species C
## Levels: Species A Species B Species C

length(unique(dat$species))

## [1] 3

head() and tail()

Returns the first 6 rows and the last 6 rows in a dataframe respectively. Very handy for taking a quick look at data. You can modify n argument in this function to change the number of rows shown.

head(dat)

##   plotID   species   height
## 1      1 Species A 49.90378
## 2      2 Species A 47.31901
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202

as.character(), as.factor(), as.numeric()

These functions are used to convert columns to a different data type. Sometimes data can read in weird and sometimes stats functions need data in a certain format. We can use these to move between them.

dat$species <- as.character(dat$species) #makes species a character
dat$species <- as.factor(dat$species) #makes it a factor again

Wrangle data

Data wrangling is the term for re formatting your data. There are many reasons you might want to do this. Maybe a particular function needs your data a certain way. Maybe you want to remove NAs. There are too many to count. Often in EECB related fields that have lots of data, this is the part of analysis that takes the most time. Here are some useful functions for data wrangling.

filter() and/or subset()
select()
gather() and spread()
na.omit()

filter() and/or subset()

Filter and subset are two ways of accomplishing the same thing. Filter is a tidyverse function while subset() is a base R function. Filter will only work if you have called the library “tidyverse” or “dplyr.” They allow you to look at portions of your data. For example, let only look at species A.

SpeciesA.1 <- filter(dat, species == "Species A")
SpeciesA.2 <- subset(dat, species == "Species A")

== means “is equal to”. You can also use > (greater than), < (less than), >= (greater than or equal to), <= (less than or equal to), != (not equal to).

select()

Select() is used to remove columns. Let’s remove “Plot ID” from dat. You can do this by keeping the columns you want or removing .

dat2 <- select(dat, species, height)
dat2 <- select(dat, -plotID)

You can also remove columns using the brackets we used earlier. Since Plot ID is the first column we can also do this.

dat2 <- dat[,-1]

dat2 <- dat[-1,] would remove the first row.

gather() and spread()

Gather and spread are VERY handy for reformatting data frames. Essentially these functions transition data between “long” and “wide” forms. Long dataframes tend to have more rows and a fewer columns, while wide dataframes have more columns and fewer rows. Let’s take a look at what these look like. Long dataframes are nice for data wrangling and for running some models, while wide dataframes are needed for analyses like hill numbers and ordination. The functions melt() and cast() in the reshape package are similar.

View(dat)

These data are long (although since it is only a small amount of data it isn’t a great example). Imagine you have 300 or 3000 observations (rows). Then the term “long” would make more sense.

We can use spread() to make it wider. We want a column for each species (which is the “key”) and those columns to be filled with the height measurement (the “value” column),

dat_spread <- spread(dat, key = species, value = height)
head(dat_spread)

##   plotID Species A Species B Species C
## 1      1  49.90378  67.43697  90.37914
## 2      2  47.31901  70.33240  91.13306
## 3      3  50.87232  70.89353  92.75686
## 4      4  49.41498  70.39009  89.08342
## 5      5  51.29793  71.12169  89.56514
## 6      6  50.82202  71.29156  91.64644

Now “Species A”, “Species B”, and “Species C” each have their own column.

We can reverse this using gather(). We need to specify that we only want to gather columns 2:4. In gather(), key is the name for the new column created from the former column names and value is the name for the values.

dat_gathered <- gather(dat_spread, 2:4, key = "Species", value = "height")
head(dat_gathered)

##   plotID   Species   height
## 1      1 Species A 49.90378
## 2      2 Species A 47.31901
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202

pipes

The tidyverse functions are nice because they allow us to string together a chain of commands using pipes. A pipe is basically take these data, run a function, then take the output and immediately run another function, and so on. The pipe in R is %>%. There is a keyboard shortcut for this, but it varies between PC and mac.

In this example we will take “dat” and first filter and then remove a column. We specify the data we will use at the beginning and so we do not need to include it in subsequent functions. We want to first only keep rows where height is greater than or equal to 50 and then remove the PlotID column.

piped_dat <- dat %>% 
  filter(height >= 50) %>% 
  select(-plotID)

head(piped_dat)

##     species   height
## 1 Species A 50.87232
## 2 Species A 51.29793
## 3 Species A 50.82202
## 4 Species A 50.86813
## 5 Species A 51.78562
## 6 Species B 67.43697

group_by() and summarise()

These functions are always used in conjunction to summarize data based on a grouping variable. These are best used with pipes. In the example below we will calculate the mean and standard deviation of height in each group. We make the new columns “av” and “std_dev.”

summ <- dat %>% 
  group_by(species) %>% 
  summarise(av = mean(height), std_dev = sd(height))
head(summ)

## # A tibble: 3 x 3
##   species      av std_dev
##   <fct>     <dbl>   <dbl>
## 1 Species A  49.8    1.57
## 2 Species B  70.0    1.36
## 3 Species C  90.1    1.54

Here you can see that a new data frame is created. It has a species column, a column with the mean height of each species and a column for the standard deviation of height in each species.

na.omit()

na.omit() is used to remove nas that appear in your data. It is important to know that it works by row. So if an na appears in an column of a row, that row is removed. Sometimes this is totally fine. Other times you may only care about nas in a particular column only.

dat_noNA <- na.omit(dat)

You can see that there is no difference, because there were no NAs in the original data.

For loop

These are a little tricky to get the hang of but are SUPER useful. For loops allow us to iterate tasks. The best part about this is you can set them to run and then go off to grab a beverage and still feel like you are working!

for loops start off like this, where it is looping through the “vector” you provide.

for (i in vector)

for (i in 1:30)

In this, the for loop will run through the vector 1:30, which is just 1,2,3,… 30. In each iteration i changes value. In the first run i=1, on the second run i=2, and so on. It is important to note that i is the standard notation, but this can be anything. You could say for (FrankHerbert in 1:30) and it would still work.

Let’s look at a full for loop. Here we saying to print the value of i for each iteration. So we press go and i=1, then it prints 1, then it reaches the end of the loop so i becomes 2, then it prints 2, and so on. I write loops all the time and almost always include print(i) at the very end just to keep track of where it is at.

for (i in 1:30) {
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30

The vector you provide doesn’t have to be numbers! Check this out:

vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in vec) {
  print(i)
}

## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"

In this case i does not equal a number, but rather the elements of the character vector.

When working with data I always prefer to have i equal a number rather than a character, even if I am looping through a character vector. When doing this, i is a nice progress tracker and i can be useful elsewhere in the loop. We can recreate the previous loop this way.

vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in 1:length(vec)) {
  print(vec[i])
}

## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"

This is a little convoluted for this example, but when working with large data I find it VERY handy. This is saying for i in 1:4 (because length(vec) = 4), print the first element of the vector, then the second, and so on.

For a more applied example, I often use for loops to run models. I work with data that encompass over 100 butterfly species. I often want to run a model for many of these species, so I will run a loop that takes a species name, subsets the data to only include that species, runs a model, saves the important values, and repeats. Meanwhile I am already at the bar.

Simulation

R is also super handy for simulating data! Simulating data is a super important part of statistics. In order to make sure your code and analyses makes sense it is important to simulate data first and check. If you know the “answer”, which you generate by simulating data, your statistical test should reflect this. If it doesn’t there is something wrong.

When simulating data in R, you simulate data from a distribution. There are many types of distributions. Here are a few. Distributions can be grouped into discrete and continuous. The Normal and gamma distributions are continuous because they predict continuous values, while the Poisson and binomial distributions are discrete, because they predict whole numbers. The wiki page for these is pretty good!

Let’s focus on the normal distribution because it is the easiest.

We can simulate data using the rnorm function(). This functions simulates points from a normal distribution. The first command is the number of points, the second is the mean of the distribution, and the third command it the standard deviation.

Here is a distribution with 1000, points, a mean of 10, and a standard deviation of 2.

sam <- rnorm(1000, 10, 2)
plot(density(sam))

Let’s look at the process of simulating data to check a statistical test. The t-test compares the distributions of two groups. Let simulate two distributions:

sam1 <- rnorm(50, 20, 1)
sam2 <- rnorm(50, 15, 1)

plot(density(sam1), xlim = c(10,25) ,ylim = c(0,1))
lines(density(sam2), col ="red")

You’ll notice that these distributions are bumpier, this is because we simulated fewer points. The more points we simulate the more individual outliers even out. You’re distributions will also probably look different than these, because the random points you got are different than mine! We can use the function set.seed() to fix this. I’ll use this later.

Now lets look at a t-test

t.test(sam1, sam2)

## 
##  Welch Two Sample t-test
## 
## data:  sam1 and sam2
## t = 22.737, df = 97.98, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.594022 5.472619
## sample estimates:
## mean of x mean of y 
##  19.86271  14.82939

This looks right! The means match what we simulated and it is VERY unlikely we obtained these differences by chance (we did simulate them to be different after all). Lets look at what happens if we simulate them to be the same.

set.seed(666)

sam1 <- rnorm(50, 15, 1)
sam2 <- rnorm(50, 15, 1)

plot(density(sam1), xlim = c(10,25) ,ylim = c(0,1))
lines(density(sam2), col ="red")

t.test(sam1, sam2)

## 
##  Welch Two Sample t-test
## 
## data:  sam1 and sam2
## t = 0.018033, df = 94.266, p-value = 0.9857
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4071033  0.4145660
## sample estimates:
## mean of x mean of y 
##  14.93520  14.93147

This also looks right! The distributions were simulated to be the same and the t-test supports this! Remember that they are not identical because they are composed of random numbers. The more values we generate the closer they will appear. Also, notice that your distribution should look the same as mine. That is because we set the seed first so our different computers generated the same random numbers.

We can also use rnorm() to simulate relationships between variables. This is super handy for confirming that analyses are doing what you think they are. Say we want to simulate the effect of height (from the test dataset) on water use, which is a variable we will make up.

We can do something like this. Here we are specifying the relationship between water growth and height. We are saying that an increase in 3 units of plant height changes water growth by 1 unit. We use rnorm to simulate the data with and without noise. Obviously data are messy so we should simulate our data with noise (I use n=30 for rnorm because we have 30 data points).

water_growth_nonoise <- 3 * dat$height
water_growth_noise <- 3 * dat$height + rnorm(30, 0, 10)

Let’s look at the differences

plot(dat$height, water_growth_nonoise, main = "No Noise")

plot(dat$height, water_growth_noise, main = "Noise")

Let’s quickly examine if the relationship we simulated.

coef(glm(water_growth_noise ~ dat$height))

## (Intercept)  dat$height 
##  -10.792341    3.129803

It is!