The wonderful world of R: Introduction, distributions, basic data simulation

Welcome to R! R is one of several programming languages used globally for data analysis across many fields of study. It can seem daunting at first, but the upfront work is totally worth it. It makes it easy to share work with collaborators and perform tasks with large data that would take forever in Excel or Google Sheets. This intro is in no way comprehensive and it meant to give some basic code that will be handy for this course. Feel free to reference this throughout the class.

You will notice in R that there are often MANY ways to accomplish the same goal. In the spirit of this, some of the code below uses the base R language and some uses the tidyverse package. They will usually accomplish the same thing. That said, it can be rewarding to minimize packages, that way you really understand what is going on.

For those interested, a great resource for learning R is “R for Data Science” by Hadley Wickham. The book can be found here: https://r4ds.had.co.nz/index.html. If you already have some experience with R, I can also recommend “Advanced R”: https://adv-r.hadley.nz/

Installing packages and calling libraries

Libraries are collections of functions and data that you can load into your scripts. When you fire up R some are already loaded. It is possible to do a ton in R without ever deviating from these base libraries. That said, often we want to run functions that are not in R and we need to install packages. To do this we use the function install.packages(). Here is an example of using install.packages() to install the package “tidyverse”, an especially useful package in data science. The tidyverse contains a set of packages that are used for data processing, plotting, etc. Most of these functions can also be performed using base functions, but there are usually differences in speed, memory usage, and ease of writing and reading the code.

install.packages("tidyverse") #sorry, tidyverse is kinda a lot

## Installing package into 'C:/Users/Ari/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)

## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Ari\AppData\Local\Temp\Rtmp2hndhm\downloaded_packages

You only have to install a package once (although you may need to update them occasionally). Once a package has been installed you can call it using the function library(). Require() and Attach() are similar functions

library("tidyverse")

## Warning: package 'tidyverse' was built under R version 4.2.2

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.3      ✔ forcats 0.5.2

## Warning: package 'ggplot2' was built under R version 4.2.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Setting your working directory

The working directory is where the R will look for files you want to use and where it will save files. Always make sure you understand what directory you’re working in when you try to access data in R. Here I check my current working directory, then set my working directory as my “Documents” folder. Set your working directory to wherever you saved the “test_data.csv” file.

Tip: you can use press tab in the process of typing your working directory and it will auto-fill with options. You can also set your working directory by using the “session” dropdown menu.

getwd()

## [1] "C:/Users/Ari/Documents/scripts/Research Design labs/Lab 1"

setwd("C:/Users/Ari/Documents/")

Some R basics

R stores different types of data in variables. We can assign data to a new variable using “<-”, “->”, and “=”

x <- 1            #arrow assignment. also works as 1 -> x
y = 2             #equals assignment
z <- (a <- 1) + 2 #compound assignment. This sets "a" equal to 1, and "z" equal to "a + 2"

A vector stores data of the same type. As x, y, z, and a are all numbers, we can create numeric vector using the “c” function like so:

c(x,y,z,a)

## [1] 1 2 3 1

We can compare data using the operators ==, >, >=, <, <=, and !=. R uses ! to mean “not”, so != means “not equals”, and !TRUE returns FALSE

a == x

## [1] TRUE

a >= 1

## [1] TRUE

a <= 0

## [1] FALSE

a != x

## [1] FALSE

We can use if/else statements to run code when certain conditions are met:

if (a == x) {print("a equals x")}

## [1] "a equals x"

if (a <  0) {print("a is negative")} else {print("a is not negative")}

## [1] "a is not negative"

Loops

These are a little tricky to get the hang of but are SUPER useful. Loops allow us to iterate tasks. The best part about this is you can set them to run and then go off to grab a beverage and still feel like you are working!

The basic loops in R are “for”, “while”, and “repeat”, but “for” is by far the most commonly used.

for loops start off like this, where it is looping through the vector you provide.

for (i in 1:30){
  print(i)
}

## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30

In this, the for loop will run through the vector 1:30, which is just 1,2,3,… 30. In each iteration i changes value. In the first run i=1, on the second run i=2, and so on. It is important to note that i is a standard notation, but this can be anything. You could say “for (FrankHerbert in 1:30)” and it would still work.

The vector you provide doesn’t have to be numbers! Check this out:

vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in vec) {
  print(i)
}

## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"

In this case i does not equal a number, but rather the elements of the character vector.

Also note that each item of a vector can be accessed using an index number. It can be useful to loop through a character vector like this:

vec <- c("I have heard", "it from the mouths", "of the Edema Ruh,", "so I know it to be true")
for (i in 1:length(vec)) {
  print(vec[i])
}

## [1] "I have heard"
## [1] "it from the mouths"
## [1] "of the Edema Ruh,"
## [1] "so I know it to be true"

While for loops have their uses, if you need your loop to return data it’s almost always better to use one of the suite of functions centered around “apply”. These functions are typically faster than for loops, but can be a little harder to write and read.

compare this for loop, which generates a list of vectors that contain a single number and the square of that number:

data <- list()
for (i in 1:10){
  data[[i]] <- c(i, i^2)
}
data

## [[1]]
## [1] 1 1
## 
## [[2]]
## [1] 2 4
## 
## [[3]]
## [1] 3 9
## 
## [[4]]
## [1]  4 16
## 
## [[5]]
## [1]  5 25
## 
## [[6]]
## [1]  6 36
## 
## [[7]]
## [1]  7 49
## 
## [[8]]
## [1]  8 64
## 
## [[9]]
## [1]  9 81
## 
## [[10]]
## [1]  10 100

to the function “lapply”:

data <- lapply(1:10, \(i) c(i, i^2))
data

## [[1]]
## [1] 1 1
## 
## [[2]]
## [1] 2 4
## 
## [[3]]
## [1] 3 9
## 
## [[4]]
## [1]  4 16
## 
## [[5]]
## [1]  5 25
## 
## [[6]]
## [1]  6 36
## 
## [[7]]
## [1]  7 49
## 
## [[8]]
## [1]  8 64
## 
## [[9]]
## [1]  9 81
## 
## [[10]]
## [1]  10 100

This may seem a little convoluted for this example, but when working with large data and complex functions I find it VERY handy.

Importing data

It’s easiest to read data into R as a csv. If you are saving data from excel with the intention of using it in R, it should be saved as a csv to avoid calling extra packages. We can import data using the read.csv() function.

Like before you can use tab complete to more easily find files. It tries to match file names from your working directory

test_data <- read.csv("test_data.csv")

We need a name for the data we imported so I called it “test_data”.

Some data basics

R saves objects in different formats. The item “test_data” is a data frame, which is essentially an array of vectors. We can access those vectors using the “$” operator. Col1 is an integer vector.

test_data$Col1

##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
## [26]  6  7  8  9 10

Col2 is a character vector.

test_data$Col2

##  [1] "Species A" "Species A" "Species A" "Species A" "Species A" "Species A"
##  [7] "Species A" "Species A" "Species A" "Species A" "Species B" "Species B"
## [13] "Species B" "Species B" "Species B" "Species B" "Species B" "Species B"
## [19] "Species B" "Species B" "Species C" "Species C" "Species C" "Species C"
## [25] "Species C" "Species C" "Species C" "Species C" "Species C" "Species C"

You can access elements of a vector using brackets. This will access the second element of the vector:

test_data$Col2[2]

## [1] "Species A"

These brackets also apply to the entire data.frame. Since it is two dimensional we need to use two index numbers. test_data[1,2] accesses the value in the first row and the second column.

test_data[1,2]

## [1] "Species A"

test_data[1,]

##   Col1      Col2     Col3
## 1    1 Species A 49.90378

test_data[,1]

##  [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
## [26]  6  7  8  9 10

Examining data

Now that we have imported the data, let’s take a look at it! There are many useful ways to look at data, too many to describe here, so I have chosen a few I use all the time. These are:

colnames() (also see rownames() and names())
length()
nrow()
ncol()
unique()
head()
tail()
as.character(), as.factor(), as.numeric()

Before going any further it’s worth pointing out that most functions require input arguments. These will be different for every function depending on what it does. If you ever need help with a function put a question mark in front of it to load the help page. Tab is also useful.

?colnames

## starting httpd help server ... done

colnames()

Returns the name of all of the columns in the data.

colnames(test_data)

## [1] "Col1" "Col2" "Col3"

This function can also be used rename columns. We can to this be creating a vector of names.

colnames(test_data) <- c("plotID", "species", "height")
colnames(test_data)

## [1] "plotID"  "species" "height"

length()

Returns the number of elements in a vector. A vector is a basic element that contains values of the same type. The code below creates a new vector named “species” that contains the values in the species column. length returns the number of values in that vector.

spec <- test_data$species
length(spec)

## [1] 30

nrow() and ncol()

Return the number of rows and the number of columns.

nrow(test_data)

## [1] 30

ncol(test_data)

## [1] 3

unique()

Returns all of the unique values in a vector (or unique rows in a data.frame). For example, we can pair it with length() to determine the number of levels in a categorical variable. Let’s examine the species category.

unique(test_data$species)

## [1] "Species A" "Species B" "Species C"

length(unique(test_data$species))

## [1] 3

head() and tail()

Returns the first 6 rows and the last 6 rows in a data.frame respectively. Very handy for taking a quick look at data. You can modify the “n” argument in this function to change the number of rows shown.

head(test_data)

##   plotID   species   height
## 1      1 Species A 49.90378
## 2      2 Species A 47.31901
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202

tail(test_data,n=2)

##    plotID   species   height
## 29      9 Species C 90.90490
## 30     10 Species C 88.14087

as.character(), as.factor(), as.numeric()

These functions are used to convert columns to a different data type. Sometimes data can read in weird and sometimes stats functions need data in a certain format. We can use these to move between them.

test_data$species <- as.character(test_data$species) #makes species a character
test_data$species <- as.factor(test_data$species)    #makes it a factor again

Wrangle data

Data wrangling is the term for reformatting your data. There are many reasons you might want to do this. Maybe a particular function needs your data a certain way. Maybe you want to remove NAs. There are too many to count. Often in EECB related fields that have lots of data, this is the part of analysis that takes the most time. Here are some useful functions for data wrangling.

with(), within() and transform()
filter() and/or subset()
select()
gather() and spread()
na.omit()

with(), within() and transform()

R stores information in different environments. Right now we’re in an environment where R knows about the variable “test_data”, but doesn’t know about the columns it contains. That’s why we need to use the “$” operator to access the environment within “test_data”.

we can jump into the environment of the data.frame so we don’t have to use test_data$ to access variables:

test_data <- transform(test_data, species = as.character(species)) #this sets species to a character vector

#within allows any amount of code to be written between the curly brackets. This is VERY useful
test_data <- within(test_data, {species <- as.factor(species)}) 

with(test_data, aggregate(height, list(species), mean)) #returns the mean height for each species

##     Group.1        x
## 1 Species A 49.83775
## 2 Species B 69.97579
## 3 Species C 90.09803

filter() and/or subset()

Filter and subset are two ways of accomplishing the same thing. Filter is a tidyverse function while subset() is a base R function. Filter will only work if you have called the library “tidyverse” or “dplyr.” They allow you to look at portions of your data. For example, let only look at species A.

SpeciesA.subset <- subset(test_data, species == "Species A") 
SpeciesA.filter <- filter(test_data, species == "Species A")

select()

Select() is a Tidyverse function used to remove columns. Let’s remove “Plot ID” from test_data. You can do this by keeping the columns you want or removing the ones you don’t want.

test_data.select <- select(test_data, species, height)
test_data.select <- select(test_data, -plotID)

The base R equivalents can be a little clunkier:

test_data.select <- test_data[c('species', 'height')]
test_data.select <- test_data[-which(names(test_data) == 'plotID')]

You can also remove columns using the brackets we used earlier. Since Plot ID is the first column we can also do this.

test_data_removed <- test_data[,-1]
test_data_removed <- test_data[-1,] #would remove the first row.

gather() and spread()

Gather and spread are VERY handy for reformatting data frames. Essentially these functions transition data between “long” and “wide” forms. Long data.frames tend to have more rows and a fewer columns, while wide data.frames have more columns and fewer rows. Let’s take a look at what these look like. Long data.frames are nice for data wrangling and for running some models, while wide data.frames are needed for analyses like hill numbers and ordination. The functions melt() and cast() in the reshape package are similar.

View(test_data)

These data are long (although since it is only a small amount of data it isn’t a great example). Imagine you have 300 or 3000 observations (rows). Then the term “long” would make more sense.

We can use spread() to make it wider. We want a column for each species (which is the “key”) and those columns to be filled with the height measurement (the “value” column),

test_data.spread <- spread(test_data, key = species, value = height)
head(test_data.spread)

##   plotID Species A Species B Species C
## 1      1  49.90378  67.43697  90.37914
## 2      2  47.31901  70.33240  91.13306
## 3      3  50.87232  70.89353  92.75686
## 4      4  49.41498  70.39009  89.08342
## 5      5  51.29793  71.12169  89.56514
## 6      6  50.82202  71.29156  91.64644

Now “Species A”, “Species B”, and “Species C” each have their own column.

We can reverse this using gather(). We need to specify that we only want to gather columns 2:4. In gather(), key is the name for the new column created from the former column names and value is the name for the values.

test_data.gathered <- gather(test_data.spread, 2:4, key = "Species", value = "height")
head(test_data.gathered)

##   plotID   Species   height
## 1      1 Species A 49.90378
## 2      2 Species A 47.31901
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202

A base R equivalent is “reshape”, although there are some differences in output.

test_data.spread <- reshape(test_data, idvar = "plotID", timevar = "species", direction = "wide")
head(test_data.spread)

##   plotID height.Species A height.Species B height.Species C
## 1      1         49.90378         67.43697         90.37914
## 2      2         47.31901         70.33240         91.13306
## 3      3         50.87232         70.89353         92.75686
## 4      4         49.41498         70.39009         89.08342
## 5      5         51.29793         71.12169         89.56514
## 6      6         50.82202         71.29156         91.64644

test_data.gathered <- reshape(test_data.spread, direction = "long")
head(test_data.gathered)

##             plotID   species height.Species A
## 1.Species A      1 Species A         49.90378
## 2.Species A      2 Species A         47.31901
## 3.Species A      3 Species A         50.87232
## 4.Species A      4 Species A         49.41498
## 5.Species A      5 Species A         51.29793
## 6.Species A      6 Species A         50.82202

Pipes

More recent versions of R allow us to string together a chain of commands using pipes. A pipe takes input data, runs a function, then passes the output to another function, and so on. The pipe in base R is |> and the Tidyverse pipe is %>%. I tend to use the base pipe, but there are some differences. See ?`%>%`

In this example we will take “test_data” and first filter and then remove a column. We specify the data we will use at the beginning and so we do not need to include it in subsequent functions. We want to first only keep rows where height is greater than or equal to 50 and then remove the PlotID column.

piped_dat <- test_data |>
  filter(height >= 50) |> 
  select(-plotID)

head(piped_dat)

##     species   height
## 1 Species A 50.87232
## 2 Species A 51.29793
## 3 Species A 50.82202
## 4 Species A 50.86813
## 5 Species A 51.78562
## 6 Species B 67.43697

group_by() and summarise()

These functions are always used in conjunction to summarize data based on a grouping variable. These are best used with pipes. There’s no base R functions that perfectly mirror what these functions do (especially in terms of speed, group_by() can run 1,000 times faster than any R code equivalent), but I find ave() and aggregate() to be useful alternatives. In the example below we will calculate the mean and standard deviation of height in each group. We make the new columns “av” and “std_dev.”

summary_data <- test_data |> 
  group_by(species) |> 
  summarise(av = mean(height), std_dev = sd(height))
head(summary_data)

## # A tibble: 3 × 3
##   species      av std_dev
##   <fct>     <dbl>   <dbl>
## 1 Species A  49.8    1.57
## 2 Species B  70.0    1.36
## 3 Species C  90.1    1.54

Here you can see that a new data frame is created. It has a species column, a column with the mean height of each species and a column for the standard deviation of height in each species.

na.omit()

na.omit() is used to remove nas that appear in your data. It is important to know that it works by row. So if an na appears in an column of a row, that row is removed. Sometimes this is totally fine. Other times you may only care about nas in a particular column only.

test_data[2,2] <- NA #let's add a single NA value to test_data
head(test_data)

##   plotID   species   height
## 1      1 Species A 49.90378
## 2      2      <NA> 47.31901
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202

head(na.omit(test_data))

##   plotID   species   height
## 1      1 Species A 49.90378
## 3      3 Species A 50.87232
## 4      4 Species A 49.41498
## 5      5 Species A 51.29793
## 6      6 Species A 50.82202
## 7      7 Species A 50.86813

notice the entire second row has been removed to get rid of the NA value

Simulation

R is also super handy for simulating data! Simulating data is a super important part of statistics. In order to make sure your code and analyses make sense it is important to simulate data first and check. If you know the “answer”, which you generate by simulating data, your statistical test should reflect this. If it doesn’t there is something wrong.

For consistent results in this class I am setting the random seed to 100. This will tell R to generate the same random numbers across different computers.

set.seed(100)

When simulating data in R, you simulate data from a distribution. There are many types of distributions. Here are a few. Distributions can be grouped into discrete and continuous. The Normal and gamma distributions are continuous because they predict continuous values, while the Poisson and binomial distributions are discrete, because they predict whole numbers. The wiki page for these is pretty good!

Let’s focus on the normal distribution because it is the easiest.

We can simulate random data using the rnorm function(). This functions simulates points from a normal distribution. The first argument is the number of points, the second is the mean of the distribution, and the third argument is the standard deviation.

Here is a distribution with 1000 points, a mean of 10, and a standard deviation of 2.

rand_points <- rnorm(1000, 10, 2)
plot(density(rand_points))

Let’s look at the process of simulating data to check a statistical test. The t-test compares the distributions of two groups. Let simulate two distributions:

samples_high <- rnorm(50, 20, 1)
samples_low  <- rnorm(50, 15, 1)

plot(density(samples_high), xlim = c(10,25) ,ylim = c(0,1))
lines(density(samples_low), col ="red")

You’ll notice that these distributions are bumpier, this is because we simulated fewer points. The more points we simulate the more individual outliers even out.

Now lets look at a t-test

t.test(samples_high, samples_low)

## 
##  Welch Two Sample t-test
## 
## data:  samples_high and samples_low
## t = 25.812, df = 97.46, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  4.730324 5.518294
## sample estimates:
## mean of x mean of y 
##  20.14918  15.02487

This looks right! The means match what we simulated and it is VERY unlikely we obtained these differences by chance (we did simulate them to be different after all). Lets look at what happens if we simulate them to be the same.

samples1 <- rnorm(50, 15, 1)
samples2 <- rnorm(50, 15, 1)

plot(density(samples1), xlim = c(10,25) ,ylim = c(0,1))
lines(density(samples2), col ="red")

t.test(samples1, samples2)

## 
##  Welch Two Sample t-test
## 
## data:  samples1 and samples2
## t = 0.30385, df = 92.446, p-value = 0.7619
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3287644  0.4475381
## sample estimates:
## mean of x mean of y 
##  15.03307  14.97369

This also looks right! The distributions were simulated to be the same and the t-test supports this! Remember that they are not identical because they are composed of random numbers. The more values we generate the closer they will appear. Also, notice that your distribution should look the same as mine. That is because we set the seed first so our different computers generated the same random numbers.

We can also use rnorm() to simulate relationships between variables. This is super handy for confirming that analyses are doing what you think they are. Say we want to simulate the effect of height (from the test dataset) on water use, which is a variable we will make up.

We can do something like this. Here we are specifying the relationship between water usage and height. We are saying that an increase in 1 unit of plant height changes water use by 3 units. We use rnorm to simulate the data with and without noise. Obviously data are messy so we should simulate our data with noise (I use n=30 for rnorm because we have 30 data points).

water_use_nonoise <- 3 * test_data$height
water_use_noise   <- 3 * test_data$height + rnorm(30, 0, 10)

Let’s look at the differences

plot(test_data$height, water_use_nonoise, main = "No Noise")

plot(test_data$height, water_use_noise, main = "Noise")

Let’s quickly examine if the relationship we simulated.

model <- glm(water_use_noise ~ test_data$height)

summary(model)

## 
## Call:
## glm(formula = water_use_noise ~ test_data$height)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -26.753   -7.837   -2.323    9.552   24.966  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.0174     9.6607  -0.105    0.917    
## test_data$height   3.0076     0.1344  22.381   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 147.4461)
## 
##     Null deviance: 77984.1  on 29  degrees of freedom
## Residual deviance:  4128.5  on 28  degrees of freedom
## AIC: 238.87
## 
## Number of Fisher Scoring iterations: 2

coef(model)

##      (Intercept) test_data$height 
##        -1.017400         3.007614

The slope matches what we simulated