R provides some shortcuts that can save you days of your lives. Excel is a wonderful tool, but sometimes you need more.
A few examples of what you can do in R for the data management:
Find the means (or other summary stats) of groups in your data (e.g. average plant height among your treatments).
Rearranging data, including quickly making matrices (like for ordination).
Subset data (in many sensible ways).
Sometimes you need to find means, medians, standard deviations, standard errors, etc. for groups within your data. This can take considerable time to do by hand or in Excel, and mistakes are easy to make.
Never fear, the package plyr is here! (There are related packages that you should check out too, such as the other packages that make up tidyverse).
We will once again use the iris dataset. Below, I view the data and then add another factor variable (just for use in this example).
View(iris) #view data
v<-c("wet","dry","mesic") #makes object 'v' that is 3 text strings (wet, dry, and mesic)
iris$habitat<-v #makes a new variable 'habitat' in the iris dataset. The values are wet, dry, mesic repeating down the column
iris$habitat<-as.factor(iris$habitat) #ensures that the variable 'habitat' is read as a factor
View(iris) #View the data once again to see the new variable
OK, so let’s do it. We first load the package plyr. (Use install.packages(“plyr”) to load it for the first time, and then never use install.packages() for plyr again).
library(plyr)
## Warning: package 'plyr' was built under R version 3.4.4
Now, let’s find the means of the sepal width for each species. In the function ddply, we tell R that the data is iris, the grouping variable is species, we want to summarise (someone British made this function), and that we want to make a new variable called mean that is the average sepal width of each group.
meanBySpecies<- ddply(iris, c("Species"), summarise,
mean = mean(Sepal.Width))
meanBySpecies
## Species mean
## 1 setosa 3.428
## 2 versicolor 2.770
## 3 virginica 2.974
We just made a new dataframe called meanBySpecies using the function ddply. Within this new dataframe, we created a variable called mean, the mean of Sepal Width for each species.
While the example above is useful, you likely want to do more complicated summary - like by multiple grouping variables and producing multiple summary stats. So let’s do it!
summbySppHab<- ddply(iris, c("Species","habitat"), summarise, N = length(Sepal.Width),
mean = mean(Sepal.Width), min=min(Sepal.Width),max=max(Sepal.Width), var=var(Sepal.Width),
sd = sd(Sepal.Width), se = sd / sqrt(N))
summbySppHab
## Species habitat N mean min max var sd se
## 1 setosa dry 17 3.447059 3.0 3.9 0.08389706 0.2896499 0.07025042
## 2 setosa mesic 16 3.362500 2.3 4.1 0.20250000 0.4500000 0.11250000
## 3 setosa wet 17 3.470588 3.0 4.4 0.15970588 0.3996322 0.09692504
## 4 versicolor dry 16 2.906250 2.6 3.4 0.04329167 0.2080665 0.05201662
## 5 versicolor mesic 17 2.735294 2.2 3.3 0.12367647 0.3516767 0.08529412
## 6 versicolor wet 17 2.676471 2.0 3.2 0.10816176 0.3288796 0.07976501
## 7 virginica dry 17 3.023529 2.5 3.6 0.09566176 0.3092924 0.07501442
## 8 virginica mesic 17 2.917647 2.2 3.8 0.13404412 0.3661204 0.08879723
## 9 virginica wet 16 2.981250 2.5 3.8 0.08829167 0.2971391 0.07428478
For each combination of species and habitat, we produced a bunch of summary stats. Including N (sample size in each group), mean, minimum, maximum, variance, standard deviation, and standard error.
From here, you can use this new dataframe as you need - you can run your stats on it, export it to another program, graph using these summaries etc.
If you do want to export your data, this is easy. You’ll use write.csv() to export the data to your working directory- the folder that R references. Remember to use setwd() to set your working directory before you export. With your own data, you likely did this at the top, during set-up.
write.csv(summbySppHab,"summbySppHab.csv")
Check your working directory and see if a csv called summbySppHab has appeared. If not, close R (after saving),re-open it, check your working directory, and re-run it.
It is very time consuming to rearrange data by hand in Excel. R has many options to help you out.
For many statistical methods, you need a matrix. The package reshape2 offers some easy ways to do it.
Often in plant ecology, we want to produce a summary among plots and species. In this case, let’s find the average sepal length for each species in each plot.
But first we need to make a plot variable! Let’s do this quickly with the code below. As usual, this isn’t something that you will need to do, we are just doing this for the example.
plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
iris$plotID<-plotIDs
In the first line below, we call the package reshape2. The second, we create the matrix, called mat. We use the function dcast. Within that function, we tell R that the data is iris, that the rows are plots and the columns are species (plotID~Species). With value.var, we tell R that we want the “cells” of the matrix filled with Sepal Length (e.g. this is the variable of interest in the matrix). Finally, we want the mean of sepal length.
library(reshape2)
mat <- dcast(iris, plotID~Species,value.var= 'Sepal.Length',mean)
Now we do a few clean-up things, to make this a nice dataframe to work with. First, we use the function as.data.frame to, you guessed it, make our matrix a dataframe. Next, we put 0s in to replace any NAs - this is important sometimes when working with species abundance/presence matrices. Think about if you want NAs or 0s for missing data in your dataset. It is not always appropriate to coerce NAs to 0. Next we assign row.names, in this case the variable plotID. Then we drop the variable plotID, because those are now our row names. Then in the last line of code, we view it.
Note: you cannot have duplicate row names, each plot needs to be unique.
mat <- as.data.frame(mat)
mat[is.na(mat)] <- 0
row.names(mat)<-mat$plotID #This makes the row names the same as plot ID, especially useful for plot IDs for ordinations
mat<-mat[-1] #dropping first column (we already made it the row names)
mat
## setosa versicolor virginica
## Plot1 5.16 5.96 6.94
## Plot2 5.03 6.11 6.28
## Plot3 4.85 6.07 6.78
## Plot4 4.95 5.82 6.44
## Plot5 5.04 5.72 6.50
Now we have a matrix showing the mean sepal width for each species in every plot - useful for making tables and some statistical analysis.
Use the code below to export your data into your working directory. Use setwd(), or the directions in Getting Started With R tutorial to set your working directory.
write.csv(mat,"sepalWidthMatrix.csv")
Often we need to make a matrix of which species are in what plots, and how many of them are there. This is your good old species matrix - needed for ordinations, cluster analysis, and other fun things.
This is a very similar process to the example above, so I’ll breeze through the code and only let you know when things are different.
We will make a species abundance matrix that shows the abundance of each species for each plot.
I have included the code below to make the plot variable again. This is just example data.
plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
iris$plotID<-plotIDs
We call the package reshape2 again to make this matrix. We use the function dcast once again. We specify that the data is iris and that we want plotID to be row and Species to be columns (plotID~Species). We set the value.var to sepal length again (I’ll explain why in a moment). We do not include an aggregate method like we did last time - let’s run it and I’ll explain.
library(reshape2)
sppAbun <- dcast(iris, plotID~Species,value.var= 'Sepal.Length')
## Aggregation function missing: defaulting to length
So we run it and we get some text that says “Aggregation function missing: defaulting to length”. Sounds like we screwed up, right? Wrong!
This is perfection! Remember that ‘length’ in R means number of observations. That’s exactly what we want - the number of oberservations of each species observed (the abundance) in each plot. But why set value.var to sepal length? Because you have to set it something - you could have set it to any variable, as long as there is no missing data in that variable. We just need a count, so any variable will do.
We once again clean up the matrix.
sppAbun <- as.data.frame(sppAbun)
sppAbun[is.na(sppAbun)] <- 0
row.names(sppAbun)<-sppAbun$plotID #This makes the row names the same as plot ID, especially useful for plot IDs for ordinations
sppAbun<-sppAbun[-1] #dropping first column (we already made it the row names)
sppAbun
## setosa versicolor virginica
## Plot1 10 10 10
## Plot2 10 10 10
## Plot3 10 10 10
## Plot4 10 10 10
## Plot5 10 10 10
What we get is pretty boring, we see that there are 10 individuals of each species in each plot. But you can use this code to get your species abundances, which should be more interesting (unless your community is crazy even!).
Often we need a presence-absence matrix, not an abundance matrix though.
It is easy to convert an abundance matrix to a presence-absence matrix. Use the code below - it makes any number greater than 0 into a 1. This way, if the species was observed there, it’s a 1; if the species was absent, it’s a 0.
We see that the every species was present in every plot.
sppAbun[sppAbun > 0] <- 1 #converts from abundance to P/A
sppAbun
## setosa versicolor virginica
## Plot1 1 1 1
## Plot2 1 1 1
## Plot3 1 1 1
## Plot4 1 1 1
## Plot5 1 1 1
R is amazing at splitting datasets up in lots of useful ways. I’m incredibly passionate about how good R is at this. If you want to share in my passion, try out the mehtods below and also check out this [subset link] (https://www.statmethods.net/management/subset.html). Much of this tutorial is stolen directly from this link.
Let’s first use a simple index to make a new dataframe that only keeps the first variable. We get a dataset that is just the variable Sepal Length. Not very useful.
iris1 <- iris[1]
Let’s now make a new dataframe that keeps the 1st and 5th through 7th variables. This creates a dataframe with sepal length, species, habitat, and plot.
iris2<- iris[c(1,5:7)]
Drop the 3rd and 5th variables. Sometimes it is easier to exclude a variable or 2.
iris4 <- iris[c(-3,-5)]
Now let’s only include some records, keeping the first 100. Notice that new comma after the parentheses, that is what tells R we are dealing with rows, not columns. Generally inside brackets, rows go before the comma, columns after. If no comma is placed, then R assumes you are indexing a column.
iris3<- iris2[c(1:100),]
Now let’s drop the first 100 records (using the - within the brackets).
iris4<- iris2[-c(1:100),]
Counting rows can become difficult in large datasets, so there are other ways to subset data in R.
Select all I. virginica records
virginica<- subset(iris, Species=="virginica")
Select all I. virginica with a sepal length greater than 5 mm.
virginica_Greater5<- subset(iris, Species=="virginica" & Sepal.Length > 5)
Select all I. virginica and I. setosa records. In R, the “|” means OR - so any record that is virginica or setosa is kept.
virginicaSetosa<- subset(iris, Species=="virginica" | Species=="setosa")
Drop I. virginica records
noVirginica<- subset(iris, Species!="virginica")
Use install.packages(“plyr”) to load it for the first time, and then never use install.packages() for plyr again. Once per computer is all that you need.
library(plyr)
meanBySpecies<- ddply(iris, c("Species"), summarise,
mean = mean(Sepal.Width))
meanBySpecies
## Species mean
## 1 setosa 3.428
## 2 versicolor 2.770
## 3 virginica 2.974
Top 3 lines only needed to run given example. If doing so, remove #s from before those three lines.
#v<-c("wet","dry","mesic") #makes object 'v' that is 3 text strings (wet, dry, and mesic)
#iris$habitat<-v #makes a new variable 'habitat' in the iris dataset. The values are wet, dry, mesic repeating down the column
#iris$habitat<-as.factor(iris$habitat) #ensures that the variable 'habitat' is read as a factor
summbySppHab<- ddply(iris, c("Species","habitat"), summarise, N = length(Sepal.Width),
mean = mean(Sepal.Width), min=min(Sepal.Width),max=max(Sepal.Width), var=var(Sepal.Width),
sd = sd(Sepal.Width), se = sd / sqrt(N))
summbySppHab
## Species habitat N mean min max var sd se
## 1 setosa dry 17 3.447059 3.0 3.9 0.08389706 0.2896499 0.07025042
## 2 setosa mesic 16 3.362500 2.3 4.1 0.20250000 0.4500000 0.11250000
## 3 setosa wet 17 3.470588 3.0 4.4 0.15970588 0.3996322 0.09692504
## 4 versicolor dry 16 2.906250 2.6 3.4 0.04329167 0.2080665 0.05201662
## 5 versicolor mesic 17 2.735294 2.2 3.3 0.12367647 0.3516767 0.08529412
## 6 versicolor wet 17 2.676471 2.0 3.2 0.10816176 0.3288796 0.07976501
## 7 virginica dry 17 3.023529 2.5 3.6 0.09566176 0.3092924 0.07501442
## 8 virginica mesic 17 2.917647 2.2 3.8 0.13404412 0.3661204 0.08879723
## 9 virginica wet 16 2.981250 2.5 3.8 0.08829167 0.2971391 0.07428478
Remember to use setwd() to set your working directory before you export. With your own data, you should have likely done this at the beginning. If you have questions, see Getting Started in R tutorial.
write.csv(summbySppHab,"summbySppHab.csv")
Check your working directory and see if a csv called summbySppHab has appeared. If not, close R (after saving),re-open it, check your working directory, and re-run it.
It is very time consuming to rearrange data by hand in Excel. R has many options to help you out.
The first 2 lines of code are only needed to run the given example. Remove the #s from before the lines.
#plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
#iris$plotID<-plotIDs
library(reshape2)
mat <- dcast(iris, plotID~Species,value.var= 'Sepal.Length',mean)
Note: you cannot have duplicate row names, each plot needs to be unique.
mat <- as.data.frame(mat)
mat[is.na(mat)] <- 0
row.names(mat)<-mat$plotID #This makes the row names the same as plot ID, especially useful for plot IDs for ordinations
mat<-mat[-1] #dropping first column (we already made it the row names)
mat
## setosa versicolor virginica
## Plot1 5.16 5.96 6.94
## Plot2 5.03 6.11 6.28
## Plot3 4.85 6.07 6.78
## Plot4 4.95 5.82 6.44
## Plot5 5.04 5.72 6.50
Use setwd(), or the directions in Getting Started With R tutorial to set your working directory.
write.csv(mat,"sepalWidthMatrix.csv")
The first 2 lines of code are only needed to run the given example. Remove the #s from before the lines.
#plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
#iris$plotID<-plotIDs
library(reshape2)
sppAbun <- dcast(iris, plotID~Species,value.var= 'Sepal.Length')
## Aggregation function missing: defaulting to length
sppAbun <- as.data.frame(sppAbun)
sppAbun[is.na(sppAbun)] <- 0
row.names(sppAbun)<-sppAbun$plotID #This makes the row names the same as plots IDs, especially useful for plot IDs for ordinations
sppAbun<-sppAbun[-1] #dropping first column (we already made it the row names)
sppAbun
## setosa versicolor virginica
## Plot1 10 10 10
## Plot2 10 10 10
## Plot3 10 10 10
## Plot4 10 10 10
## Plot5 10 10 10
sppAbun[sppAbun > 0] <- 1 #converts from abundance to P/A
sppAbun
## setosa versicolor virginica
## Plot1 1 1 1
## Plot2 1 1 1
## Plot3 1 1 1
## Plot4 1 1 1
## Plot5 1 1 1
Keep only first variable.
iris1 <- iris[1]
Keep 1st and 3rd through 5th variables.
iris2<- iris[c(1,3:5)]
Drop the 3rd and 5th variables.
iris4 <- iris[c(-3,-5)]
Keep first 100 records.
iris3<- iris2[c(1:100),]
Drop first 100 records.
iris3<- iris2[-c(1:100),]
Select all I. virginica records
virginica<- subset(iris, Species=="virginica")
Select all I. virginica with a sepal length greater than 5 mm.
virginica_Greater5<- subset(iris, Species=="virginica" & Sepal.Length > 5)
Select all I. virginica and I. setosa records. In R, the “|” means OR - so any record that is virginica or setosa is kept.
virginicaSetosa<- subset(iris, Species=="virginica" | Species=="setosa")
Drop *I. virginica records
noVirginica<- subset(iris, Species!="virginica")