Basic Data Management in R

R provides some shortcuts that can save you days of your lives. Excel is a wonderful tool, but sometimes you need more.

A few examples of what you can do in R for the data management:

Find the means (or other summary stats) of groups in your data (e.g. average plant height among your treatments).
Rearranging data, including quickly making matrices (like for ordination).
Subset data (in many sensible ways).

Summarizing data

Sometimes you need to find means, medians, standard deviations, standard errors, etc. for groups within your data. This can take considerable time to do by hand or in Excel, and mistakes are easy to make.

Never fear, the package plyr is here! (There are related packages that you should check out too, such as the other packages that make up tidyverse).

Setting up data for this example

We will once again use the iris dataset. Below, I view the data and then add another factor variable (just for use in this example).

View(iris) #view data

v<-c("wet","dry","mesic") #makes object 'v' that is 3 text strings (wet, dry, and mesic)
iris$habitat<-v #makes a new variable 'habitat' in the iris dataset. The values are wet, dry, mesic repeating down the column
iris$habitat<-as.factor(iris$habitat) #ensures that the variable 'habitat' is read as a factor
View(iris) #View the data once again to see the new variable

Simple Summary - Averages by one grouping variable

OK, so let’s do it. We first load the package plyr. (Use install.packages(“plyr”) to load it for the first time, and then never use install.packages() for plyr again).

library(plyr)

## Warning: package 'plyr' was built under R version 3.4.4

Now, let’s find the means of the sepal width for each species. In the function ddply, we tell R that the data is iris, the grouping variable is species, we want to summarise (someone British made this function), and that we want to make a new variable called mean that is the average sepal width of each group.

meanBySpecies<- ddply(iris, c("Species"), summarise,
               mean = mean(Sepal.Width))
meanBySpecies

##      Species  mean
## 1     setosa 3.428
## 2 versicolor 2.770
## 3  virginica 2.974

We just made a new dataframe called meanBySpecies using the function ddply. Within this new dataframe, we created a variable called mean, the mean of Sepal Width for each species.

More complex summary

While the example above is useful, you likely want to do more complicated summary - like by multiple grouping variables and producing multiple summary stats. So let’s do it!

summbySppHab<- ddply(iris, c("Species","habitat"), summarise, N    = length(Sepal.Width),
               mean = mean(Sepal.Width), min=min(Sepal.Width),max=max(Sepal.Width), var=var(Sepal.Width),
               sd   = sd(Sepal.Width), se   = sd / sqrt(N))
summbySppHab

##      Species habitat  N     mean min max        var        sd         se
## 1     setosa     dry 17 3.447059 3.0 3.9 0.08389706 0.2896499 0.07025042
## 2     setosa   mesic 16 3.362500 2.3 4.1 0.20250000 0.4500000 0.11250000
## 3     setosa     wet 17 3.470588 3.0 4.4 0.15970588 0.3996322 0.09692504
## 4 versicolor     dry 16 2.906250 2.6 3.4 0.04329167 0.2080665 0.05201662
## 5 versicolor   mesic 17 2.735294 2.2 3.3 0.12367647 0.3516767 0.08529412
## 6 versicolor     wet 17 2.676471 2.0 3.2 0.10816176 0.3288796 0.07976501
## 7  virginica     dry 17 3.023529 2.5 3.6 0.09566176 0.3092924 0.07501442
## 8  virginica   mesic 17 2.917647 2.2 3.8 0.13404412 0.3661204 0.08879723
## 9  virginica     wet 16 2.981250 2.5 3.8 0.08829167 0.2971391 0.07428478

For each combination of species and habitat, we produced a bunch of summary stats. Including N (sample size in each group), mean, minimum, maximum, variance, standard deviation, and standard error.

From here, you can use this new dataframe as you need - you can run your stats on it, export it to another program, graph using these summaries etc.

Exporting your data

If you do want to export your data, this is easy. You’ll use write.csv() to export the data to your working directory- the folder that R references. Remember to use setwd() to set your working directory before you export. With your own data, you likely did this at the top, during set-up.

write.csv(summbySppHab,"summbySppHab.csv")

Check your working directory and see if a csv called summbySppHab has appeared. If not, close R (after saving),re-open it, check your working directory, and re-run it.

Rearrange data

It is very time consuming to rearrange data by hand in Excel. R has many options to help you out.

Summary matrix

For many statistical methods, you need a matrix. The package reshape2 offers some easy ways to do it.

Often in plant ecology, we want to produce a summary among plots and species. In this case, let’s find the average sepal length for each species in each plot.

But first we need to make a plot variable! Let’s do this quickly with the code below. As usual, this isn’t something that you will need to do, we are just doing this for the example.

plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
iris$plotID<-plotIDs

In the first line below, we call the package reshape2. The second, we create the matrix, called mat. We use the function dcast. Within that function, we tell R that the data is iris, that the rows are plots and the columns are species (plotID~Species). With value.var, we tell R that we want the “cells” of the matrix filled with Sepal Length (e.g. this is the variable of interest in the matrix). Finally, we want the mean of sepal length.

Making a matrix

library(reshape2)
mat <- dcast(iris, plotID~Species,value.var= 'Sepal.Length',mean)

Cleaning up the matrix

Now we do a few clean-up things, to make this a nice dataframe to work with. First, we use the function as.data.frame to, you guessed it, make our matrix a dataframe. Next, we put 0s in to replace any NAs - this is important sometimes when working with species abundance/presence matrices. Think about if you want NAs or 0s for missing data in your dataset. It is not always appropriate to coerce NAs to 0. Next we assign row.names, in this case the variable plotID. Then we drop the variable plotID, because those are now our row names. Then in the last line of code, we view it.

Note: you cannot have duplicate row names, each plot needs to be unique.

mat <- as.data.frame(mat)
mat[is.na(mat)] <- 0
row.names(mat)<-mat$plotID #This makes the row names the same as plot ID, especially useful for plot IDs for ordinations
mat<-mat[-1] #dropping first column (we already made it the row names)
mat

##       setosa versicolor virginica
## Plot1   5.16       5.96      6.94
## Plot2   5.03       6.11      6.28
## Plot3   4.85       6.07      6.78
## Plot4   4.95       5.82      6.44
## Plot5   5.04       5.72      6.50

Exporting your data

Now we have a matrix showing the mean sepal width for each species in every plot - useful for making tables and some statistical analysis.

Use the code below to export your data into your working directory. Use setwd(), or the directions in Getting Started With R tutorial to set your working directory.

write.csv(mat,"sepalWidthMatrix.csv")

Making a Species Abundance matrix

Often we need to make a matrix of which species are in what plots, and how many of them are there. This is your good old species matrix - needed for ordinations, cluster analysis, and other fun things.

This is a very similar process to the example above, so I’ll breeze through the code and only let you know when things are different.

Species Abundance matrix

We will make a species abundance matrix that shows the abundance of each species for each plot.

I have included the code below to make the plot variable again. This is just example data.

plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
iris$plotID<-plotIDs

We call the package reshape2 again to make this matrix. We use the function dcast once again. We specify that the data is iris and that we want plotID to be row and Species to be columns (plotID~Species). We set the value.var to sepal length again (I’ll explain why in a moment). We do not include an aggregate method like we did last time - let’s run it and I’ll explain.

Making a matrix

library(reshape2)
sppAbun <- dcast(iris, plotID~Species,value.var= 'Sepal.Length')

## Aggregation function missing: defaulting to length

So we run it and we get some text that says “Aggregation function missing: defaulting to length”. Sounds like we screwed up, right? Wrong!

This is perfection! Remember that ‘length’ in R means number of observations. That’s exactly what we want - the number of oberservations of each species observed (the abundance) in each plot. But why set value.var to sepal length? Because you have to set it something - you could have set it to any variable, as long as there is no missing data in that variable. We just need a count, so any variable will do.

Cleaning up the matrix

We once again clean up the matrix.

sppAbun <- as.data.frame(sppAbun)
sppAbun[is.na(sppAbun)] <- 0
row.names(sppAbun)<-sppAbun$plotID #This makes the row names the same as plot ID, especially useful for plot IDs for ordinations
sppAbun<-sppAbun[-1] #dropping first column (we already made it the row names)
sppAbun

##       setosa versicolor virginica
## Plot1     10         10        10
## Plot2     10         10        10
## Plot3     10         10        10
## Plot4     10         10        10
## Plot5     10         10        10

What we get is pretty boring, we see that there are 10 individuals of each species in each plot. But you can use this code to get your species abundances, which should be more interesting (unless your community is crazy even!).

Presence absence matrix

Often we need a presence-absence matrix, not an abundance matrix though.

It is easy to convert an abundance matrix to a presence-absence matrix. Use the code below - it makes any number greater than 0 into a 1. This way, if the species was observed there, it’s a 1; if the species was absent, it’s a 0.

We see that the every species was present in every plot.

sppAbun[sppAbun > 0] <- 1 #converts from abundance to P/A
sppAbun

##       setosa versicolor virginica
## Plot1      1          1         1
## Plot2      1          1         1
## Plot3      1          1         1
## Plot4      1          1         1
## Plot5      1          1         1

Subsetting data

R is amazing at splitting datasets up in lots of useful ways. I’m incredibly passionate about how good R is at this. If you want to share in my passion, try out the mehtods below and also check out this [subset link] (https://www.statmethods.net/management/subset.html). Much of this tutorial is stolen directly from this link.

‘Indexing’ - dropping/selecting variables and records

Let’s first use a simple index to make a new dataframe that only keeps the first variable. We get a dataset that is just the variable Sepal Length. Not very useful.

iris1 <- iris[1]

Let’s now make a new dataframe that keeps the 1st and 5th through 7th variables. This creates a dataframe with sepal length, species, habitat, and plot.

iris2<- iris[c(1,5:7)]

Drop the 3rd and 5th variables. Sometimes it is easier to exclude a variable or 2.

iris4 <- iris[c(-3,-5)]

Now let’s only include some records, keeping the first 100. Notice that new comma after the parentheses, that is what tells R we are dealing with rows, not columns. Generally inside brackets, rows go before the comma, columns after. If no comma is placed, then R assumes you are indexing a column.

iris3<- iris2[c(1:100),]

Now let’s drop the first 100 records (using the - within the brackets).

iris4<- iris2[-c(1:100),]

Selecting records by name

Counting rows can become difficult in large datasets, so there are other ways to subset data in R.

Select all I. virginica records

virginica<- subset(iris, Species=="virginica")

Select all I. virginica with a sepal length greater than 5 mm.

virginica_Greater5<- subset(iris, Species=="virginica" & Sepal.Length > 5)

Select all I. virginica and I. setosa records. In R, the “|” means OR - so any record that is virginica or setosa is kept.

virginicaSetosa<- subset(iris, Species=="virginica" | Species=="setosa")

Drop I. virginica records

noVirginica<- subset(iris, Species!="virginica")

Now with your data

Summarizing data

Simple Summary - Averages by one grouping variable

Use install.packages(“plyr”) to load it for the first time, and then never use install.packages() for plyr again. Once per computer is all that you need.

library(plyr)
meanBySpecies<- ddply(iris, c("Species"), summarise,
               mean = mean(Sepal.Width))
meanBySpecies

##      Species  mean
## 1     setosa 3.428
## 2 versicolor 2.770
## 3  virginica 2.974

More complex summary

Top 3 lines only needed to run given example. If doing so, remove #s from before those three lines.

#v<-c("wet","dry","mesic") #makes object 'v' that is 3 text strings (wet, dry, and mesic)
#iris$habitat<-v #makes a new variable 'habitat' in the iris dataset. The values are wet, dry, mesic repeating down the column
#iris$habitat<-as.factor(iris$habitat) #ensures that the variable 'habitat' is read as a factor
summbySppHab<- ddply(iris, c("Species","habitat"), summarise, N    = length(Sepal.Width),
               mean = mean(Sepal.Width), min=min(Sepal.Width),max=max(Sepal.Width), var=var(Sepal.Width),
               sd   = sd(Sepal.Width), se   = sd / sqrt(N))
summbySppHab

##      Species habitat  N     mean min max        var        sd         se
## 1     setosa     dry 17 3.447059 3.0 3.9 0.08389706 0.2896499 0.07025042
## 2     setosa   mesic 16 3.362500 2.3 4.1 0.20250000 0.4500000 0.11250000
## 3     setosa     wet 17 3.470588 3.0 4.4 0.15970588 0.3996322 0.09692504
## 4 versicolor     dry 16 2.906250 2.6 3.4 0.04329167 0.2080665 0.05201662
## 5 versicolor   mesic 17 2.735294 2.2 3.3 0.12367647 0.3516767 0.08529412
## 6 versicolor     wet 17 2.676471 2.0 3.2 0.10816176 0.3288796 0.07976501
## 7  virginica     dry 17 3.023529 2.5 3.6 0.09566176 0.3092924 0.07501442
## 8  virginica   mesic 17 2.917647 2.2 3.8 0.13404412 0.3661204 0.08879723
## 9  virginica     wet 16 2.981250 2.5 3.8 0.08829167 0.2971391 0.07428478

Exporting your data

Remember to use setwd() to set your working directory before you export. With your own data, you should have likely done this at the beginning. If you have questions, see Getting Started in R tutorial.

write.csv(summbySppHab,"summbySppHab.csv")

Check your working directory and see if a csv called summbySppHab has appeared. If not, close R (after saving),re-open it, check your working directory, and re-run it.

Rearrange data

It is very time consuming to rearrange data by hand in Excel. R has many options to help you out.

Summary matrix

The first 2 lines of code are only needed to run the given example. Remove the #s from before the lines.

Make a matrix

#plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
#iris$plotID<-plotIDs
library(reshape2)
mat <- dcast(iris, plotID~Species,value.var= 'Sepal.Length',mean)

Cleaning up the matrix

Note: you cannot have duplicate row names, each plot needs to be unique.

mat <- as.data.frame(mat)
mat[is.na(mat)] <- 0
row.names(mat)<-mat$plotID #This makes the row names the same as plot ID, especially useful for plot IDs for ordinations
mat<-mat[-1] #dropping first column (we already made it the row names)
mat

##       setosa versicolor virginica
## Plot1   5.16       5.96      6.94
## Plot2   5.03       6.11      6.28
## Plot3   4.85       6.07      6.78
## Plot4   4.95       5.82      6.44
## Plot5   5.04       5.72      6.50

Exporting your matrix

Use setwd(), or the directions in Getting Started With R tutorial to set your working directory.

write.csv(mat,"sepalWidthMatrix.csv")

Species Abundance matrix

The first 2 lines of code are only needed to run the given example. Remove the #s from before the lines.

#plotIDs<-c("Plot1","Plot2","Plot3","Plot4","Plot5")
#iris$plotID<-plotIDs
library(reshape2)
sppAbun <- dcast(iris, plotID~Species,value.var= 'Sepal.Length')

## Aggregation function missing: defaulting to length

Cleaning up the matrix

sppAbun <- as.data.frame(sppAbun)
sppAbun[is.na(sppAbun)] <- 0
row.names(sppAbun)<-sppAbun$plotID #This makes the row names the same as plots IDs, especially useful for plot IDs for ordinations
sppAbun<-sppAbun[-1] #dropping first column (we already made it the row names)
sppAbun

##       setosa versicolor virginica
## Plot1     10         10        10
## Plot2     10         10        10
## Plot3     10         10        10
## Plot4     10         10        10
## Plot5     10         10        10

Presence absence matrix from abundance matrix

sppAbun[sppAbun > 0] <- 1 #converts from abundance to P/A
sppAbun

##       setosa versicolor virginica
## Plot1      1          1         1
## Plot2      1          1         1
## Plot3      1          1         1
## Plot4      1          1         1
## Plot5      1          1         1

Subsetting data

‘Indexing’ - dropping/selecting variables and records

Keep only first variable.

iris1 <- iris[1]

Keep 1st and 3rd through 5th variables.

iris2<- iris[c(1,3:5)]

Drop the 3rd and 5th variables.

iris4 <- iris[c(-3,-5)]

Keep first 100 records.

iris3<- iris2[c(1:100),]

Drop first 100 records.

iris3<- iris2[-c(1:100),]

Selecting records by name

Select all I. virginica records

virginica<- subset(iris, Species=="virginica")

Select all I. virginica with a sepal length greater than 5 mm.

virginica_Greater5<- subset(iris, Species=="virginica" & Sepal.Length > 5)

Select all I. virginica and I. setosa records. In R, the “|” means OR - so any record that is virginica or setosa is kept.

virginicaSetosa<- subset(iris, Species=="virginica" | Species=="setosa")

Drop *I. virginica records

noVirginica<- subset(iris, Species!="virginica")

Basic Data Management

Michael Sinclair

July 1, 2019

Basic Data Management in R

Summarizing data

Setting up data for this example

Simple Summary - Averages by one grouping variable

More complex summary

Exporting your data

Rearrange data

Summary matrix

Making a matrix

Cleaning up the matrix

Exporting your data

Making a Species Abundance matrix

Species Abundance matrix

Making a matrix

Cleaning up the matrix

Presence absence matrix

Subsetting data

‘Indexing’ - dropping/selecting variables and records

Selecting records by name

Now with your data

Summarizing data

Simple Summary - Averages by one grouping variable

More complex summary

Exporting your data

Rearrange data

Summary matrix

Make a matrix

Cleaning up the matrix

Exporting your matrix

Species Abundance matrix

Cleaning up the matrix

Presence absence matrix from abundance matrix

Subsetting data

‘Indexing’ - dropping/selecting variables and records

Selecting records by name