species_x_site_matrix_tidyverse

Author

Jamie Brewster

Introduction

Certain analyses, such as NMDS, require your data to appear in “wide” format referred to as a “species x site matrix”. Here, each row is a sample and each column is a unique species. Within the cells are values denoting the number of individuals recorded.

However, it is not practical to enter data in this format, whether in the field or the lab. A more typical form is “long” format, with a column for “species name” or similar, and a column for “total”. In this format, only species encountered in each sample are recorded.

Additionally, your record sheet may have columns/variables not required for analysis and sub-samples that need summing at a replicate/plot level.

Conversion to wide format and the streamlining/summarising of your data set can easily be done within R using the Tidyverse package. Using the pipe operator (%>%, or the shortcut Shift+Ctrl+M) you can string together functions to reduce the number of intermediary objects.

Running Code

If you have not done already, install Tidyverse and call it with the library function.

install.packages("tidyverse", repos = "http://cran.us.r-project.org")
library(tidyverse)

This tutorial will use a dummy record sheet. Head to https://github.com/brewstej/tutorials to download or use the following code to automatically load the record sheet into an object called “mydata”.

urlfile = "https://raw.githubusercontent.com/brewstej/tutorials/refs/heads/main/example_species_record.csv"

mydata <- read_csv(url(urlfile))

head(mydata)

# A tibble: 6 × 8
  plot.id sub.samp date       habitat species.name  male female total
  <chr>   <chr>    <chr>      <chr>   <chr>        <dbl>  <dbl> <dbl>
1 P1      sub1     07/03/2025 grass   A.species       10      2    12
2 P1      sub1     07/03/2025 grass   B.species        6      4    10
3 P1      sub1     07/03/2025 grass   C.species        2      3     5
4 P1      sub2     07/03/2025 grass   D.species        1      2     3
5 P1      sub2     07/03/2025 grass   A.species       12      4    16
6 P2      sub1     07/03/2025 grass   C.species       12     12    24

Have a look at the variable titles and first few lines of data in the object “mydata”. You will notice that each plot (our unit of replication in this example) comprises two sub samples labelled “sub1” and “sub2” here. These could be quadrats, pitfall traps, repeated timed counts. Regardless, we want to pool these data and display a single row per species and a summed count of the individuals at a plot level.

We can use the “group_by” and “summarise” functions to do this. Give the function the variables you want to retain and use for grouping. Any variables not selected will be omitted from the output or summarised based on the subsequent “summarise” function. Here, we are creating a new variable called “new.total” and performing the function “sum” on the existing variable “total”.

The output will then be presented according to the grouping variables specified.

mydata_summarised <- mydata %>% 
  group_by(plot.id, habitat, species.name) %>% 
  summarise(new.total = sum(total))

head(mydata_summarised)

# A tibble: 6 × 4
# Groups:   plot.id, habitat [2]
  plot.id habitat species.name new.total
  <chr>   <chr>   <chr>            <dbl>
1 P1      grass   A.species           28
2 P1      grass   B.species           10
3 P1      grass   C.species            5
4 P1      grass   D.species            3
5 P10     heath   A.species           21
6 P10     heath   B.species           18

You should now see a new object, with all previous sub samples summed and presented as a single row of data per species, while retaining plot and habitat grouping.

We can now create the pivot table as follows:

mydata_matrix <- mydata_summarised %>% 
  pivot_wider(names_from = species.name, values_from = new.total, values_fill = 0)

head(mydata_matrix)

# A tibble: 6 × 8
# Groups:   plot.id, habitat [6]
  plot.id habitat A.species B.species C.species D.species E.species F.species
  <chr>   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 P1      grass          28        10         5         3         0         0
2 P10     heath          21        18        14        11         6         9
3 P2      grass          33         0        24        25        15        10
4 P3      grass          11        22        10         0        24         0
5 P4      grass          39         0        20         0         0         0
6 P5      grass           8        12         0         9         0        20

Pivoting wider introduces NA values (you didn’t enter zeros in your record sheet!). These are transformed into “0” with the values_fill argument.

All in one code chunk

The various stages have been outlined separately, to aid understanding and allow description. However, these can be combined and run as one:

mydata_matrix2 <- mydata %>% 
  group_by(plot.id, habitat, species.name) %>% 
  summarise(new.total = sum(total)) %>% 
  pivot_wider(names_from = species.name, values_from = new.total, values_fill = 0)

head(mydata_matrix2)

# A tibble: 6 × 8
# Groups:   plot.id, habitat [6]
  plot.id habitat A.species B.species C.species D.species E.species F.species
  <chr>   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 P1      grass          28        10         5         3         0         0
2 P10     heath          21        18        14        11         6         9
3 P2      grass          33         0        24        25        15        10
4 P3      grass          11        22        10         0        24         0
5 P4      grass          39         0        20         0         0         0
6 P5      grass           8        12         0         9         0        20

You can now swap out the dataset and variables for your own!

Bonus

P.S. if you wish to filter out your data based on certain variables, you can easily do this as part of the same process. Say, for example, I only wanted to look at a species x site matrix for “heath” habitat. I can use the “filter” function and specify that “habitat” must be exactly equal to (==) “heath”. There are many other operators you can use here (for numerical values etc.).

While you no longer need to group by “habitat” as there is only one, omitting it at this stage would remove it from the resulting object.

mydata_matrix_heath <- mydata %>%
  filter(habitat == "heath") %>% 
  group_by(plot.id, habitat, species.name) %>% 
  summarise(new.total = sum(total)) %>% 
  pivot_wider(names_from = species.name, values_from = new.total, values_fill = 0)

head(mydata_matrix_heath)

# A tibble: 5 × 8
# Groups:   plot.id, habitat [5]
  plot.id habitat A.species B.species C.species D.species E.species F.species
  <chr>   <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 P10     heath          21        18        14        11         6         9
2 P6      heath           0        16         0         9         0         0
3 P7      heath           3         0         0         0         0        22
4 P8      heath          17         0        41         0         0         4
5 P9      heath           0         0         0         0        15         0