Sampling publications from COMPADRE via R

Introduction

This document illustrates how to sample publications from the COMPADRE and COMADRE databases via R.

You will need to load the packages tidyverse and Rcompadre. If you don’t already have them installed, you can install them from CRAN with the command install.packages("PACKAGE NAME").

library(tidyverse)
library(Rcompadre)

Downloading the database

COMPADRE (plants) and COMADRE (animals) are two separtate databases that have hte same structure.

Downloading them is easy with Rcompadre, using the cdb_fetch function. Here I am downloading the animal matrix database to a new object called db.

db <- cdb_fetch("comadre")

## This is COMADRE version 4.21.8.0 (release date Aug_20_2021)
## See user agreement at https://compadre-db.org/Help/UserAgreement
## See how to cite at https://compadre-db.org/Help/HowToCite

Note that every time you run this command, R is fetching the entire database. Therefore, it might be a good idea to save the data out to a file so that you can use it without needing to access the Internet. You can save the object db to your working directory like this:

save(db, file = "compadre_20230315.Rdata")

Then you can load it again like this:

load("compadre_20230315.Rdata")

Examining the database

You can get an idea of the contents of the database by asking for the column names or by asking for the dimensions with dim.

names(db)

##  [1] "mat"                    "MatrixID"               "SpeciesAuthor"         
##  [4] "SpeciesAccepted"        "CommonName"             "Kingdom"               
##  [7] "Phylum"                 "Class"                  "Order"                 
## [10] "Family"                 "Genus"                  "Species"               
## [13] "Infraspecies"           "InfraspeciesType"       "OrganismType"          
## [16] "DicotMonoc"             "AngioGymno"             "Authors"               
## [19] "Journal"                "SourceType"             "OtherType"             
## [22] "YearPublication"        "DOI_ISBN"               "AdditionalSource"      
## [25] "StudyDuration"          "StudyStart"             "StudyEnd"              
## [28] "ProjectionInterval"     "MatrixCriteriaSize"     "MatrixCriteriaOntogeny"
## [31] "MatrixCriteriaAge"      "MatrixPopulation"       "NumberPopulations"     
## [34] "Lat"                    "Lon"                    "Altitude"              
## [37] "Country"                "Continent"              "Ecoregion"             
## [40] "StudiedSex"             "MatrixComposite"        "MatrixSeasonal"        
## [43] "MatrixTreatment"        "MatrixCaptivity"        "MatrixStartYear"       
## [46] "MatrixStartSeason"      "MatrixStartMonth"       "MatrixEndYear"         
## [49] "MatrixEndSeason"        "MatrixEndMonth"         "CensusType"            
## [52] "MatrixSplit"            "MatrixFec"              "Observations"          
## [55] "MatrixDimension"        "SurvivalIssue"          "_Database"             
## [58] "_PopulationStatus"      "_PublicationStatus"

dim(db)

## [1] 3317   59

You can see that there’s a lot of content in the database. You can also see that there are 3317 rows of data and 59 columns.

The database (db) is stored in a special format called CompadreDB that was specifically created for this purpose. This makes handling the date a little bit complicated. For our purposes, we are only interested in the metadata (the data about the data, rather than the actual matrices themselves). We can extract the metadata to an ordinary data.frame like this:

db_metadata <- cdb_metadata(db)

Selecting columns and filtering data

We usually not interested in ALL of the data, So we can simplify the data by selecting only the columns that we need.
Let’s say we are mainly interested in the publication information and the species. We could select columns using the select function from the dplyr package. There are often multiple matrices for each species-paper combination, we can get rid of duplicates by using unique.

In addition, I filter the data so it only includes mammals and birds:

db_metadata_2 <- db_metadata %>%
  select(Class, SpeciesAccepted, Authors, Journal, YearPublication, DOI_ISBN) %>%
  unique() %>%
  filter(Class %in% c("Mammalia", "Aves"))

head(db_metadata_2)

## # A tibble: 6 × 6
##   Class    SpeciesAccepted         Authors      Journal YearPublication DOI_ISBN
##   <chr>    <chr>                   <chr>        <chr>   <chr>           <chr>   
## 1 Mammalia Alces alces             Ballard; Wh… Wildli… 1991            https:/…
## 2 Mammalia Alouatta seniculus      Wiederholt;… Ecol M… 2010            10.1016…
## 3 Aves     Ara glaucogularis       Maestri; Fe… Ecol M… 2017            10.1016…
## 4 Mammalia Bos taurus primigenius  Lesnoff; Co… Ecol M… 2012            10.1016…
## 5 Mammalia Brachyteles hypoxanthus Morris; Pfi… Ecology 2008            10.1890…
## 6 Mammalia Canis lupus             Smith; Cuba… Popul … 2014            10.1111…

Randomly selecting publications

It is often desired to randomly select a sample from the database. Here I show how to randomly select publications To make this repeatable, I start using set.seed. Basically, set.seed makes random sampling repeatable, if the same “seed” number is used.

set.seed(42)

mySample <- db_metadata_2 %>%
  slice_sample(n = 4)

mySample

## # A tibble: 4 × 6
##   Class    SpeciesAccepted      Authors         Journal YearPublication DOI_ISBN
##   <chr>    <chr>                <chr>           <chr>   <chr>           <chr>   
## 1 Aves     Anas platyrhynchos   Hoekman; Mills… J Wild… 2002            10.2307…
## 2 Mammalia Dugong dugon         Heinsohn; Lacy… Anim C… 2004            10.1017…
## 3 Aves     Dendragapus obscurus Schumaker; Ern… Ecol A… 2004            10.1890…
## 4 Mammalia Rattus fuscipes      Lindenmayer; L… Biol C… 2002            10.1016…

If I wanted to stratify my random sample so that I take the same number from both mammals and birds I can do that by first using group_by to group the data by taxonomic class:

set.seed(42)

mySample <- db_metadata_2 %>%
  group_by(Class) %>%
  slice_sample(n = 4)

mySample

## # A tibble: 8 × 6
## # Groups:   Class [2]
##   Class    SpeciesAccepted         Authors      Journal YearPublication DOI_ISBN
##   <chr>    <chr>                   <chr>        <chr>   <chr>           <chr>   
## 1 Aves     Falco peregrinus        Deines; Pet… Ecol A… 2007            10.1890…
## 2 Aves     Cyanistes caeruleus     Koons        <NA>    2005            <NA>    
## 3 Aves     Bonasa umbellus         Tirpak; Giu… Biol C… 2006            10.1016…
## 4 Aves     Puffinus tenuirostris   Yearsley; F… Math B… 2002            10.1016…
## 5 Mammalia Urocitellus columbianus Dobson; Oli  Am Nat  2001            10.1086…
## 6 Mammalia Phascolarctos cinereus  Rhodes; Ng;… Biol C… 2011            10.1016…
## 7 Mammalia Canis lupus             Miller; Jen… Ecol M… 2002            10.1016…
## 8 Mammalia Procyon lotor           Schumaker; … Ecol A… 2004            10.1890…

Now you have a random sample stratified across birds and mammals. If you have COMPADRINO access privileges you can find PDFs at www.compadre-db.org, otherwise you can search on Google scholar etc.

You could save this file out to a CSV file for later use like this:

write_csv(mySample,file = "mySample.csv")