This document illustrates how to sample publications from the COMPADRE and COMADRE databases via R.
You will need to load the packages tidyverse and
Rcompadre. If you don’t already have them installed, you
can install them from CRAN with the command
install.packages("PACKAGE NAME").
library(tidyverse)
library(Rcompadre)
COMPADRE (plants) and COMADRE (animals) are two separtate databases that have hte same structure.
Downloading them is easy with Rcompadre, using the
cdb_fetch function. Here I am downloading the animal matrix
database to a new object called db.
db <- cdb_fetch("comadre")
## This is COMADRE version 4.21.8.0 (release date Aug_20_2021)
## See user agreement at https://compadre-db.org/Help/UserAgreement
## See how to cite at https://compadre-db.org/Help/HowToCite
Note that every time you run this command, R is fetching the entire
database. Therefore, it might be a good idea to save the data out to a
file so that you can use it without needing to access the Internet. You
can save the object db to your working directory like
this:
save(db, file = "compadre_20230315.Rdata")
Then you can load it again like this:
load("compadre_20230315.Rdata")
You can get an idea of the contents of the database by asking for the
column names or by asking for the dimensions with
dim.
names(db)
## [1] "mat" "MatrixID" "SpeciesAuthor"
## [4] "SpeciesAccepted" "CommonName" "Kingdom"
## [7] "Phylum" "Class" "Order"
## [10] "Family" "Genus" "Species"
## [13] "Infraspecies" "InfraspeciesType" "OrganismType"
## [16] "DicotMonoc" "AngioGymno" "Authors"
## [19] "Journal" "SourceType" "OtherType"
## [22] "YearPublication" "DOI_ISBN" "AdditionalSource"
## [25] "StudyDuration" "StudyStart" "StudyEnd"
## [28] "ProjectionInterval" "MatrixCriteriaSize" "MatrixCriteriaOntogeny"
## [31] "MatrixCriteriaAge" "MatrixPopulation" "NumberPopulations"
## [34] "Lat" "Lon" "Altitude"
## [37] "Country" "Continent" "Ecoregion"
## [40] "StudiedSex" "MatrixComposite" "MatrixSeasonal"
## [43] "MatrixTreatment" "MatrixCaptivity" "MatrixStartYear"
## [46] "MatrixStartSeason" "MatrixStartMonth" "MatrixEndYear"
## [49] "MatrixEndSeason" "MatrixEndMonth" "CensusType"
## [52] "MatrixSplit" "MatrixFec" "Observations"
## [55] "MatrixDimension" "SurvivalIssue" "_Database"
## [58] "_PopulationStatus" "_PublicationStatus"
dim(db)
## [1] 3317 59
You can see that there’s a lot of content in the database. You can also see that there are 3317 rows of data and 59 columns.
The database (db) is stored in a special format called
CompadreDB that was specifically created for this purpose.
This makes handling the date a little bit complicated. For our purposes,
we are only interested in the metadata (the data about the data, rather
than the actual matrices themselves). We can extract the metadata to an
ordinary data.frame like this:
db_metadata <- cdb_metadata(db)
We usually not interested in ALL of the data, So we can simplify the
data by selecting only the columns that we need.
Let’s say we are mainly interested in the publication information and
the species. We could select columns using the select
function from the dplyr package. There are often multiple
matrices for each species-paper combination, we can get rid of
duplicates by using unique.
In addition, I filter the data so it only includes
mammals and birds:
db_metadata_2 <- db_metadata %>%
select(Class, SpeciesAccepted, Authors, Journal, YearPublication, DOI_ISBN) %>%
unique() %>%
filter(Class %in% c("Mammalia", "Aves"))
head(db_metadata_2)
## # A tibble: 6 × 6
## Class SpeciesAccepted Authors Journal YearPublication DOI_ISBN
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Mammalia Alces alces Ballard; Wh… Wildli… 1991 https:/…
## 2 Mammalia Alouatta seniculus Wiederholt;… Ecol M… 2010 10.1016…
## 3 Aves Ara glaucogularis Maestri; Fe… Ecol M… 2017 10.1016…
## 4 Mammalia Bos taurus primigenius Lesnoff; Co… Ecol M… 2012 10.1016…
## 5 Mammalia Brachyteles hypoxanthus Morris; Pfi… Ecology 2008 10.1890…
## 6 Mammalia Canis lupus Smith; Cuba… Popul … 2014 10.1111…
It is often desired to randomly select a sample from the database.
Here I show how to randomly select publications To make this repeatable,
I start using set.seed. Basically, set.seed
makes random sampling repeatable, if the same “seed” number is used.
set.seed(42)
mySample <- db_metadata_2 %>%
slice_sample(n = 4)
mySample
## # A tibble: 4 × 6
## Class SpeciesAccepted Authors Journal YearPublication DOI_ISBN
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Aves Anas platyrhynchos Hoekman; Mills… J Wild… 2002 10.2307…
## 2 Mammalia Dugong dugon Heinsohn; Lacy… Anim C… 2004 10.1017…
## 3 Aves Dendragapus obscurus Schumaker; Ern… Ecol A… 2004 10.1890…
## 4 Mammalia Rattus fuscipes Lindenmayer; L… Biol C… 2002 10.1016…
If I wanted to stratify my random sample so that I
take the same number from both mammals and birds I can do that by first
using group_by to group the data by taxonomic class:
set.seed(42)
mySample <- db_metadata_2 %>%
group_by(Class) %>%
slice_sample(n = 4)
mySample
## # A tibble: 8 × 6
## # Groups: Class [2]
## Class SpeciesAccepted Authors Journal YearPublication DOI_ISBN
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Aves Falco peregrinus Deines; Pet… Ecol A… 2007 10.1890…
## 2 Aves Cyanistes caeruleus Koons <NA> 2005 <NA>
## 3 Aves Bonasa umbellus Tirpak; Giu… Biol C… 2006 10.1016…
## 4 Aves Puffinus tenuirostris Yearsley; F… Math B… 2002 10.1016…
## 5 Mammalia Urocitellus columbianus Dobson; Oli Am Nat 2001 10.1086…
## 6 Mammalia Phascolarctos cinereus Rhodes; Ng;… Biol C… 2011 10.1016…
## 7 Mammalia Canis lupus Miller; Jen… Ecol M… 2002 10.1016…
## 8 Mammalia Procyon lotor Schumaker; … Ecol A… 2004 10.1890…
Now you have a random sample stratified across birds and mammals. If you have COMPADRINO access privileges you can find PDFs at www.compadre-db.org, otherwise you can search on Google scholar etc.
You could save this file out to a CSV file for later use like this:
write_csv(mySample,file = "mySample.csv")