Add taxonomic information to the DASCO v2.5 dataset
The RAVeM workflow uses data from the DASCO data set, which provides coordinates of alien species occurrences worldwide. The DASCO data set was produced by applying the DASCO workflow (the DASCO workflow) using the SInAS database (version 2.5). The workflow imports checklists of alien species such as those stored in SInAS, and extracts coordinates for the alien regions (according to SInAS) from GBIF and OBIS.
The original DASCO data (version 2.5) can be found at https://zenodo.org/records/10054162. In this report we show how to further format the DASCO data in order to incorporate additional metadata (including taxonomic data) that can be used to extract data subsets for specific groups. The code below should be used only if you want to update the DASCO data to a newer version. If you are using DASCO v2.5, you can skip this section.
1 Load DASCO data
1.1 Load packages
Code
# install packages if not installedif (!requireNamespace("sketchy", quietly =TRUE)) {install.packages("sketchy")}packages <-c("data.table", "rgbif", "pbapply", "dplyr")# install/ load packagessketchy::load_packages(packages = packages)
1.2 Download/read data
We will download the .csv file with the 2.5 version of DASCO data was previously extracted from the DASCO repository. If you plan to use a newer version of the data please skip this step. The following code downloads it to a local folder defined by the user:
Code
local_folder <-"PATH TO YOUR LOCAL FOLDER"# e.g. "C:/Users/RAVe-M_2025/"download.file("https://zenodo.org/record/10054162/files/DASCO_AlienCoordinates_SInAS_2.5.csv?download=1",destfile =file.path(local_folder, "DASCO_AlienCoordinates_SInAS_2.5.csv"),mode ="wb")
To work on a more recent version of the DASCO data, replace the name and path of the file in the code below. Now we can read the downloaded file. We will use the fread function from the data.table package, which is optimized for fast reading of large files:
Now we extract taxonomic details for all species in DASCO from GBIF’s backbone taxonomy (GBIF Secretariat, 2023) using the function rgbif::name_backbone() (this can take several hours to run):
Code
# Extract all species names in DASCO (18292 species names)sppDASCO <-unique(DASCO_v2.5$taxon)# Extract all GBIF info for those sppGBIFtax_list <-pblapply(sppDASCO, name_backbone) names(GBIFtax_list) <- sppDASCO# Create a single DFGBIFtax <-bind_rows(GBIFtax_list)
We then add higher taxonomy to all records in DASCO v2.5 by merging the taxonomic information extracted from GBIF with the DASCO dataset:
Code
# ADD "taxon" column to the taxonomic dataset to make the match with the names in DASCO datasetGBIFtax$taxon <- GBIFtax$canonicalName# Merge taxonomic information with DASCO datasetDASCO_tax <-merge(DASCO_v2.5, GBIFtax[, c("taxon","speciesKey","kingdom","phylum","class","order","family")], by ="taxon")
3 Save results
3.1 Save DASCO + taxonomy in a single file
We can save the DASCO data with taxonomic information as a .csv file:
Code
# Save the updated DASCO data with taxonomyfwrite(DASCO_tax, file.path(local_folder, "DASCO_v2.5_withTaxonomy.csv"))
3.2 Save DASCO + taxonomy as data subsets by groups of interest
We can also split the data set by kingdom (e.g. Animalia, Plantae, Fungi, etc.) and save them as .csv files:
Code
# save a csv file for each subset by kingdomfor(i inunique(DASCO_tax$kingdom)) { subset_data <- DASCO_tax[DASCO_tax$kingdom == i, ]fwrite(subset_data, file.path(local_folder, paste0("DASCO_v2.5_", i, ".csv")))}
The output files can be found in the folder defined by the user in the local_folder variable:
The files saved above can be quite large (> 3GB) and may be difficult to handle in some software. Therefore, we recommend selecting a subset including only the groups of interest. For instance, the following code chunk show how to extract vertebrates from the DASCO dataset, which is done by first extracting tetrapods and then fish.
To extract tetrapod species (amphibians, reptiles, birds and mammals) we can use the following code:
Code
# Extract DASCO records for tetrapod speciesDASCO_Tetrap <-subset(DASCO_tax, class %in%c("Amphibia", "Aves", "Squamata", "Crocodylia", "Testudines","Mammalia"))
The following code can be used to extract fish species (Actinopterygii):
We can now combine all vertebrates and save them as a single .csv file:
Code
# Combine all vertebratesDASCO_Verts <-rbind(DASCO_Tetrap, DASCO_Fish)# save csvfwrite(DASCO_Verts, file.path(local_folder, "DASCO_v2.5_Vertebrates.csv"))
---title: Adding taxonomic info to the DASCO datasubtitle: RAVeMauthor: <a href="http://researcher.website.com/">Researcher name</a>date: "`r Sys.Date()`"toc: truetoc-depth: 3toc-location: leftnumber-sections: truehighlight-style: pygmentsformat: html: df-print: kable code-fold: show code-tools: true css: qmd.csseditor_options: chunk_output_type: console---```{r set root directory}#| eval: true#| echo: false# install knitr package if not installedif (!requireNamespace("knitr", quietly =TRUE)) {install.packages("knitr")}# set working directory knitr::opts_knit$set(root.dir ="..")``````{r setup style}#| message: false#| warning: false#| echo: false# options to customize chunk outputsknitr::opts_chunk$set(# tidy.opts = list(width.cutoff = 65), # tidy = TRUE,message =FALSE )local_folder <-"./data/processed"```<!-- skyblue box -->::: {.alert .alert-info}# Purpose {.unnumbered .unlisted}- Add taxonomic information to the DASCO v2.5 dataset:::The RAVeM workflow uses data from the DASCO data set, which provides coordinates of alien species occurrences worldwide. The DASCO data set was produced by applying the DASCO workflow ([the DASCO workflow](https://doi.org/10.5281/zenodo.5841930)) using the [SInAS database](https://doi.org/10.5281/zenodo.10038256) (version 2.5). The workflow imports checklists of alien species such as those stored in SInAS, and extracts coordinates for the alien regions (according to SInAS) from GBIF and OBIS. The original DASCO data (version 2.5) can be found at [https://zenodo.org/records/10054162](https://zenodo.org/records/10054162). In this report we show how to further format the DASCO data in order to incorporate additional metadata (including taxonomic data) that can be used to extract data subsets for specific groups. **The code below should be used only if you want to update the DASCO data to a newer version**. If you are using DASCO v2.5, you can skip this section.# Load DASCO data## Load packages```{r load packages}#| eval: false# install packages if not installedif (!requireNamespace("sketchy", quietly =TRUE)) {install.packages("sketchy")}packages <-c("data.table", "rgbif", "pbapply", "dplyr")# install/ load packagessketchy::load_packages(packages = packages)```## Download/read dataWe will download the .csv file with the 2.5 version of DASCO data was previously extracted from the [DASCO repository](https://zenodo.org/records/10054162). If you plan to use a newer version of the data please skip this step. The following code downloads it to a local folder defined by the user: ```{r}#| eval: falselocal_folder <-"PATH TO YOUR LOCAL FOLDER"# e.g. "C:/Users/RAVe-M_2025/"download.file("https://zenodo.org/record/10054162/files/DASCO_AlienCoordinates_SInAS_2.5.csv?download=1",destfile =file.path(local_folder, "DASCO_AlienCoordinates_SInAS_2.5.csv"),mode ="wb")```To work on a more recent version of the DASCO data, replace the name and path of the file in the code below. Now we can read the downloaded file. We will use the `fread` function from the `data.table` package, which is optimized for fast reading of large files:```{r}#| eval: false# Load DASCO dataDASCO_v2.5<-fread(file.path(local_folder, "DASCO_AlienCoordinates_SInAS_2.5.csv"))```# Format data## Select non-marine observationsWe first select only non-marine observations from the DASCO dataset, which includes terrestrial and freshwater records:```{r}#| eval: falseDASCO_v2.5<- DASCO_v2.5[DASCO_v2.5$Realm !="marine",]```## Add GBIF taxonomyNow we extract taxonomic details for all species in DASCO from GBIF's backbone taxonomy (GBIF Secretariat, 2023) using the function `rgbif::name_backbone()` (this can take several hours to run):```{r extract-gbif-taxonomy}#| eval: false# Extract all species names in DASCO (18292 species names)sppDASCO <-unique(DASCO_v2.5$taxon)# Extract all GBIF info for those sppGBIFtax_list <-pblapply(sppDASCO, name_backbone) names(GBIFtax_list) <- sppDASCO# Create a single DFGBIFtax <-bind_rows(GBIFtax_list) ```We then add higher taxonomy to all records in DASCO v2.5 by merging the taxonomic information extracted from GBIF with the DASCO dataset:```{r add-taxonomy}#| eval: false# ADD "taxon" column to the taxonomic dataset to make the match with the names in DASCO datasetGBIFtax$taxon <- GBIFtax$canonicalName# Merge taxonomic information with DASCO datasetDASCO_tax <-merge(DASCO_v2.5, GBIFtax[, c("taxon","speciesKey","kingdom","phylum","class","order","family")], by ="taxon")```# Save results## Save DASCO + taxonomy in a single fileWe can save the DASCO data with taxonomic information as a .csv file:```{r}#| eval: false# Save the updated DASCO data with taxonomyfwrite(DASCO_tax, file.path(local_folder, "DASCO_v2.5_withTaxonomy.csv"))```## Save DASCO + taxonomy as data subsets by groups of interestWe can also split the data set by kingdom (e.g. Animalia, Plantae, Fungi, etc.) and save them as .csv files:```{r split by realm}#| eval: false# save a csv file for each subset by kingdomfor(i inunique(DASCO_tax$kingdom)) { subset_data <- DASCO_tax[DASCO_tax$kingdom == i, ]fwrite(subset_data, file.path(local_folder, paste0("DASCO_v2.5_", i, ".csv")))}```The output files can be found in the folder defined by the user in the `local_folder` variable:```{r}#| echo: falsefs::dir_tree(local_folder, regexp ="Vertebrates", invert =TRUE)```The files saved above can be quite large (> 3GB) and may be difficult to handle in some software. Therefore, we recommend selecting a subset including only the groups of interest. For instance, the following code chunk show how to extract vertebrates from the DASCO dataset, which is done by first extracting tetrapods and then fish. To extract tetrapod species (amphibians, reptiles, birds and mammals) we can use the following code:```{r}#| eval: false# Extract DASCO records for tetrapod speciesDASCO_Tetrap <-subset(DASCO_tax, class %in%c("Amphibia", "Aves", "Squamata", "Crocodylia", "Testudines","Mammalia"))```The following code can be used to extract fish species (Actinopterygii):```{r}#| eval: false# Orders of Actinopterygii:Actinop <-c("Acipenseriformes","Albuliformes", "Amiiformes","Anguilliformes", "Atheriniformes", "Aulopiformes", "Beloniformes", "Beryciformes", "Characiformes","Clupeiformes", "Cypriniformes", "Cyprinodontiformes","Elopiformes", "Esociformes", "Gadiformes","Gasterosteiformes", "Gonorynchiformes", "Lepisosteiformes","Mugiliformes","Osmeriformes", "Osteoglossiformes","Perciformes", "Percopsiformes", "Pleuronectiformes","Salmoniformes", "Scorpaeniformes", "Siluriformes","Synbranchiformes", "Syngnathiformes", "Tetraodontiformes")# extract subsetDASCO_Fish <-subset(DASCO_tax, order %in% Actinop)# ADD Class NameDASCO_Fish$class <-"Actinopterigii"```We can now combine all vertebrates and save them as a single .csv file:```{r}#| eval: false# Combine all vertebratesDASCO_Verts <-rbind(DASCO_Tetrap, DASCO_Fish)# save csvfwrite(DASCO_Verts, file.path(local_folder, "DASCO_v2.5_Vertebrates.csv"))```<!-- add packages used, system details and versions --># Session information {.unnumbered .unlisted}<details><summary>Click to see</summary>```{r session info}#| echo: false# if devtools is installed use devtools::session_info()if (requireNamespace("devtools", quietly =TRUE)) { devtools::session_info()} else {sessionInfo()}```</details>