Create GeoLocator DP from SOI database

Online version: https://rpubs.com/rafnuss/geolocator_create_from_soi

In this vignette, we cover the steps involved in the creation of the Core GeoLocator Data Package (i.e., before analysis) with the Swiss Ornithological Institute (SOI) geolocator dataset.

library(frictionless)
library(GeoLocatoR)
library(tidyverse)
library(writexl)
library(readxl)

Illustration of the entire pipeline of data for SOI. This script deals with steps 1, 2, 4, 5 and 6.

1 Create Project

A project will result in a single data package, but usually consists of multiple orders. As such, the creation of project metadata is only performed once.

First, define the path where you are working (typically on the Z: drive) and your project name.

dp_path <- "Z:/DOM_Forschung/UNIT_Vogelzug/40 Data/20 Geolocator/"
# dp_path <- '/Volumes/Daten/DOM_Forschung/UNIT_Vogelzug/40 Data/20 Geolocator/' # Mac User
project_name <- "test"

1.1 Create the project folder

You can create a new Rstudio project with

usethis::create_project(path = glue::glue("{dp_path}/Datapackage/{project_name}"))

Add the data Scientifique Collaboration Aggreement to this folder.

1.2 Initiate DataPackage with `datapackage.json`

Start by setting be most basic metadata. The required metadata (with a * below) should be defined in the data agreement, while optional metadata can be added later. These metadata make up the datapackage.json file:

title*
contributors*: including email to define who has access to the private Zenodo during embargo
embargo*: default is no embargo with a date defined in the past (e.g. 1970-01-01)
licences*: CC-BY-4.0 by default.
description
relatedIdentifiers
grants
keywords

pkg <- create_gldp(
  title = "Geolocator study of {species_name} in {location}", # required
  contributors = list( # required
    list(
      title = "Raphaël Nussbaumer",
      roles = c("ContactPerson", "DataCurator", "ProjectLeader"),
      email = "raphael.nussbaumer@vogelwarte.ch",
      path = "https://orcid.org/0000-0002-8185-1020",
      organization = "Swiss Ornithological Institute"
    ),
    list(
      title = "Yann Rime",
      roles = c("Researcher"),
      email = "yann.rimme@vogelwarte.ch",
      path = "https://orcid.org/0009-0005-7264-6753",
      organization = "Swiss Ornithological Institute"
    )
  ),
  # licenses = The default licenses should be ok in most cases
  embargo = "2028-01-01"
)

Read the datapackage specification to learn about all recommended metadata that can be added. They can be added in create_gldp() or updated manually on pkg directly as below:

# Description is really important to provide some textual background information on the project.
pkg$description <- "Geolocator study of Mangrove Kingfisher and Red-capped Robin-chat on the coast of Kenya"

# You can also add keywords:
pkg$keywords <- c("Mangrove Kingfisher", "Red-capped Robin Chat", "multi-sensor geolocator")

# Funding sources
pkg$grants <- c("Swiss Ornithological Intitute")

# Identifiers of resources related to the package (e.g. papers, project pages, derived datasets, APIs, etc.).
pkg$relatedIdentifiers <- list(
  list(
    relationType = "IsDescribedBy",
    relatedIdentifier = "10.13140/RG.2.2.34477.10721",
    relatedIdentifierType = "DOI"
  )
)

2 Create the pre-filled `tags.csv` and `observations.csv`

The new geolocators need to be sent together with pre-filled tags.csv and observations.csv. Here is how to create them.

We first extract the entire data Access database with read_gdl.

Caution

On SOI PC computer, we need to install the 64bit driver of access that you can be downloaded on microsoft download center. Then run the application from the terminal to be able to install this driver next to the 32bit (using /quiet):

`Downloads\accessdatabaseengine_X64.exe /quiet`

gdl0 <- read_gdl(
  access_file = "../../40 Database/GDL_Data.accdb", # Assuming you're in the create RStudio project
  filter_col = FALSE
)

Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)
One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat)

For this example, we will take the geolocator data from the order OtuScoES24 and add them to the pkg created earlier.

pkg <- pkg %>%
  add_gldp_soi(
    gdl = gdl0 %>% filter(str_detect(OrderName, "OtuScoES23")),
    directory_data = "." # !!! THIS IS A WRONG PATH, USED AS A TRICK TO MAKE THE CODE THINK THAT WE DON'T HAVE DATA.
    # directory_data = "../../10 Raw data" # USE THIS LINE FOR INCLUDING THE DATA (See below)
  )

Before creating the files, make sure to update the version of the package.

pkg$version <- "v0.1.0" # v(0=not analyse),(1=first year of the project),(0=empty table)
write_package(pkg, directory = pkg$version)

Caution

Make sure you share instructions with ringers/collaborators on how to fill these csv files by sending the link to the documentation: tags.csv and observations.csv:

Do not modify column headers, do not remove (or add) columns.
Make sure to fill all required fields (marked by * in documentations)
Each tag should be present only once in tags.csv and at least once in observations.csv with an observation_type with value equipment.
Remind them that you can only send back the geolocator if these tables are completely and correctly filled and returned to you.

Tip

As an alternative to the .csv, you might want to create a .xlsx spreadsheet which preserves the column class and might be easier to work with the ringers.

write_xlsx(tags(pkg), file.path(pkg$version, "tags.xlsx"))
write_xlsx(observations(pkg), file.path(pkg$version, "observations.xlsx"))

3 Processing returned data

3.1 Process geolocator data

The geolocator data can be extracted and stored as usual on the Z: drive.

3.2 Store returned `tags.csv` and `observations.csv`

For every new tags.csv and/or observations.csv file returned, I suggest to create a new folder with a bumped version number. Copy the files (including datapackage.json) from the previous version and overwrite them with the new version. For the first file returned, the version number should be

version <- "v0.1.1" # v(0=not analyse),(1=first year of the project),(1=first version returned)

3.3 Create DataPackage with data

We can now use the function read_gldp to read the datapackage from the previous version and add the three core resources on pkg with add_gldp_soi()

pkg <- read_gldp(file = file.path(version, "datapackage.json")) # Read the datapackage from the folder just created

# Add data by selection
pkg <- pkg %>% add_gldp_soi(
  gdl = gdl0 %>% filter((GDL_ID %in% tags(pkg)$tag_id) & (OrderName == "OtuScoES23")),
  directory_data = "../../10 Raw data"
)

⠙ 1/5 ETA: 17s |

⠹ 2/5 ETA: 10s |

⠸ 4/5 ETA:  3s |

# Bump the version
pkg$version <- "v0.1.2"
# v(0=not analyse),(1=first year of the project),(1=New version of the package with the same list of tag_id)

# Diplay the package
print(pkg)

── A GeoLocator Data Package (v0.2)

• title: "Geolocator study of {species_name} in {location}"

• contributors:

  Raphaël Nussbaumer ('raphael.nussbaumer@vogelwarte.ch') (ContactPerson,
  DataCurator, ProjectLeader) - <https://orcid.org/0000-0002-8185-1020>

  Yann Rime ('yann.rimme@vogelwarte.ch') (Researcher) -
  <https://orcid.org/0009-0005-7264-6753>

• embargo: 2028-01-01

• licenses: Creative Commons Attribution 4.0 (CC-BY-4.0) -
  <https://creativecommons.org/licenses/by/4.0/>

• description: "Geolocator study of Mangrove Kingfisher and Red-capped
  Robin-chat on the coast of Kenya"

• version: "v0.1.2"

• relatedIdentifiers:

  IsDescribedBy <10.13140/RG.2.2.34477.10721>

• grants: "Swiss Ornithological Intitute"

• keywords: "Mangrove Kingfisher", "Red-capped Robin Chat", and "multi-sensor
  geolocator"

• created: 2025-01-13 10:57:54

• spatial: Polygon and c(-3.840505, -3.794914, -3.794914, -3.840505, -3.840505,
  43.455195, 43.455195, 43.485319, 43.485319, 43.455195)

• temporal: "2023-07-01" to "2024-08-20"

• taxonomic: "Otus scops"

• numberTags:

  tags: 22

  measurements: 5

  light: 5

  pressure: 5

  activity: 5

  magnetic: 5

── 3 resources:

• tags
• observations
• measurements
Use `unclass()` to print the Geolocator Data Package as a list.

As we’ve just added the three core resources, some additional metadata of the package were automatically computed. You can now see the total number of tags.

3.4 Validate datapackage

Before publishing/sharing your data, it is essential to validate your GeoLocator Data Package.

validate_gldp(pkg)

── Check GeoLocator DataPackage profile ──

✔ title is valid.

✔ contributors is valid.

✔ contributors[[1]] is valid.

✔ contributors[[2]] is valid.

✔ embargo is valid.

✔ licenses is valid.

✔ licenses[[1]] is valid.

✔ created is valid.

✔ $schema is valid.

✔ resources is valid.

✔ resources[[1]] is valid.

✔ resources[[2]] is valid.

✔ resources[[3]] is valid.

✔ description is valid.

✔ keywords is valid.

✔ keywords[[1]] is valid.

✔ keywords[[2]] is valid.

✔ keywords[[3]] is valid.

✔ grants is valid.

✔ grants[[1]] is valid.

✔ relatedIdentifiers is valid.

✔ relatedIdentifiers[[1]] is valid.

✔ taxonomic is valid.

✔ taxonomic[[1]] is valid.

✔ numberTags is valid.

✔ numberTags$tags is valid.

✔ numberTags$measurements is valid.

✔ numberTags$light is valid.

✔ numberTags$pressure is valid.

✔ numberTags$activity is valid.

✔ numberTags$temperature_external is valid.

✔ numberTags$temperature_internal is valid.

✔ numberTags$magnetic is valid.

✔ numberTags$wet_count is valid.

✔ numberTags$conductivity is valid.

✔ spatial is valid.

! spatial cannot be validated (external schema).

✔ version is valid.

✔ temporal is valid.

✔ temporal$start is valid.

✔ temporal$end is valid.

✔ Package is consistent with the profile.

── Check GeoLocator DataPackage Resources

── Check GeoLocator DataPackage Resources tags ──

✔ tags$tag_id is valid.

✔ tags$ring_number is valid.

✔ tags$scientific_name is valid.

✔ tags$manufacturer is valid.

✔ tags$model is valid.

✔ tags$firmware is valid.

✔ tags$weight is valid.

✔ tags$attachment_type is valid.

✔ tags$readout_method is valid.

✔ tags$tag_comments is valid.

✔ Table tags is consistent with the schema.

── Check GeoLocator DataPackage Resources observations ──

✔ observations$ring_number is valid.

✔ observations$tag_id is valid.

✔ observations$observation_type is valid.

✔ observations$datetime is valid.

✔ observations$latitude is valid.

✔ observations$longitude is valid.

✔ observations$location_name is valid.

✔ observations$device_status is valid.

✔ observations$observer is valid.

✔ observations$catching_method is valid.

✔ observations$age_class is valid.

✔ observations$sex is valid.

✔ observations$condition is valid.

✔ observations$mass is valid.

✔ observations$wing_length is valid.

✔ observations$additional_metric is valid.

✔ observations$observation_comments is valid.

✔ Table observations is consistent with the schema.

── Check GeoLocator DataPackage Resources measurements ──

✔ measurements$tag_id is valid.

✔ measurements$sensor is valid.

✔ measurements$datetime is valid.

✔ measurements$value is valid.

✔ measurements$label is valid.

✔ Table measurements is consistent with the schema.

✔ Package's ressources are valid.

── Check GeoLocator DataPackage Coherence

✔ Package is internally coherent.

── Check Observations Coherence

✔ observations table is coherent.

✔ Package is valid.

4 Create the Zenodo repository

4.1 Option 1: Manually

First, create a new deposit on Zenodo and reserve the DOI to be able to define the package id.

The package id should be the concept DOI, that is, the one that doesn’t change with new versions. The DOI displayed on Zenodo is actually the DOI of the first version, but you can retrieve the concept DOI by substracting 1 to your ID number

pkg$id <- "https://doi.org/10.5281/zenodo.{ZENODO_ID - 1}"
# e.g. "10.5281/zenodo.14620590" for a DOI reserved as 10.5281/zenodo.14620591

# Update the bibliographic citation with this new DOI
pkg <- pkg %>% update_gldp_bibliographic_citation()

Now, we can write the datapackage to file with

write_package(pkg, directory = pkg$version)

The content of the folder created can now be uploaded on your Zenodo deposit.

Tip

You can populate all other fields on Zenodo with the information provided in datapackage.json! Note that a datapackage contributors corresponds to creators on Zenodo and not the contributors.

4.2 Option 2: Programatically

A more efficient solution is to create a deposit on Zenodo using the API. For this, you first need to create a token and save it to your keyring with:

keyring::key_set_with_value("ZENODO_PAT", password = "{your_zenodo_token}")

This will allow us to create a ZenodoManager object which will become useful later.

zenodo <- zen4R::ZenodoManager$new(token = keyring::key_get(service = "ZENODO_PAT"))

✔ Successfully connected to Zenodo with user token

You can create a zen4R::ZenodoRecord object from the from pkg.

z <- gldp2zenodoRecord(pkg)

✔ Successfully connected to Zenodo with user token

✔ Successfully fetched resourcetype 'dataset'

✔ Successfully fetched list of affiliations!

Warning: ! Zenodo's creator can only have a single role.
→ Only the first role will be kept

✔ Successfully fetched list of affiliations!

✔ Successfully fetched license 'cc-by-4.0'

✔ Successfully fetched list of funders!

print(z)

<ZenodoRecord>
....|-- created: <NULL>
....|-- updated: <NULL>
....|-- revision_id: <NULL>
....|-- is_draft: <NULL>
....|-- is_published: <NULL>
....|-- status: <NULL>
....|-- versions: <NULL>
....|-- access: 
........|-- record: public
........|-- files: restricted
........|-- embargo: 
............|-- active: TRUE
............|-- until: 2028-01-01
............|-- reason: 
....|-- files: <NULL>
....|-- id: <NULL>
....|-- links: <NULL>
....|-- metadata: 
........|-- resource_type: 
............|-- id: dataset
........|-- publisher: Zenodo
........|-- title: Geolocator study of {species_name} in {location}
........|-- creators: 
........|-- rights: 
........|-- description: Geolocator study of Mangrove Kingfisher and Red-capped Robin-chat on the coast of Kenya
........|-- version: v0.1.2
........|-- related_identifiers: 
........|-- subjects: 
........|-- publication_date: 2025-01-13
....|-- parent: <NULL>
....|-- pids: <NULL>
....|-- stats: <NULL>

Tip

Learn more about the zen4R package!

You can create the deposit on the website. For this we need to reserve the DOI, but without publishing the record yet: there is no data!

z <- zenodo$depositRecord(z, reserveDOI = TRUE, publish = FALSE)

✔ Successful record deposition

✔ Successful reserved DOI for record 15004303

You can now open this record on your browser using its self_html

print(z$links$self_html)

[1] "https://zenodo.org/uploads/15004303"

We can retrieve the concept DOI to build the pkg id

pkg$id <- paste0("https://doi.org/", z$getConceptDOI())

We can now upload the data to the deposit with (or do it manually from the website)

write_package(pkg, directory = pkg$version)
for (f in list.files(pkg$version)) {
  zenodo$uploadFile(file.path(pkg$version, f), z)
}

Warning

At this stage, the Zenodo record is still not published. This is voluntarily not done automatically so that you check the record before publishing.

Tip

A nice feature of Zenodo is that you can share the record BEFORE publication with others (e.g., co-authors) allowing them to check everything before publication.

If any modification of the metadata are made on Zenodo, you overwrite pkg’s metadata with

z_updated <- zenodo$getDepositionByConceptDOI(z$getConceptDOI())
pkg <- zenodoRecord2gldp(z_updated, pkg)

5 Following years

The following years, the same two operations will need to be performed with a subset of the geolocators.

5.1 Placing order

The creation of the tags.csv and observation.csv follows the same procedure. Create the data package from datapackage.json, read the corresponding gdl (using OtuScoES24 for the new year). add_gldp_soi gives priority to the existing tags and observations and only generates the empty row for the new geolocators.

pkg <- read_gldp("v0.1.2/datapackage.json") %>%
  add_gldp_soi(
    gdl = gdl0 %>% filter(str_detect(OrderName, "OtuScoES24")),
    directory_data = "../../data"
  )

pkg$version <- "v0.2.0" # v(0=not analyse),(2=second year of the project),(0=empty table)
write_package(pkg, directory = pkg$version)

The tags.csv and observation.csv created will therefore have a combination of last years’ and the current year’s data.

5.2 Processing return data

The newly filled tags.csv and observations.csv files that the ringers returned should now be put in the new folder version v0.2.1.

version <- "v0.2.1" # v(0=not analyse),(2=second year of the project),(1=first version returned)

pkg <- read_gldp(file = file.path("v0.2.1", "datapackage.json")) # Read the datapackage from the folder just created

# Add data by selection
pkg <- pkg %>% add_gldp_soi(
  gdl = gdl0 %>% filter((GDL_ID %in% tags(pkg)$tag_id) & (OrderName == "OtuScoES24")),
  directory_data = "../../10 Raw data"
)

Create the new version with

# Bump the version
pkg$version <- "v0.2.2"
# v(0=not analyse),(2=second year of the project),(1=New version of the package with the same list of tag_id)

write_package(pkg, directory = pkg$version)

5.3 Update Zenodo

Important

Not yet, tested, probably need to create a new deposit if already published???

We can retrieve the latest record from the concept DOI

z <- zenodo$getDepositionByConceptDOI(gsub("https://doi.org/", "", pkg$id))

If any metadata from datapackage.json has changed, we can update the deposit with

z <- gldp2zenodoRecord(pkg, z)

✔ Successfully connected to Zenodo with user token

✔ Successfully fetched resourcetype 'dataset'

✔ Successfully fetched list of affiliations!

Warning: ! Zenodo's creator can only have a single role.
→ Only the first role will be kept

✔ Successfully fetched list of affiliations!

✔ Successfully fetched license 'cc-by-4.0'

✔ Successfully fetched list of funders!

We can create a new version with

z <- zenodo$depositRecordVersion(
  z,
  delete_latest_files = TRUE,
  files = file.path(pkg$version, list.files(pkg$version)),
  publish = FALSE
)